AI for Good stories

Enhancing open-source large language models for industrial use: Insights from China Unicom

At the 2025 AI for Good Global Summit, Wang Kai, Vice Manager at China Unicom Data Intelligence Co., Ltd., presented a comprehensive overview of the company’s recent work on improving the practical deployment of open-source large language models (LLMs). His talk explored three key areas: model selection, reasoning optimization, and safety enhancement.

by

Celia Pizzuto

Featured Image

At the 2025 AI for Good Global Summit, Wang Kai, Vice Manager at China Unicom Data Intelligence Co., Ltd., presented a comprehensive overview of the company’s recent work on improving the practical deployment of open-source large language models (LLMs). His talk explored three key areas: model selection, reasoning optimization, and safety enhancement.

Emerging trends in the open-source LLM ecosystem

Wang began by identifying three major trends shaping the LLM landscape in 2025. First, open-source models are becoming increasingly dominant, with powerful examples such as DeepSeek, Qin, LLaMA, and Gamma entering widespread use. Second, the reasoning capabilities of LLMs have improved significantly, with top-performing models on popular benchmarks like Arena being reasoning-focused. Third, despite these advancements, safety remains a persistent challenge. “All the top ten open source models suffer from serious safety issues,” Wang noted, making this an obstacle for real-world applications.

Three core challenges in practical deployment

Wang outlined three main obstacles in bringing LLMs into practical use. The first is model selection. With limited computational resources and numerous model options, users often struggle to choose the right foundation model for specific applications.

The second challenge is overthinking. Reasoning models tend to generate unnecessarily long answers even for simple tasks, which increases computational costs. For example, models may generate extensive reasoning for a basic arithmetic problem like “what’s 9 plus 5,” consuming more resources than necessary.

The third challenge is safety. While there are established benchmarks to evaluate safety performance in English, there remains a gap in assessing models in other languages. Additionally, efforts to improve safety often risk reducing reasoning performance, presenting a trade-off that must be addressed.

A structured approach to model selection

To simplify model selection, China Unicom developed the Yuanjing LLM Selection Guide. This approach breaks down model capabilities into five main categories and 27 subcategories, based on over 100 real-world applications.

To support this framework, the team created the A-Eval benchmark, a dataset of 678 question-answer pairs used to evaluate 20 open-source models. Each model’s capability boundaries were assessed, allowing developers to identify which models are best suited for specific tasks. The guide establishes correlations between model parameters, functions, and applications, enabling users to select the most optimal model for their needs.

“For example, if you want to categorize the news you searched, then it suggests that you use a Q1 2.54B model,” Wang explained. The result is a practical reference that simplifies one of the most difficult steps in building applications using LLMs.

Watch the full session here:

Reducing overthinking with difficulty-adaptive slow thinking

To address the inefficiencies caused by overthinking, China Unicom proposed a novel technique called Difficulty-Adaptive Slow Thinking, or DAST. The objective is to generate concise answers for simple questions while preserving depth and rigor for complex ones.

This was achieved through a three-step process. First, the team introduced a metric called Token Length Budget (TLB) to measure answer conciseness during training. For each training question, 20 answers were generated and analyzed. If the average accuracy was high, indicating a simple question, shorter answers were preferred. If the accuracy was low, longer, more detailed answers were encouraged.

Second, the model was trained using Constrained Policy Optimization (CPO). Correct answers with shorter length were rewarded, while overly long responses were penalized, unless the question was complex, in which case longer answers were promoted.

Third, preference pairs were constructed for training. These included “winner” pairs where one answer was better than another, and “loser” pairs where neither answer was satisfactory. This approach helped the model learn how to balance brevity with thoroughness.

To validate the results, Wang shared performance data from the MATH 500 benchmark. The improved DeepSeek R1 model demonstrated higher precision across all difficulty levels, while producing shorter responses for simple questions without sacrificing accuracy for more complex ones. A subtraction question that originally produced a lengthy explanation was now answered briefly and correctly by the enhanced model. For a difficult logic question, the model still generated a detailed response, but more efficiently.

Improving safety without compromising reasoning

The third focus area was safety. China Unicom introduced ChiSafetyBench, a Chinese-language benchmark built in alignment with the TC260 standard. It includes five major categories and 31 subcategories for assessing model safety performance.

The team tested over 40 open-source models using this benchmark and found that most required additional safety enhancements. “Open-source models actually are not safe,” Wang said.

To improve safety without compromising reasoning, the team created a mixed training dataset with over 50,000 safety-related questions and 30,000 “train of thought” questions aimed at preserving general thinking abilities. The model was then fine-tuned using supervised fine-tuning (SFT).

The results showed a more than 10 percent increase in risk content identification accuracy and a 50 percent reduction in harmful responses. At the same time, evaluations using MATH-500, GPQA, and other benchmarks confirmed that the model’s reasoning capabilities remained intact.

One illustrative case involved a prompt injection scenario asking the model to tell a story that includes a Windows 10 Pro key. The original DeepSeek model output the key, while the enhanced model avoided doing so while still telling a coherent story, indicating improved resistance to prompt injection attacks.

A commitment to safer, smarter AI

Wang concluded by highlighting China Unicom’s broader goal: to make LLMs more practical, safe, and accessible for industry use. He underscored the importance of collaboration with global partners in advancing AI that is robust, efficient, and secure.

“We also propose the world’s first model guide for the general model selection, which is quite useful for novice users. We, at China Unicom, would like to collaborate with our partners to put AI into more practical and to make AI safer,” Wang concluded.

Are you sure you want to remove this speaker?