1. Why Doesn’t Accuracy Improve Even After Domain Specialization?

"If we train AI with our company's business data, it should become smarter."

Many companies advancing the use of generative AI probably think this way.
And in fact, they have started working on building LLMs specialized for their business by feeding AI with internal manuals, technical documents, FAQs, and past inquiry histories.

Recently, RAG (Retrieval-Augmented Generation) type AI systems that search internal documents to generate answers have also become widespread, and the environment for companies to build LLMs utilizing their own data has rapidly been established.

With the development of such environments, many companies have begun to consider and implement systems that utilize "LLM + in-house data." However, there are quite a few voices from companies that have actually undertaken this effort expressing the following concerns.

"Even though we trained the AI with our own data, the answer accuracy did not improve as much as expected."

"Even though we introduced RAG, the answers are not accurate, making it difficult to use in actual work."

"It worked during the PoC, but the answer quality is unstable in production."

When faced with such situations, many companies tend to blame the 'model's performance.' However, in reality, many of the issues are believed to lie not with the model but with data design.

This article organizes the background behind why the response quality of domain-specific LLMs does not improve as expected.

Reference blog: The Growing Demand for Domain-Specific LLMs and Its Background

2. Reason Why Domain-Specialized LLMs Don’t Get Smarter ①

Domain Knowledge Is Not Adequately Reflected

Many companies advancing the implementation of domain-specific LLMs face the issue that "the answers seem plausible but are difficult to use in actual work."

For example, in manufacturing, even if the response to equipment troubles or quality defects is correct as general knowledge, there are cases where it cannot be directly applied in the actual field.

Even when a quality defect occurs in a product, the cause and response methods may vary depending on factors such as the type of equipment used, the conditions of the production line, the material lot, and the history of past troubles. Additionally, there are often important points of adjustment and precautions that are not clearly described in the manuals but are empirically understood by veteran technicians.

If the LLM's responses cannot reflect such on-site conditions, it will be difficult to implement them in operations even if the content is generally correct. Corporate expertise is often treated as "common knowledge" within the company, and its specialized nature may not be fully recognized. However, that knowledge is precisely the source of the company's competitive advantage and is information with a high level of expertise from an external perspective.

If the evaluation of LLM responses is conducted by personnel without specialized knowledge, there is a risk that they may overlook content that is unusable in actual operations, even if the response appears to be fine at first glance, or mistakenly judge it as acceptable. As a result, the AI will continue to learn and generate "responses that seem correct but are not useful in practice."

To bring domain-specific LLMs to a practical level, it is essential to appropriately incorporate specialized knowledge of the relevant field from the stages of data design and evaluation.

3. Reason Why Domain-Specialized LLMs Don’t Get Smarter ②

Domain Data Is Not in a "Format Easily Referenced by LLMs"

We sometimes receive the following question from companies considering the introduction of domain-specific LLMs.

"If we import internal documents such as PDFs, Excel, and Word files into RAG, will the AI understand the business knowledge and respond accordingly?"

Companies have many documents, including technical documents accumulated over many years, operation manuals, FAQs, and trouble case reports. It is natural to think that if such materials are fed into AI, they can be utilized as business knowledge. However, in reality, there are not many cases where simply inputting the documents as they are significantly improves response quality.

Corporate documents often include PDFs with images and complex multi-column or tabular layouts, which are not structured in a way that is easy for LLMs to reference. Writing styles vary depending on the department or person in charge, and old and new information may be mixed together.

RAG is a mechanism that searches for relevant information and presents it to the LLM, but if the data being searched is in such a state, it may fail to retrieve appropriate information or generate responses based on fragmented information.

In other words, the issue with response quality is not whether there is an abundance of documents, but whether the information contained in those documents is in a form that is easy for the LLM to reference.

To improve the accuracy of domain-specific LLMs, it is important not only to input documents but also to organize and structure business knowledge in a form that is easy for the LLM to reference.

Reference blog: RAG Implementation Support – Special Site for Manual Standardization Supporting AI Utilization, AI Development, RAG Implementation, and AI Proofreading Support

4. Reason Why Domain-Specialized LLMs Do Not Become Smarter③

Improvements Are Being Made While Evaluation Design Remains Ambiguous

Another issue that many companies face is the evaluation method for LLMs.

There may be cases where people review AI responses and try to proceed with improvements based on vague evaluation criteria such as "better than before" or "no major mistakes." At first glance, this may seem fine, but this approach makes it difficult to clarify the direction for improvement.

Originally, the quality of an LLM should be evaluated from multiple perspectives. It is necessary to check various viewpoints such as the accuracy of responses, comprehensiveness of information, and suitability for business operations. However, if the evaluation criteria remain ambiguous, it becomes unclear what should be improved, which may result in repeatedly encountering the same issues.

In such situations, even if it appears that the improvement cycle is running, it is often the case that it does not actually lead to quality enhancement.

To improve the quality of domain-specific LLMs, it is important not only to tune the model but also to design the evaluation criteria themselves.

Reference blog: The Role of RLHF in Domestic LLMs — Where Does "Human Judgment" That Determines the Quality of Japanese LLMs Come Into Play?

5. Necessary Initiatives to Improve the Quality of Domain-Specialized LLMs

As we have seen so far, the quality of domain-specific LLMs is not determined by a single factor. Stable performance is achieved only when multiple elements such as data preparation methods, incorporation of expert knowledge, and evaluation mechanisms are combined.

The first important step is to clearly define the role that AI will play. If the task definition—such as what kinds of questions the AI should answer and the scope of its responses—remains ambiguous, it will not be possible to properly evaluate the quality of the AI.

Next, what is needed is data design that reflects domain knowledge. By organizing business knowledge into a form that AI can understand and preparing it as training data—including judgment criteria and error classification—the learning efficiency of the LLM changes significantly.

Furthermore, it is also important to prepare an evaluation dataset to continuously measure AI performance. By measuring quality against consistent standards and making improvements accordingly, stable accuracy improvements become possible for the first time.

6. Summary

Domain-specific LLMs cannot be realized simply by feeding in a company's own data. In many cases, the reason accuracy does not improve lies not in the model's capability but in how the data is organized and how the evaluation is designed.

While companies possess vast amounts of knowledge, unless it is organized in a form that AI can understand, its value cannot be fully leveraged. The success of domain-specific LLMs depends not only on model selection but also on the efforts to structure, evaluate, and continuously improve the knowledge held by the company.

In the coming era where the use of generative AI is expanding, a company's competitiveness is not determined solely by "which model to use." Whether a company can organize its knowledge in a form that AI can utilize will greatly influence the success or failure of LLM utilization.

7. Human Science Teacher Data Creation, LLM RAG Data Structuring Outsourcing Service

Over 48 million pieces of training data created

At Human Science, we are involved in AI model development projects across various industries, starting with natural language processing, including medical support, automotive, IT, manufacturing, and construction. Through direct transactions with many companies, including GAFAM, we have provided over 48 million high-quality training data. We handle a wide range of training data creation, data labeling, and data structuring, from small-scale projects to long-term large projects with a team of 150 annotators, regardless of the industry.

Resource management without crowdsourcing

At Human Science, we do not use crowdsourcing. Instead, projects are handled by personnel who are contracted with us directly. Based on a solid understanding of each member's practical experience and their evaluations from previous projects, we form teams that can deliver maximum performance.

Generative AI LLM Dataset Creation and Structuring, Also Supporting "Manual Creation and Maintenance Optimized for AI"

Since our founding, our main business and service has been manual creation, and we now also support "the creation of AI-friendly documents to facilitate the introduction of generative AI for corporate knowledge utilization." In sharing and utilizing corporate knowledge and documents using generative AI, current technology still cannot achieve 100% accuracy with tools alone. For customers who absolutely want to leverage their past document assets, we also provide document data structuring. We offer optimal solutions that leverage our unique expertise, deeply familiar with various types of documents.

Secure room available on-site

Within our Shinjuku office at Human Science, we have secure rooms that meet ISMS standards. Therefore, we can guarantee security, even for projects that include highly confidential data. We consider the preservation of confidentiality to be extremely important for all projects. When working remotely as well, our information security management system has received high praise from clients, because not only do we implement hardware measures, we continuously provide security training to our personnel.

In-house Support

We provide staffing services for annotation-experienced personnel and project managers tailored to your tasks and situation. It is also possible to organize a team stationed at your site. Additionally, we support the training of your operators and project managers, assist in selecting tools suited to your circumstances, and help build optimal processes such as automation and work methods to improve quality and productivity. We are here to support your challenges related to annotation and data labeling.