Some parts of this page may be machine-translated.

 

The Role of RLHF in Domestic LLMs — Where Does "Human Judgment" That Determines the Quality of Japanese LLMs Come Into Play?

alt

2/9/2026

The Role of RLHF in Domestic LLMs — Where Does "Human Judgment" That Determines the Quality of Japanese LLMs Come Into Play?

1. Introduction

The evolution of LLMs (Large Language Models) has been remarkable, attracting attention to efforts aimed at larger model sizes, attempts to make them smarter by increasing the amount of pre-training data, and topics such as optimizing model design and the learning and inference processes. At the same time, the importance of "RLHF (Reinforcement Learning from Human Feedback)" as a factor that determines the practical usability of LLMs is being increasingly recognized.

In particular, in the field of domestic LLMs and Japanese LLMs, the design and operation of RLHF often directly impact the evaluation of the LLM's Japanese language comprehension. Even if the model's performance is high, insufficient RLHF can lead to the impression of a "difficult-to-use LLM."

This article organizes the role of RLHF in LLM development and explains why RLHF is important for domestic LLMs and how it affects the quality of Japanese LLMs.

2. What is RLHF (Reinforcement Learning from Human Feedback)?

RLHF (Reinforcement Learning from Human Feedback) is a learning method where humans provide feedback on the outputs of an LLM, and the model is improved based on that judgment. In many recent LLMs, it is used as a core process to achieve responses that follow instructions and practical behavior.

The typical training of an LLM begins with pretraining on a large amount of text. At this stage, the model acquires the structure of the language and knowledge, but it does not sufficiently learn value judgments such as "which answer is preferable" or "which phrasing is appropriate." RLHF is a mechanism to supplement these deficiencies with human judgment.

In RLHF, humans compare and evaluate multiple LLM outputs for the same instruction, and based on the results, create a model (reward model) that learns which output humans would judge as better. By using this reward model to perform reinforcement learning on the LLM, it becomes possible to generate responses that better align with human expectations.

The important point is that RLHF is not merely a post-processing step but a process that shapes the behavior of the LLM itself. Especially for Japanese LLMs and domestically produced LLMs, the design of this process can create significant differences in the "usability" and "sense of security" as LLMs that understand Japanese.

3. Why is it difficult for LLMs to be practical without RLHF?

An LLM that has completed pre-training learns language patterns based on a large amount of text, and its text generation ability itself is at a very high level. Even in tasks such as question answering and summarization, it produces outputs that appear sufficiently intelligent on the surface.

However, when actually trying to use LLMs in business or services, other challenges become apparent. Sometimes the intent of the instructions is subtly misunderstood, the phrasing is too strong, or the content is correct but the response feels a bit unsettling to use as is.

This is not so much due to a lack of knowledge in the LLM, but rather because it has not learned judgments such as "which answers are desirable" and "which behaviors are preferred." Pre-training is merely a process of learning how words are used and patterns that appear in large amounts of text, and it does not sufficiently include value judgments.

RLHF is a method to bridge this gap. Humans evaluate the outputs of the LLM and compare multiple responses, enabling the model to learn not only whether an answer is "correct" but also whether it is "desirable." RLHF is an indispensable process for bringing LLMs to a practical level.

4. Why RLHF is especially important for domestic LLMs and Japanese LLMs

The reason why RLHF becomes even more important for Japanese LLMs lies in the characteristics of the Japanese language. Japanese often omits the subject in expressions and is a language highly dependent on context. Additionally, appropriate expressions vary greatly depending on the relationship between the speaker and the listener, including honorifics, polite language, and euphemistic expressions.

Even if the sentence is correct in terms of meaning, it is not uncommon to feel that "this way of saying it is a bit harsh" or "it is too assertive to use in business." Such sensitivities cannot be judged solely by grammatical or vocabulary correctness and rely on the intuition of people who have actually used Japanese.

Domestic LLMs are increasingly being developed with the assumption that they will be used in business systems, internal tools, and customer support within Japan. As a result, there are more situations where safe responses or risk-averse expressions are required, even if they are somewhat ambiguous. To reflect these implicitly expected behaviors of users in the model, RLHF specialized for Japanese becomes important.

5. "Annotation Design" That Determines the Success of RLHF

RLHF is a method that utilizes human feedback, but simply involving people does not automatically improve the quality of the LLM. What greatly influences the success of RLHF is the design of the annotation.

Annotation design refers to the process of predefining the perspectives from which the LLM's outputs are evaluated in RLHF, and how those judgments are recorded as data. By organizing evaluation criteria, the granularity of judgments, and the shared understanding among workers, human feedback can function effectively as training data for the first time.

If RLHF is conducted with ambiguous evaluation criteria, the assessments of the same LLM output will vary, resulting in a loss of consistency as training data. If one person evaluates it as "polite and good" while another sees it as "roundabout," the LLM will not know in which direction it should improve.

In particular, for Japanese LLMs, it is necessary to treat concepts that are difficult to quantify—such as "politeness," "consideration," "naturalness," and "assertiveness"—as evaluation criteria. The very task of verbalizing these and organizing them with concrete examples forms the core of RLHF.

Furthermore, this kind of annotation design is not something that is decided once and finished. It is revised while reviewing the LLM's outputs, and the evaluation criteria are continuously updated. RLHF is not a one-time process but an ongoing effort that continues in line with the growth of the LLM.

6. Continuous RLHF Operation Stabilizes the Quality of LLMs

As mentioned in the previous chapter, RLHF is not a process that is completed once it is carried out. As the LLM approaches practical use, inputs closer to actual operation increase, and unexpected responses and subtle failure cases become more noticeable.

Each time, it is necessary to organize what the problem was, review the evaluation criteria, and create additional feedback data. By continuously cycling through this improvement process, RLHF contributes to the quality improvement of the LLM for the first time.

If the steady tasks such as sharing evaluation criteria, adjusting judgment variability, and quality checks are insufficient, RLHF will remain a mere formality and will not achieve the expected effects. RLHF is not only a "human-conducted process" but also a process where the question of "whether continuous operation is possible" is posed.

7. The Competitiveness of Domestic LLMs Depends on How RLHF Is Utilized

In the future, the differences in aspects such as LLM model architectures and the scale of computational resources like GPUs are expected to gradually narrow. Within this context, for domestically produced LLMs and Japanese LLMs to remain competitive, it is crucial how carefully the judgments and sensibilities of Japanese speakers can be reflected in the model through RLHF.

RLHF is not merely a post-processing step but a crucial development process that influences the quality and reliability of LLMs. Consistently embedding human judgment into data and continuously improving it supports the usability of Japanese LLMs.

The competition in developing domestic LLMs extends beyond model performance to how RLHF is designed and operated. This accumulation will create the difference between LLMs that are actually used and those that are not.

8. Human Science Teacher Data Creation, LLM RAG Data Structuring Outsourcing Service

Over 48 million pieces of training data created

At Human Science, we are involved in AI model development projects across various industries, starting with natural language processing, including medical support, automotive, IT, manufacturing, and construction. Through direct transactions with many companies, including GAFAM, we have provided over 48 million high-quality training data. We handle a wide range of training data creation, data labeling, and data structuring, from small-scale projects to long-term large projects with a team of 150 annotators, regardless of the industry.

Resource management without crowdsourcing

At Human Science, we do not use crowdsourcing. Instead, projects are handled by personnel who are contracted with us directly. Based on a solid understanding of each member's practical experience and their evaluations from previous projects, we form teams that can deliver maximum performance.

Generative AI LLM Dataset Creation and Structuring, Also Supporting "Manual Creation and Maintenance Optimized for AI"

Since our founding, our main business and service has been manual creation, and currently, we also support the creation of documents optimized for AI recognition to facilitate the introduction of generative AI for corporate knowledge utilization. In sharing and utilizing corporate knowledge and documents using generative AI, current technology still cannot achieve 100% accuracy with tools alone. For customers who want to make the most of their past document assets, we also provide document data structuring. We offer optimal solutions leveraging our unique expertise, deeply familiar with various types of documents.

Secure room available on-site

Within our Shinjuku office at Human Science, we have secure rooms that meet ISMS standards. Therefore, we can guarantee security, even for projects that include highly confidential data. We consider the preservation of confidentiality to be extremely important for all projects. When working remotely as well, our information security management system has received high praise from clients, because not only do we implement hardware measures, we continuously provide security training to our personnel.

In-house Support

We also provide personnel dispatch services for annotation-experienced staff and project managers tailored to our customers' tasks and situations. It is also possible to organize teams stationed at the customer's site. Additionally, we support the training of your workers and project managers, selection of tools according to your situation, automation, work methods, and the construction of optimal processes to improve quality and productivity, assisting with any issues related to annotation and data labeling.

 

 

 

Related Blog Posts

 

 

Contact Us / Request for Materials

TOP