1. Introduction

Large language models (LLMs), including ChatGPT, are increasingly being utilized in various business scenarios such as improving operational efficiency and customer support. On the other hand, many companies face the question, "Is this AI really functioning correctly?" when actually implementing and operating these systems.

LLMs do not necessarily return clear-cut correct answers, and their outputs can vary and be ambiguous. Therefore, how to evaluate their quality has become an important topic.

This article provides an easy-to-understand explanation of the basic concepts of LLM evaluation, specific evaluation methods, and practical points.

Reference Blog: The Role of RLHF in Domestic LLMs — Where Does "Human Judgment" That Determines the Quality of Japanese LLMs Come Into Play?

2. What is LLM Evaluation?

LLM evaluation is an effort to measure and assess the quality of the text output by generative AI.

Traditional AI (such as image recognition and classification models) could be evaluated based on the match rate with the "correct labels." However, LLMs have the following characteristics.

・There is no single correct answer

・The output may vary each time

・The quality varies depending on context and use case

Therefore, it is necessary to evaluate from multiple perspectives, not only "how correct it is" but also "how usable it is" and "how safe it is."

Reference blog: The Growing Demand for Domain-Specific LLMs and Their Background

3. Why is LLM Evaluation Important?

In business use, "plausibility" is not enough

LLMs generate very natural and persuasive text. However, that "plausibility" does not necessarily guarantee accuracy. So-called hallucinations, where content that differs from the facts is generated as if it were correct, pose a significant risk in practical work.

For example, providing incorrect guidance in customer support can lead to a decline in customer satisfaction, and in specialized fields such as finance and healthcare, it can result in serious judgment errors. Evaluation is essential not based on superficial naturalness, but to determine whether it can be used reliably in business operations.

Functions as a guideline for improvement

When the quality of the LLM does not improve as expected, the cause is not singular. Whether it is an issue with prompt design, the quality of the referenced data is low, or the accuracy of the retrieval component (RAG) is insufficient—without conducting an evaluation, it is impossible to identify where the bottleneck lies.

By conducting appropriate evaluations, it becomes clear "which parts need improvement," reducing unnecessary trial and error. As a result, both development speed and quality can be enhanced.

The foundation for continuous quality management

LLMs are systems designed not to be built once and finished, but to be continuously improved while in operation. Each time the model is updated, data is added, or prompts are changed, it is necessary to verify whether the quality is improving or deteriorating.

By designing evaluation metrics in advance, it becomes possible to quantitatively grasp the effects of improvements, enabling stable operation.

4. Main Perspectives of LLM Evaluation

LLM evaluation cannot be measured by a single metric; it requires a comprehensive judgment by combining multiple perspectives. Here, we introduce representative perspectives commonly used in practical work.

Accuracy

The most important aspect is accuracy. It is necessary to verify whether the output content is based on facts and does not contain misinformation. This perspective is especially critical when handling internal knowledge or specialized information, as operation becomes difficult without ensuring it.

Relevance

This evaluates whether the response appropriately addresses the user's question or intent. Even if the content itself is correct, it is not practical if it deviates from the intent of the question. This is a particularly important metric for FAQ handling and chatbots.

Consistency

Check whether the direction of the response does not vary significantly for the same input. Since LLMs generate text probabilistically, outputs may fluctuate, but a certain level of stability is required for business use.

Fluency

This refers to whether the text is natural and easy to read as a sentence, and whether it feels appropriate as Japanese. In written expressions that serve as points of contact with users, not only the content but also the quality of the expression is important.

Safety

This involves checking for inappropriate statements, harmful content, or biases. In corporate use, this is a critical aspect as it can lead to brand damage and compliance risks, making it an indispensable consideration.

5. Methods of LLM Evaluation

The evaluation methods for LLMs can be broadly classified into three categories: "human evaluation," "automatic evaluation," and "evaluation based on actual business operations." It is important to understand the characteristics of each and use them appropriately according to the purpose.

Human Evaluation

This method involves humans actually reviewing the output and scoring it based on pre-established criteria. For example, evaluations are conducted in forms such as "rating accuracy on a 5-point scale" or "determining whether it can be used in business operations."

The greatest strength of this method is its ability to make flexible judgments that include context and nuance. While it allows for evaluations closely aligned with actual business needs, it tends to produce variability between evaluators and can be costly. Therefore, clarifying evaluation criteria and establishing guidelines are important.

Automatic Evaluation

This method mechanically evaluates using metrics or other models. Traditionally, text similarity metrics such as BLEU and ROUGE have been used, but recently, the method called "LLM-as-a-Judge," where another LLM performs the evaluation, has also become widely used.

Automatic evaluation is characterized by its ability to process large amounts of data quickly, making it easy to incorporate into continuous improvement cycles. However, since numerical scores do not always correspond to practical usefulness, it is assumed to be used in conjunction with human evaluation.

Task-Based Evaluation

This method evaluates outcomes in actual business tasks. For example, indicators include the accuracy rate of inquiry responses, the amount of work time reduced by the answers, and user satisfaction.

This method is the closest to real-world practice and can directly answer the question, "Is this AI actually useful?" On the other hand, since evaluation design and data collection require effort, it is common to introduce it gradually while combining it with other methods.

Reference blog: Comprehensive Comparison of Major LLMs: A Guide to Using ChatGPT, Perplexity, Grok, and Gemini

6. Common Evaluation Design Process in Practice

LLM evaluation is generally carried out through the following steps.

1. Define the evaluation objective (What do you want to improve?)

2. Prepare evaluation data (Actual questions and use cases)

3. Setting Evaluation Metrics (accuracy, validity, etc.)

4. Conducting Evaluation (manual or automated)

5. Improvement and Re-evaluation

By repeating this cycle, the quality will gradually improve.

7. Common Challenges in LLM Evaluation

While LLM evaluation is important, many companies commonly encounter stumbling blocks during the process of applying it to practical work. Here, we will specifically look at the challenges that tend to occur most frequently.

Evaluation criteria tend to become ambiguous

Since LLM outputs do not have a single correct answer, it is not uncommon to start evaluations without clearly defining "what constitutes a good response." As a result, judgments vary among evaluators, relying on subjective assessments such as "somewhat good" or "feels slightly off."

In this state, even if there is a change in the score, it is impossible to explain what has been improved, and the basis for adjusting prompts or data becomes unclear. In practice, it is important to define specific evaluation criteria and examples for each evaluation perspective, such as "Accuracy: whether it matches the facts" and "Validity: whether it aligns with the intent of the question."

Evaluation data is disconnected from actual business practice

Insufficient evaluation data or data that diverges from actual business operations is also a major issue. Even if simple questions prepared for validation yield high scores, in real-world settings there are often complex contexts and ambiguous inquiries, causing the system to not perform as expected.

For example, in internal knowledge searches, evaluation data needs to include not only "questions written in formal terminology" but also "abbreviations," "paraphrases," and "questions with omitted assumptions." If data close to actual operation is not prepared, the evaluation results will diverge from the actual usability.

Evaluation and improvement processes are disconnected

Even when evaluations are conducted and scores are visualized, there are many cases where this does not lead to concrete improvement actions. For example, even if the result shows "low accuracy," if it is not distinguished whether the issue lies in search precision, prompt design, or the quality of reference data, it is impossible to start improvements.

Evaluation should originally serve as the starting point for improvement. Based on the evaluation results, it is necessary to break down the causes and design concrete measures such as "improving search accuracy," "modifying prompts," and "adding or organizing data." Designing evaluation and improvement as an integrated process is a crucial point for successfully utilizing LLMs.

8. Summary

LLM evaluation is an indispensable process for leveraging generative AI in business. Unlike traditional AI, which can be measured simply by accuracy rates, it is necessary to comprehensively assess quality from multiple perspectives such as "accuracy," "validity," and "safety."

Moreover, evaluation is not merely a checking task but should function as a guideline for improvement. By visualizing where the issues lie and linking this to improvements in prompts, data, and system configuration, we can move closer to AI that is truly usable in practical work.

LLM is not a technology that ends once implemented; it is premised on continuous quality improvement. In this context, evaluation serves as the “foundation” for maintaining quality and maximizing value. Designing evaluations tailored to your company’s use cases and running improvement cycles can be said to be the key factor that determines the success or failure of AI utilization.

9. Human Science Training Data Creation, LLM RAG Data Structuring Outsourcing Service

Over 48 million pieces of training data created

At Human Science, we are involved in AI model development projects across various industries, starting with natural language processing, including medical support, automotive, IT, manufacturing, and construction. Through direct transactions with many companies, including GAFAM, we have provided over 48 million high-quality training data. We handle a wide range of training data creation, data labeling, and data structuring, from small-scale projects to long-term large projects with a team of 150 annotators, regardless of the industry.

Resource management without crowdsourcing

At Human Science, we do not use crowdsourcing. Instead, projects are handled by personnel who are contracted with us directly. Based on a solid understanding of each member's practical experience and their evaluations from previous projects, we form teams that can deliver maximum performance.

Generative AI LLM Dataset Creation and Structuring, Also Supporting "Manual Creation and Maintenance Optimized for AI"

Since our founding, our main business and service has been manual creation, and we now also support "the creation of AI-friendly documents to facilitate the introduction of generative AI for corporate knowledge utilization." In sharing and utilizing corporate knowledge and documents using generative AI, current technology still cannot achieve 100% accuracy with tools alone. For customers who absolutely want to leverage their past document assets, we also provide document data structuring. We offer optimal solutions that leverage our unique expertise, deeply familiar with various types of documents.

Secure room available on-site

Within our Shinjuku office at Human Science, we have secure rooms that meet ISMS standards. Therefore, we can guarantee security, even for projects that include highly confidential data. We consider the preservation of confidentiality to be extremely important for all projects. When working remotely as well, our information security management system has received high praise from clients, because not only do we implement hardware measures, we continuously provide security training to our personnel.

In-house Support

We provide staffing services for annotation-experienced personnel and project managers tailored to your tasks and situation. It is also possible to organize a team stationed at your site. Additionally, we support the training of your operators and project managers, assist in selecting tools suited to your circumstances, and help build optimal processes such as automation and work methods to improve quality and productivity. We are here to support your challenges related to annotation and data labeling.