Evaluating the Performance of Generative AI Translation in 6 Languages [ChatGPT vs Human Translation Comparison Report]

Table of Contents

1. Background and Challenges: Is Translation Evaluation Possible with Generative AI?
2. Verification Design: Methods of Translation Evaluation Using ChatGPT
3. Results: Trends and Limitations of Generative AI Translation Evaluation
4. Improvement Measures: Enhancing Evaluation Accuracy through Hybrid Operation
5. Future Prospects and Potential Applications: How Will Translation Evaluation Evolve in the AI Era?
6. Detailed verification materials are available for free download!

1. Background and Challenges: Is Translation Evaluation Possible with Generative AI?

“Is it possible to automate translation quality checks using generative AI like ChatGPT?” In recent years, this question has been attracting attention. From the translation field, voices like the following can be heard.

・"It is difficult to manually verify the quality of translations"
・"Multilingual support is required, but there are no personnel available to perform evaluations"

As multilingual translation of manuals and technical documents becomes commonplace, the need to quickly and objectively evaluate translation quality in-house is increasing. However, manual checks become more time-consuming and costly as the number of languages increases, and subjective bias from evaluators is unavoidable.
This is where the use of generative AI, including ChatGPT, is expected. The idea is, "Wouldn't AI be able to automatically evaluate based on consistent standards?" If translation checks could be completed with ChatGPT, it would be possible to instantly assess the quality of delivered translations and narrow down the parts that need to be re-checked by external experts as necessary.
That said, there are also concerns about entrusting evaluation to generative AI.

For example:
・Can it provide feedback as intended? (Does the result vary too much depending on the prompt?)
・Is it safe to use even with confidential documents? (Is there any risk of input data leaking externally or being used for training?)
・Can it evaluate as accurately as a human? (Can it detect errors without omission or excess?)

To address these challenges, Human Science conducted an internal seminar to verify the usefulness of translation quality evaluation using ChatGPT. This article introduces an overview and the results of that seminar.

2. Verification Design: Methods of Translation Evaluation Using ChatGPT

First, we explain the verification method for multilingual translation checks using ChatGPT. This time, we used English technical documents and their multilingual translations to compare evaluations by ChatGPT (GPT-4) and professional translators.

・Target data: Parallel data of English technical manuals translated into French, German, Italian, Spanish, Russian, and Dutch (approximately 200 sentences and about 1,000 words per language)
・Evaluation method: For the above six languages, we collected and compared quality evaluation results from two professional translators and **ChatGPT (GPT-4, as of July 2024)**
・Tools used: We used the LLM proofreading function integrated into our in-house developed translation QA tool "HS XChecker." The parallel data was input sentence by sentence into ChatGPT for automatic proofreading and evaluation.

The evaluation criteria follow general translation quality standards and consist of the following five items.

① Mistranslation or omission (whether the original meaning is correctly conveyed)
② Grammatical errors (whether there are any grammatical mistakes in the translation)
③ Formal errors (errors in terminology consistency, spelling variations, use of symbols, and other formal aspects)
④ Unnatural expressions (whether the translation contains unnatural or hard-to-understand expressions)
⑤ Inappropriate expressions (whether the wording is unsuitable for the context or the audience)

From the perspectives of "accuracy" and "fluency" mentioned above, we analyzed how well ChatGPT can detect various errors by comparing its results with human evaluations.

3. Results: Trends and Limitations of Generative AI Translation Evaluation

The conclusion of the verification is that it is difficult to complete a perfect translation check using only ChatGPT. However, it was also found that some types of errors can be detected with high accuracy, and with some ingenuity, it can become a useful auxiliary tool. Below is a summary of the accuracy and trends for each evaluation criterion.

Evaluation Criteria	ChatGPT Accuracy	Comments
Grammar errors	○ (High accuracy)	Detection of grammatical mistakes is relatively accurate
Mistranslations and omissions	△ (Partially detectable)	Major errors are caught, but there are many unnecessary flags
Formatting errors	× (Difficult to detect)	Formatting mistakes such as incorrect bracket types are overlooked
Unnatural expressions	× (Difficult to detect)	Poor at nuance evaluation and hardly detectable
Inappropriate expressions	× (Difficult to detect)	Insufficient context understanding, hardly detectable

(* "◎" very high accuracy, "△" partially detectable but with limitations, "×" tends to be difficult to detect)

As shown in the table above, ChatGPT is good at detecting "grammatical errors" and showed results almost consistent with human indications. For example, it accurately captured cases of grammatical violations such as gender agreement errors in pronouns in Russian translations. It also generally detected omissions and clear mistranslations where the meaning was completely lost.

However, in the case of "mistranslations and omissions," there were instances where translations that are acceptable from a human perspective were excessively judged as errors. For example, for the English chapter title “Read This First,” the Spanish translation “Leer antes de empezar” was flagged by ChatGPT for lacking a word corresponding to “This,” but in reality, it is a natural Spanish translation and not a mistranslation. It is important to be aware that expressions that are originally not problematic can sometimes be falsely detected as errors.

On the other hand, there was a tendency that detection of "formal errors" (deviations from rules for symbols and notation consistency) and "unnatural/inappropriate expressions" was not possible. For example, in the case of French translations where the type of quotation marks was inappropriate, human evaluators pointed this out, but ChatGPT overlooked it. Also, in a German translation example where the use of pronouns referring to service names did not fit the context, ChatGPT was unable to detect it. These relate to subtle expression nuances and style rules unique to each language, and it can be said that they reveal the current limitations of ChatGPT.

From the above, this verification concluded that "ChatGPT alone cannot complete translation checking." However, it is also true that for aspects where mechanical checks are effective, such as grammatical errors, it can be helpful with accuracy close to that of humans. The important point is to delegate the parts it excels at to AI while having humans supplement the parts it struggles with.

4. Improvement Measures: Enhancing Evaluation Accuracy through Hybrid Operation

Taking into account the limitations of ChatGPT, Human Science is exploring a hybrid translation checking method combining AI and humans. The key to this is the combination of our proprietary "HS XChecker" and the "LLM Proofreading Tool."

HS XChecker has traditionally been a QA verification tool that performs terminology consistency checks, numerical and formatting checks for translations. We have now developed an "LLM Proofreading Support Function" by integrating OpenAI's GPT engine. Specifically, this function analyzes the source and translated texts sentence by sentence and automatically detects and displays errors from the five perspectives mentioned above.
This mechanism overcomes the concerns pointed out when using ChatGPT as-is (such as only showing representative examples when analyzing the entire text at once, or generating nonexistent errors arbitrarily). Errors are visualized in a list format for each source-translation pair, making it immediately clear which sentence and where the problem lies. Human reviewers only need to select and decide whether the issues pointed out by ChatGPT are truly problematic, leading to significant efficiency improvements.

Additionally, it is designed to be secure and safe to use. The tool utilizes OpenAI's API, and the bilingual data used for analysis is not reused for model training.

We believe that a "hybrid operation," where the final judgment is made by humans while leveraging the power of generative AI, is currently the most practical and effective solution.

5. Future Prospects and Potential Applications: How Will Translation Evaluation Evolve in the AI Era?

From this verification, we have gained insight into the "current state" and "future prospects" of AI-based translation evaluation. Finally, we will summarize the knowledge obtained and the outlook going forward.

・AI evaluation is merely an auxiliary tool:
It became clear that ChatGPT alone cannot cover all errors, and that human eyes are indispensable for subtle nuances and contextual judgments. At the same time, AI can sufficiently handle routine parts such as grammar checks. A smart division of roles is "AI excels, humans handle what AI struggles with."

Efficiency Improvement of the Evaluation Process:
In the era of generative AI, translation checks will mainly focus on efficiency through collaboration with humans rather than full automation. AI’s suggestions serve as preliminary information, allowing humans to concentrate on more creative and higher-level judgments. Considering the strengths and weaknesses of AI revealed this time, designing an optimal hybrid operation in each company’s workflow will be key.

Further Potential: Expansion of Languages and Application to Specialized Fields:
Since GPT-4 can handle dozens of languages, it is possible to expand the scope of evaluation beyond the European languages targeted this time. Additionally, there is potential for future application in quality checks for specialized fields such as legal documents and medical documents. In highly specialized areas, it is risky to take AI judgments at face value, but AI can be effectively utilized for certain pattern detection and initial checks, with final decisions made by experts, which would be sufficiently helpful. Our company also plans to continue research on such field-specific verifications and methods to improve evaluation accuracy, and to reflect these in our services.

As described above, the introduction of AI into translation evaluation is expected to evolve not as a replacement for humans but as a support for humans. To meet quality requirements while improving operational efficiency, we encourage you to consider adopting a hybrid evaluation approach combining AI and human input.

6. Detailed verification materials are available for free download!

The content introduced here is only a part. We have prepared a free report summarizing detailed verification results of translation evaluation using ChatGPT. It includes many graphs based on actual evaluation data and specific examples of corrections, so if you are interested, please be sure to take a look.

We also accept inquiries regarding the use of generative AI in translation work. Please feel free to contact us even if your requirements are not yet finalized.
>>Contact us here

Please make use of this along with the download and leverage it to improve your future operations.

Service Links
>>Translation Services | Translation Services for Industry and Manufacturing
>>Translation Services | Multilingual Translation Services