Verification Report on Translation Proofreading Accuracy Using LLM Proofreading Tools｜Generative AI × Translation | Blog

Recently, in corporate translation work, unifying specialized terminology and maintaining quality across multiple languages have become major challenges in post-translation quality checks (proofreading). Manual proofreading requires significant time and effort, and multilingual deployment necessitates dedicated checking systems for each language. In this context, automatic translation proofreading tools utilizing generative AI (large language models, LLMs) have attracted attention. With the advent of LLMs like ChatGPT, grammar checking and rewrite suggestions for texts have advanced dramatically, but to what extent can this technology assist in quality checking of translated texts? In this article, we introduce the results of actual multilingual translation proofreading using the LLM translation proofreading tool developed by our company, along with key points for effective utilization that have emerged. We consider through verification whether it can replace human proofreaders and identify where it can be used most effectively.

Table of Contents

1. What is an LLM Proofreading Tool
2. Verification Method
2-1. Verification Check File
2-2. Actual Project File
3. Verification Results
3-1. Error Detection Results in Verification Check File
3-2. Error Detection Results in Actual Project File
3-3. Differences in Multilingual Support Capability
4. Strengths and Weaknesses of the LLM Proofreading Tool
4-1. Strengths
4-2. Weaknesses
5. How to Utilize in Translation Work
5-1. Integration into the Translation QA Flow
5-2. Quality Improvement of Existing Translations
5-3. On-the-Spot QA for Short-Deadline Translation Tasks
6. Summary and Future Outlook
7. For inquiries about AI utilization, contact Human Science

1. What is an LLM Proofreading Tool

The LLM proofreading tool is a translation proofreading support tool developed by our company that utilizes large language models (LLMs) like ChatGPT to automatically detect errors in translated texts. By inputting the source and translated texts, it identifies problematic areas from various perspectives such as omissions, mistranslations, grammatical errors, and unnatural expressions. It considers context as a human proofreader would, and it can also point out issues often missed by Microsoft Word’s document proofreading features or conventional QA tools, such as spelling mistakes in the source text and incorrect technical terms.

Supplement: "LLM (Large Language Model)" refers to AI models trained on vast amounts of text data, capable of advanced text generation and understanding. ChatGPT is a representative example.

2. Verification Method

To understand the performance of the LLM proofreading tool, verification was conducted using the following two types of data.

2-1. Verification Check File

First, we prepared short bilingual data containing intentional errors such as omissions and mistranslations (a multilingual test document). This "verification confirmation file" was converted into a bilingual format and loaded into the LLM proofreading tool to check whether the expected errors were correctly detected. Input to the tool was done either by loading the bilingual file or by directly copying and pasting the text, and the detection status for each language was recorded across multiple cases (Case-1 to Case-6).

2-2. Actual Project File

Next, we prepared a bilingual file consisting of approximately 16 segments (195 words) extracted from translation data actually used in business operations. We ran this real-case-based bilingual data through the LLM proofreading tool to verify the extent of the feedback it could provide. Here, we compiled the number and types of errors and evaluated how useful the tool is for practical-level translations.

The verification was conducted using the LLM proofreading tool as of July 2025 (using a model equivalent to GPT-4). Additionally, the validity of the detection results was confirmed by translators as needed, and the advantages and challenges of the tool were analyzed.

3. Verification Results

3-1. Error Detection Results in Verification Check File

First, here are the results from the test file containing intentional errors. Most items in Cases 1 to 5 were detected as errors, showing that the LLM proofreading tool was able to identify many issues as expected. For example, in Case 1, the tool correctly detected a mistranslation in Italian, but the same error was not detected in the Slovenian translation at that location. This indicates that there were some cases where detection failed depending on the language. It was also confirmed that in such cases, if the Slovenian sentence alone was re-entered into the tool separately, the error could sometimes be detected. This suggests that detection results can vary depending on context and input method, reflecting the context-dependency unique to LLMs.

In other cases, Case-2 (French) and Case-3 (English) each detected problematic sections, and in Case-5 and Case-6, errors were detected in both Italian and Slovenian. In other words, some errors were consistently detected across all languages, while others showed variability depending on the language. This "language difference" will also be mentioned later as a weakness, but it is important to note that the detection performance of LLM-based tools is not completely uniform across languages.

3-2. Error Detection Results in Actual Project File

Next, here are the results of testing the LLM proofreading tool on actual translation files. For a bilingual text of 16 segments (195 words), the tool pointed out errors at an average frequency of 3.5 errors per 16 segments. This includes all issues derived from the previously mentioned test files (such as intentionally inserted mistranslations), confirming that all known errors contained in the test files were detected. Furthermore, the tool also identified new issues beyond those, capturing translation mistakes and expression problems that were not anticipated beforehand.

A closer look at the pointed-out issues reveals that they included not only clear errors but also gray-area points requiring detailed verification by translators. For example, there were outputs such as "expressions that may appear to be mistranslations at first glance but are acceptable depending on the context" and "inconsistencies in notation that need review from the perspective of style unification," which require human judgment. This means that the LLM proofreading tool goes beyond simple mechanical rule checks and makes suggestions that delve into context and nuance. Conversely, it also implies that not all suggestions from the tool can be immediately and definitively classified as "errors," and human review is essential to assess importance and make selections.

3-3. Differences in Multilingual Support Capability

The LLM proofreading tool features multilingual support, but the verification revealed differences in the number of issues detected depending on the language. Using actual project bilingual data prepared in 22 languages, mainly European, the tool was applied to each, resulting in a range of 2 to 5 detected issues per language.

Below is a partial excerpt (language name and number of flagged errors):

Language (Target)	Number of flagged errors
Bulgarian (BG)	2 errors
Czech (CS)	5 errors
German (DE)	4 errors
Spanish (ES)	2 errors
French (FR)	4 errors
Hungarian (HU)	3 items
Italian (IT)	4 errors
Lithuanian (LT)	5 errors
Slovenian (SL)	4 errors
… Others	…

It is noteworthy that there were at least 2 and at most 5 issues pointed out in each language, and no language had zero issues detected at all. In other words, the tool detected some potential problems in all 22 languages. On the other hand, the difference in the number of issues may be influenced not only by differences in translation quality but also by the strengths and weaknesses of the LLM model across languages. For example, Czech and Lithuanian had a relatively high number of 5 issues, while Spanish and Bulgarian had only 2. Determining the cause of this difference—whether it truly reflects differences in translation quality or differences in the model’s recognition performance—requires detailed analysis. However, as a general rule, it has been pointed out that AI accuracy tends to be higher for resource-rich languages (such as English and major European languages), so there is a possibility that minor languages may have more oversights. Nevertheless, in this verification, some issues were detected in all languages including minor ones, indicating that it is not the case that the tool is completely useless for less familiar languages. Rather, it demonstrates the advantage that at least a minimum level of checking can be performed even for languages without specialized reviewers within the company.

4. Strengths and Weaknesses of the LLM Proofreading Tool

Based on the above verification results, we will organize the strengths (advantages) and weaknesses (disadvantages) unique to LLM proofreading tools.

4-1. Strengths (Advantages):

・Wide range of error detection: It can automatically detect a broad spectrum of issues, including omissions and mistranslations that human reviewers and conventional tools tend to overlook, grammatical errors, terminology inconsistencies, variations in notation, and unnatural nuances. Detection of grammatical errors is particularly accurate, and it was able to precisely point out issues such as gender agreement errors in pronouns in Russian translations. Additionally, a high detection rate of typos (misspellings) in Japanese has also been confirmed.

・Suggestions Based on Advanced Content Understanding: Leveraging the extensive knowledge unique to LLMs, it can even delve into verifying the accuracy of technical terms and proper nouns. For example, it has been observed to detect spelling mistakes in specialized abbreviations overlooked by translators and to attempt verifying whether product names are correct following organizational changes.

・Multilingual Support: Since one tool can handle many languages, it enables basic checks even when there is no reviewer for that language within the company. In this verification, errors were detected in all 22 languages, showing coverage from major European languages to some Eastern and Northern European languages. This provides a foundation for conducting mechanical quality checks with consistent standards even in multilingual projects.

・Ease of Integration into Existing Workflows: The LLM proofreading tool is designed to easily integrate into current translation workflows because it can directly use bilingual files created by various CAT tools as input. Since it can read and check the bilingual text output after translation in tools like Trados as is, there is little need for additional format conversion, making it easy to add to the QA process.

]

・Potential for Quality Improvement and Efficiency: By reducing overlooked errors, it is expected to raise the final quality, and it may be particularly useful as a simple substitute for manual QA in short-deadline projects. Additionally, accumulating feedback from proofreading results can lead to improvements in translation memories and glossaries, potentially contributing to the overall efficiency of the translation process in the long term.

4-2. Weaknesses:

・Inconsistencies in Detection Depending on Language and Context: As mentioned above, there are variations in error detection across languages, and some languages may experience missed detections. For example, a certain mistranslation was detected in Italian but overlooked in Slovenian. This shows that the tool is not completely foolproof and weaknesses can appear with specific language combinations, so caution is necessary. In complex contexts or long sentences, AI suggestions may be unstable, and there have been reports of the feedback changing with each run.

・Existence of False Positives (Excessive Flags):Incorrect flags derived from the knowledge of the LLM are also occasionally observed. False positives require users to make selections, and there is a workload involved in verifying whether the flags are truly valid.

・Human judgment remains necessary: The final decision on how to handle the tool's suggestions rests with humans. Since not all flagged points necessarily require correction, the process of having translators or reviewers familiar with the language pair carefully examine the content cannot be omitted. In particular, it is pointed out that it is difficult for non-experts to determine whether suggestions regarding "omissions or mistranslations" are accurate, and expert judgment is needed. In short, it is difficult to complete translation checks using only LLM proofreading tools, and final confirmation by human eyes is indispensable. It is more appropriate to position these tools as "useful aids" rather than fully automated solutions.

・Caution in Handling Style Corrections: LLM proofreading tools may not only point out errors but also offer suggestions for improving expressions (e.g., "proposals to make Japanese expressions more concise"). However, especially for content where the overall flow of the text is important (such as blog articles or white papers), there have been reports that making corrections exactly as suggested can actually disrupt the flow of the text. In other words, it is necessary to exercise operational caution by not taking style-related suggestions at face value and treating them only as references.

5. How to Utilize in Translation Work

Based on the strengths and weaknesses of LLM proofreading tools mentioned above, appropriately utilizing LLM proofreading tools as "assistive tools that amplify human capabilities" has the potential to bring significant benefits to the translation workflow. In particular, the following three utilization methods are considered promising.

5-1. Integration into the Translation QA Workflow

By incorporating LLM proofreading tools into the final check (Quality Assurance) of the translation process, you can prevent overlooking careless mistakes. First, the translator or proofreader conducts the usual review, then performs an automatic check with the tool as a final step. Since only the points flagged by the tool need to be rechecked, an efficient and thorough QA system can be established. The double-check by both humans and AI enhances the reliability of the quality assurance process and has the potential to reduce human errors before delivery.

5-2. Quality Improvement of Existing Translations

This is a usage method where the LLM proofreading tool is applied in bulk to previously translated assets (such as manuals) to help improve quality. It is effective for maintaining existing content, such as detecting and correcting mistranslations in past translations, which would require enormous effort if done manually. This allows for efficient enhancement of the quality of internal documents and publicly released materials.

5-3. On-the-Spot QA for Short-Deadline Translation Tasks

In situations where time is extremely limited, it may be unavoidable to omit the proper QA process. However, in cases where completely ignoring quality risks is frightening, there is the idea of using the LLM proofreading tool as a “simple QA tool.” Specifically, as soon as the translation is completed, it is run through the tool to automatically extract only major omissions or clear mistakes. Ideally, manual QA should also be used, but in cases such as “only one hour is available for proofreading,” this tool can function as a minimum quality gate and potentially serve as a weapon to support short-deadline tasks.

As described above, LLM proofreading tools can be utilized in a wide range of situations, from assisting with regular QA to asset maintenance and emergency backup. However, since it is essential for humans to be involved in selectively accepting or rejecting the tool’s suggestions rather than taking them at face value, these tools are positioned strictly as “assistants that strongly support proofreaders.”

6. Summary and Future Outlook

This verification has made the usefulness and limitations of translation proofreading tools utilizing generative AI concretely clear. While LLM proofreading tools are not a universal solution that can completely replace traditional manual checks, when used appropriately, they can become powerful assistants in translation quality management. In particular, they demonstrate accuracy almost equivalent to humans in detecting grammar errors and clear mistranslations, providing unprecedented convenience through consistent automatic checks across multiple languages. On the other hand, aspects such as judging nuances that require contextual understanding and selecting or discarding false positives still need to rely on human experience and judgment. In other words, the key to future translation quality management will be a hybrid operation of "AI + human." By incorporating AI’s highly accurate strengths while complementing them with human advantages, it will be possible to achieve both quality and efficiency—this is likely the next-generation style of translation checking.

Key point: *By making generative AI an "ally" rather than an "enemy," it becomes possible to balance both productivity and quality assurance for translators and proofreaders. It is important to skillfully integrate it into your company's translation workflow and continuously refine it by incorporating feedback from the field.

If you are interested in text composition with ChatGPT or ChatGPT's translation capabilities, please also take a look at our related blog articles.

>>How to Proofread Text with ChatGPT: Prompt Examples and Benefits Explained!
>>How Accurate Is the New OpenAI Model GPT-4.1 in Translation? A Comparison with DeepL!

7. For inquiries about AI utilization, contact Human Science

At Human Science, we are actively incorporating generative AI technologies, including this LLM proofreading tool, to improve the quality and efficiency of translation projects. We have established a system to provide our clients with higher-quality translations within shorter delivery times. If you are a translation manager reading this article and have concerns such as "challenges in your company's translation quality checks" or "difficulty keeping up with quality assurance in multilingual deployments," please consider our translation services. We also offer consultations for free trials utilizing the LLM proofreading tool. Would you like to practice new translation quality management methods for the generative AI era together with us? We will provide proposals tailored to your company's needs. Please experience the new form of translation quality management through our services. By combining professional translators and the latest AI technology, we offer unprecedented peace of mind and efficiency. We look forward to your inquiries!