
"We want to make our website multilingual for overseas markets" and "We want to efficiently translate product manuals"
For companies expanding globally, machine translation has become an indispensable tool. However, there are many machine translation services available, and many people may be unsure about which service is best suited for their business. In such cases, the "BLEU (Bilingual Evaluation Understudy) score" serves as a useful objective metric to compare the performance of machine translation. This article clearly explains the basic meaning of the BLEU score, how to use it in business, and important points to keep in mind.
- Table of Contents
1. What is the BLEU Score? A Score Indicating Closeness to the "Correct Answer"
The BLEU score is an indicator that numerically evaluates how closely the text output by machine translation resembles the "reference answer" created by humans (professional translators). The name BLEU is an acronym for "Bilingual Evaluation Understudy." The score is generally expressed on a scale from 0 to 100, and the higher this number, the closer the translation is judged to be to a high-quality human translation (the correct answer). A score of 30 or above is considered an understandable translation of moderate quality.
So, how is this score calculated? The detailed formula is complex, but the basic idea is very simple.
The foundation of the evaluation is the "word match rate," which compares the machine-translated sentence with the correct sentence created by a human and measures how many of the same words are included. For example, if the correct translation is "I have a pen," and a machine translation also outputs "I have a pen," the score will be high; however, if it outputs "I have a pencil," the score will be low.
However, simply matching words does not guarantee a natural sentence. Therefore, the BLEU score also takes into account the order of words, that is, the correctness of word order. By checking how many consecutive words (phrases) such as "I am pen" or "pen am holding" match, it evaluates more natural translations more highly. Furthermore, there is a mechanism that penalizes translations that are unnaturally shorter than the reference translation, adjusting the score to be lower.
2. How to Use the BLEU Score in Business
For example, when you are unsure whether to adopt the machine translation service from Company A or Company B, it serves as an objective basis for decision-making. By using your company's product manuals and website texts as samples and translating them with both services, you can compare the results using BLEU scores to numerically evaluate which service is better suited to your company's content.
It also helps improve the translation process after implementation. After introducing machine translation, there may be a task called "post-editing," where humans make corrections. However, if you use a machine translation engine with a high BLEU score, meaning high initial quality, the amount of human correction work decreases, which in turn leads to reductions in the time and cost required for translation. The BLEU score can also be used as a KPI (Key Performance Indicator) to measure this improvement effect.
Furthermore, it is also effective for fulfilling accountability to management. In response to the question, "How much has quality improved as a result of investing in the new translation system?", you can present the effect of the investment with concrete numbers, such as, "The BLEU score improved by 10 points before and after implementation, and this is expected to reduce revision costs by XX%."
3. Points to Note About the BLEU Score
While the BLEU score is convenient in this way, it is not万能. It is important not to take the score at face value and to understand some of its limitations.
First, even if the meaning matches, the score will be low if the expression differs from the correct translation. For example, if the correct translation is "I need to attend that meeting," but the machine translation outputs "I have to participate in that meeting," the meaning is perfectly fine, but because the words differ, such as "meeting → meeting" and "attend → participate," the BLEU score will be rated low.
It is also important to be aware that grammatical errors and unnatural nuances can be overlooked. As long as the words or phrases match the reference translation, the score may be high even if the overall flow of the sentence is unnatural.
It should not be forgotten that the reliability of the score is greatly influenced by the quality of the "human reference translation" used for comparison. If the quality of the reference translation itself is low, no matter how much the score is measured, the evaluation cannot be trusted.
4. Beyond the BLEU Score: A New Era Where AI Evaluates Quality
Because BLEU scores have these caveats, new evaluation technologies have emerged in recent years to compensate for their weaknesses. These are automatic evaluation methods that utilize AI. AI-based evaluation assesses translation quality not only by surface-level word matches but also by considering the context and the semantic closeness of words. As a result, translations that BLEU scores tend to rate low—such as those with correct meaning but different expressions—can now be fairly evaluated in a way that is closer to human judgment.
What is particularly noteworthy in this field is the new evaluation metric represented by "MetricX." AI-based metrics like MetricX incorporate recent advances in AI technology to achieve more precise automatic evaluation. By using BLEU scores and these new AI evaluation metrics appropriately, it becomes possible to grasp the quality of machine translation more comprehensively and accurately.
For those who want to learn more about AI-based translation evaluation and MetricX, please see the following article.
5. Summary
The BLEU score is a benchmark that allows you to easily measure the basic performance of machine translation, but it has limitations in that it cannot evaluate the correctness of meaning or the richness of expression. AI evaluation, represented by MetricX, compensates for these weaknesses. AI evaluates the accuracy of meaning in a way that is closer to human perception. While skillfully utilizing advanced AI evaluation, it is ultimately important to verify with human judgment whether it can truly be used in a business setting. The combination of objective data and real-world sensibility is the surest shortcut to finding the optimal translation solution for your company.
Human Science offers MTrans for Office, which utilizes LLM and machine translation. Please try the quality and usability with a 14-day free trial.
Features of MTrans for Office
- ① No limit on the number of files that can be translated or on the glossary, with a flat-rate system
- ② Translate with one click from Office products!
- ③ API connection ensures security
・For customers who want further enhancement, we also offer SSO, IP restrictions, and more. - ④ Support in Japanese by Japanese companies
・Support for security check sheets is also available
・Payment via bank transfer is available
MTrans for Office is an easy-to-use translation software for Office.