The Era of AI Evaluating AI: Decoding the Forefront of Machine Translation Quality Evaluation in the WMT24 Report

In today's global business scene, machine translation (MT) has become an indispensable tool. Its range of applications is expanding daily, including communication with overseas customers, gathering the latest information, and deploying content in multiple languages. Especially in recent years, with the advent of large language models (LLMs) represented by ChatGPT, the quality of machine translation has improved dramatically, enabling more natural and contextually appropriate translations.

However, on the other hand, new challenges have also emerged. That is, "How can we objectively evaluate the 'quality' of evolved machine translation?"

The metrics that were once mainstream for evaluating machine translation quality (such as BLEU) can no longer accurately measure the advanced translations generated by LLMs. Unless we reconsider the "measuring stick" itself, it will become difficult for us to choose the most suitable translation tool for our company or to correctly grasp the return on investment.

This article introduces the latest trends in machine translation evaluation based on the paper "Are LLMs Breaking MT Metrics? Results of the WMT24 Metrics Shared Task."

Table of Contents

1. What is the international conference "WMT" that competes on machine translation quality?
2. Why can't the previous "measures" accurately evaluate it?
3. AI evaluating AI? The latest evaluation metrics
4. Business application points suggested by the WMT24 results
4-1. Check the "evaluation metrics" of the translation tools under consideration for introduction
4-2. The Role of "Humans" in the Final Decision
5. Summary

Free Download

AI Translation Service Comparison Report

1. What is the international conference "WMT" that competes on machine translation quality?

"WMT (Workshop on Machine Translation)" is the world's premier international workshop on machine translation held annually. Here, companies and research institutions from around the world compete in the performance of the translation systems they have developed.

Within that WMT, a subtask called the "Metrics Shared Task," which is the theme of this article, is held. This is not about the translation systems themselves, but rather a kind of "contest of evaluation metrics" that compete on the performance of "evaluation metrics for assessing translation quality." The biggest focus in 2024 was on the question, "Can existing evaluation metrics correctly assess translations generated by LLMs?"

2. Why can't the previous "measuring sticks" measure it?

One of the widely used metrics for evaluating machine translation so far is "BLEU." This metric compares the machine-generated translation with the correct translation created by humans and scores how much the words and phrases match. While simple and easy to understand, it has a fundamental issue of "not understanding the meaning and only looking at superficial word matches." Because of this, BLEU has two major weaknesses.

The first point is that even if the meaning is correct, it is unfairly scored low if the words are different.

For example,
Correct translation: The cat sat on the mat
Machine translation: The cat perched on the rug

This machine translation is semantically perfect, but because the words hardly match the correct translation, the BLEU score ends up being low.

The second point is the opposite phenomenon. Even if the meaning is critically wrong, it can be rated highly just because some words are similar.

For example,
Reference translation: The president visits Japan to discuss economic policy
Machine translation (hallucination): The president visits Japan to discuss military policy

This translation incorrectly renders "economic" as "military," resulting in a completely different meaning. However, since most of the other words match, the BLEU score is unfairly calculated as high.

In particular, recent LLMs excel at generating very fluent sentences. Therefore, even if they make mistakes like the second example above, the text tends to appear natural to human readers. As a result, when evaluated with BLEU, semantic errors are overlooked and high scores are given, making it increasingly difficult to accurately assess the true quality.

3. AI Evaluating AI? The Latest Evaluation Metrics

Developed to surpass the limitations of BLEU, the new evaluation metrics called "Neural Metrics" utilize AI technology. At this year's WMT24 as well, these neural metrics demonstrated high performance, with "MetaMetrics-MT," "MetricX-24-Hybrid," and the evolved version of COMET, "XCOMET," receiving the highest ratings.

For more details about MetricX-24, please see the article below.

What Is Google's AI Translation Evaluation Metric "MetricX-24"? Explaning Its Functions and Features

These metrics do not look at word matches like BLEU does. Instead, they convert sentences into vectors (collections of numerical values) that capture the "meaning," and evaluate the semantic closeness between the source text, machine-translated text, and reference translation. This allows for a deeper assessment of semantic validity without being influenced by superficial differences in wording.

For more details about MetricX-24, please see the article below.

What Is the BLEU Score? Introducing Everything from Its Basic Meaning to Applications

Furthermore, at WMT24, the method of "meta-evaluation" for assessing these evaluation metrics has also evolved. Evaluations are now conducted in a manner closer to actual business use cases (e.g., determining which of multiple translation systems is superior), resulting in the selection of more practical metrics.

4. Business Utilization Points Suggested by the WMT24 Results

As a result of numerous evaluation metrics competing, WMT24 presented the following conclusion.

"Properly fine-tuned and additionally trained neural metrics continue to demonstrate high performance in evaluating LLM-based translation systems."

This indicates that, for us business users, there is a reliable "measure" even for the new technology called LLM. From this result, we will introduce two points on what perspectives to have.

4-1. Check the "evaluation metrics" of the translation tools you are considering implementing

When comparing and considering machine translation services, please check not only the term "high accuracy" but also "which evaluation metrics are used to achieve high scores." The phrase "BLEU score No.1" alone may not necessarily guarantee quality. Whether the vendor discloses evaluation results using the latest neural metrics can serve as a useful reference for assessing the vendor's technical reliability.

4-2. The Role of "Humans" in the Final Decision

Automatic evaluation metrics are not万能. This report also points out cases where evaluation becomes difficult for certain specialized fields or intentionally created inaccurate translations. Ultimately, it is important to have a process where humans verify quality in light of their company's business domain and usage scenarios. While automatic evaluation metrics can be effectively used as objective reference information, positioning them as an aid to the final decision is considered an effective way to utilize them.

Summary

With the advent of LLMs, machine translation has advanced to a new stage. Along with this, the "measures" used to assess its quality have also continued to evolve into more practical and realistic ones. In the business field, keeping up with these latest trends and selecting and utilizing appropriate evaluation metrics directly leads to maintaining and improving translation quality. In an era where AI evaluates AI, it is essential to build an optimal translation environment by combining both human perspectives and technology.

Human Science offers MTrans for Office, which utilizes LLM and machine translation. Please try the quality and usability with a 14-day free trial.

Free Download

AI Translation Service Comparison Report

Features of MTrans for Office

① Unlimited number of file translations and glossary integration for a fixed fee
② One-click translation from Office products!
③ Secure API connection
・For customers who want further security enhancement, we also offer SSO, IP restrictions, and so on.
④ Support in Japanese by Japanese companies
・Support for security check sheets is also available
・Payment via bank transfer is available

MTrans for Office is an easy-to-use translation software for Office.