What Is Google's AI Translation Evaluation Metric "MetricX-24"? Explaning Its Functions and Features

In the business environment, the use of AI translation is rapidly advancing. However, objective evaluation of translation quality remains a challenge. This article explains the functions and features of "MetricX-24," the latest AI translation evaluation metric developed by Google.

Table of Contents

1. Current Status and Challenges of AI Translation Evaluation
2. Overview and Key Features of MetricX-24
2-1. Flexible Evaluation Modes
2-2. High-Precision Error Detection Function
2-3. Support for Extensive Language Pairs and High Evaluation Accuracy
3. How to Read MetricX-24 Scores: From 0 to 25
4. Comparison with Conventional Evaluation Metrics
5. Summary

1. Current Status and Challenges of AI Translation Evaluation

Automatic evaluation metrics have traditionally been used for assessing the quality of AI translation. These metrics generally involve comparing AI-generated translations with human-created reference translations and quantifying the degree of word or character matches (e.g., BLEU). However, recent advanced AI translation systems, especially those based on LLMs (large language models), have been pointed out to face challenges in adequately evaluating semantic accuracy and contextual appropriateness.

To address this issue, a "learning-based" automatic evaluation metric has been developed that learns from human evaluation data to achieve higher-precision quality assessment.

For more details about the BLEU score, please see the following article.

What Is the BLEU Score? Introducing Everything from Its Basic Meaning to Applications

2. Overview and Key Features of MetricX-24

MetricX-24 is a learning-based AI translation evaluation metric developed by Google. It has the following three features.

2-1. Flexible Evaluation Modes

MetricX-24 is designed as a hybrid evaluation system.

・Reference-based mode: When reference translations are available, quality is evaluated by comparing with them.
・Quality Estimation (QE) mode: Even when reference translations do not exist, quality evaluation is possible using only the source text and the translation result.

This mechanism allows quality evaluation to be conducted even when preparing reference translations is difficult.

2-2. High-Precision Error Detection Function

MetricX-24 is designed to accurately detect the following error patterns unique to AI translation. These errors tend to be overlooked by conventional evaluation methods and could pose significant issues in business documents.

・Translation omission: A state where part of the information has not been translated.
・Duplication: A state where the same content has been translated multiple times.
・Missing punctuation: A state where punctuation marks (commas, periods, question marks, etc.) are missing after translation.
・Fluent but irrelevant translation: The translated sentence appears grammatically natural and fluent, but its meaning is completely different from or unrelated to the original content.

2-3. Support for a Wide Range of Language Pairs and High Evaluation Accuracy

MetricX-24 is trained on human evaluation data from diverse language pairs and demonstrates stable performance across a wide range of language pairs. Even in the results of the WMT24 Metrics Shared Task, where traditional evaluation metrics like BLEU ranked below 20th place, MetricX-24 ranked at the top, proving its effectiveness.
For more details on WMT24, please see the article below.

The Era of AI Evaluating AI: Decoding the Frontlines of Machine Translation Quality Assessment in the WMT24 Report

3. How to Interpret MetricX-24 Scores: From 0 to 25

MetricX-24 evaluates machine translation quality using a floating-point score from 0 to 25. A score of 0 indicates a perfect translation with no detected errors, while scores approaching 25 indicate more errors and serious issues. If the translation score is below 5, it is considered to contain no major problems. Since this is the opposite of other common metrics (such as BLEU) where "higher scores are better," caution is needed when using it.

4. Comparison with Conventional Evaluation Metrics

The conventional BLEU metric had the issue that the score barely changed even with slight improvements in translation quality. On the other hand, MetricX-24 sensitively detects even subtle quality differences noticeable to humans and reflects them in the score. This enables accurate judgment of improvements in translation quality.

5. Summary

AI translation quality evaluation has traditionally relied on automatic metrics such as BLEU, but these have faced challenges in adequately measuring the accuracy of meaning and context. Developed by Google, "MetricX-24" is a learned metric trained on human evaluation data, featuring flexible evaluation modes depending on the presence or absence of reference translations, as well as error detection functions for issues like translation omissions, duplications, missing punctuation, and irrelevant translations. It supports multiple language pairs and enables highly accurate evaluations. With the advent of MetricX-24, quality management of AI translation has become more practical and reliable. Going forward, it is expected that leveraging advanced evaluation metrics like MetricX-24 in the introduction and operation of AI translation will further ensure and improve translation quality.

Human Science offers MTrans for Office, which utilizes LLM and machine translation. Please try the quality and usability with a 14-day free trial.

Features of MTrans for Office

① Unlimited number of file translations and glossary integration for a fixed fee
② One-click translation from Office products!
③ Secure API connection
・For customers who want further security enhancement, we also offer SSO, IP restrictions, and so on.
④ Support in Japanese by Japanese companies
・Support for security check sheets is also available
・Payment via bank transfer is available

MTrans for Office is an easy-to-use translation software for Office.