How much volume should a corpus have?

The statistics-based engine can read bilingual data (corpus) such as past translation memories and customize the expressions and terminology of machine translation.

Statistically analyze the words and phrases that follow with the highest probability from the loaded corpus,
and generate translations. It is said that the more corpora you load into the engine,
the higher the quality of the translation will be.

So, how much corpus should be loaded into the engine?
It depends on the engine and language used, but generally speaking,
to achieve a certain level of translation quality,
it is said that "a corpus of about 200,000 to 1,000,000 words is necessary".

Upon hearing this,
some of you may think, "Preparing a corpus of over 200,000 words is a high hurdle."

However, in reality, even with a small amount of corpus,
it is possible to achieve sufficient quality with a statistical-based engine.
The key in this case is the high level of specialization of the corpus being used.
If the corpus being used consists of translations of documents from the same series as the target document,
such as product documentation,
and the terms and expressions used are similar,
it is possible to achieve good quality even with a corpus of around 100,000 words.

For example, let's consider the case of translating a printer driver manual from English to Japanese.

The correct translation is,
——————————————————————
Remove the data
Remove the data
——————————————————————
That's it.

Here, let's assume that a corpus related to printers in general has been loaded into the engine.
As a result, in addition to the translations related to the drivers that are actually needed,
translations related to the printer itself (hardware) may also be included in the corpus.
Consequently, machine translation may output translations like the following.

——————————————————————
Remove the data
Remove the data
——————————————————————

The term that should be translated as "delete"
was incorrectly translated as "remove" which means "take off".
This is because there were more sentences in the corpus describing the printer itself rather than the driver,
leading to the more frequently occurring translation of "remove" as "take off" being output.

In this case, even with a small amount, it may have been better to load only the corpus specialized in "printer drivers"
to achieve a more accurate translation.

This is an easy-to-understand example, but often, a smaller corpus with high specialization can result in better quality than having a large amount of corpus.
It is not always the case that "if you have a corpus of over 200,000 words,
you can use a statistical-based engine."

I have heard examples where an engine trained on an 8 million word corpus in a certain field and an engine trained on a 400,000 word corpus specialized for a product showed no difference in quality.

Therefore, I recommend starting by using the current translation memory data and other bilingual data to check the quality of the machine translation engine.

It is possible to evaluate the quality of machine translation using the current data,
calculate cost reduction rates and return on investment.
Please feel free to contact us.

Related Services

Machine Translation Evaluation Services

Upcoming Machine Translation Seminar
We hold machine translation seminars every month.
If you would like to receive seminar announcement emails, please register using the button below.

Seminar Information
Email Registration

Machine Translation
Contact Us
Request for Materials, Estimates, and Proposals
are free of charge.