Some parts of this page may be machine-translated.

 

How much volume should the corpus have?

How much volume should the corpus have?

With the statistical-based engine, you can customize machine translation expressions and terminology by loading parallel data such as past translation memories (corpus).

Statistically analyze the words and phrases with the highest probability from the loaded corpus and generate a translation. The more corpora that are loaded into the engine, the better the translation quality is said to be.

So, how much corpus should be loaded into the engine?
It depends on the engine and language used, but generally,
it is said that
"a corpus of about 200,000 to 1 million words is necessary" to achieve a certain level of translation quality.

When you hear this,
you may think, "It's difficult to prepare a corpus of over 200,000 words."
But there may be some people who think that.

However, even with a small amount of corpus,
sufficient quality can be obtained with a statistical-based engine.
The key to this is the high level of expertise of the corpus being loaded.
If the loaded corpus is translated from documents in the same series as the target document,
such as product documents,
and the terminology and expressions used are similar,
good quality can still be achieved even with a corpus of only 100,000 words.

For example, let's consider translating a printer driver user manual from English to Japanese.

The correct translation is,
——————————————————————
Remove the data
Delete the data
——————————————————————
Thank you.

Here, let's assume that we have loaded a corpus related to printers in the engine.
As a result, in addition to the translation related to the necessary drivers,
translations related to the printer body (hardware) will also be included in the corpus.
As a result, in machine translation, the following translation may be output.

——————————————————————
Remove the data
Remove the data
——————————————————————

"Remove" should have been translated as "delete",
but it was translated as "remove" which means "take out".
This is because there were more sentences describing the printer itself rather than the driver in the corpus,
which resulted in the more frequently used translation of "take out" being output as the translation for "remove".

In this case, even with a small amount, it may be more accurate to only read a corpus specialized in "printer drivers".

This is an easy-to-understand example, but in many cases, a smaller but more specialized corpus can result in better quality than a larger one.
It is not always the case that having a corpus of over 200,000 words means that a statistical-based engine can be used.

We have heard examples where the engine that loaded 8 million words of a certain field's corpus and the engine that loaded 400,000 words of a product-specific corpus were compared, and there was no difference in quality.

Therefore, we recommend first using parallel data such as current translation memory data to check the quality of the machine translation engine.

Using current data, we can evaluate the quality of machine translation and calculate cost reduction rates and return on investment.
Please feel free to contact us for more information.

 

Related Services

Machine Translation Evaluation Service

Machine Translation Seminar Scheduled
The machine translation seminar is held every month.
If you would like to receive seminar information emails, please register using the button below.

 

Blog Writing Team

Machine Translation Seminar_Tokuda

Tokuda Ai

・As a machine translation consultant, I provide consulting services for Japanese companies on machine translation implementation and process building.
・I place importance on the quality of the source text, which affects multilingual translation, and also provide consulting services for manual creation suitable for machine translation in the Japanese writing process.
・I also give presentations on the following topics related to machine translation.
- Presentation at the 23rd JTF (Japan Translation Federation) Translation Festival in 2013
"Approaches to Machine Translation in Multilingual Environments
- From the Perspectives of Evaluation and Process"
- Presentation at the 2014 AAMT (Asia-Pacific Association for Machine Translation) Machine Translation Fair
"Mastering Machine Translation - Improving Quality and Productivity"

Popular Article Ranking
Archive
Category

For those who want to know more about translation

Tokyo: +81-3-5321-3111
Nagoya: +81-52-269-8016

Reception hours: 9:30 AM to 5:00 PM JST

Contact Us / Request for Materials