Improving the accuracy of machine translation requires that "the quality of the original document is also important" as I have discussed in previous blogs.
Regarding the quality of the original document, it is actually very important not only for the target document but also for the "corpus" that is fed into the statistical-based engine.
If a sentence is long or contains many complex grammatical structures,
there are issues where even training a statistics-based engine does not improve translation accuracy,
and the training process itself can also become lengthy.
This time, we will introduce the verification results using project data from Japanese-English translation regarding processing time.
●Verification Results Regarding Training Time
Training involves many processes, and among them,
the process that takes the most time is syntax analysis, which determines the part of speech and dependency of each word.
Therefore, if a sentence is long and contains complex grammatical structures,
it means that the process of analyzing this syntax takes time.
The table below summarizes the parsing times of the Japanese-English translation corpus used in actual projects.
Let's compare the analysis results of the corpus from "Project A," which contained many short sentences, and the corpus from "Project B," which contained many long sentences.
Project A (Processing Time) | Project B (Processing Time) | |
1 sentence | 6.72 seconds | 6.38 seconds |
100 sentences | 15 minutes | 41 minutes |
1000 sentences | 1 minute 10 seconds | 7 minutes 53 seconds |
3000 sentences | 6 minutes 27 seconds | 1 hour 5 minutes |
10,000 sentences | 4 hours 9 minutes | 5 hours 46 minutes |
- Verification Environment
Parser: Ckylark Using PC: iMac
Processor: Core i5 Processor Speed: 2.8GHz
Memory: 12GB 1,333Hz DDR3
As you can see from the table, even with the same number of sentences, B takes
overwhelmingly more time than A.
By the way, the time taken for syntactic analysis does not simply correlate with the number of sentences.
There were more long sentences in B compared to A, and as the number of sentences increases,
the difference in processing time becomes significant.
In this way, the length of Japanese sentences affects
the processing time of training significantly.
In this verification, there are 10,000 sentences, but in research and development that deals with vast corpora, it often takes one to two weeks for training.
●Reducing processing time is achieved by shortening sentences
To reduce the processing time for training,
we recommend shortening thesentence used as a corpus.
Additionally, by shortening Japanese sentences, machines can
be trained more accurately, leading to higher quality machine translation.
●Summary
When using a statistics-based engine,
it takes time for training, but
by simplifying Japanese sentences in the corpus, it can lead to a reduction in processing time.
Additionally, simplifying Japanese sentences can achieve better machine translation.
At Human Science, we offer analysis services for corpora and target documents.
We also provide advice on the introduction of machine translation, so please feel free to contact us!
If the form is not available, please send your inquiry via email to hsweb_inquiry@science.co.jp.
Thank you.
Alternatively, please feel free to contact us by phone at TEL: 03-5321-3111.