Tuesday, October 23, 2007

Machine translation has a long way to go.

Recently, Google switched to its own translation system from Systran system. There is an interesting article compares these two by translation between English and German.

Honestly, I'm quite pessimistic about machine translation. I never had good experience with any translator. Just out of my curiosity, I compared Google and Systran by translating an English snippet into Chinese and it turns out that neither of them did a good work and nobody can understand their result, though Google did a slightly better. Everyone knows machine translation is a hard problem, me too. But according to their result, I think they could have done much better. They just didn't put enough effort.

Let's take a look at the translation result.

Englisth to Chinese Tranlastion Comparison
English SnippetGoogle TranslatorSystran translation
Apple infuriated developers and some iPhone users when it issued a software update September 27 that disabled unofficial programs installed on the handsets.

Until Wednesday, Apple had tried to control which applications consumers had on their iPhones.

Now, Jobs said the company intends to release a software development kit in February that will let coders create applications to work directly on the iPhone and the iPod Touch. The Touch is the new iPod portable player that resembles the iPhone but lacks the function of a cell phone.

"We are excited about creating a vibrant third-party developer community around the iPhone and enabling hundreds of new applications for our users," Jobs said in the posting.
苹果激怒了开发商和一些iphone的用户时,它发出的一个软件更新9月27日伤残非官方程式安装于手机。

直到周三,苹果公司曾试图控制哪些申请消费者对它们iphones 。

现在,乔布斯表示,该公司打算发布一个软件开发工具包,在2月,可以让编码员创造应用工作,直接就对iphone与ipod的触摸。触摸是新的ipod便携式播放相似iphone的,但欠缺的功能手机。

"我们很高兴看到,创造一个充满活力的第三方开发社区周围iphone的和有利的数百名新申请用户, "乔布斯说,在张贴。
苹果计算机公司激怒了开发商和一些iPhone 用户当它发布了残疾非官方的节目安装在手机的软件更新9月27 日。

直到星期三, 应用消费者有在他们的iPhones 的苹果计算机公司设法控制。

现在, 工作认为公司打算发布软件开发成套工具在将让编码人创造应用工作直接地在iPhone 和iPod 接触的2月。接触是类似iPhone 的新iPod 便携式的球员但缺乏手机的作用。

"我们被激发关于创造一个充满活力的第三方开发商社区在iPhone 附近和使能上百新应用为我们的用户," 工作说在0N 投稿。

In the first paragraph, both Google and Systran translate "disabled" as an adjectives, as "disabled" in "a disabled veteran", which messed up the meaning of the whole sentence. Because in Chinese, we use different words for "disabled" in "a disabled veteran" and "disabled" in "disable a functionality". As well known, verbs are usually very importing for understanding the sentence. So verbs should be handled every carefully. Actually, by using better NLP technologies, it can be known that "disabled" here is a verb and its object is "programs". Then it can be translated in a much better way. However, Systran translates the time adverb clause, "when it issued a ...", in a better way. Chinese will misinterpret Google's translation as "When Apple infuriated developers and some iPhone users, it issued it issued a software update ...".

It becomes even worse when I went through the second paragraph. The translation doesn't make any sense at all. It is the matter of the order of words and the meaning of "application", which is translated into "the act of applying" by both, instead of "computer software".

In the third paragraph, Google did a better job. It recognized "Jobs" as the name of Steven Jobs, but Systran translated "Jobs" as "job" in "look for a job". Simply from the fact that "J" is in capital, Systran should have done better. Systran also translates "player" as in "football player"... Moreover, Google did a lightly better job on the order or words, but still, the result is very hard to understand.

In the last paragraph, Google did a perfect job in translating "We are excited about" into a beautiful Chinese sentence, but Systran though it was "we are inspired". I think Google benefits from a collection of tons of common phrases, even sentences, and corresponding accurate translation. I actually used Google Translation a lot for phrases or short sentences.

To sum up, I saw the biggest two problems are:
(1) how to figure out the meaning of words in a particular context, and use the accurate translation in the target language.
(2) how to order translated words in a good way which is used in the target language. (more difficult)

Both of Google and Systran try to leverage huge dictionaries, NLP knowledge and statistic models, but it seems to me that in most cases, they are still translating text simply word by word. The translated text is very hard to understand, even misleading. So I would say they really have a long long way to go before people can understand their translation.

No comments: