Tuesday, October 23, 2007

Machine translation has a long way to go.

Recently, Google switched to its own translation system from Systran system. There is an interesting article compares these two by translation between English and German.

Honestly, I'm quite pessimistic about machine translation. I never had good experience with any translator. Just out of my curiosity, I compared Google and Systran by translating an English snippet into Chinese and it turns out that neither of them did a good work and nobody can understand their result, though Google did a slightly better. Everyone knows machine translation is a hard problem, me too. But according to their result, I think they could have done much better. They just didn't put enough effort.

Let's take a look at the translation result.

Englisth to Chinese Tranlastion Comparison
English SnippetGoogle TranslatorSystran translation
Apple infuriated developers and some iPhone users when it issued a software update September 27 that disabled unofficial programs installed on the handsets.

Until Wednesday, Apple had tried to control which applications consumers had on their iPhones.

Now, Jobs said the company intends to release a software development kit in February that will let coders create applications to work directly on the iPhone and the iPod Touch. The Touch is the new iPod portable player that resembles the iPhone but lacks the function of a cell phone.

"We are excited about creating a vibrant third-party developer community around the iPhone and enabling hundreds of new applications for our users," Jobs said in the posting.
苹果激怒了开发商和一些iphone的用户时,它发出的一个软件更新9月27日伤残非官方程式安装于手机。

直到周三,苹果公司曾试图控制哪些申请消费者对它们iphones 。

现在,乔布斯表示,该公司打算发布一个软件开发工具包,在2月,可以让编码员创造应用工作,直接就对iphone与ipod的触摸。触摸是新的ipod便携式播放相似iphone的,但欠缺的功能手机。

"我们很高兴看到,创造一个充满活力的第三方开发社区周围iphone的和有利的数百名新申请用户, "乔布斯说,在张贴。
苹果计算机公司激怒了开发商和一些iPhone 用户当它发布了残疾非官方的节目安装在手机的软件更新9月27 日。

直到星期三, 应用消费者有在他们的iPhones 的苹果计算机公司设法控制。

现在, 工作认为公司打算发布软件开发成套工具在将让编码人创造应用工作直接地在iPhone 和iPod 接触的2月。接触是类似iPhone 的新iPod 便携式的球员但缺乏手机的作用。

"我们被激发关于创造一个充满活力的第三方开发商社区在iPhone 附近和使能上百新应用为我们的用户," 工作说在0N 投稿。

In the first paragraph, both Google and Systran translate "disabled" as an adjectives, as "disabled" in "a disabled veteran", which messed up the meaning of the whole sentence. Because in Chinese, we use different words for "disabled" in "a disabled veteran" and "disabled" in "disable a functionality". As well known, verbs are usually very importing for understanding the sentence. So verbs should be handled every carefully. Actually, by using better NLP technologies, it can be known that "disabled" here is a verb and its object is "programs". Then it can be translated in a much better way. However, Systran translates the time adverb clause, "when it issued a ...", in a better way. Chinese will misinterpret Google's translation as "When Apple infuriated developers and some iPhone users, it issued it issued a software update ...".

It becomes even worse when I went through the second paragraph. The translation doesn't make any sense at all. It is the matter of the order of words and the meaning of "application", which is translated into "the act of applying" by both, instead of "computer software".

In the third paragraph, Google did a better job. It recognized "Jobs" as the name of Steven Jobs, but Systran translated "Jobs" as "job" in "look for a job". Simply from the fact that "J" is in capital, Systran should have done better. Systran also translates "player" as in "football player"... Moreover, Google did a lightly better job on the order or words, but still, the result is very hard to understand.

In the last paragraph, Google did a perfect job in translating "We are excited about" into a beautiful Chinese sentence, but Systran though it was "we are inspired". I think Google benefits from a collection of tons of common phrases, even sentences, and corresponding accurate translation. I actually used Google Translation a lot for phrases or short sentences.

To sum up, I saw the biggest two problems are:
(1) how to figure out the meaning of words in a particular context, and use the accurate translation in the target language.
(2) how to order translated words in a good way which is used in the target language. (more difficult)

Both of Google and Systran try to leverage huge dictionaries, NLP knowledge and statistic models, but it seems to me that in most cases, they are still translating text simply word by word. The translated text is very hard to understand, even misleading. So I would say they really have a long long way to go before people can understand their translation.

Wednesday, October 17, 2007

Google is on its way for to-do list.

I have been waiting for Google's to-do list for quite a long time. Not surprisingly, so did many others. Fortunately, we won't wait for too long. Google is "working to add our special Google secret sauce to the to- do lists space".

Friday, October 05, 2007

Updated SOSP/OSDI HOF with SOSP07 papers.

I just updated the SOSP/OSDI Hall of Fame with the SOSP 07 papers.

Check it out!

Thursday, October 04, 2007

Two more things I want from Google Doc.

I have been using Google Doc from its debut. It is a great product, but I wouldn't say it is perfect to me because of two things it missed, to-do list and wiki. Now, at least one of them is solved by "Remember the milk" (RTM), at least for me :-)

As everyone knows, calendar can't be a good replacement of to-do list. Calendar is more about "have to do something during a period of time". In the other hand, to-do list organizes things that should be done, but not necessary during a particular period of time. In my mind, I have some expectations for a to-do list. First of all, it should be on-line and easy to use, as what Google always did. It should be integrated with some common Calendar tools. I don't want to open another page for the to-do list. Of course, it should have fully functionalities of a to-do list, such as prioritizing tasks, organizing tasks into different groups and so on. Given all above, RTM is a perfect fit for me!

RTM is an on-line tool. It can also run in an off-line mode, so you are able to manage your tasks anywhere you are and any time you want.

RTM is easy to use and powerful. It has a clear and intuitive interface. Everyone can master it in five minutes. The user can organize tasks with priorities and put them into different groups. You can also add notes and tags to a task, assign est. time to a task, remember how many times you postpone a task, share tasks with others, and a lot more.

More important, RTM really did a wonderful job to work with other tools. It utilizes Google Calendar Widget to put the to-do list into Google Cal, which is extremely convenient for me. It is quite easy to add RTM to Google Cal. After that, when you open the Google Cal, there will a small icon for each day. By clicking it, a windows poped out to display tasks of that day, where you are able to create and modify tasks. I don't if you can see it on GCal on iPhone. RTM also can communicate with other tools through standard iCalendar protocol. Their new MilkSync, which is not free though, can synchronize RTM to-do list with Windows mobile devices. What else you can expect?

Well, it is the time to forget about your stickers and try RTM.

As for wiki, Google bought JotSpot one year ago and proposed to launch a wiki in this September, but it seems still not ready yet, both in Google Doc and Google Apps. But I believe that it is worth waiting. Wiki is definitely a good way to organize notes, even for a single user. So ... my eyes are peeled!

Tuesday, October 02, 2007

Mercurial - not "hg"

After a long debate over this list, we decided to use Mercurial as the new revision control system.

Personally, I ignored hunting for better revision control systems for a long time in the past. CVS is my choice since I started using revision control, although CVS has been criticized for many things, such as bad support for binary files, complicated structure, no atomic commit and so on so forth. However, you know, people just don't want to change, which is bad though.

Mercurial gave me fresh air and I couldn't help go through it manual. It is cool!

First thing first, Mercurial is distributed, which means there is essentially no "centralized" repository. Everyone can have local repository and do development on it. You can freely create tags, branches, unstable commits, even "push" their temporary changes to other people without polluting the main repository. After everything is done and tested, they "push" their changes into the main repository. This is extremely convenient for distributed development.

Mercurial is easy. It is easy to learn and have a clear structure. I especially love the idea of changeset, which is for both identification and integrity.

Mercurial is fast. It's key part is written in C, though most of it is written by Python. It borrows the idea of I, P frames from video compression. For each revision, it saves the delta data instead of a complete copy of a file, for saving space. After a period of time, it takes a snapshot to make retrieval fast. It also use "copy-on-write" to save space from cloning repository.

Mercurial makes commit atomic. It actually uses a simple trick to achieve this. All writes follow the order of writing real file changes, manifest data and finally change log. So there will never be a change log which contains partial manifest data or file changes.

Mercurial has a queue. MQ is a powerful extension of Mercurial for solving the patch management problems. It over-performs previous tools since MQ knows revisions.

To sum up, Mercurial looks very powerful and nature. If you are still using CVS, try Mercurial and you're gonna like it.

For those who are new to Mercurial, I would recommend "Distributed revision control with Mercurial" as your first read.