Am trying to consolidate all the different type of language and interaction tools required to make computer better at understanding and interacting in human languages. I have captured it as a mind map in Xmind . This will be a living document and welcome any suggestions on tools / technologies that I have missed.
Interestingly most of the technologies like nlp, speech recognition and speech translation are still at very nascent stage. But sadly there is hardly any research being done for Indian / Indic languages like Hindi , Tamil etc. some technologies are relevant to only Indic languages for example Symbol translation. Languages like Tamil owing to its ancient nature, have different scripts at different stages of its evolution , namely Brahmi-tamil script, vateluthu (வட்டெழுத்து), modern script. Symbol translators convert a text from ancient script to modern script of the same language.
Interacting with computers in native language takes the technology closer to masses. There are many ways to input in native languages.On Windows platform , Microsoft Input method editors (IME)can be used to type in non-latin languages like Indic or CJK . A newer alternative is Google IMEs, Though it supports only transliteration. On Linux there are different alternatives to type in non-latin languages viz scim, xim, uim etc. SCIM IME was the most popular on Linux until recently. However SCIM is older and has its own disadvantages.So a newer architecture was developed called IBus.
The Intelligent Input Bus (IBus, pronounced as I-Bus) is an input method (IM) framework for multilingual input in Unix-like operating systems. It’s called “Bus” because it has a bus-like architecture.
Latest Linux releases inluding Ubuntu 11.04 come with IBus installed. Am listing down the steps to configure Indic languages like Tamil, Hindi, Kannada on KDE or GNOME desktop on Ubuntu Linux 11.04.
Open a terminal and type the following commands. Alternatively you can select these packages from Synaptic package manager on (K)Ubuntu. Install IBus if it’s not already there.
sudo apt-get install ibus
sudo apt-get install ibus-m17n # this package contains tables for Indic languages)
sudo apt-get install ibus-qt4 #(if you are using KDE desktop)
sudo apt-get install ibus-gtk # (if you are using GNOME desktop)
sudo apt-get install im-config
Now run im-config from command line or using your favorite app launcher. Slelct ibus as the input method.And accept whatever the pop-up dialog says.
Restart the PC and log in to your desktop. You should see a keyboard icon in the taskbar. If not type ‘ibus’ in the terminal and give enter. Now you can add the selected input methods by right clicking on the icon and selecting preferences.
Now press ctrl + space to enable the IME, select the language you want to input and start typing in that language. IBus-m17n supports transliteration for few Indic langugaes like Tamil , Hindi.
I was waiting for this good news from Google for a long time. Finally, it’s here. Google Translate will now support 5 Indian languages viz. Bengali, Gujarati, Kannada, Tamil and Telugu. This is still in alpha release mind you..
Beginning today, you can explore the linguistic diversity of the Indian sub-continent with Google Translate, which now supports five new experimental alpha languages: Bengali, Gujarati, Kannada, Tamil and Telugu. In India and Bangladesh alone, more than 500 million people speak these five languages. Since 2009, we’ve launched a total of 11 alpha languages, bringing the current number of languages supported by Google Translate to 63.
More details can be found in the post by Ashish Venugopal, Research Scientist , Google. Curious to check the accuracy of the translation, I tried translating the above quote from the Google blog to Tamil.
I will say the accuracy is almost 70%. Certainly not bad for alpha release. And Google’s statistical machine translation approach will help in improving the accuracy as more web content are presented in these language. Now waiting for the update on Google translate for Android.