Universal Translator

From GuardianiUS

This is a chronicle of some of my ongoing personal research and work toward implementing a Universal Translator and my adventures with the great folks at VoxForge.

Contents

[edit] Background

In late March 2007 my family and I took a trip to Myrtle Beach. Luckily, my work had supplied me with a loaner laptop and a Verizon Wireless cellular broadband access card. I was happily researching various subjects on Wikipedia, as is my custom when I have some spare time, while my wife drove the car. When suddenly I realized that the open source community had bits and pieces of software that could be glued together and improved into a basic Universal Translator. My wife is a grade school teacher and frequently deals with language problems when interacting with Spanish speaking students. Fascinated, and encouraged by my wife's enthusiasm, I spent every spare moment during the rest of the vacation researching this subject. My research and work toward this goal continued after the vacation. You can find most of what I've learned in the content below.

[edit] Software Components

Through my research, I discovered that a working version of the fictional Universal Translator could probably be assembled from the following software components:

[edit] Speech Recognition Engine

A Speech Recognition Engine is software that translates spoken language into written text. The concept has been around for nearly as long as computers. It is featured in many movies, TV shows, and books. (i.e. Hal9000, Star Trek's Ship Computer) Speech Recognition software has existed for nearly as long. I remember playing with a useful copy of Dragon Naturally Speaking on my first Pentium Pro 200mhz around 1995. Before that, I'm pretty sure I bought a DOS speech recognizer for my 486 computer.

Generally speaking, software concepts that old usually have an Open Source implementation. So why isn't Open Source Speech Recognition more common? The answer has to do with how Speech Recognition works. It has a few discrete components:

To understand why Speech Recognition isn't more common in the Open Source community, you have to understand the Acoustic Model and how it is created.

A number of Open Source Speech Recognition Engines exist. I've attempted to list these below. Of the below listed, only Julius has been built with dictation in mind. Dictation is an undoubtedly essential feature for Universal Translation.

[edit] Machine Language Translation Engine

A Machine Language Translation Engine is the software that converts text strings of one language into text strings of another language. AltaVista's BabelFish is probably the first free publicly available example of such a system. I was made aware of it sometime around the year 2000. I believe it uses SYSTRANS under the hood. Since then many other free online translation services have popped up, with perhaps the most well known being Google's Language Tools.

There are a few Open Source Machine Language Translators out there:

The leader seems to be Moses, which is designed as a drop in replacement for Pharaoh.

Moses requires high quality Language Models to operate, in the same way that Sphinx and Julius require Acoustic and Language Models to operate.

[edit] Text to Speech Engine

Text to Speech, sometimes abbreviated TTS, is software that translates written text into audio. The only Open Source TTS Engine I am currently aware of is Festival. Fortunately for the purpose of Universal Translation, Festival is designed to be multi-lingual. I believe recent versions can actually be trained using an Acoustic Model for more natural voice production and utilize a Language Model for more accurate speech (i.e. saying 'record' as 'ree-cord' or 'rec-ord' based on context). Currently the default Festival voices are rather difficult to understand, but that might be greatly improved with better Open Source acoustic and language models.

[edit] Acoustic Model

Shortly after the vacation (See Background), in early April 2007, I discovered VoxForge. I had been trying to figure out why Speech Recognition wasn't more common in the Open Source community. It seemed like all the pieces were there, but few working products had been assembled. I discovered the answer to my question while studying the VoxForge forums and documentation. It's pretty simple, really. The fact is that Speech Recognition software needs an Acoustic Model, or AM in order to be trained properly. An acoustic model is basically a large collection of transcribed audio. The problem with acoustic models is that you need many different people to speak for you and agree to have their voices recorded. Many acoustic models have been created in the past. After all, Speech Recognition has been around for a few decades at least. But all of these acoustic models are copyrighted, proprietary, closed source, and most certainly not Free. Voxforge aims to change that by creating an Open Source acoustic model.

One of the tools VoxForge uses to collect audio for their Acoustic Model is the VoxForge IVR, which I helped develop.

[edit] Language Model

A Language Model is basically a massive collection of sentences (useful language models will probably have millions of sentences) known as a Text Corpus that has been statistically analyzed and broken down into word and/or phrase units with each unit being assigned a probability.

Speech Recognition and Text to Speech engines will have one Language Model per language heard or spoken, respectively. In these engines the Language Model is usually used to improve accuracy and reduce CPU requirements. Whereas Machine Language Translation engines will have one Language Model per language pair and direction. For example, English -> German would be a single Language Model, and German -> English would be another. The Language Model then becomes the translation database. For example, the German phrase 'das ist' can be directly mapped to the English phrase 'this is' through the probability scoring contained in a Language Model. Each of these language pairs have to be trained individually. However, the source for these models, the Text Corpus for each language, can be used to individually train each engine type.

So far I've found two useful Open Source raw Text Corpora:

[edit] Work Needed

The three software engines, Speech Recognition, Machine Language Translation, and Text to Speech exist today. They are not ideal. They are not perfect. But they work and are usable as a starting point. However, in order to use these engines we *must have* high quality Open Source Acoustic Models and a Language Models. These are the fuel for the machine. Without these two components we cannot perform useful work.

[edit] Acoustic Model Work

The VoxForge project is working on an Open Source Acoustic Model in English. It is slowly growing through volunteer audio submissions. Once it is complete, we will still need parallel acoustic models in other languages. This is currently beyond the scope of the VoxForge project, but it's the next logical progression once their English corpus has been completed.

[edit] Language Model Work

Scripts need to be written to format, compile, and train the existing corpora (See Language Model for existing corpora) for each engine (i.e. Julius, Festival, Moses). However, though the existing two corpora together offer more than 800,000 sentences we will probably still need more data for optimal performance and accuracy. There is not currently a formal Open Source project working on an Open Source Language Model, but the folks at VoxForge and I are both interested in the subject. A sister project to the VoxForge project designed to expand the existing Open Source multi-lingual Language Model through the help of volunteers might be a viable solution.

I've got a page detailing my efforts to build a language model for moses.

Personal tools
related