TechCentral

Tuesday July 19, 2005

Recognising speech with Sphinx-4

BY LOKE KAR SENG



LIKE the Great Sphinx, Sphinx-4 is mute but it doesn't have any secrets to keep. Sphinx-4 is a state-of-the-art speech recognition system written entirely in Java.

It was created as a research tool from the collaboration of Carnegie Mellon University, Sun Microsystems Labs, Mitsubishi Electric Research Labs, and Hewlett Packard.

Sphinx-4 is licensed under the BSD licence that has no restriction for commercial applications.

Downloading and installation
Before downloading the Sphinx-4 package, you have to make sure that your computer can record sound with a microphone.

If you use Windows XP, you can test it via the Sounds and Audio Devices panel that can be found in the Control Panel. Choose the Audio tab, then select Properties from the menu, and make sure the microphone option is checked.

From the earlier Sound and Audio Devices window, click on the Voice tab and run the Test Hardware wizard.

On Linux, you need to set the mixer application to enable the microphone. On Fedora Core 4 (FC4), select volume control from the Application/Sound and Video menu.

Click on the Capture tab, and select the microphone icon, then turn the volume control up. You can check to see if it is recording correctly with a recording program, like Audacity.

Since this is a Java application, you will need the J2SE (Java 2 Standard Edition) JRE (Java Runtime Engine) or the JDK, from version 1.4 upwards. You can get these tools from java.sun.com/j2se/corejava/index.jsp. Sphinx-4 can be downloaded from cmusphinx.sourceforge.net/sphinx4. If you do not want to mess around with the actual engine, you can just download binary package.

You can build applications around it with the bin package. All you have to do is unzip the package into any suitable directory.

How it works
Speech that comes through the microphone is first encoded according to a suitable acoustic model.

Then it is chopped into tiny millisecond pieces. Each of these millisecond pieces have some unique features that could be matched with other pieces stored in a library.

When they match up, we would have found something that would identify it as belonging to some word. You can imagine that there are a lot of pieces to match up, given that each of the pieces are only part of a phoneme, which itself is only part of a word.

In order to improve accuracy, statistics are collected earlier in a training process that indicates how likely each feature will follow another.

This is rather like knowing that in English, the letter ‘r’ is more likely to follow the letter ‘p’ than ‘l’. The statistics are calculated using a mathematical technique called Hidden Markov Models (HMM). The statistics will guide the matching of the unknown spoken word (or phrase) against the stored word (or phrase).

It is like trying to match up or align a long length of text with another so that it gives the best results. The search for the best match is very time-consuming as there are many combinations to check. A common approach is to use dynamic programming that reuses work that is already performed. A well-known algorithm called Viterbi is often used.

Running the demo
Firstly you need to enable the JSAPI (Java Speech API) by accepting the license. For Windows, run the lib/jsapi.exe program, and for Linux, the lib/jsapi.sh script.

There are a number of demos in the package, all to be run from the command line. Try the Hello Digits Demo, this is a command line program that recognises spoken connected digits.

This means that you can say the digits continuously without pausing. Here’s the command: Java -jar bin/HelloDigits.jar.

This actually works quite well with a recognition latency of one to three seconds on FC4.

Surprisingly, the same program on the same dual-boot machine and hardware running on Windows XP, runs significantly slower. It sometimes takes an excess of 10 seconds to do the same thing. Perhaps this is some software issue, but this makes it rather unusable.

The ZipCity demo will attempt to locate a US city with the spoken five-digit zip code. When you run it, a US map will be shown.

As each digit is recognised, the cities that share the current recognised digit is lit up until only one match remains.

This works rather well on both Windows XP and FC4.

The HelloWorld demo recognises limited simple phrases, like “Hello Paul,’’ “Good morning Phillip,” and so on. It also works well if the words are articulated in a manner that is expected by the program, but on the large vocabulary N-Gram demo, it was difficult for the program to recognise the correct words.

This is to be expected because the data was trained using spoken North American English, which differs somewhat from British English.

In order for the application to be able to recognise words in an acceptable manner, it has to be trained with locally spoken data.

This can be done through SphinxTrain, but it is easier said than done. At the moment, documentation on how to do this is rather sparse, although it would really be an interesting thing to do.

Perhaps someone can step up to the task in the spirit of the open-source movement. And because there is nothing in the engine itself that is inherently language specific, a Bahasa Malaysia application would be a great thing to have too.

Applications and uses
The beauty of the Sphinx-4 system is its architectural design. Other open-source recognisers are available, but Sphinx-4 has a nicely documented modular object-oriented framework that allows plug-in classes.

Another highlight is that Sphinx-4 has already done all the hard work by implementing a state-of-the-art algorithm.

And of course being Java, it can run without modification on a lot of systems such as Solaris, Linux, Windows XP, and Mac OS X.

Sphinx-4 has found uses in universities. McGill University uses Sphinx-4 in their Automated Door Attendant system (ah, shades of Hitchhiker’s Guide to the Universe?) that allow visitors to leave video messages, schedule appointments, or review web-based documents and demos. MIT Media labs uses it for robotics research.

Unfortunately, Sphinx-4 currently doesn't support speaker adaptation and is not able to support dictation. Currently it also has poor accuracy and speed for large vocabulary when compared to commercial products.

There have been some comparisons between Sphinx-4 and commercial engines that show that with equivalent training data, Sphinx-4 performs as well as the commercial engines.

According to its developers, Sphinx-4 has most of the hooks in place in the engine for this, but the final touches are just not there now. Rumours of Java being slow is also dispelled, because Sphinx-4 runs better and faster when compared to previous versions written in C.

Conclusion
With the availability of localised training data, Sphinx-4 can be a useful system, i.e. for handsfree computing.

If you look at the demo source code provided, it is fairly easy to write programs and have it integrated to your applications. This is an amazing package for those who want to get their feet wet with state-of-the-art technology.

·The writer can be reached at Loke.Kar.Seng@infotech.monash.edu.my

  • E-mail this story
  • Print this story