Surfing for other possible solutions I found some documentation for speech recognition engines from private companies that usually ship their products with a developer's API, big companies like IBM or Nuance. They would usually implement an interface known as SAPI (Speech Application Programming Interface) developed by Microsoft to provide speech recognition capabilities to Windows applications. In the same lines there is a JSAPI specification for the Java programming language. Microsoft not only developed the SAPI specification but also includes his own speech recognizer with some versions of the Windows operating system. So I downloaded the SAPI Software Development Kit and wrote a simple command line utility that reads a raw audio wav file and outputs a text file with the transcription of what was said in the audio file. Results were not great, especially because the recognition engine is not very speaker independent and some audio files that I tried had noise/music in the background.
The results of this hacking activity went beyond modifying one of their code examples to write this command line utility but I also wrote a JNI (Java Native Interface) interface to use the recognition engine from Java, although I have to stress I did it more as a practicing exercise because I'm still limited to read only from a wav file. Of course this will only work on Windows but portability is something you will have to give up for this time if you want all those things that I mentioned at the beginning of this post. I'm including a link here with the command line utility with source code and the Java interface and one example of usage of the interface using Java.
One limitation of those tools will also be that the wav file must be a PCM raw audio file with 22KHz frequency, 16 bit per sample and stereo sound. For this purpose I recommend using some nice command line utility to do the job: You can use SoX, an open source library and command line tool to change any of those parameters from the wav audio file. Also it would be great if you could input mp3 files or even video files. The problem with mp3 is that there are not so many solutions for conversion out of the box due to patents. You will have to download and compile LAME and integrate this encoder with SoX in order to get SoX to convert raw PCM wav files to mp3 encoded files. For video files the best solution is to use mplayer from the command line using mplayer -vo null -ao pcm:file=%FILE_PATH%. This will extract the audio from the video file.
Getting to hack with audio and video formats and speech recognition development was an interesting experience, this also exposed me to other higher level technologies like VoiceXML and later I had the chance to meet one of the developers of Firevox, a Firefox extension that focus on accessibility. Coincidentially a research project here at Stony Brook also does accesibility using voice technologies: HearSay.
|This book about the history of vocoders|
is titled: "How to wreck a nice beach"
a phrase commonly used as example of
the difficulty of speech recognition,
because it sounds like: "How to recognize speech"