As part of my ongoing research I needed a quick and easy way to recognize speech. After seeing how effortless products like Siri are at recognition, I naively thought that the technology has been developing nicely, and I was a few short clicks away from glorious, well-supported recognition with moderate accuracy. The reality of the situation was not quite this. Carnegie Mellon current puts out the best open-source speech recognition toolkit, CMUSphinx. It’s great, but poorly documented for the beginning. When I visited the page I had one task I wanted to accomplish: Recognize arbitrary English quickly, preferably from within a language like Python. While this is certainly possible with Sphinx, it’s not intuitive.
So many options.. which one to choose?
By far the hardest aspect of using Sphinx is installing it. It seems the authors, in an effort to cut down on support requests, have actively tried to make it unintuitive.
We must install Sphinx, but which one? On the downloads page the maintainer helpfully points out that it’s tough to know which package to install, we have a good half-dozen available to us, from SphinxBase, to Sphinx1-3, written in C, to Sphinx4, which has been rewritten in Java, to PocketSphinx, which seems as if it’s designed for a mobile platform.
Which one of these to install is not obvious. At first Sphinx4 seems like the obvious choice, but because it’s written in Java, and relatively new it has no language bindings for Python, and seems very beta-ish.
Looking back, Sphinx3 was written in C, and seems decent, so I tried that next. No dice. It’s a mess, reading along there’s a blurb hidden in a wiki page somewhere noting that it’s for research use only.
Finally, after reading an obscure forum post somewhere it was mentioned that PocketSphinx is actually intended for desktop usage too, and has Python bindings! This makes a lot of sense. After face-palming myself for missing the connection, made obvious by the title, I decided PocketSphinx was the application I needed to install!
Luckily for us, Ubuntu has packages available. Pulling out my
apt-get shotgun, a quick command installed everything I needed (and more).
sudo apt-get install sphinx*
Actually Doing Recognition
After installing things, life started looking up. Throwing together a quick Python script, using the documentation found here, buried in the CMUSphinx labyrinth actually wasn’t too difficult.
You’ll need a test audio file. Raw 16-bit audio, formatted as a binary stream of unsigned integers works really well. A freely available utility called
sox comes with Ubuntu and will help you convert almost anything into raw audio. I’d also suggest looking into Python Audio Tools for on the fly conversions, however don’t try to use PCMConverter, it’s a pile of garbage.
Just open up a raw binary audio file, and invoke the decoder:
import audiotools as at hmmd = '/usr/share/pocketsphinx/model/hmm/wsj1' lmd = '/usr/share/pocketsphinx/model/lm/wsj/wlist5o.3e-7.vp.tg.lm.DMP' dictd = '/usr/share/pocketsphinx/model/lm/wsj/wlist5o.dic' fRaw1 = open('tmp1.raw', 'r') speechRec = ps.Decoder(hmm = hmmd, lm = lmd, dict = dictd) speechRec.decode_raw(fRaw1) result = speechRec.get_hyp() print result
dictd are files used by the Decoder to give it the sense of the language necessary to decode words. By default PocketSphinx comes with a corpus of general text that works alright. If you’re using Sphinx for domain-specific work I’d highly recommend creating your own dictionary with a limited number of words, you’ll achieve much greater accuracy that way.
And we’re done!
So hopefully by now if you followed these steps loosely you’ll have a working speech recognizer. Playing around with my own voice, I’ve found the accuracy to be alright, but not great. Training it to your voice apparently yields better results. From what I’ve read commercial recognizers are using slightly more advanced algorithms than what Sphinx currently uses, and more community time is needed to bring open-source recognition up to speed with something like Siri.