If you want to try high-quality voice recognition without buying anything, good luck. Sure, you can borrow speech recognition from your phone or force some virtual assistants on a Raspberry Pi to do the processing for you, but they’re not good for major work that you don’t want to be tied to a closed source. solution. OpenAI introduced Whisper, which they claim is an open source neural network that “approaches human-level robustness and accuracy in English speech recognition.” It seems to work in other languages at least.
If you try the demos, you’ll see that speaking fast or with a nice accent doesn’t seem to affect the results. The post mentions that it was trained on 680,000 hours of supervised data. If you had to talk to an AI that much, it would take you 77 years without sleep!
Internally, speech is split into 30-second bites that feed a spectrogram. Encoders process the spectrogram and decoders digest the results using some predictions and other heuristics. About a third of the data came from non-English speaking sources and were then translated. You can read the paper about how generalized training underperforms some specially trained models on standard benchmarks, but they think Whisper does better on random speech beyond certain benchmarks.
The size of the model in the “minute” variant is still 39 megabytes, and the “large” variant is over a gig and a half. So this probably won’t run on Arduino any time soon. If you want to code, though, it’s all on GitHub.
There are other solutions, but not as robust. If you want to go the assistant-based route, here are some inspirations.