OpenAI has announced that third-party developers may now incorporate ChatGPT into their apps and services using a newly available Whisper API. This tool will be substantially less expensive than using its current language models.
The Whisper API is a hosted version of the company’s open-source Whisper speech-to-text paradigm, which was launched in September 2022. Whisper is an automatic speech recognition technique that allows OpenAI to claim large-scale transcription in several languages for $0.006 per minute. It accepts a variety of file types, including M4A, MP3, MP4, MPEG, MPGA, WAV, and WEBM.
Despite the fact that competitors like Google, Amazon, and Meta have created high-quality speech recognition algorithms, Whisper outperforms them, having been trained on 680,000 hours of multilingual and “multitask” data acquired from the web. According to Greg Brockman, the president and chairman of OpenAI, this allows it to recognize individual accents, background noise, and technical jargon.
“We released a model, but that actually was not enough to cause the whole developer ecosystem to build around it,” Brockman said in a video call with TechCrunch yesterday afternoon. “The Whisper API is the same large model that you can get open source, but we’ve optimized to the extreme. It’s much, much faster and extremely convenient.”
In response to Brockman’s statement, there are restrictions to businesses implementing voice transcription technology. This is supported by the 2020 Statista poll, which cites accuracy, accent- or dialect-related identification challenges, and cost as the primary impediments to implementing technology such as tech-to-speech.
One of Whisper’s limitations is its “next-word” prediction. This is owing to the massive amount of data trained with the method. Nevertheless, OpenAI warns that Whisper’s transcriptions may include words that were not said, possibly because it is attempting to anticipate the next word in the audio as well as transcribe the audio recording itself.
Furthermore, Whisper’s performance varies by language, with speakers of less well-represented languages in the training set suffering a higher mistake rate.
Expanding on the previous remark, even the best systems are flawed by prejudices, with a 2020 Stanford study finding that systems from Amazon, Apple, Google, IBM, and Microsoft produced considerably less errors, roughly 19%, with white users than with black users.