Three Ways In Which Whisper Is Advancing ChatGPT
ChatGPT will be advanced by Whisper due to the disfluency of speech, the creation of more data for ChatGPT to access & the notion of LLMs moving towards Multi-Modal Foundation Models.
In this article I want to consider why the LLM ASR model of Whisper will benefit ChatGPT, and why OpenAI releasing the two APIs simultaneously is not a coincidence…
Speech Disfluency
Conversations transcribed from speech via ASR is vastly different to conversations generated by user text input.
The reason why speech input is so different to text input is due to the disfluency of speech, as apposed to typed conversations.
Disfluency is what makes developing a successful voicebot so hard.
Disfluency can be described as various breaks, irregularities, or non-lexicalvocables which occur within the flow of otherwise fluent speech. Speakers performing self-correction, repetition, repeating prior context, etc.
Disfluencies are interruptions in the regular flow of speech, such as using uhand um, pausing silently, repeating words, or interrupting oneself to correct something said previously, etc.
Hence speech disfluency will introduce a new paradigm to ChatGPT in terms of text data submitted.
I hasten to add that transcribed audio data is not necessarily of low-quality language data, but merely different. A key indicator of the general quality of transcribed audio is Word Error Rate (WER).
[In any conversation] meaning is established through turns.
~ John Taylor
Access to More Data
There is an interesting paper where the growth in data usage is investigated while considering the total stock of unlabelled data available on the internet.
The paper concludes that high-quality langauge data will be exhausted by 2026. Low-quality language data and images will be exhausted much later.
Coupling the Whisper API with the ChatGPT API provides a whole new source of data to the OpenAI GPT models. The vast amount of audio data can be accessed now and put to work with a low barrier to entry in terms of technical expertise and cost.
Large Language Models ➡️ Foundation Models ➡️ Multi-Modal AI
There has been a shift in terms of Large Language Models by including additional modalities. Audio was the first step towards a multi-modal approach.
Large Models is sprawling into other none-language related tasks, but still remain the foundation of many applications and services.
A Foundation Model can be language only related in terms of functionality, or it can also cover other modalities like voice, images, video, etc.
Read more here:
Large Language Models, Foundation Models & Multi-Modal Models
⭐️ Please follow me on LinkedIn for updates on Conversational AI ⭐️
I’m currently the Chief Evangelist @ HumanFirst. I explore and write about all things at the intersection of AI and language; ranging from LLMs, Chatbots, Voicebots, Development Frameworks, Data-Centric latent spaces and more.