A key focus of this release is to make AI voice interactions feel more natural and human-like, with significantly reduced latency during conversations
OpenAI has unveiled its latest version of the ChatGPT bot, marking a significant advancement in the field of conversational artificial intelligence.
On Tuesday, OpenAI rolled out an advanced voice mode for ChatGPT, offering users their first experience with GPT-4o’s hyperrealistic audio capabilities. Initially, the enhanced version will be accessible to a limited group of ChatGPT Plus users, with a subscription priced at $20 (Dh74 approx.) per month.
However, they plan to extend this feature to all premium users gradually from September to November.
The new release promises enhanced capabilities, increased accuracy, and a more human-like interaction experience, with the latest enhancement set to transform the way users interact with AI, through real-time, voice-driven conversations.
OpenAI's use of hyperrealistic voice synthesis means that ChatGPT can produce speech that closely mimics human intonation, rhythm, and emotion. Users will find the AI's voice interactions to be engaging and intuitive, with responses that sound remarkably human. This development marks a significant step forward in making AI more accessible and user-friendly.
You might already be familiar with the Voice Mode currently available in ChatGPT, but OpenAI's new Advanced Voice Mode offers a notable upgrade.
A significant focus of this release is on making interactions with the ChatGPT bot feel more natural and human-like. OpenAI has worked on refining the conversational tone of the bot, making it capable of understanding and replicating various styles of communication. Whether the user prefers a formal tone for business interactions or a casual, friendly chat, the new voice mode will be able to adapt accordingly.
Previously, ChatGPT relied on three separate models for its voice feature: one to transcribe your voice to text, GPT-4 to process the input, and another to convert the text back into speech. In contrast, GPT-4o will be built on a multimodal system that handles all these tasks internally, resulting in significantly lower latency during conversations. This will lead to a much quicker response rate, bringing it closer to real-life human interaction.
Additionally, OpenAI asserts that GPT-4o can also detect emotional intonations in your voice, such as sadness, excitement, or even singing.
Initially announced in May, the new voice feature has launched a month later than planned. OpenAI delayed the release to enhance safety measures, ensuring the model can effectively detect and reject inappropriate content.
As with any AI advancement, the introduction of voice capabilities brings ethical considerations and security challenges. OpenAI says it has implemented safeguards to prevent misuse of the voice feature, which include measures to detect and mitigate inappropriate content, as well as systems to ensure that voice data is handled securely and privately.
“We tested GPT-4o's voice capabilities with over 100 external red teamers across 45 languages,” OpenAI announced on X. “To protect people's privacy, we’ve trained the model to only use the four preset voices and developed systems to block any outputs that deviate from those voices. Additionally, we’ve implemented guardrails to prevent requests for violent or copyrighted content.”
In an effort to prevent the model from being misused for creating audio deepfakes, which has become a significant threat to the information economy in recent times, OpenAI has developed four preset voices in collaboration with voice actors. The advanced voice options are designed in a way that avoids impersonating other individuals.
When OpenAI first demonstrated GPT-4o's voice capabilities in May, the voice named Sky drew significant criticism for its close resemblance to that of actress Scarlett Johansson. The actress publicly stated that OpenAI had sought her permission to use her voice, which she had declined. Upon hearing the similarity in the model's demo, she engaged legal counsel to protect her rights.
Photo: OpenAI
OpenAI is also committed to transparency and user consent. Users are informed when interacting with AI-generated voices, ensuring that they are aware of when they are communicating with an artificial entity.
However, challenges remain. The potential for misuse of conversational AI, such as generating misleading or harmful information, requires continuous monitoring and improvement of the technology.
As with any new feature, once the advanced voice mode is rolled out at a significant level and feedback from users is gathered in real time, one will be able to gauge potential pitfalls in safety and security.
ALSO READ:
Somya Mehta is a Senior Features Writer at Khaleej Times, who contributes extensively to the UAE's arts, culture, and lifestyle scene. When not engrossed in writing, you'll find her on the hunt for the next best solo travel destination or indulging in podcast binges.