New Developments for ChatGPT: Multimodal Abilities and Security Challenges

May 14, 2024 ChatGPT

Recently, OpenAI has made a series of upgrades to ChatGPT, enhancing its understanding of both audio and visual data. ChatGPT can now not only process text but also comprehend images and audio, thanks to extensions based on multimodal GPT-3.5 and GPT-4 models. These models combine language reasoning skills and can handle various types of images, such as photos, screenshots, and files containing both text and images.

OpenAI has stated that they are adopting a phased deployment strategy aimed at ensuring the safety and beneficial impact of artificial general intelligence (AGI). The importance of this strategy has grown with the inclusion of advanced models involving sound and vision. For instance, the new audio technology can generate lifelike synthetic voices from short snippets of real speech, promising numerous applications in creative and accessibility domains. However, these capabilities also introduce new risks, such as the potential for impersonation of public figures or fraudulent activities.

Furthermore, OpenAI is attentive to how users leverage these new features and is collaborating with multiple teams for cooperative testing before actual deployment to mitigate risks. For example, they are partnering with “Be My Eyes,” a free mobile app aiding blind and low-vision individuals, to understand the practical applications and limitations of these features.

ChatGPT