OpenAI’s advanced chatbot, ChatGPT, has recently faced challenges as researchers from Stanford and UC Berkeley noted a perplexing decline in its performance over a few months. A study conducted on July 18 revealed that ChatGPT’s latest models were struggling to provide accurate answers to identical questions.
Despite analyzing the issue extensively, the researchers were unable to identify the exact reasons behind the AI chatbot’s deteriorating capabilities. To evaluate different ChatGPT models, researchers Lingjiao Chen, Matei Zaharia, and James Zou tested ChatGPT-3.5 and ChatGPT-4 on various tasks, including solving math problems, answering sensitive questions, coding, and spatial reasoning.
The study discovered a significant drop in accuracy for ChatGPT-4. In March, this model achieved an impressive 97.6% accuracy in identifying prime numbers. However, when the same test was conducted in June, the accuracy plummeted to a mere 2.4%. In contrast, the earlier GPT-3.5 model showed improvement in prime number identification during the same period.
The decline was not limited to prime number identification alone. Both ChatGPT-3.5 and ChatGPT-4 experienced a significant deterioration in generating lines of new code between March and June. Additionally, ChatGPT’s responses to sensitive questions underwent a noticeable change. Previous iterations provided extensive reasoning for avoiding such questions, but in June, the models simply apologized and refused to answer, with some examples even focusing on ethnicity and gender.
The authors of the study emphasized that the behavior of large language models like ChatGPT can change considerably within a relatively short period. They highlighted the importance of continuous monitoring of AI model quality. Users and companies relying on these models in their workflows were advised to implement monitoring analysis to ensure the chatbot’s performance remains reliable.
In a separate development, OpenAI took proactive measures by announcing plans on June 6 to establish a dedicated team to manage the potential risks associated with superintelligent AI systems, which they anticipate could emerge within the next decade. This step showcases OpenAI’s commitment to addressing the challenges posed by the advancement of AI.
The findings of the research raise concerns regarding the reliability and consistency of AI models like ChatGPT. It is crucial for developers and users to monitor and evaluate these models regularly to ensure accurate and reliable performance for various tasks. As AI continues to advance, it becomes increasingly important to maintain vigilance and take proactive measures to mitigate potential risks.
It is expected that OpenAI will thoroughly investigate the reasons behind ChatGPT’s deteriorating performance and work towards improving its capabilities. As users and companies rely on AI language models for various applications, it is essential to maintain their trust by addressing any issues and continuously striving for enhanced performance and reliability.