ChatGPT’s Accuracy Deteriorating Rapidly: Is Our Budding Love Affair with It Over?

Claims have been made in the last few days that OpenAI’s artificial intelligence-powered chatbot ChatGPT seems to be getting worse as time goes on and researchers can’t seem to figure out the reason why. In this latest claim Fortune magazine screamed in a headline, “Over just a few months, ChatGPT went from correctly answering a simple math problem 98% of the time to just 2%”.

They are not alone in the assertion that ChatGPT, hailed as a “revolutionary innovation” that will change society, has been losing credibility and trust.

In a July 18 study, researchers from Stanford and UC Berkeley found ChatGPT’s newest models had become far less capable of providing accurate answers to an identical series of questions within the span of a few months.

Multiple problems had already been associated with the chatbot program, the most concerning being its propensity for “hallucinations” that can range from minor inaccuracies to completely erroneous responses. For example, ChatGPT might generate a plausible-sounding answer to a factual question that is completely incorrect and essentially, gibberish.

This latest study’s authors couldn’t provide a clear answer as to why the AI chatbot’s capabilities had deteriorated.

To test how reliable the different models of ChatGPT were, researchers Lingjiao Chen, Matei Zaharia and James Zou asked ChatGPT-3.5 and ChatGPT-4 models to solve a series of math problems, answer sensitive questions, write new lines of code and conduct spatial reasoning from prompts.

According to the research, in March ChatGPT-4 was capable of identifying prime numbers with a 97.6% accuracy rate. In the same test conducted in June, GPT-4’s accuracy had plummeted to just 2.4%.

In contrast, the earlier GPT-3.5 model had improved on prime number identification within the same time frame.

When it came to generating lines of new code, the abilities of both models deteriorated substantially between March and June.

Another problem dealt with racism and gender issues. The study found ChatGPT’s responses to sensitive questions in this area later became less informative and more concise in refusing to answer.

Earlier iterations of the chatbot provided extensive reasoning for why it couldn’t answer certain sensitive questions. In June however, the models simply apologized to the user and refused to answer.

“The behavior of the ‘same’ [large language model] service can change substantially in a relatively short amount of time,” the researchers wrote, noting the need for continuous monitoring of AI model quality.

The researchers recommended users and companies who rely on LLM services as a component in their workflows implement some form of monitoring analysis to ensure the chatbot remains up to speed. But this would imply a lack of trust in the technology and a substantial reduction in its usefulness as the oversight it will need may neutralize the benefits it purportedly offers.

On June 6, OpenAI unveiled plans to create a team that will help manage the risks that could emerge from a superintelligent AI system, something it expects to arrive within the decade. Yet if inaccuracy continues to increase, there may be even more problems to consider.

While the experts cannot explain this alarming degeneration of ChatGPT’s functions, speculation suggests that it may be a victim of its own–at least initial–success: the ChatGPT AI chatbot is being overworked. It may be experiencing capacity issues due to the high amount of traffic its website has garnered since becoming an internet sensation.