Smart Software: Caution Advised

March 26, 2026

green-dino_thumb_thumb[3]_thumbAnother dinobaby post. No AI unless it is an image. This dinobaby is not Grandma Moses, just Grandpa Arnold.

A Washington State University (Spokane, Washington) reported some information about ChatGPT’s accuracy. Science Daily summarized the professor’s research in “Study Finds ChatGPT Gets Science Wrong More Often Than You Think.” The “you” troubled me because I know that one must have an answer or sufficient knowledge about a topic before creating a prompt when factual information is required. Therefore, the “you” might be considered overly broad.

image

Sam, who works in the field of artificial intelligence, is outraged that his friends will not count his horse shoe on the roof as a point. His three friends find his arguments better than a Jimmy Kimmel quip. Thanks, Venice.ai. Good enough but why is Sam fat and the other three more svelte?

What did the good professor set out to learn? The WSU luminary wanted to test the ChatGPT large language model. The approach was to create scientific questions and then prompt the model. ChatGPT had to output information stating the hypotheses were “true” or “false.” I am using quotes because I have learned as I marched toward my present status of dinobaby that truth and falsity are mercurial.

Let’s jump to the findings. Please, read the full article in ScienceDaily.

I noted this statement:

In total, the team evaluated more than 700 hypotheses and asked the same question 10 times for each one to measure consistency. When the experiment was first conducted in 2024, ChatGPT answered correctly 76.5% of the time. In a follow-up test in 2025, accuracy rose slightly to 80%. However, once the researchers adjusted for random guessing, the results looked far less impressive. The AI performed only about 60% better than chance, a level closer to a low D than to strong reliability.

But the killer comment, in my opinion, was this one:

The system had the most difficulty identifying false statements, correctly labeling them only 16.4% of the time. It also showed notable inconsistency. Even when given the exact same prompt 10 times, ChatGPT produced consistent answers only about 73% of the time.

Knowing what’s wrong strikes me an important mental or knowledge value operation. The score range for figuring out what might be fake ranged between 16 percent and 73 percent. But what about the other 84 percent of 27 percent? How does that work out for decisions that involve medical treatments, stress analyses for alternative nuclear reactors, or smart weapons? I know the answer, and most people who interact with me don’t like how I respond to this question. But here it is: Today’s smart software is essentially close enough for horseshoes.  Stated another way, one might want to get a chimpanzee to throw darts at a target with “answers” written on Post It Notes.

The Science Daily article pointed out:

The findings … highlight the importance of using caution when relying on AI for important decisions, especially those that require nuanced or complex reasoning. While generative AI can produce smooth, convincing language, it does not yet demonstrate the same level of conceptual understanding. According to [Professor] Cicek, these results suggest that artificial general intelligence capable of truly “thinking” may still be further away than many expect. “Current AI tools don’t understand the world the way we do — they don’t have a ‘brain,'” Cicek said. “They just memorize, and they can give you some insight, but they don’t understand what they’re talking about.”

Several observations:

  1. Studies like the one from the Washington State University professor suggest that smart software makes mistakes…frequently. The so-called improvements in news releases and marketing collateral are not in line with smart software’s actual fact functions.
  2. The need for a next big thing has created a situation which disseminates a fictional description of what developers believe their probabilistic word prediction systems can deliver. Belief is good. Failure to recognize and articulate limitations is bad. We are in a bad information space in my opinion.
  3. The money pumped into smart software is notable. Furthermore the relatively small number of organizations investing tens of billions of dollars want to “own” the market. The idea would get a student in an MBA program a high mark and maybe a grant. In real life, the mismatch between what is marketed and what the systems can do in a fact centric setting is wide.

Net net: One hopes that existing smart software can be juiced up with additional methods. Until then keep that accuracy range in mind: 16 to 73 percent. Getting to 100 percent matters to some. On the other hand, most users of smart software don’t know what’s fact or fiction in the smart software output. Furthermore, good enough is the new norm for excellence for many people and organizations.

Stephen E Arnold, March 26, 2026

Comments

Got something to say?





  • Archives

  • Recent Posts

  • Meta