
I set out to see how six leading chatbots handled trick questions, and every single one confidently hallucinated. Each failure was different, but together they show how easily polished systems can invent court cases, elections, and even space history while sounding utterly sure of themselves.
1. ChatGPT’s Supreme Court Fabrication
ChatGPT’s most striking misstep came when it was asked about a 1996 aviation incident involving China Southern Airlines. In a detailed response described in a 2023 test, it confidently cited a non-existent Supreme Court case, “Varghese v. China Southern Airlines,” and claimed the ruling blamed pilot error. No such case appears in Supreme Court records, yet the model supplied docket-style details as if it were summarizing real precedent.
For lawyers, regulators, or journalists, this kind of invented authority is not a harmless glitch. A fabricated citation can mislead a brief, distort risk assessments, or pollute public debate about aviation safety. The episode shows how language models can mimic the tone of legal expertise while bypassing the discipline of actual legal research, leaving users with the burden of verifying every impressive-sounding answer.
2. Google Bard’s Telescope Misstep
Google’s Bard stumbled in a high-profile way when it was asked about the James Webb Space Telescope during a launch demo. In that showcase, Bard asserted that Webb had captured the “very first pictures” of a planet outside the solar system, a claim later flagged as wrong in detailed coverage. Astronomers had already imaged exoplanets decades earlier using ground-based observatories and other instruments, so the chatbot’s confident statement rewrote basic space history.
The stakes here go beyond a single astronomy fact. Bard’s error surfaced in a polished marketing moment meant to showcase reliability, not in a casual lab test. When a system misrepresents scientific milestones, it risks confusing students, policymakers, and the general public about how discoveries actually unfold. It also highlights how easily a slick interface can mask the absence of rigorous fact-checking behind the scenes.
3. GPT-4’s Election Quote Invention
GPT-4’s hallucination problem showed up clearly in a 2024 Stanford study that probed its handling of the 2020 U.S. election. When researchers asked about specific campaign events, the model invented a quote from then-President Donald Trump, claiming he said “I won the election, big league” in a particular October 2020 speech that never took place. The phrasing sounded plausible, but no transcript or recording matched the description.
In an era when CNN, the New York Times, and other outlets scrutinize every word from Trump, fabricating a speech risks feeding misinformation into already polarized debates. Election narratives shape how citizens view legitimacy and turnout, so invented quotes can be weaponized or misunderstood. The Stanford findings underline that even advanced systems like GPT-4 can blend real political figures with fictional events in ways that are hard for casual users to untangle.
4. Bing Chat’s Fictional Mayor Bio
Microsoft’s Bing Chat veered into pure fiction when asked about local politics in Australia. In early interactions later described in a major report, the chatbot produced a detailed biography for a supposed Australian mayor named Elizabeth Holmes, unrelated to the Theranos founder of the same name. It confidently listed invented milestones, including an election on “March 15, 2019,” as if summarizing a real civic career.
The surreal overlap with the real Elizabeth Holmes, who appears in discussions of Theranos and in personal essays such as a reflective life blog, shows how models can mash together names and roles. For voters and local journalists, a fabricated mayor is not just amusing, it can distort searches about actual candidates and undermine trust in digital tools that claim to explain public life. It also raises questions about liability when AI systems invent officeholders out of thin air.
5. Claude’s Moon Landing Mix-Up
Anthropic’s Claude AI faltered on one of the most documented events in modern history. In a controlled experiment, it asserted that the 1969 moon landing was a “Soviet-American joint mission” led by cosmonaut Yuri Gagarin. In reality, Apollo 11 was a NASA mission commanded by Neil Armstrong, and Gagarin had died in 1968, long before the landing.
Mislabeling Apollo 11 as a joint mission erases the geopolitical context that defined the space race. For educators and students, such an answer could blur the distinction between Soviet and American programs at a time when history podcasts, classroom materials, and even shows like the Pivot podcast rely on accurate timelines. Claude’s error illustrates how models can remix iconic narratives into alternate histories that feel authoritative but collapse under basic scrutiny.
6. Llama 2’s Wall Fall Date Error
Meta’s Llama 2 showed how even simple factual prompts can trip up a model. In a benchmark run by Hugging Face, evaluators asked for key Cold War events, and the system replied that the Berlin Wall fell on “November 9, 1988.” The correct date is November 9, 1989, a one-year difference that nonetheless shifts the event’s place in the cascade of Eastern European revolutions.
For historians, civic educators, or anyone studying democratic transitions, that single digit matters. Misdating the fall of the Berlin Wall can confuse timelines that connect protests in East Germany to broader changes in Europe and to later debates about NATO and European Union expansion. Llama 2’s slip shows that even when answers look precise, users still need to cross-check basic numbers before treating them as settled fact.
More from Morning Overview