salvadorr/Unsplash

Elon Musk’s chatbot Grok has landed at the bottom of a new benchmark that measures how well leading AI systems recognize and push back on antisemitic and extremist content. The Anti-Defamation League’s researchers say Grok is the weakest of six major models they tested, a finding that cuts against Musk’s repeated claims that his xAI products are safer and more “truth-seeking” than rivals. The results sharpen a broader question for the industry: if one of the most heavily hyped chatbots struggles this badly with antisemitism, what does that mean for the millions of people now relying on AI for information and moderation?

The ADL’s new AI Index is not a casual scorecard, but a structured attempt to quantify how different systems handle anti-Jewish hate, anti-Zionist rhetoric, and extremist propaganda. Its verdict on Grok is stark, and it arrives at a moment when antisemitism is rising offline and online alike. I see the study as both a warning about the current state of AI safety and a roadmap for how quickly the field needs to mature.

The ADL’s AI Index and what it measures

The Anti-Defamation League created the ADL AI Index to move debates about AI safety away from vague assurances and toward measurable performance. Instead of relying on marketing claims from tech companies, the organization built a standardized battery of tests that probe how models respond to prompts about antisemitism, anti-Zionism, and broader extremism. The goal is to see whether these systems can reliably recognize hateful content, refuse to amplify it, and, when appropriate, offer corrective information. That kind of rigor is essential when chatbots are increasingly embedded in search engines, productivity tools, and social platforms.

To give the scores real weight, ADL researchers evaluated more than 25,000 distinct interactions across 37 topical subcategories, spanning three primary content types. That scale matters, because it reduces the chance that a model looks good or bad based on a handful of cherry-picked examples. Jan, the organization’s methodology, leans on domain experts who understand how antisemitic narratives mutate over time, which makes the Index a rare attempt to fuse technical evaluation with deep subject-matter knowledge.

Grok’s last-place ranking and what the numbers show

Within that framework, Grok’s performance stands out for the wrong reasons. In the Index’s overall ranking of six leading large language models, Grok landed in last place with an overall score of 21, far behind its peers. That figure reflects how the model handled a mix of direct antisemitic slurs, coded language, and more subtle forms of bias. When I look at a score that low, I see a system that is not just missing edge cases, but failing at the basics of content moderation in a domain where mistakes can have serious real-world consequences.

The same pattern shows up when the results are sliced by task. Across six top large language models, Across the test suite, xAI’s Grok performed the worst at identifying and countering antisemitic content, lagging behind competitors like Gemini and Llama. Another breakdown of the Index notes that Grok’s strengths are limited and inconsistent, with Jan highlighting that its low aggregate score reflects repeated failures to flag or challenge harmful narratives. For a product that Elon Musk has pitched as a more “honest” alternative to other chatbots, the numbers suggest a system that is undertrained or underaligned on one of the most sensitive categories of online speech.

Claude’s contrasting performance and what “good” looks like

The same evaluation that punished Grok also showed what better performance can look like. Jan reports that Claude topped the list, with an overall score of 80 across the various chat formats and three categories of prompts, including anti-Jewish, anti-Zionist, and extremist content. That gap between 80 and 21 is not a marginal difference, it is the difference between a model that usually gets it right and one that routinely stumbles. When I compare those scores, I see evidence that better safeguards are not just theoretically possible, they are already deployed in production systems.

The Index also digs into how models behave in specific testing modes. Jan notes that Within the survey modality, one model scored highest in detecting and responding to anti-Jewish bias with a perfect 100, while scoring lowest in extremist categories. That 100 shows that, under controlled conditions, an AI system can be tuned to spot and address anti-Jewish bias with near-total reliability. The challenge is extending that level of performance to more open-ended, real-world conversations, where prompts are messy, context shifts quickly, and users may be actively trying to evade filters.

Why Grok’s failure matters for Musk, xAI, and users

Elon Musk has framed Grok as a bold alternative to what he portrays as overly censored mainstream AI, yet the ADL’s findings suggest that this looser approach is coming at a cost. Jan notes that Elon Musk and his AI chatbot Grok have been found to perform the worst at countering antisemitic content compared to five other leading AI systems. For users on X and other platforms where Grok is integrated, that means a higher risk of receiving unchallenged or even subtly reinforced antisemitic narratives when they ask about Jewish history, Israel, or conspiracy theories that target Jewish communities.

The ADL’s senior vice president of counter-extremism and intelligence, Oren Segal, has described the Index as a way to fill a critical gap in AI safety research by applying domain expertise and standardized testing to antisemitic and extremist content. When a model like Grok scores at the bottom of that benchmark, it is not just a reputational problem for xAI, it is a signal that the product may be unfit for use in contexts like education, news discovery, or campus tools where antisemitism is already a live concern. For Musk, who has positioned himself as a defender of free speech, the challenge now is to show that his commitment to open expression does not translate into an AI that shrugs at hate.

How the Index could reshape AI safety debates

Beyond any single model, the ADL’s work is likely to influence how regulators, universities, and tech buyers think about AI procurement and oversight. Jan emphasizes that This Index fills a critical gap in AI safety research by giving institutions a way to compare systems and address antisemitism on campus and beyond. I expect that kind of standardized scoring to become a reference point for school districts choosing tutoring bots, companies integrating AI into HR tools, and platforms deciding which models to plug into their recommendation engines.

For developers, the message is equally clear. The ADL’s detailed breakdown of model strengths and weaknesses shows that safety is not a monolith, but a set of specific competencies that can be measured and improved. When Jan reports that Grok ranked last and that Claude scored 80 overall, it creates competitive pressure to close the gap, not just in aggregate scores but in sensitive areas like Jewish and anti-Zionist content. If companies take that challenge seriously, the next generation of AI systems could be far better at recognizing and countering antisemitism than the tools many people are using today.

More from Morning Overview