Mainstream AI fails key Australia, NZ curriculum test

Wed, 18th Feb 2026

A new benchmark of about 13,500 question-and-response pairs has found that mainstream AI models struggle to recall and interpret detailed content from Australian and New Zealand K-10 curriculum frameworks, despite strong performance on broader, open-ended education questions.

Researchers assessed seven widely used frontier models from OpenAI, Google and Anthropic against the Australian Curriculum v9, the Victorian Curriculum, the Western Australian Curriculum and the New Zealand Curriculum. They also tested CurricuLLM, an education technology product built specifically around those documents.

Across all four curricula and all question types, no mainstream model achieved an overall pass rate above 41%. Performance dropped sharply on the structured knowledge teachers often need for planning, curriculum mapping and compliance, such as the meaning of specific outcome entries and recall of content points.

Structured recall

The benchmark split questions into five categories. It included open-ended prompts that mirror how teachers might query a chatbot, alongside structured tasks such as interpreting an outcome statement, matching outcomes to subject and year level, and reverse look-ups from content to the relevant curriculum entry.

On open-ended, conceptual questions, results were much higher: Google's Gemini 3 Pro reached 80% in that category. Performance fell on detailed curriculum elements written in formal frameworks and outcome codes.

Two of the seven mainstream models scored 0% when asked to interpret specific curriculum outcome entries. Most scored below 17% on content recall questions.

Among the Australian frameworks, the Victorian Curriculum produced the lowest baseline pass rates, dropping as low as 15% for some models. The benchmark also found a recurring tendency for models to draw on older curriculum versions that have since been superseded.

Curriculum drift

The study linked part of the problem to a mismatch between general training data and the specificity of regional education frameworks. It also flagged operational risk from ongoing revisions to curriculum documents in both countries.

All four curricula assessed are undergoing multi-year revision processes. Models frequently answered with outdated information, increasing the likelihood that teachers using general-purpose chatbots may plan lessons against old requirements.

Dan Hart, founder of CurricuLLM, said the work aimed to test whether tools already used in classrooms reflect local curriculum reality.

"We wanted to answer a simple question: do the AI tools teachers are already using actually know our curriculum?" Hart said.

He argued that strong results on standardised tests and general knowledge benchmarks do not translate into reliable recall of curriculum frameworks used in Australia and New Zealand.

"These are incredibly capable models. They score 95% on American SAT exams. But when it comes to the specific curriculum content that structures teaching and learning in Australia and New Zealand, there's a very significant knowledge gap. The data makes that clear," Hart said.

Different behaviours

The benchmark also reported differences in how models handled uncertainty. Anthropic's Claude models were more likely to decline to answer when they lacked confidence. By contrast, OpenAI and Google models rarely refused to respond and often returned confident but incorrect answers.

The study framed this pattern as a potential classroom risk. Teachers may treat an assertive response as reliable even when it contains outdated curriculum references or inaccurate interpretations of outcome statements.

Hart said the findings were not an argument against AI in schools. Instead, he said they show general-purpose chatbots have blind spots on regional curriculum materials.

"This isn't a criticism of AI in education, quite the opposite. We believe AI will be transformative for teaching," Hart said. "But the tools teachers use need to actually know the curriculum they're working with. A general-purpose chatbot trained predominantly on American content is going to have blind spots when it comes to the Australian Curriculum or the New Zealand Curriculum. That's not a controversial claim, it's just a data problem, and it's one that can be solved."

Purpose-built model

CurricuLLM was evaluated alongside the mainstream models using the same benchmark. It recorded an overall pass rate of 89%, outperforming the strongest mainstream model by 48 percentage points.

On outcome identification, CurricuLLM scored 83%, and 89% on reverse look-ups. On open-ended teacher-style questions, it reached 98%.

Its system grounds responses in version-controlled curriculum data. The benchmark presented this approach as a way to reduce reliance on what a model has memorised from general web-scale training.

School adoption

The findings come amid growing use of AI tools by teachers. The research cited 69% of New Zealand primary school teachers using AI weekly for lesson planning and assessment, and pointed to interest among Australian policymakers in AI risk management in education.

Hart said schools should consider the reliability and currency of curriculum information as adoption increases.

"Teachers are already stretched thin. If an AI tool is going to be part of their workflow, it has to give them confidence that the curriculum information it provides is accurate and up to date," Hart said.

The benchmark methodology used deterministic matching and an independent AI judge. Human validation assessed the judge as about 80% accurate. CurricuLLM said it expects curriculum revisions in both countries to widen the gap between mainstream model recall and what schools are required to teach.

ChatGPT

Key takeaways Explain why it matters Create action plan Future watch

Claude

Key takeaways Explain why it matters Create action plan Future watch

Perplexity

Key takeaways Explain why it matters Create action plan Future watch

Grok

Key takeaways Explain why it matters Create action plan Future watch

Share Share

Add us as a preferred source on Google