AIED 2026 · Long Paper · Seoul

Can We Trust AI’s
Self-Assessment?

When a language model codes a student’s message and reports “confidence: 0.9,” can we believe it? Across three studies on 633 expert-coded student–AI dialogues, all three models we tested were consistently overconfident. We then look at why prompting struggles to fix this, a pattern we call plausibility-gated anchoring, and explore whether adjusting a model’s internal activations can help its confidence track correctness a little better.

Open PDF Talk slides Open-source toolkit DOI

Data633 student–AI dialogues, expert-coded into 9 categories

Core findingVerbalized confidence looks more like a learned prior than a genuine uncertainty estimate

TakeawayRead confidence as a ranking rather than a probability, and consider steering on open weights

Research Story

LLMs are becoming everyday tools for qualitative coding in educational research. They scale, they stay consistent, and they reduce human labor. But one basic question shapes how far we can rely on them. When can an AI-generated code be trusted? Traditional classifiers emit calibrated probabilities. LLMs do not. They write a number as just another token, without direct access to their own internal state.

Promise accept high-confidence outputs, and flag low-confidence ones for human review.Catch that workflow only helps if the confidence score is calibrated.Stakes highest for open-weight models, often the FERPA-compliant choice.

The Setup

We use 633 student prompts submitted to ChatGPT in a graduate introductory CS course, independently coded by two expert qualitative researchers into nine mutually exclusive categories (Cohen’s κ from 0.65 to 0.94). Every model receives the same prompt. It predicts a category, reports a confidence from 0 to 1, and gives a brief rationale, which is how practitioners often elicit confidence in practice.

Prompt template with system context, coding scheme, instructions, output format, and input sections — The shared prompt template. The same structure, with the coding scheme, instructions, and a JSON output carrying a self-reported confidence, was used across all models.

We evaluate three models across the accessibility spectrum. gpt-5-mini and gemini-3-flash-preview are commercial APIs, and llama-3.1-8b-instruct is open-weight and run locally. We measure calibration with three metrics, ECE, AUC, and Spearman’s ρ, which separate how overconfident a model is from how well it ranks its own correctness. Runs use greedy decoding for reproducibility.

Three Questions, Three Studies

RQ1 · How calibrated?Compare verbalized confidence against true accuracy across three model families.

RQ2 · What governs it?Manipulate the confidence value shown in the prompt’s format example, and watch what the model does.

RQ3 · Can we fix it?Intervene on internal activations to amplify the directions tied to well-calibrated predictions.

Study 1 · Consistent Overconfidence

All three models reported mean confidence between 0.89 and 0.92 while being correct only 34% to 52% of the time, leaving confidence–accuracy gaps of 0.40 to 0.55. Nine-way coding over semantically overlapping categories is genuinely hard, and chance is about 0.11. The more useful story is in discrimination, whether higher confidence at least means more likely correct.

Closed-source: usable ranking

gpt-5-mini reached AUC 0.69 and gemini-3-flash 0.67, moderate and good enough to help prioritize items for human review.

Open-weight: closer to chance

llama-3.1-8b reached AUC 0.57, so confidence-based triage offers limited value over random ordering.

Accuracy ≠ calibration

gemini was the most accurate model yet ranked its correctness slightly worse than gpt-5-mini. Knowing the answer and knowing when you know it look like different abilities.

The confusion matrices below show where each model concentrates its predictions and which categories it tends to conflate.

Scope. One dataset and one coding scheme. These specific values may not transfer to other domains or to coding schemes of different granularity.

Figures · Error Patterns and the Anchoring Effect

Browse the per-model confusion matrices and the anchoring density plot from Study 2.

Normalized confusion matrix for gpt-5-mini across nine coding categories — gpt-5-mini. Predictions lean toward a few categories. example and collaboration are recognized well, while finer distinctions are confused.

Study 2 · Plausibility-Gated Anchoring

Why can we not just tell the model to be less confident? We tried, by changing the example confidence shown in the prompt’s format spec. The result was notably asymmetric.

High anchor (0.95) → near mode collapse

96.5% of outputs landed at exactly 0.95. Variance dropped sharply (kurtosis 37.7), and discrimination got a little worse, as the spread that carried the signal narrowed.

Low anchor (0.05) → not adopted

Zero outputs adopted 0.05. The distribution returned to its default high-confidence range (around 0.80 to 0.95).

The model tends to adopt a shown value only when it falls inside a range it already treats as plausible for the task. Anchoring here is not symmetric, it appears filtered through a learned prior. One reading is that verbalized confidence reproduces what confident classification text looks like more than it estimates uncertainty, which would help explain why demonstrating low-confidence examples, on its own, does not recalibrate the model.

Scope. This anchoring analysis is on the open-weight model. We did not test whether larger open models share the same plausibility range.

Study 3 · A Light Touch on the Activations

If prompting is limited by the prior, can we work a level below it? Using representation engineering, we recorded activations from one middle layer on all 633 samples, grouped them into well- and poorly-calibrated sets, and took the difference of their mean activations as a single direction inside the model. Adding a small multiple of that direction back during inference nudges the model toward better-calibrated behavior, with no retraining.

Small, consistent gains

AUC moved from about 0.565 to 0.585 in both settings we tried. Spearman’s ρ rose from about 0.13 to 0.16, a small absolute change that looks larger in relative terms because the base is low.

Held out

The direction was built on one half of the data and measured on the other half, so the gains are not simply fitting the training split.

What it does not do

The overconfidence gap stays large. This shifts the ordering of confidence slightly, it does not turn the number into a probability.

A cautious reading is that some calibration-relevant signal sits in the representations and can be partly surfaced without retraining, as a complement to prompt-based strategies rather than a replacement.

Scope. The gains are modest and evaluated within a single dataset. We have not yet confirmed them across other datasets or larger models.

What This Means for Your Workflow

Triage depends on AUC, not accuracy

Around AUC 0.68, directing review toward low-confidence items can help. Near 0.56 it offers little, so check the discrimination on your own data before relying on it.

Rankings, not probabilities

With ECE above 0.40, a reported 0.9 is not a 90% chance of being right. Confidence is safer for ordering items than for setting acceptance thresholds.

Be careful demonstrating confidence

A shown value tended to either collapse the distribution or be ignored, so fixing explicit confidences in few-shot examples can quietly affect calibration.

One option for open weights

For FERPA-bound local deployments, activation steering is one approach that gave small gains here. Our implementation is open if you would like to try it.

Steering code on GitHub Read the paper

Conference Talk

The talk follows the same arc as the paper. It opens with the trust problem, walks through three studies that build on one another, and closes with what this changes for anyone using LLMs to code educational data.

AIED 2026 talk

Open the slides to walk through the setup, the three studies, and the practical implications.

Open slides Toolkit on GitHub

Limitations & Next Steps

We want to be clear about scope. This is a single dataset of introductory CS prompts with one nine-category scheme, so the specific numbers may not transfer to other domains or granularities. Studies 2 and 3 use one open-weight model, and larger models may shift where the plausibility range sits. The steering gains are modest and evaluated within one dataset, so we present them as a promising direction rather than a settled result. We also used single-turn, zero-shot prompting.

The natural next steps are per-category calibration, larger models, richer elicitation such as multi-turn or chain-of-thought, and replication across datasets. Per-category metrics interest us most, since they would tell researchers which codes are reasonable to automate and which still need a person.