How 24 questions can pinpoint a learner — and why static tests don’t.
Item Response Theory, Computerized Adaptive Testing, and the math of information. How a smart 24-item diagnostic delivers more precision than a 100-item static test — and why this changes what “placement” can mean.
The finding: A well-designed Computerized Adaptive Test (CAT) using Item Response Theory (IRT) can estimate a learner’s ability to ±0.3 standard errors in around 20–25 items. The same precision requires 80–100 items from a static test, because most of those items carry almost no information about that particular learner.
The mechanism: Every test item has a difficulty parameter (how able you need to be to have a 50% chance of getting it right) and a discrimination parameter (how sharply the item separates above-ability from below-ability learners). CAT picks the next item that adds maximum information about this learner’s ability — usually one near their current estimated level.
The product: The AI Skill Diagnostic uses IRT-2PL with periodic re-calibration of the item bank, runs CAT per concept with a Bloom-aware item selection layer, and places a new learner in roughly 20 minutes — per concept, per cognitive level.
Walk into a typical corporate L&D rollout. Every new hire takes the same 100-question placement test. The strong engineer is bored for the first 60 questions, which are too easy. The new-to-the-domain hire is overwhelmed by the last 40, which are way too hard. After two hours, the system places each of them — but the test that determined their starting point delivered almost no useful information about either.
This is the static-test problem. The fix has existed since the 1950s but is still rare in enterprise learning platforms: an adaptive test that figures out who you are while you’re taking it, and uses each question to learn more about you than the last.
The starting point: Lord, 1952
Frederic Lord’s 1952 monograph [1] introduced what became Item Response Theory. The core insight: rather than treating a test as a sum of items, treat each item as a probabilistic function of the test-taker’s underlying ability.
Concretely: for each test item, there is a probability of correct response that varies with the test-taker’s ability. For an easy item, almost everyone gets it right, with the probability rising sharply around a low ability level. For a hard item, almost nobody gets it right at low ability, but the probability climbs steeply once ability exceeds a threshold. Plotting this probability against ability gives the Item Characteristic Curve (ICC) — a logistic curve characterized by its difficulty (b-parameter) and its discrimination (a-parameter).
Once you have an item bank with calibrated ICCs, you can run the test as follows: start with an item of average difficulty. Observe whether the learner gets it right. Update your estimate of their ability (Bayesian or Maximum Likelihood — both work). Pick the next item to maximize information about ability at the current estimate. Repeat until the precision crosses a threshold.
This is Computerized Adaptive Testing, and it’s been the standard in high-stakes psychometrics since the 1980s. The GRE, GMAT, and every modern licensure exam use it. Most learning platforms, somehow, still don’t.
The 2-parameter logistic model (2PL), informally
The probability that a learner with ability θ gets item i correct:
P(X_i = 1 | θ) = 1 / (1 + exp(−a_i × (θ − b_i)))
Where:
- θ is the latent ability of the learner.
- b_i is the difficulty of item i (the ability at which p = 0.5).
- a_i is the discrimination of item i (how sharply the item separates ability levels — the slope of the ICC at its midpoint).
CAT selects item i* to maximize Fisher information at the current θ estimate. Items near the learner’s current ability with high discrimination are the most informative.
Why 24 items beats 100
The number “24” isn’t arbitrary. It comes from the math of Fisher information.
Each item answer reduces uncertainty in your ability estimate by an amount that depends on how well-matched the item was to your current estimate. A perfectly matched item (its difficulty equal to your ability) contributes maximally; an item that’s way too easy or way too hard contributes nearly nothing — you got it right (or wrong), but you would have gotten it right (or wrong) at most ability levels.
In a 100-item static test, perhaps 15–20 items happen to be at your ability level. The other 80 carry almost no information about you. They produce noise — and frustration. The same 15–20 items, selected adaptively, are the only ones a CAT would have given you in the first place.
Practically, this means CAT can deliver:
- Equivalent precision in ~25% of the time.
- Adjustable precision — stop early for rough placement, run longer for high-stakes.
- A reasonable test experience: you’re never bored, never crushed; almost every question feels “right at the edge.”
The right next item is the one that, if you got it right, would surprise me — and if you got it wrong, would also surprise me.Frederic Lord, paraphrased
Going beyond 1D ability: Bloom-aware CAT
Classical IRT assumes a single ability dimension (θ) per content area. This is a useful simplification for math placement or vocabulary tests — but it falls apart when you ask the deeper question that L&D leaders actually care about: at what level of Bloom’s taxonomy [7] is this learner functioning for this concept?
A learner might score well on “Remember” items (define a function) but fail on “Apply” items (write a function for a novel problem). One-dimensional CAT averages these and reports a single ability — which loses the most important signal in the data.
Future Proof’s diagnostic runs a separate CAT per Bloom level, with shared priors across levels (a strong “Remember” score raises the prior on “Understand” — there’s a partial-credit Bayesian network running underneath). The result: not a single placement score, but a Bloom profile per concept. This is what makes the “weak on Apply” classroom view (visible to teachers and L&D admins) possible.
What it takes to do this well
Most of the difficulty in real-world CAT isn’t in the math — the math is settled — but in the operational disciplines around the item bank:
- Item calibration. Each item’s a and b parameters must be estimated from a sufficient sample (typically 200–500 responses per item) before it’s safe to use in adaptive selection.
- Item exposure control. If one item is “optimal” for many learners at the same ability, it gets shown too often and leaks. We use Sympson-Hetter or similar randomization to spread exposure.
- Calibration drift. Item parameters change over time as content meaning shifts (e.g., a question about cloud computing is different difficulty in 2010 vs. 2024). We re-calibrate quarterly.
- Item bank size. CAT needs many items per ability range. Generating these manually is a bottleneck — which is why the AI Question Studio exists.
What we don’t yet know
Three questions our research team is actively studying:
- How do item parameters drift across language editions? The same item, translated to Hindi, doesn’t necessarily preserve its a and b parameters. We’re building cross-language calibration models with a state education department.
- Can LLM-generated items match human-authored items in IRT fit? Empirically, LLM-generated items have higher variance in discrimination — some are excellent, some are noise. Our QC pipeline is built to detect and prune the noise, but the underlying question is open.
- What’s the right CAT termination criterion in a learning (not assessment) context? For grading, a fixed standard error works. For diagnostic placement before adaptive practice, the answer is less clear — we’re testing several rules.
How the AI Skill Diagnostic uses this.
A new learner takes a 20-minute diagnostic. The CAT engine runs per concept, with a Bloom-aware layer that estimates ability separately at each cognitive level. The item bank is calibrated quarterly using fresh response data; LLM-generated items pass through an automated QC pipeline before entering circulation. The output isn’t a single score — it’s a Bloom profile per concept, which the AI Memory Coach uses to schedule the very first session.
See the AI Skill Diagnostic →Selected papers.
The IRT literature is enormous. These are the canonical entry points and the modern revisions worth knowing.
-
Lord, F.M. (1952). A theory of test scores. Psychometric Monograph No. 7. Iowa City: Psychometric Society. PDF
-
Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee’s ability. In F.M. Lord & M.R. Novick, Statistical Theories of Mental Test Scores (pp. 397–479). PDF
-
Wainer, H. (Ed.) (2000). Computerized Adaptive Testing: A Primer (2nd ed.). Mahwah, NJ: Lawrence Erlbaum. Publisher
-
van der Linden, W.J., & Glas, C.A.W. (Eds.) (2010). Elements of Adaptive Testing. New York: Springer. DOI
-
Sympson, J.B., & Hetter, R.D. (1985). Controlling item-exposure rates in computerized adaptive testing. Proceedings of the 27th Annual Conference of the Military Testing Association. PDF
-
Embretson, S.E., & Reise, S.P. (2000). Item Response Theory for Psychologists. Mahwah, NJ: Lawrence Erlbaum. Publisher
-
Anderson, L.W., & Krathwohl, D.R. (Eds.) (2001). A Taxonomy for Learning, Teaching, and Assessing: A Revision of Bloom’s Taxonomy. New York: Longman. Publisher
-
Reckase, M.D. (2009). Multidimensional Item Response Theory. New York: Springer. DOI
Want to see the diagnostic on your team?
A 20-minute walkthrough. We run the AI Skill Diagnostic on a sample of your content, with a real learner profile. You’ll see the Bloom profile the AI builds — and decide whether you’d want that for every learner.