Quality benchmark
81.3 on a held-out eval.
Field average: 67.9.
Leaderboard · 3 documents · 4 criteria
01Studyly81.3
02Unattle78.0
03Gauntlet68.0
04Turbolearn57.8
Criteria
Factual correctness
Every question's correct answer is grounded in the source document. We verify against the actual PDF / slide content, not the model's pretrained knowledge.
Clarity
The stem is unambiguous. A well-prepared student should be able to identify which option matches without re-reading the question three times.
Distractor quality
Wrong answers are plausible and similar in length. No 'all of the above', no obviously wrong throwaway options. The question rewards understanding, not test-taking heuristics.
Question type coverage
A good deck mixes recall, application, comparison, and case-based questions. We score how varied the output is across a single source.