Quality benchmark

81.3 on a held-out eval.
Field average: 67.9.

Leaderboard · 3 documents · 4 criteria

01Studyly81.3

02Unattle78.0

03Gauntlet68.0

04Turbolearn57.8

Criteria

Factual correctness

Every question's correct answer is grounded in the source document. We verify against the actual PDF / slide content, not the model's pretrained knowledge.

Clarity

The stem is unambiguous. A well-prepared student should be able to identify which option matches without re-reading the question three times.

Distractor quality

Wrong answers are plausible and similar in length. No 'all of the above', no obviously wrong throwaway options. The question rewards understanding, not test-taking heuristics.

Question type coverage

A good deck mixes recall, application, comparison, and case-based questions. We score how varied the output is across a single source.

81.3 on a held-out eval.Field average: 67.9.

Factual correctness

Clarity

Distractor quality

Question type coverage

See for yourself.

81.3 on a held-out eval.
Field average: 67.9.