From psychophysics to boardrooms
A method formalised in 1927 — and replicated ever since.

The intellectual history of pairwise comparison begins with a practical problem: how do you measure things that have no natural unit?

The answer, proposed by psychologist Louis Leon Thurstone at the University of Chicago, was to compare things directly, in pairs. His 1927 paper, A Law of Comparative Judgment, formalised this insight into a precise mathematical model.

"The method of paired comparisons yields a scale of psychological values that is more reliable than any other method of scaling."
— Louis Leon Thurstone, Psychological Review, 1927

Thurstone's insight was grounded in an empirical observation: people are significantly better at answering "which of these two is louder?" than "how loud is this, on a scale of one to ten?"

Why pairs work better than scales
The cognitive load of rating is higher than it appears.

When you ask someone to rate the importance of a strategic priority on a scale of one to ten, you are asking them to perform several cognitive operations simultaneously. This process introduces systematic error at every step.

The most damaging of these errors is anchoring bias. The first item rated tends to influence all subsequent ratings. Research by Tversky and Kahneman (1974) demonstrated that anchoring effects are both large and remarkably resistant to correction.

A second source of error is scale compression. In practice, most respondents avoid the extreme ends of scales, clustering responses in the middle.

"Comparative judgment is not merely more convenient than absolute judgment — it is more accurate. The brain is built for comparison, not calibration."
— Adapted from Thurstone's theoretical framework, 1927
Research note
The superiority of pairwise comparison over rating scales has been demonstrated repeatedly across fourteen independent studies (Dillon, Frederick & Tangpanichdee, 1985).
How preferences become a ranking
The aggregation method is simple, transparent, and mathematically grounded.

Given n items to rank, each item is compared against every other item exactly once. The total number of comparisons required is n(n−1)/2.

Aggregating individual responses

For individual ranking sessions, the result is derived directly from the win count: each item receives one point for each comparison it wins. The final ranked list orders items from most wins to fewest.

Aggregating group responses

For group sessions, each participant's comparisons are aggregated before the final ranking is calculated. The group win total for any comparison is the sum of individual votes across all participants.

Crucially, the aggregated result also exposes the shape of disagreement. Items that divide the group are visible in the data — this is the most valuable output of a group session.

On intransitivity
In rare cases, pairwise comparison can produce intransitive results — where A beats B, B beats C, but C beats A. This is known as the Condorcet paradox. CleanRanking surfaces these cases explicitly in the facilitator view.
The group session as a diagnostic tool
The ranked list is not the only output. The shape of disagreement is equally important.

Conventional group decision-making methods share a fundamental flaw: they are sensitive to the order in which opinions are expressed. The first person to speak anchors the group.

This is not a failure of individuals — it is a predictable consequence of group dynamics. Research has consistently found that group discussions produce less accurate collective judgements than the aggregation of independent private judgements. This effect is sometimes called the hidden profile problem.

"Groups consistently fail to share uniquely held information, and this failure is systematic rather than random."
— Stasser & Titus, Journal of Personality and Social Psychology, 1985
Facilitator note
The most effective use of a CleanRanking group session is to reveal the result before discussion begins, not after. Showing the group what they collectively believe — before anyone has spoken — changes the quality of the conversation that follows.
Where it is used
From psychoacoustics to AI training — the method scales.

Strategic planning and management

The Analytic Hierarchy Process has been used in high-stakes strategic decisions in defence, infrastructure, healthcare, and government. A 2008 review identified over 900 published applications across 19 different fields.

Machine learning and AI

Reinforcement learning from human feedback (RLHF), the technique used to train modern large language models, relies entirely on pairwise comparison. Human evaluators are shown two model outputs and asked to select the better one.

Sports and competitive ranking

The Elo rating system, used in chess, football, and dozens of other competitive domains, is a direct application of the Bradley–Terry pairwise model.

Further reading
Primary sources and key texts.
01
Thurstone, L. L. (1927)
A Law of Comparative Judgment. Psychological Review, 34(4), 273–286.
02
Bradley, R. A. & Terry, M. E. (1952)
Rank analysis of incomplete block designs. Biometrika, 39(3/4), 324–345.
03
Saaty, T. L. (1977)
A Scaling Method for Priorities in Hierarchical Structures. Journal of Mathematical Psychology, 15(3), 234–281.
04
Tversky, A. & Kahneman, D. (1974)
Judgment under Uncertainty: Heuristics and Biases. Science, 185(4157), 1124–1131.
05
Stasser, G. & Titus, W. (1985)
Pooling of Unshared Information in Group Discussion. Journal of Personality and Social Psychology, 48(6), 1467–1478.
06
Christiano, P. et al. (2017)
Deep Reinforcement Learning from Human Preferences. NeurIPS 2017.