The Method — CleanRanking

About Pairwise Ranking

The science
of better decisions

Pairwise comparison is not a new idea. It has been studied, tested, and validated across nearly a century of research in psychology, economics, and decision science.

This page sets out the intellectual foundations of the method — where it came from, why it works, and why it produces more reliable results than any alternative approach to ranking and prioritisation.

Theory · History · Evidence

Contents

01 · Origins
02 · Cognitive basis
03 · The mathematics
04 · Group dynamics
05 · Applications
06 · References

01 · Origins

From psychophysics to boardrooms

A method formalised in 1927 — and replicated ever since.

The intellectual history of pairwise comparison begins with a practical problem: how do you measure things that have no natural unit?

The answer, proposed by psychologist Louis Leon Thurstone at the University of Chicago, was to compare things directly, in pairs. His 1927 paper, A Law of Comparative Judgment, formalised this insight into a precise mathematical model.

"The method of paired comparisons yields a scale of psychological values that is more reliable than any other method of scaling."

— Louis Leon Thurstone, Psychological Review, 1927

Thurstone's insight was grounded in an empirical observation: people are significantly better at answering "which of these two is louder?" than "how loud is this, on a scale of one to ten?"

02 · Cognitive basis

Why pairs work better than scales

The cognitive load of rating is higher than it appears.

When you ask someone to rate the importance of a strategic priority on a scale of one to ten, you are asking them to perform several cognitive operations simultaneously. This process introduces systematic error at every step.

The most damaging of these errors is anchoring bias. The first item rated tends to influence all subsequent ratings. Research by Tversky and Kahneman (1974) demonstrated that anchoring effects are both large and remarkably resistant to correction.

A second source of error is scale compression. In practice, most respondents avoid the extreme ends of scales, clustering responses in the middle.

"Comparative judgment is not merely more convenient than absolute judgment — it is more accurate. The brain is built for comparison, not calibration."

— Adapted from Thurstone's theoretical framework, 1927

Research note

The superiority of pairwise comparison over rating scales has been demonstrated repeatedly across fourteen independent studies (Dillon, Frederick & Tangpanichdee, 1985).

03 · The mathematics

How preferences become a ranking

The aggregation method is simple, transparent, and mathematically grounded.

Given n items to rank, each item is compared against every other item exactly once. The total number of comparisons required is n(n−1)/2.

Aggregating individual responses

For individual ranking sessions, the result is derived directly from the win count: each item receives one point for each comparison it wins. The final ranked list orders items from most wins to fewest.

Aggregating group responses

For group sessions, each participant's comparisons are aggregated before the final ranking is calculated. The group win total for any comparison is the sum of individual votes across all participants.

Crucially, the aggregated result also exposes the shape of disagreement. Items that divide the group are visible in the data — this is the most valuable output of a group session.

On intransitivity

In rare cases, pairwise comparison can produce intransitive results — where A beats B, B beats C, but C beats A. This is known as the Condorcet paradox. CleanRanking surfaces these cases explicitly in the facilitator view.

04 · Group dynamics

The group session as a diagnostic tool

The ranked list is not the only output. The shape of disagreement is equally important.

Conventional group decision-making methods share a fundamental flaw: they are sensitive to the order in which opinions are expressed. The first person to speak anchors the group.

This is not a failure of individuals — it is a predictable consequence of group dynamics. Research has consistently found that group discussions produce less accurate collective judgements than the aggregation of independent private judgements. This effect is sometimes called the hidden profile problem.

"Groups consistently fail to share uniquely held information, and this failure is systematic rather than random."

— Stasser & Titus, Journal of Personality and Social Psychology, 1985

Facilitator note

The most effective use of a CleanRanking group session is to reveal the result before discussion begins, not after. Showing the group what they collectively believe — before anyone has spoken — changes the quality of the conversation that follows.

05 · Applications

Where it is used

From psychoacoustics to AI training — the method scales.

Strategic planning and management

The Analytic Hierarchy Process has been used in high-stakes strategic decisions in defence, infrastructure, healthcare, and government. A 2008 review identified over 900 published applications across 19 different fields.

Machine learning and AI

Reinforcement learning from human feedback (RLHF), the technique used to train modern large language models, relies entirely on pairwise comparison. Human evaluators are shown two model outputs and asked to select the better one.

Sports and competitive ranking

The Elo rating system, used in chess, football, and dozens of other competitive domains, is a direct application of the Bradley–Terry pairwise model.

06 · References