What is Pairwise Ranking Prompting and why should SEOs care?

Welcome to a new Hamsterdam Research post. š¹
We’ll be taking a look at a paper from Google Research titled, “Large Language Models are Effective Text Rankers with Pairwise Ranking Prompting.”
The paper’s authors include Zhen Qin, Rolf Jagerman, Kai Hui, Honglei Zhuang, Junru Wu, Le Yan, Jiaming Shen, Tianqi Liu, Jialu Liu, Don Metzler, Xuanhui Wang and Michael Bendersky.
We’ll start with the paper’s abstract and then get into more of its details.
But first, why should SEOs and marketers care about this research paper?
Pairwise ranking prompting (PRP) is a technique used to rank documents using LLMs. It works by presenting the LLM with a query and a pair of documents and asking the model to determine which is more relevant to the query. This is done repeatedly for all document pairs, then the results are aggregated to produce a final ranking.
Since PRP has been shown to achieve state-of-the-art ranking performance on standard benchmarks using open-source LLMs, it’s important for SEOs to be aware of this technique. It may come in handy when doing competitive research, for example. Furthermore, this technique is a scalable solution that could apply to real-world search engines.
Let’s get into the abstract.
“Ranking documents using Large Language Models (LLMs) by directly feeding the query and candidate documents into the prompt is an interesting and practical problem. However, researchers have found it difficult to outperform fine-tuned baseline rankers on benchmark datasets. We analyze pointwise and listwise ranking prompts used by existing methods and argue that off-the-shelf LLMs do not fully understand these challenging ranking formulations. In this paper, we propose to significantly reduce the burden on LLMs by using a new technique called Pairwise Ranking Prompting (PRP). Our results are the first in the literature to achieve state-of-the-art ranking performance on standard benchmarks using moderate-sized open-sourced LLMs. On TREC-DL 2019&2020, PRP based on the Flan-UL2 model with 20B parameters performs favorably with the previous best approach in the literature, which is based on the blackbox commercial GPT-4 that has 50x (estimated) model size, while outperforming other LLM-based solutions, such as InstructGPT which has 175B parameters, by over 10% for all ranking metrics. By using the same prompt template on seven BEIR tasks, PRP outperforms supervised baselines and outperforms the blackbox commercial ChatGPT solution by 4.2% and pointwise LLM-based solutions by more than 10% on average NDCG@10. Furthermore, we propose several variants of PRP to improve efficiency and show that it is possible to achieve competitive results even with linear complexity.”
The researchers mention that it’s “difficult to outperform fine-tuned baseline rankers on benchmark datasets.” A fine-tuned baseline ranker is a smaller language model that has been specifically trained (fine-tuned) on a large dataset of labeled examples for the task of ranking documents.
Fine-tuned baseline rankers are often used as benchmarks to evaluate the performance of newer ranking models, like PRP. Examples of fine-tuned baseline rankers include monoBERT, monoT5, and RankT5.
The researchers mention how they “analyze pointwise and listwise ranking prompts used by existing methods and argue that off-the-shelf LLMs do not fully understand these challenging ranking formulations.”
These are different approaches to using LLMs for ranking documents based on relevance to a given query.
Pointwise ranking prompts present the LLM with a query and a single document to assess the relevance. They typically involve asking the model to generate a score or a label (relevant or not relevant) for the document.
Listwise ranking prompts give the LLM the query and a list of documents and ask the model to rank the documents based on their relevance. However, the researchers note there is sensitivity to the input order.
The researchers “propose to significantly reduce the burden on LLMs by using a new technique called Pairwise Ranking Prompting (PRP).”
The authors argue that pointwise and listwise prompting methods are too challenging for LLMs to understand, but PRP overcomes these challenges by simplifying the task. PRP asks the LLM to compare only two documents at a time, which is easier for the model to comprehend.
In their work, the researchers further explain why it is difficult for LLMs to perform ranking tasks with pointwise and listwise ranking.
Pointwise approaches require LLMs to output calibrated prediction probabilities before sorting, which is difficult, while listwise approaches can generate conflicting or even useless outputs, even with clear instructions.
In general, the researchers find that “existing popular LLMs do not fully understand ranking tasks, potentially due to the lack of ranking awareness during their pre-training and (instruction) fine-tuning procedures.”
In terms of ranking awareness, LLMs might not be exposed sufficiently to the specific challenges and nuances of ranking documents by relevance give the way they are trained.
LLMs may struggle with understanding the concept of relevance and thus comparing the relevance of different documents to the same query, producing inconsistent results.
Figure 1 shows examples of the pointwise and listwise approaches:

To reduce task complexity for LLMS, PRP uses the query and a pair of documents in the prompt.
Using moderate-sized, open-sourced LLMs on standard benchmark datasets, PRP can achieve state-of-the-art results. They also studied several efficiency improvements that show promising empirical performance.
Figure 2 shows an example of PRP in action:

To avoid sensitivities to text orders like with listwise prompting, the LLM swaps the order of the pair of documents twice.
They explain three variations of PRP, including all pair comparisons, sorting-based, and sliding window, noting that pairwise comparisons can serve as the basic computation unit of many algorithms.
This is an important point, because pairwise ranking prompting is a versatile approach that could be potentially used in a variety of contexts, including search engines.
So, why should SEOs and marketers care?
As the researchers point out, “There has been a strong recent interest in exploring information retrieval in general with LLMs based approaches, due to the importance of the applications and the power of LLMs to understand textual queries and documents.”
Beyond possible applications within search engines, I believe PRP is an interesting technique for competitive research. Upon creating a piece of content, an LLM can be used to judge its relevance to a target query compared to the currently top-ranking result.
Thanks for reading. Happy marketing! š¤
Related posts
What are zero-shot pointwise LLM rankers and why should SEOs care?
What are zero-shot pointwise LLM rankers and intermediate relevance labels, and why should SEOs care? Welcome to a new Hamsterdam Research post. š¹ This time,ā¦
Learning what query expansion to improve generalization of strong cross-encoder rankers means and why SEOs should care.
Learning what query expansion to improve generalization of strong cross-encoder rankers means and why SEOs should care. Welcome to a new Hamsterdam Research post. Thisā¦
Leave a Reply