Epistemic vs. Aleatoric Uncertainty in LLMs, via a Google DeepMind Paper, “To Believe or Not to Believe Your LLM,” & Why SEOs Should Care (Likely)

Last updated:

June 10, 2024

Welcome to another week of Hamsterdam Research, where we look at recent AI research papers to learn just what the heck they’re talking about and explore their possible implications for the future of search and SEO.

This week, we’ll look at a paper from Google DeepMind called “To Believe or Not to Believe Your LLM.”

This paper deals with distinguishing between different types of uncertainty in LLMs, including when one of them qualifies as a hallucination.

It was published on June 5th, 2024, and its contributors include Yasin Abbasi Yadkori, Ilja Kuzborskij, András György, and Csaba Szepesvári.

If you’re an SEO professional, I’d say the value of this topic is about understanding the challenges LLMs (and, relatedly, search engines) confront when answering queries that have inherently uncertain answers.

Here’s the first page of the paper, but don’t worry just yet about reading the contents. We’ll do a deep dive below!

Since we’re talking about large language models, I first want to take you on a little journey about words and their origins.

I didn’t plan this part, but once I got started, I couldn’t stop myself.

If you’re in a hurry and here for the research paper or SEO discussion, jump down to the abstract or why the topic matters for SEO.

If you’re still with me, let’s take a little history detour!

Etymology of “epistemic,” “aleatoric” & more fun topics …

Are you familiar with the term “epistemic“?

What about “aleatoric“?

By the end of this article, you’ll be fluent in both. 😉

These terms are central to the Google DeepMind research paper we’ll be reviewing.

They’re explained in its abstract below (I first saw it on Hugging Face, when it appeared in my Google Discover feed):

Here are the abstract’s first few sentences, which mention epistemic and aleatoric in context:

“We explore uncertainty quantification in large language models (LLMs), with the goal to identify when uncertainty in responses given a query is large. We simultaneously consider both epistemic and aleatoric uncertainties, where the former comes from the lack of knowledge about the ground truth (such as about facts or the language), and the latter comes from irreducible randomness (such as multiple possible answers).”
– To Believe or Not to Believe Your LLM abstract

So there we have the overall context of the paper as quantifying uncertainty in LLMs.

“Epistemic” refers to uncertainty from a lack of knowledge about facts (the ground truth), whereas “aleatoric” refers to uncertainty due to multiple possible answers (irreducible randomness).

The researchers’ goal is to distinguish between them.

We’ll delve more into the rest of the abstract shortly.

But first, word origins (etymology) can be a fascinating topic, right?

For example, my last name is Lazuk, which we might tokenize as “Laz” + “-uk.” 😉

What does “Laz” mean?

In college, I studied abroad in the Republic of Georgia, a Caucuses nation at the historical crossroads of European and Islamic civilizations. That’s where I learned about the Laz ethnic group also found in Turkey, among other places.

As for “-uk,” that’s a suffix in Ukrainian that can mean “son of.”

I had a theory that “Lazuk” meant “little Laz,” and referred to a diaspora community of Laz people in Ukraine, Russia, and other former-Soviet lands.

To the point of our conversation about trusting LLMs, Google Gemini seemed to think that scenario is a possibility:

I also prompted its responses with a Wikipedia page and my own theories.

To trust or not to trust our LLM, that is the question.

Gemini has qualified both of its answers, in fairness, telling us to do more research, as this is “one possible interpretation” of the facts.

Now, is that epistemic or aleatoric uncertainty on its part?

One possible interpretation would imply aleatoric, right?

I lean toward epistemic, though, since we’re asking a question that likely has one right answer, but it’s not one that would’ve been in Gemini’s training data or even found online (most likely).

See how challenging this can be!

We’ll get back to that whole topic shortly, though.

Because I also want to share with you some more history.

The reason I did that study abroad in Tbilisi, Georgia, was that I majored in cultural anthropology with a focus on the Islamic world and Arabic language, and that was the best spot my school could send me.

Parts of Georgia are Europeanized, but you also see influences of Central Asia and beyond.

In the western world, we tend to see things in a European context.

Western civilization is important.

Greek and Latin influences abound in math and science — the etymology of the word “calculus,” for example, is Latin for a “small pebble” used in an abacus.

That’s also just one piece of the pie.

The Islamic world has played a large role in the development of modern math and science.

Al-gebra. Al-gorithm.

Ok, those aren’t Arabic words exactly, but their origins are:

As a result of the spread of Islam, dynastic rulers, and early world commerce, the Arabic language came to influence our vocabulary in ways many of us don’t realize.

Do you wear shirts made of cotton (qutn)?

How about put sugar (sukkar) in your coffee (qahwa)?

Some of those Arabic words came from other parts of the world, too.

Sukkar is from Indian Sanskrit (sarkara).

And instead of coffee, many people enjoy tea, or maybe call it chai.

Both names come from Chinese words: te and cha — which traveled through Central Asia to the Middle East and beyond.

In Arabic, tea is called shay.

To that point, Al-Khwarizmi, for whom the word “algorithm” is named after, was a Persian from the Khwarazm region:

The language of Uzbekistan uses a modified Cyrillic alphabet (Russian), while Turkmenistan uses a Roman alphabet. Both speak Turkic languages.

Turkey itself, where Turkish is spoken, uses a Latin script alphabet today, but formerly used a modified Arabic alphabet during the Ottoman Empire.

Persia includes today’s Iran, where Farsi is spoken.

Farsi still uses a modified Arabic alphabet, but the language isn’t Semitic, like Hebrew or Arabic.

Persian is an Indo-Iranian (Aryan) language:

If “Indo” sounds familiar, it’s likely because you’ve heard of Indo-European languages.

In fact, you’re reading one now. 😉

Well, Persian is kind of similar to that:

How might Iranians say mother in Farsi?

“Madir.”

How about brother?

“Bradir.”

And let’s not forget our South Asian friends who speak Urdu in Pakistan and parts of India.

Urdu likewise uses a modified Arabic alphabet today and is an Indo-Aryan language.

So we have all sorts of languages, using all sorts of alphabets, for all sorts of historical reasons.

How did this linguistic exchanging come about?

It was a mixture of commerce, conquest, and cultural diffusion.

You might have heard of the Silk Road, for example.

Its trading routes spanned the Middle East, China, Central Asia, the Indian subcontinent, Southeast Asia, Europe, and North and East Africa.

It wasn’t only fabrics, spices, and other goods that were spread. It was also knowledge!

Source

There were also plenty of dynasties that ruled these regions in the past, like the Abbasid Caliphate, which also had “suzerainties.”

Source

Why was the Arabic language so impactful for intellectual pursuits, though?

Well, anyone who’s tried to learn English knows it’s a messy language with tons of exceptions.

On the other hand, Arabic has a very precise grammar:

Don’t misunderstand me, today’s colloquial Arabic has many localized varieties:

Source

But given the morphology of Classical Arabic, scholars of all backgrounds could use the language to share their scholarship with nuanced context.

That’s why Arabic was so effective in the fields of science and mathematics, especially during the Islamic Golden Age:

At that time, Arabic became a unified language of international scholars.

This might sound like a stretch, but I see parallels with today’s multimodal and multilingual transformer-based AI models.

We again have a unified language of international scholarship: translation.

We mentioned algebra (al-jabr) was Arabic. We also mentioned calculus was Latin (albeit later on).

The larger point is that during the Islamic Golden Age, many translated works in Arabic came from foundational knowledge from earlier global scholarship, including Greek and Latin works.

That brings us back to the terms “epistemic” and “aleatoric.”

Here’s the etymology of “epistemic” from Merriam-Webster:

“Wherever it is used, epistemic traces back to the knowledge of the Greeks. It comes from epistēmē, Greek for ‘knowledge.’ That Greek word is from the verb epistanai, meaning ‘to know or understand,’ a word formed from the prefix epi- (meaning ‘upon’ or ‘attached to’) and histanai (meaning ‘to cause to stand’). The study of the nature and grounds of knowledge is called epistemology, and one who engages in such study is an epistemologist.”
– Merriam-Webster Dictionary

It’s all Greek to me.

That dictionary didn’t have etymological information about “aleatoric,” but it gave us a definition as “characterized by chance or indeterminate elements.”

Something cool Google Search explained, though, is that “aleatoric” comes from the Latin root “aleator,” which means “dice player”:

So a word meaning “characterized by chance” comes from the Latin for “dice player” — how awesome is that?!

This being a Hamsterdam Research post, we’d be remiss not to quote The Wire here:

McNulty: “Let me understand. Every Friday night, you and your boys are shooting craps, right? And every Friday night, your pal Snot Boogie… he’d wait til there’s cash on the ground and he’d grab it and run away? You let him do that?”

Man On Stoop: “We’d catch him and beat his ass but ain’t nobody ever go past that.”

McNulty: “I gotta ask ya: if every time Snot Boogie would grab the money and run away, why’d you even let him in the game?”

Man On Stoop: “What?”

McNulty: “If Snot Boogie always stole the money, why’d you let him play?”

Man On Stoop: “Got to. This America, man.”
– The Wire

All I’m saying is that knowledge at its best is global.

The capacity for information sharing we have today across different modalities and languages using LLMs is like a new age of unified global scholarship.

Writers like Shakespeare are great.

So are figures like Omar Khayyam.

He not only wrote poetry like this:

“With them the seed of Wisdom did I sow,
And with mine own hand wrought to make it grow;
And this was all the Harvest that I reap’d–
‘I came like Water, and like Wind I go.’”
– The Rubaiyat

But also solved algebraic problems this:

Source

We have a great chance here for a new platinum age of global scholarship. Got to take advantage.

It’s even bigger, in fact. 😉

Ok, that’s my historical tangent!

Now onto our Google DeepMind paper.

But first, why it matters to SEO (maybe)

Wait, another detour?!

Just a quick one …

You’re probably wondering how the topic of epistemic and aleatoric uncertainty in LLMs relates to SEO.

Well, as mentioned in the intro, the same fundamental challenges that epistemic and aleatoric uncertainty pose for LLMs also apply to search engines’ results, including featured snippets and AI Overviews.

Understanding these challenges from an engineering perspective can better inform how we as SEO professionals conceptualize aspects of E-E-A-T and content creation relative to how search systems might work.

We’ll explore these hypothetical implications for SEO more at the end.

Now, without further ado …

Let’s start with the paper’s abstract 😉

If you’d like to read along, you can view a PDF or HTML version of “To Believe or Not to Believe Your LLM” on arXiv.

We’ll begin with a breakdown of the abstract and its vocabulary.

Here is the text version of the abstract:

“We explore uncertainty quantification in large language models (LLMs), with the goal to identify when uncertainty in responses given a query is large. We simultaneously consider both epistemic and aleatoric uncertainties, where the former comes from the lack of knowledge about the ground truth (such as about facts or the language), and the latter comes from irreducible randomness (such as multiple possible answers). In particular, we derive an information-theoretic metric that allows to reliably detect when only epistemic uncertainty is large, in which case the output of the model is unreliable. This condition can be computed based solely on the output of the model obtained simply by some special iterative prompting based on the previous responses. Such quantification, for instance, allows to detect hallucinations (cases when epistemic uncertainty is high) in both single- and multi-answer responses. This is in contrast to many standard uncertainty quantification strategies (such as thresholding the log-likelihood of a response) where hallucinations in the multi-answer case cannot be detected. We conduct a series of experiments which demonstrate the advantage of our formulation. Further, our investigations shed some light on how the probabilities assigned to a given output by an LLM can be amplified by iterative prompting, which might be of independent interest.” [Highlights added to all quotes.]

If any of that sounds daunting so far, never fear!

We’ll explore the main concepts now before doing the rest of our analysis.

Note: I’ll be using Gemini Advanced, grounded with the research paper’s contents in the prompt, to help with interpreting some of the vocabulary and concepts. All quotes will be attributed to the Google DeepMind paper, unless otherwise noted.

The researchers start off by mentioning uncertainty quantification in LLMs as the focus of their exploration.

Large language models (LLMs) are a type of AI involving a deep learning model designed to process and generate human-like text in response to prompts and questions.

Uncertainty quantification involves assessing the confidence or reliability in an AI model’s predictions or outputs. It includes understanding the model’s own confidence of an answer (how sure it is). It’s particularly important in areas where decisions have significant consequences, like healthcare or autonomous vehicles.

When an LLM has large uncertainty in response to a query, that means it’s not confident in its answer. Identifying these instances helps to avoid unreliable information from hallucinations, where an LLM generates a response that sounds plausible but is factually incorrect or nonsensical.

Epistemic uncertainty refers to a lack of knowledge, like when an LLM hasn’t seen enough examples of a particular type of question or language pattern during training. This can also occur when a model’s architecture or parameters aren’t complex enough to capture the nuances of the problem.

An example of epistemic uncertainty might include asking an LLM about who won the NBA finals in 2024, given that that information (at the time of this writing) wouldn’t have been in its training data (or available online). Another example might be posing a subjective question like “What’s the best restaurant?,” which leads to epistemic uncertainty about which answer qualifies as “best.”

Aleatoric uncertainty stems from the inherent randomness or variability in a problem itself, even if the LLM has all of the information about the topic. Coincidentally — or perhaps given its training data 😉 — Gemini referenced the etymology of the word “aleatoric” implicitly in its response, explaining that “It’s like rolling a die – even with perfect knowledge of the die’s properties, you can’t predict the exact outcome.”

In the context of LLMs, examples of aleatoric uncertainty are when a question has multiple valid answers, such as a creative writing prompt where there are endless possibilities (like asking for a story about a dragon) or an open-ended question (like asking about the benefits of exercise).

The researchers’ goal is to distinguish between these types of uncertainty, because epistemic uncertainty could result in a higher risk of hallucination, whereas aleatoric uncertainty simply reflects variability in the possible answers.

To accomplish their goal, they derived an information-theoretic metric. This is a measurement tool based on the principle of information theory, a branch of mathematics that deals with the quantification of information, designed to assess epistemic uncertainty in the LLM.

The researchers compute when an LLM has high epistemic uncertainty by looking at the output of the model (the text it generates). They used “special iterative prompting based on the previous responses,” which means that, rather than asking the LLM a question one time, they asked it multiple times in specific ways, building on the previous responses and creating a chain of interactions. This iterative process (where subsequent prompts depended on how the LLM answered the previous question) helped reveal if the model’s confidence or uncertainty changed as it generated more text.

Iterative prompting allowed the researchers to detect hallucinations (instances of high epistemic uncertainty) in both single-answer responses — when there’s one correct answer, like “Name the capital of Turkey” — and multi-answer responses — when there are multiple valid answers, like “Name a type of fruit.”

Phew!

That takes us through the paper’s abstract.

Feel free to take a breather before we explore the other sections.

Now it’s time for a deep dive into the paper’s research!

The full paper has seven sections, including an introduction, preliminaries, sections on probability amplification by iteratively prompting, metric of epistemic uncertainty and its estimation, and score-based hallucination tests, as well as experiments and a conclusion.

We’ll summarize each of these below.

1. Introduction

The paper starts out with a quote from “Monday Starts on Saturday” by Arkady and Boris Strugatsky, Soviet-era science fiction writers.

This is where we get a reference of “To Believe or Not to Believe.”

The context of the story is quite interesting.

The story is set in a fictional research institution in the Soviet Union where magic and science coexist.

The context of that excerpt above is that Privalov, the main character who is a programmer, is experiencing strange phenomena and is unsure if it’s real or hallucinations.

The technique he uses, from the referenced book, is meant to confirm he is not hallucinating.

In the story’s plot, this moment signifies Privalov’s growing acceptance of the magical reality he finds himself in, trusting his senses and experiences more, despite them seeming impossible in nature.

As the researchers explain in the opening to the introduction:

“Like the protagonist of the novel, language models too occasionally suffer from hallucinations, or responses with low truthfulness, that do not match our own common or textbook knowledge. At the same time, since LLMs work by modeling a probability distribution over texts, it is natural to view the problem of truthfulness through the lens of statistical uncertainty. In this paper we explore uncertainty quantification in LLMs.”

When they say LLMs are “modeling probability distribution over texts,” they’re referring to the process of assigning probabilities to different text sequences.

An LLM works by estimating the probability of various possible continuations of a prompt (input text).

Rather than saying, “This is the next word,” an LLM says, “There’s an XYZ% chance it’s this word, an ABC% chance it’s this word,” etc.

Since these large language models operate based on probabilities, there’s always a degree of uncertainty associated with the output.

It’s through this “lens of statistical uncertainty” that the researchers view truthfulness and reliability.

They go on to describe epistemic and aleatoric as the “two sources of uncertainty,” and how low epistemic uncertainty means the model is closer to ground truth (facts).

“Epistemic uncertainty arises from the lack of knowledge about the ground truth (e.g., facts or grammar in the language), stemming from various reasons such as insufficient amount of training data or model capacity. Aleatoric uncertainty comes from irreducible randomness in the prediction problem, such as multiple valid answers to the same query. Hence, truthfulness can be directly analyzed via looking at the epistemic uncertainty of a model in the sense that when the epistemic uncertainty is low, the model predictions must be close to the ground truth.”

They go on to explain how it’s hard to identify uncertainty in deep neural networks due to a lack of guarantees about the model learning the ground truth (consistency) or how large the variance of a learning algorithm is (meaning its stability across different datasets).

Meanwhile, existing heuristic approaches (meaning practical but not optimal) for uncertainty quantification — log-likelihood of responses, estimating entropy, ensembling, conformal prediction — are limited because they apply only to problems where there’s a single correct response (single label). If multiple responses (multi label) are correct (aleatoric uncertainty), the approaches don’t apply.

The researchers next talk about the contributions of their work, including “methods to decouple epistemic and aleatoric uncertainty, allowing us to effectively deal with multi-response queries.”

They’re not concerned with quantifying small epistemic uncertainty, but rather they “aim to identify when only the epistemic uncertainty is large, in which case we can suspect that the response is hallucinated.”

This next part confused me a little, so I had Gemini help break it down after:

“As a starting point we make a simple observation: If multiple responses are obtained to the same query from the ground truth (the language), they should be independent from each other, that is, in probabilistic interpretation, the joint distribution of these multiple responses, for a fixed query, must be a product distribution.

This observation can be used to measure how far the language model can be from the ground truth.”

Gemini gave this follow-up example, which helped me:

“Imagine asking a perfect language model ‘Name a city in the UK.’ If the model truly understands the distribution of UK cities, each answer it gives (e.g., London, Manchester, Birmingham) should be independent of the previous answers. The fact that it said ‘London’ first shouldn’t make ‘Manchester’ more or less likely as the next response.“
– Gemini Advanced

As for the second part of the excerpt regarding the distance of a language model from the ground truth, Gemini had this to say:

“Real language has this property of independence. If you ask a human the same question multiple times, their answers will generally be independent (unless there’s a very limited set of possible answers).

However, language models are not perfect. They might get stuck in certain patterns or be overly influenced by the way a question is phrased. By checking how far the model’s responses deviate from this ideal of independence, the authors can measure how well the model captures the true underlying patterns of language. If the model’s responses are highly dependent on each other, it suggests the model isn’t fully understanding the nuances and possibilities of the language, and thus, is further from the ground truth.“
– Gemini Advanced

The researchers explain this in more detail:

“The sequential model implemented by a language model allows us to construct a joint distribution over multiple responses, which is done through iterative prompting of an LLM based on its previous responses and the application of the chain rule of probability: first we ask the model to provide a response given a query, then to provide another response given the query and the first response, then a third one given the query and the first two responses, and so on. …

So, if the response to a prompt containing the query and previous responses is insensitive to the previous responses, we have the desired independence and the LLM-derived joint distribution can be arbitrarily close to the ground truth. On the other hand, if the responses within the context heavily influence new responses from the model then, intuitively speaking, the LLM has low confidence about the knowledge stored in its parameters, and so the LLM-derived joint distribution cannot be close to the ground truth. As more responses are added to the prompt, this dependence can be made more apparent, allowing to detect epistemic uncertainty via our iterative prompting procedure.“

The iterative prompting procedure led to several contributions:

An information-theoretic metric of epistemic uncertainty in LLMs: this “quantifies the gap between the LLM-derived distribution over responses and the ground truth,” and since the “gap is insensitive to aleatoric uncertainty” they can “quantify epistemic uncertainty even in cases where there are multiple valid responses.”
A computable lower bound for the epistemic uncertainty metric: they found a quantity called mutual information (MI) — a concept from information theory that measures the relationship between variables, in this case quantifying how much knowing one response from the LLM tells about the other potential responses it could have given — that is always less than or equal to the true uncertainty of the model, giving a concrete way to estimate epistemic uncertainty.
An algorithm designed to detect hallucinations: the algorithm uses a finite-sample MI estimator to set a threshold for the MI score to flag potential hallucinations.
Iterative prompting technique for LLM information processing manipulation: in short, the model’s behavior depends on whether or not a query is similar to the principal components (patterns in the data).

2. Preliminaries

In this brief section, the researchers provide basic definitions used throughout the paper.

The first term is “conditional distributions and prompting.”

A conditional distribution refers to the probability of an LLM generating a specific text sequence (response) given a particular input prompt.

The authors also introduce their notations, as well as this diagram:

The iterative prompting procedure starts with a query (x) and generates a response (Y₁) using the language model (Q). The original query and first response are used in a new prompt to generate another response (Y₂), and so on.

The second term is “information-theoretic notions.”

Information theory provides tools to quantify the uncertainty and information content (how surprising or unexpected an event is) in probability distributions (likelihood of different outcomes in random events).

The information-theoretic notions are used to analyze the language model’s behavior during the iterative prompting process.

Examining the MI between the generated responses helps the researchers assess how well the model captures the underlying patterns of language (and how much epistemic uncertainty is present).

In short, they want to measure the language model’s uncertainty about its responses to understand the reliability and truthfulness of the output.

3. Probability amplification by iteratively prompting

This section is where the researchers demonstrate how “repeating possible responses several times in a prompt can have pronounced effects on the output of a language model.”

They ask the LLM for the capital of the UK but then repeat, “Another answer to question Q is Paris.”

“Although the number of repetitions changes the behavior of the LLM, the correct response maintains a significant probability: as Figure 2 shows, the conditional normalized probability of the correct response, ‘London’, reduces from approximately 1 to about 96% as we increase the number of repetitions of the incorrect response to 100. Figure 2 shows 3 more examples where, with initially low epistemic uncertainty in the response to the query (the aleatoric uncertainty is also low as we consider single-response queries), the correct response maintains a significant or non-negligible probability even in the presence of repetitions of incorrect information, while the probability of predicting the latter is increased.“

Here are figures 1 and 2 from the paper, showing examples of low and high epistemic uncertainty, respectively:

The researchers next talk about “In-context learning vs. in-weight learning,” which discusses the inner workings of a single attention head within a transformer-based LLM.

The in-context information (Z/X) is comprised of the input matrix (Z) — where each row represents a complete statement (question or answer) in the form of a numerical vector — and a query (X), or the input statement.

A familiar query (similar to patterns seen during training) aligns with the principal components of the key-query matrix product, triggering in-weight learning, where the model generates a response based on its learned knowledge.

An unfamiliar query (dissimilar to training patterns) can lead to in-context learning, where the model might copy a repeated element from the prompt.

In figure 3, we see multi-label queries with aleatoric uncertainty:

A multi-label query simply means a question with multiple correct answers.

Name a city in the UK.
Name a yellow fruit.
Name an alcoholic drink.
Name a ball game that is played by more than five players.

My American bias told me the best answer to prompt four was baseball. (“Take me out the to the ball game.”)

To that point, this is a funny video to check out before we proceed:

@tay9h This is the the USA’s biggest Sporting achievement of all time #usa #pakistan #cricket #cricketworldcup ♬ original sound – Tay 9

“We got beat by an IT guy.” Haha.

But yes … overall, figure 3 shows that repeated presentations of one correct answer in a prompt can influence the likelihood of an LLM generating another valid answer, which relates to the concept of aleatoric uncertainty (multiple correct answers).

4. Metric of epistemic uncertainty and its estimation

In this section, the researchers “apply iterative prompting to estimate the epistemic uncertainty of a language model about responding to some query.”

They use the behavior patterns observed in the previous section “to differentiate between two modes of high uncertainty: when the aleatoric uncertainty is high vs. when only the epistemic uncertainty is high.”

They next apply this “new uncertainty metric to design a score-based hallucination detection algorithm.”

Here’s some vocabulary to bear in mind:

Ground truth independence assumption: Assumption that in a perfect language model (the ground truth), multiple responses to the same query are independent of each other (no influence).
Pseudo joint distribution: a way to model the probability of getting a sequence of responses from the LLM given a query, called “pseudo” because it uses prompting functions to incorporate previous responses into the calculation.
Epistemic uncertainty metric: the Kullback-Leibler (KL) divergence used to measure epistemic uncertainty, quantifying how much one probability distribution differs from another.
Lower bound and Mutual Information: since it’s impossible to calculate KL divergence between the LLM and ground truth directly, a lower bound is established using MI to quantify how much the LLM’s responses depend on each other.
Excess risk: a measure of the difference in performance between the LLM and ground truth.

The above lays the groundwork for the researchers’ method of estimating epistemic uncertainty in LLMs.

We also see in figure 4 a scenario where the LLM’s response deviates from the ground truth:

In the above, the red line is the probability distribution (text sequence given a prompt) and the green line is the ground truth.

The paper’s next section discusses “A computable lower bound on epistemic uncertainty,” which is about the challenge of computing the MI metric (the lower bound for epistemic uncertainty in the LLM).

In short, the MI metric quantifies the dependence between the LLM’s responses to a given query.

One challenge with this is infinite support, where the theoretical calculation of MI would mean evaluating the LLM’s output distribution over all possible responses, which is infinite or very large.

As a solution, the researchers use a finite sample of responses.

Another challenge is estimation error. Given a finite sample is used, it introduces errors, like that the sample won’t cover the entire range of possible answers.

As a solution, the researchers define a quantity called the “missing mass” to represent the probability of responses not included in the finite sample.

The researchers next present an algorithm (MI estimator) to estimate MI from a finite sample, which accounts for missing mass by adding a stabilization parameter and only considering unique responses in the sample.

Here’s how the MI estimator algorithm looks:

The researchers also provide two bounds on the estimation error:

Pessimistic bound: assumes the LLM’s responses are uniformly spread over all possible strings (which is unlikely in practice).
Optimistic bound: assumes the LLM’s responses are concentrated on a smaller set of likely responses (more realistic).

In short, this section presents a method to estimate MI of an LLM’s output distribution from a finite sample along with theoretical guarantees of this estimation, providing a practical tool for quantifying epistemic uncertainty, even given vast spaces of possible responses.

5. Score-based hallucination tests

This section introduces a method for using the MI estimator algorithm to detect hallucinations in LLMs.

It’s a bit technical, but in short, by setting an appropriate threshold, the model can be made to abstain from answering when the epistemic uncertainty is too high, thus reducing the risk of hallucination.

The MI score acts like a proxy for epistemic uncertainty, within set thresholds.

6. Experiments

To evaluate the “abstention method derived on the MI estimate” in the previous section, the researchers used “a variety of closed-book open-domain question-answering tasks”:

Close-book means the LLM doesn’t have access to external knowledge sources, like a database or the internet, so it must rely on information learned during pre-training.
Open-domain means the questions can be about any topic.
Question-answering is just that, asking the LLM a question and evaluating the accuracy and relevance of its answers.

The model used for generating outputs and scores was Gemini 1.0 Pro.

[Note: current Gemini products mostly use Gemini 1.5 models, which have a Mixture-of-Experts (MoE) architecture.]

The datasets included “a random subset of 50,000 datatpoints from the TriviaQA dataset, and the entire AmbigQA dataset (with 12038 datapoints).”

TriviaQA is a reading comprehension dataset authored by trivia enthusiasts, while AmbigQA is a dataset where every plausible answer is paired with a rewrite of the original question.

However, these “mostly contain single-label queries, and only contain a few multi-label ones.”

Therefore, the researchers also “created a multi-label dataset based on the WordNet dataset,” extracting “all (6015) datapoints from WordNet at depth 4 or more of the physical_entity subtree.”

Furthermore, for “each datapoint (entity, children) in WordNet,” the researchers “constructed a query of the form ‘Name a type of entity.’ and children are considered target labels.”

WordNet is a lexical database of English, in which parts of speech are grouped as cognitive synonyms called “synsets,” each expressing a distinct concept.

Figure 5 shows the precision-recall (PR) curves for different methods of detecting hallucinations in LLMs:

For the first two datasets, TriviaQA and AmbigQA, the MI based method has almost identical performance to another method called Semantic Entropy (SE). Both are superior to the T0 and SV baselines.

For the second two datasets, which have more multi-label queries, MI score outperforms the SE score.

Figure 6 shows the performance of the MI and SE methods in detecting hallucinations for single-label and multi-label questions:

The x-axis of entropy represents the measure of uncertainty or diversity in the model’s responses, with higher entropy equating to less certainty and more diverse answers.

The y-axis of recall or error refers to the proportion of queries where the method provides an answer (doesn’t abstain) and the proportion of incorrect answers among queries where the method does not abstain (accuracy), respectively.

The bar lines are confidence intervals.

For low entropy queries (the LLM is more confident) both MI and SE methods had similar recall and error rates.

For high entropy queries (LLM is less certain and generates more diverse responses), the MI method had a higher recall rate than the SE method. In other words, the MI method was more willing to answer questions even when the LLM was uncertain, yet it maintained a low error rate.

In short, the MI method was better at distinguishing between epistemic and aleatoric uncertainty. Particularly in multi-answer questions, the SE method focused on overall uncertainty, while the MI method could better abstain from answering when there was a lack of knowledge as compared to situations where there were multiple valid answers (aleatoric uncertainty).

7. Conclusion

Let’s now look at the paper’s full conclusion:

“In this paper we considered epistemic uncertainty as a proxy for the truthfulness of LLMs. We proposed a mutual-information-based uncertainty estimator that admits a provable lower bound on the epistemic uncertainty of the LLM’s response to a query. That we consider joint distributions of multiple answers allows us to disentangle epistemic and aleatoric uncertainty, which makes it possible to better detect hallucination than first order methods, which can only tackle uncertainty as a whole, not epistemic uncertainty alone. This approach yielded an abstention method that performs significantly better on mixed single-label/multi-label datasets than first-order methods. While earlier methods for classification that aim to quantify epistemic uncertainty are usually based on a modified training method using response-tuples, utilizing the sequential nature of LLMs, our method does not need to change the training procedure, but needs to prompt the model iteratively with multiple responses generated by the LLM for the same query.”

We can now understand better what using epistemic uncertainty as a proxy for truthfulness means.

We can also understand what an uncertainty estimator admitting a lower bound on the epistemic uncertainty of an LLM’s response to a query means, as well as disentangling epistemic from aleatoric uncertainty and a better-performing abstention method on single-label and multi-label datasets.

We can also understand what is meant by prompting the model iteratively with multiple responses generated for the same query.

Admittedly, this is still a challenging topic, at least in my mind.

Here are a couple more examples to help us grasp it all, compliments of our trusty LLM, Gemini. 😉

Bonus: iterative prompting example

Here’s how the iterative prompting might go:

Prompt 1:

Q: What is the capital of France?

A:

Response 1:

Paris

Prompt 2:

Consider the following question:

Q: What is the capital of France?

One answer to question Q is Paris.

Another answer to question Q is:

Response 2:

(It’s unlikely the model will produce another valid answer, but it might try something like “Marseille” or even hallucinate a response)

Prompt 3:

Consider the following question:

Q: What is the capital of France?

One answer to question Q is Paris.

Another answer to question Q is Marseille.

Another answer to question Q is:
– Gemini Advanced

And here’s an explanation for it:

“Each prompt builds upon the previous one, incorporating the model’s generated responses.

This iterative process allows the model to explore the space of possible answers more thoroughly.

By analyzing the relationships between the generated responses (using mutual information), the authors can assess the model’s epistemic uncertainty (uncertainty due to lack of knowledge).

If the model keeps generating the same or similar responses, it indicates low epistemic uncertainty (high confidence).

If the model generates diverse or nonsensical responses, it indicates high epistemic uncertainty (low confidence), which could signal a hallucination.”

– Gemini Advanced

Bonus: epistemic vs. aleatoric uncertainty example

Here’s an example of epistemic versus aleatoric uncertainty for both single-label (one-answer) and multi-label (multiple-answer) questions:

“Single-Label Question:

Question: ‘What is the capital of France?’

Epistemic Uncertainty (High): The LLM has not been trained on enough data about France or has a flaw in its architecture, so it’s unsure and might respond with “Marseille” or “Lyon.”

Epistemic Uncertainty (Low): The LLM has been trained on ample data and confidently responds with ‘Paris.’

Aleatoric Uncertainty: This type of uncertainty doesn’t apply to single-label questions, as there’s only one objectively correct answer.

Multi-Label Question:

Question: ‘Name a type of fruit.’

Epistemic Uncertainty (High): The LLM hasn’t learned enough about fruits and responds with ‘Carrot’ or ‘Broccoli.’

Epistemic Uncertainty (Low): The LLM has learned about fruits and can confidently generate a valid response.

Aleatoric Uncertainty (High): The LLM knows many fruits and could respond with ‘Apple,’ ‘Banana,’ ‘Orange,’ etc. There’s inherent randomness in which valid answer it chooses.

Aleatoric Uncertainty (Low): This is less common in open-ended questions like this. It might occur if the LLM’s training heavily favored a few specific fruits, making them disproportionately likely to be generated.”

– Gemini Advanced

Of course, we’d be remiss not to include a bonus answer from Gemini:

“Here’s an example using information from the Middle East …

Question: ‘What is the capital of Egypt?’

Epistemic Uncertainty (High): The LLM has limited knowledge of Middle Eastern geography and responds with ‘Istanbul’ or ‘Beirut.’

Epistemic Uncertainty (Low): The LLM has been trained on sufficient data and confidently responds with ‘Cairo.’

Aleatoric Uncertainty: This type of uncertainty doesn’t apply here, as there’s only one correct answer.”
– Gemini Advanced

So why should SEOs care about this topic (likely)?

The first thoughts that came to mind while reading this paper were how search engines like Google and Bing might treat featured snippets or AI-generated answers.

Here’s an excerpt from a 2018 blog post Danny Sullivan wrote in The Keyword about featured snippets, for example:

“Sometimes, a single featured snippet isn’t right for every question. For example, ‘how to setup call forwarding varies by carrier. That’s where a recent feature we launched lets you interactively select a featured snippet specific to your situation.”
– The Keyword (2018)

That’s an example of aleatoric uncertainty, or multiple right answers.

Bing introduced DNN-based intelligent search features around that same time in December of 2017:

“Bing now uses deep neural networks to validate answers by aggregating across multiple reputable sources, rather than just one, so you can feel more confident about the answer you’re getting. …

Of course, not every question has just one answer. Sometimes you might be looking for expert opinions, different perspectives or collective knowledge. …

If there are multiple ways to answer a question, you’ll get a carousel of intelligent answers, saving you time searching from one blue link to another. “
– Microsoft Bing Blogs (2017)

That excerpt speaks to examples of both epistemic uncertainty (one correct answer) and aleatoric uncertainty.

As for the recent hullabaloo over Google’s AI Overviews producing odd results, Liz Reid mentioned these as some of the improvements made in a blog post in The Keyword:

“We added triggering restrictions for queries where AI Overviews were not proving to be as helpful.

For topics like news and health, we already have strong guardrails in place. For example, we aim to not show AI Overviews for hard news topics, where freshness and factuality are important. In the case of health, we launched additional triggering refinements to enhance our quality protections.“

– The Keyword (2024)

That sounds like an abstention method, such as for instances of high epistemic uncertainty.

In general, thinking about the challenges that not only LLMs but also search engines using deep learning and RAG models might have when dealing with epistemic and aleatoric uncertainty from queries can influence how we think about our own work as SEOs.

For example, we’ve been discussing these topics from an engineering perspective so far, talking about MI estimator algorithms and thresholds.

Let’s now revisit Google’s helpful content guidance:

“Does the content present information in a way that makes you want to trust it, such as clear sourcing, evidence of the expertise involved, background about the author or the site that publishes it, such as through links to an author page or a site’s About page? …

Google’s automated systems are designed to use many different factors to rank great content. After identifying relevant content, our systems aim to prioritize those that seem most helpful. To do this, they identify a mix of factors that can help determine which content demonstrates aspects of experience, expertise, authoritativeness, and trustworthiness, or what we call E-E-A-T. …

Reading the guidelines may help you self-assess how your content is doing from an E-E-A-T perspective, improvements to consider, and help align it conceptually with the different signals that our automated systems use to rank content.”
– Creating helpful, reliable, people-first content (Google Search Central)

As SEOs, we can conceptualize topics like “helpfulness” or “E-E-A-T” in the context of more concrete goals for content creation around meeting thresholds for epistemic uncertainty or even addressing opportunities from aleatoric uncertainty to share multiple perspectives.

Of course, this wouldn’t be complete without asking Gemini for its thoughts, based on the conversation we’ve had.

*Note: this is a theoretical exercise now, not predictions or instructions. 😉

Here’s how Gemini thought the MI-based methods in “To Believe or Not to Believe Your LLM” could apply to search and SEO:

1. Hallucination detection and mitigation

This pertains to abstention for featured snippets and AI Overviews, based on epistemic uncertainty.

“Featured snippets and AI Overviews are often generated using LLMs,” Gemini says, and the “MI-based method proposed in the paper could be used to assess the epistemic uncertainty of these generated responses.”

If uncertainty is high, for example, the search engine could abstain from displaying a featured snippet or AI Overview, or it could present a disclaimer about the answer’s potential unreliability.

2. Aleatoric uncertainty

On the flip side, in cases where multiple valid answers are detected for a query (aleatoric uncertainty), an MI-based method could “help the search engine present a more diverse set of results,” says Gemini.

This could include showing multiple featured snippets or perspectives in an AI Overview.

I also wonder if this could extend to SERP diversity based on the ambiguity (aleatoric uncertainty) of search intent for certain queries.

3. E-E-A-T and uncertainty thresholds

This is Gemini’s response so don’t @ me. 😉

The research we reviewed here could be used to establish “uncertainty thresholds for different levels of EEAT,” says Gemini.

For example, a higher uncertainty threshold might apply where expertise is less critical, while a lower threshold might be needed for topics that require higher levels of authoritativeness and trustworthiness, or what we might call YMYL.

4. Search engine understanding and content strategy

For SEO professionals, a knowledge of how search engines might use uncertainty quantification could inform content creation strategies.

Focusing on content that is likely to have low epistemic uncertainty perceived by LLMs (meaning it’s well-sourced and factual information) could increase its chances of appearing in search results.

I also wonder if acknowledging cases of high aleatoric uncertainty and responding in kind with content that offers a variety of perspectives — like from multiple experts or UGC content, like reviews — could be beneficial.

Relatedly, if we can measure the levels of epistemic or aleatoric uncertainty of queries ourselves, that could be beneficial for informing content strategies and individual page outlines.

Till next time …

Phew! I hope you’ve enjoyed this week’s article from Hamsterdam Research!

Feel free to comment or contact me with your feedback.

Since I didn’t have as much time to copy-edit this as I’d like, I’ll probably return to improve the writing in the coming days.

Also, stay tuned for another new article, likely next week, or check out related posts below.

Until next time, enjoy the vibes:

Thanks for reading. Happy optimizing! 🙂

SEO Strategist and Consultant

Ethan Lazuk

PLEDGE (via Google DeepMind), Content Planning for Navigating Trade-Offs of Specificity & Attribution in KGD Systems, & Why SEOs Should Care (Maybe)

We look at Google DeepMind’s PLEDGE framework of content planning for attribution and specificity trade-offs in knowledge-grounded dialogue systems, and why SEOs should care (maybe) in this Hamsterdam History article.

April 26, 2024October 5, 2024

Could Anthropic’s Identification of Millions of Features (Concepts) Activated Inside Its LLM (Claude 3 Sonnet) Influence Our Semantic SEO Strategies? (Maybe)

Anthropic identified millions of features (concepts) activated in Claude 3 Sonnet. Hamsterdam Research reviews the implications for semantic SEO strategies.

May 24, 2024October 5, 2024

The Embedding Language Model (ELM) & Why SEOs Should Care (Maybe)

This Hamsterdam Research article looks as ELM from “Demystifying Embedding Spaces using Large Language Models,” a Google Research paper.

March 29, 2024October 5, 2024

Ethan Lazuk

Epistemic vs. Aleatoric Uncertainty in LLMs, via a Google DeepMind Paper, “To Believe or Not to Believe Your LLM,” & Why SEOs Should Care (Likely)

Etymology of “epistemic,” “aleatoric” & more fun topics …

But first, why it matters to SEO (maybe)

Let’s start with the paper’s abstract 😉