PLEDGE (via Google DeepMind), Content Planning for Navigating Trade-Offs of Specificity & Attribution in KGD Systems, & Why SEOs Should Care (Maybe)

Last updated:

April 26, 2024

Welcome to another week of Hamsterdam Research, where we look at recent AI research papers to learn just what the heck they’re talking about as well as explore their possible implications for the future of SEO and search.

This week, we’ll be looking at “Investigating Content Planning for Navigating Trade-offs in Knowledge-Grounded Dialogue.”

This paper gets at a fundamental “trade-off” that humans and machines alike confront with their content — balancing “attribution” (correct information) and “specificity” (relevant to the conversation).

This paper speaks about this trade-off in a specific context, but I think it also has general takeaways for SEO professionals, as well as good lessons for learning more advanced AI concepts.

The paper is from Google DeepMind and introduces a framework called PLEDGE (PLan-EDit-GEnerate). In short, PLEDGE helps researchers understand how different content planning techniques can affect the quality (attribution and specificity) of responses in knowledge-grounded dialogue systems.

We can think of knowledge-grounded dialogue systems as a category of AI chatbots that have LLMs as the foundation but also the ability to retrieve and incorporate factual information from structured or unstructured sources.

If that sounds like retrieval augmented generation (RAG) — a term SEOs are growingly familiar with — it’s not quite.

A RAG model is designed more for individual question answering, whereas a knowledge-grounded dialogue system is designed more for back-and-forth exchanges, or “maintaining a coherent conversation over multiple turns,” as Gemini put it.

Examples of knowledge-grounded dialogue systems could include customer service bots (like the Google Cloud customer agent we looked at in Hamsterdam Part 53), voice assistants (like Google Assistant or Siri), or specialized systems for specific knowledge domains (like medical or legal knowledge).

But what’s most relevant about this paper, in my mind, isn’t so much the specific use cases but rather how the researchers describe the challenge of balancing specificity and attribution in AI chat model responses, generally.

As SEOs, we care about whether our documents (webpages, for example) are referenced accurately as sources for chatbot answers. At the same time, we also confront the challenge of making content both accurate and engaging for audiences when it’s human-generated.

This paper highlights the inherent challenge of attribution in natural-sounding conversational AI responses.

We’ll learn all about PLEDGE and relevant AI concepts in this article. We’ll also explore the hypothetical implications of this topic for SEO professionals (at the end).

Quick story on selecting this topic, and why it matters to SEO (maybe)

Per usual in Hamsterdam Research, I selected this week’s topic by taking a list of recent publications from Google Research and consulting with ChatGPT and Gemini to pick one that’s relevant to SEO professionals.

This time, I filtered the publications for Natural Language Processing (NLP) topics. That’s what led to selecting this paper, “Investigating Content Planning for Navigating Trade-offs in Knowledge-Grounded Dialogue.”

The paper was authored by Kushal Chawla of the University of Southern California and Hannah Rashkin, Gaurav Singh Tomar, and David Reitter of Google DeepMind.

It was submitted to arXiv on February 3rd, 2024, and was set to appear at EACL 2024 last month.

As for why the topic of content planning in knowledge-grounded dialogue systems is relevant to SEOs (maybe), here were our initial thoughts:

ChatGPT said: “The paper … offers valuable insights for SEO professionals by addressing the balance between creating specific, engaging content and ensuring it aligns with credible sources, a core challenge in SEO. Exploring this topic in a blog post would be beneficial for SEO professionals as it delves into using AI to enhance content authenticity and user engagement—key factors in improving search rankings and user experience. Additionally, the research introduces the PLEDGE framework, providing a glimpse into future AI tools that could automate and optimize SEO content strategies. This blend of theoretical knowledge and practical application makes it a compelling topic, promising to equip SEO experts with advanced strategies for leveraging AI in content planning and execution.” (My bolding) (*LLM disclaimers apply.)

Gemini said: “This research directly explores the challenges search engines and AI assistants face when trying to provide answers that are both conversational and rooted in reliable sources. Understanding these challenges will help SEO professionals stay ahead of the curve as search becomes more conversational. It emphasizes the ongoing need for high-quality, well-sourced content. Furthermore, the paper’s findings about metrics highlight the importance of optimizing for human judgment as well as for algorithms, which underpins Google’s focus on E-A-T principles.” (My bolding) (*LLM disclaimers apply.)

And I said: “As users engage with more conversational AI chat interfaces, understanding how researchers confront the need to properly attribute source documents while maintaining engaging and relevant answers can help us understand how the content we create is leveraged by these chat experiences. More and more, SEO is becoming as much about influencing the information that users see during their journeys with AI chatbots as it is acheiving rankings in search results. The more we understand what happens behind the scenes for how those answers are generated — in this case for knowledge-grounded dialogue systems — the more we can contextualize why our content is referenced the way it is (or not). Plus, the idea of well-attributed yet specific and engaging content applies to human-generated content as well!“

Let’s start with the paper’s abstract 😉

Here is a screenshot of the abstract taken from the Google Research website:

And here is the abstract text:

“Knowledge-grounded dialogue generation is a challenging task because it requires satisfying two fundamental yet often competing constraints: being responsive in a manner that is specific to what the conversation partner has said while also being attributable to an underlying source document. In this work, we bring this trade-off between these two objectives (specificity and attribution) to light and ask the question: Can explicit content planning before the response generation help the model to address this challenge? To answer this question, we design a framework called PLEDGE, which allows us to experiment with various plan variables explored in prior work, supporting both metric-agnostic and metric-aware approaches. While content planning shows promise, our results on whether it can actually help to navigate this trade-off are mixed — planning mechanisms that are metric-aware (use automatic metrics during training) are better at automatic evaluations but underperform in human judgment compared to metric-agnostic mechanisms. We discuss how this may be caused by over-fitting to automatic metrics and the need for future work to better calibrate these metrics towards human judgment. We hope the observations from our analysis will inform future work that aims to apply content planning in this context.”

There are probably some unfamiliar concepts in there — there were for me!

So let’s start by reviewing the basic vocabulary.

*Note: I’ll use Gemini 1.5 Pro in Google AI Studio (Gemini Pro) and Gemini Advanced (Gemini), each prompted with the research paper for context, to help with this analysis. All quotations will be attributed and refer to the Google Research paper by default unless otherwise specified. (*LLM disclaimers apply.)

“Knowledge-grounded dialogue generation” refers to AI models that generate conversational responses to user prompts that are both “specific” (relevant to the ongoing dialogue and conversation history) and “attributable” (where the information can be traced back to a reliable source or knowledge base to prevent hallucinations).

As the researchers explain, “specificity and attribution” can be “competing constraints,” which makes balancing them a “challenging task.” In other words, it’s easier for AI models to generate responses that sound natural or attribute information well, but it’s harder to do both.

This made more sense as I read the paper. After all, if the source document is highly technical, the more the AI model explains it in natural language, the further it departs from the source’s precise information. It’s the same issue we humans have. 😉

But early on, to be honest, I wasn’t really clear on how this AI model was being used, or why it differed from a RAG model.

But Figure 1 of the paper provides helpful context:

We can see above how the user isn’t just looking for an answer to a question. They are actually engaging in a conversation with the AI chatbot, asking it if it’s a fan of comic books.

[Aside: In my view, the above example represents a type of anthropomorphism — assigning human qualities to objects — that we discussed in Hamsterdam Part 54 as a risk of advanced AI assistants. I could see that type of interaction being prevalent with certain users who are less familiar with how generative AI works, though. This paper also has an ethics section at the end, which I appreciated.]

In order to reconcile the need for specificity and attribution in generated AI responses, the researchers pose the question, “Can explicit content planning before the response generation help the model to address this challenge?”

“Explicit content planning,” as Gemini Pro explains, refers to a process of defining the features and characteristics of a response before generating it.

In other words, content planning involves generating an intermediate representation, called a plan, that outlines key aspects of the response, including structural features (emotion, dialogue act (e.g., inform, question, etc.), level of formality, or similarity to source document) as well as keywords (important words or concepts to include in the answer).

The hypothesis is that explicitly planning the content before it’s generated will guide the AI model toward a response that’s more specific to the conversation and attributable to the source document.

Looking back at Figure 1 above, we can see how the AI model’s responses fall into four dimensions for attribution and conversation specificity. (We’ll revisit this in a bit.)

Meanwhile, to answer the research question, the DeepMind researchers “design a framework called PLEDGE, which allows us to experiment with various plan variables explored in prior work, supporting both metric-agnostic and metric-aware approaches.”

As mentioned, PLEDGE is short for PLan-EDit-GEnerate, a framework that systematically tests different content planning techniques, as Gemini Pro worded it.

The “plan variables” the researchers mentioned refer to features and attributes in the content plan that serve as the intermediate representation of the desired response (the structural features and keywords).

A “metric-agnostic approach” involves planning the content without explicitly considering automatic evaluation metrics (like specificity or attribution scores) that assess the quality of the generated response.

Meanwhile, a “metric-aware approach” uses these automatic evaluation metrics in the planning process. This approach can involve optimizing the model for specificity or attribution score metrics during the training process or using them to guide the selection of plan variables.

The purpose of PLEDGE is to allow the researchers to test both metric-agnostic and metric-aware approaches.

Jumping ahead, the researchers describe their results as being “mixed,” which means the results aren’t entirely conclusive and depend on the approach used.

The researchers used both automatic metrics and human evaluations of the results, as we’ll learn. They explain the mixed results could be “caused by over-fitting to automatic metrics.”

While the metric-aware results performed well (based on the automatic metrics), the human evaluators preferred metric-agnostic results. In short, there was a discrepancy.

The metric-aware model may have been over-fitted to the automatic metrics, which means it was learning to generate responses that scored well (automatic metrics) but didn’t necessarily reflect true quality or human preferences, per Gemini Pro.

However, the researchers hope their results can “inform future work.”

This is an interesting paper for our Hamsterdam Research project. Typically, the AI research papers we investigate conclude with results that either exceed past benchmarks or are described as cutting-edge. But here, we have an example of research that wasn’t conclusive.

That said, I think learning about these types of findings can still give us interesting takeaways for understanding how generated AI responses can work, as well as appreciate the challenges of attribution and specificity for content generation, of all kinds. 😉

Now for a deep dive into the research!

The paper is available in an HTML format from arXiv, but since the formatting isn’t perfect, I’ll be referencing the PDF version, if you wish to follow along.

The paper has nine main sections, including an introduction, sections on evaluation metrics for grounded dialogue response generation, if content planning can help, and PLEDGE, plus a section for experiments, another for related work, and a conclusion, followed by sections on ethical considerations and limitations.

We’ll summarize each of these below.

1. The introduction

The researchers start by giving an overview of the paper’s main themes.

They first explain how a “knowledge-grounded dialogue system that aims to address a user’s information needs must meet two fundamental requirements,” including being credible — sharing information that’s attributable to the retrieved document (attribution) — and specific — or “make sense in the context of the conversation” (specificity).

They reference Figure 1 as showing “how responses can fail along either of these dimensions independently of each other.”

Looking at Figure 1 again below, we can see how the top-right answer has the most attribution and specificity, both answering the question and providing information from the retrieved document (evidence span):

We can also see how the bottom two answers lack attribution, while the top-left answer lacks specificity.

“There is a scarcity of research explicitly investigating how to navigate the trade-off between these
objectives,” the researchers explain.

We also get further clarity on the “automatic metrics,” which “can serve as a proxy for these dimensions.”

Then the researchers explain their rationale for using content planning:

“Content planning approaches add an intermediate step of generating the desirable features in the response, referred to as a plan, before actually generating the final surface realization conditioned on this plan. Prior work showed that splitting the generation into guided steps could be effective in indirectly encouraging the model to be more grounded to commonsense (Zhou et al., 2022b) and source documents (Narayan et al., 2021, 2023; Hua and Wang, 2019), or to be more coherent (Yao et al., 2019; Hu et al., 2022; Zheng et al., 2022; Tan et al., 2021). Hence, it is only natural to hypothesize that content planning can also help to handle the trade-off between these two objectives as well.”

With regard to the “prior work” referenced above, “splitting the generation into guided steps” refers to an approach where, instead of going straight from understanding the conversation to generating the response, the AI model takes intermediate steps that help it consider what’s more generally known (reasonable), steer closer to the information from retrieved sources, and be more coherent, per Gemini’s summary.

As a result, the researchers hypothesize that such an approach — splitting the AI-generated response into guided steps — could work for balancing credibility and specificity in knowledge-grounded dialogue systems.

They next introduce their “framework called PLEDGE,” which is shown below in Figure 2:

Let’s break down the PLEDGE diagram with added context from Gemini.

The gray “Inputs” box refers to evidence (relevant passages retrieved from a knowledge source, like a document).

[Aside: I like to think of the “evidence” or “passages retrieved” in terms of chunks or candidate passages in documents, like how Google SGE answers contain jumplinks to highlighted sections of webpages.

For example this SGE citation:

Goes to this “relevant passage”:

This has influenced how I think about the construction of my content.]

The PLEDGE inputs box also contains conversation history (between the user and the dialogue system, which informs the context of the current query) and keywords (important terms or concepts from the retrieved evidence and conversation history).

The purple “Plan generation stage” box is the first main stage of the framework, which uses the evidence, conversation history, and keywords to generate a content plan, either structural or keyword-based.

Within this stage is also the “Plan-editing model” (yellow box), which takes the initial content plan and refines it to balance specificity and attribution. It’s here that any metrics are used to make decisions to modify the plan.

The purple “Response generation stage” box is the second main stage of the framework, where the content plan is used to generate an actual text response. Factors like grammar and fluency may also be considered by the model in this stage.

The gray “Generated Response” box represents the output, or the final text response from the model to answer the user’s query.

Now let’s look a little more at the purpose of the framework overall.

When it comes to balancing trade-offs of attribution and specificity in AI-generated answers, the PLEDGE framework lets the researchers “explore the utility of planning in navigating this trade-off, as well as the effects of structural vs keyword-based plans for this task.”

The mention of “structural” plans here refers to the overall structure of the response, while “keyword-based plans” refers to key terms or concepts a response should contain.

In terms of the results, the researchers find that “content planning shows promise in general,” but whether their results “can actually help to navigate this tradeoff are mixed.”

Here are their conclusions and discussion of the discrepancy between the automatic metrics and human evaluators:

“We observe that planning mechanisms that use automatic metrics during training are better at automatic evaluations but underperform in human judgments compared to mechanisms that do not rely on these metrics explicitly. We discuss how metrics that are better calibrated towards human judgment might help to address this misalignment. We provide insights from our analysis with the hope of informing future work that aims to apply content planning in this context.”

This gets back to the over-fitting of the model to automatic metrics with a suggested solution of using “metrics that are better calibrated towards human judgment.”

2. Evaluation metrics for grounded dialogue response generation

This second section has multiple subsections, which we’ll explore with assistance from Gemini.

The researchers first explain what a knowledge-grounded dialogue system aims to achieve.

In short, the system is “given a sequence of previous conversation turns … and an evidence span … selected from a knowledge corpus, and must generate a response” with maximized quality.

Good responses from the system must be “conversationally appropriate in the context of the rest of the dialogue” and accurately represent “the information from the knowledge evidence.” In other words, specificity and attribution.

The first subsection talks about prior efforts in this type of modeling.

The researchers explain how past efforts “often focused on evaluating the faithfulness of responses to evidence,” or what they refer to as attribution. “This is often estimated by entailment scores from a trained Natural Language Inference (NLI) model.”

A Natural Language Inference (or NLI) score measures how well a sentence (hypothesis) follows logically from another sentence (premise). But as the researchers point out, “when looked at in isolation from other metrics, maximizing the NLI score is in fact, trivial.”

The first reason NLI scores alone are trivial is because the model can simply output the entire evidence passage as the response (copying the evidence) — this reminded me of when people call out SGE responses as being verbatim to the source page’s content.

Another issue with NLI scores is that the AI model’s response might be factually accurate but not clear or engaging for the user. I find this interesting — the goal of these knowledge-grounded dialogue systems isn’t just to retrieve the correct source document information, but also to make it easy to understand and engaging to read. That’s a concept we can apply to human-generated content, as well. 😉

The next subsection goes into more detail about dialogue system requirements.

The researchers explain how a generated response needs to be “conversationally relevant to the previous conversation turns,” which is about “more than topical relevance” but a logical flow. This is what the researchers call specificity.

They chose the term specificity to be consistent with past work that used LaMDA, but note how it “refers to how specific the response is to the conversational history” and “not how concrete the language is.”

The researchers used an “out-of-the-box DialoGPT model” as “the most suitable metric to measure coherency,” which is “similar to how coherence was measured for long text generation” in another study that “used next sentence prediction probabilities from BERT as a proxy.”

The DialoGPT model is based on the GPT-2 model but is specifically trained on multi-turn dialogue from Reddit discussion threads, according to Gemini.

I find this interesting, given Google’s recent partnership with Reddit. (I also recently learned that Bing did their own Reddit partnership back in 2017.)

The final subsection explains the researchers’ hypothesis that they can increase either the attribution or specificity metrics “trivially by forcing a model to attend more to either the evidence or the conversation history.”

To test their hypothesis quantitatively, the researchers use “T5-base fine-tuned on Wizard of Wikipedia,” applying “different levels of dropout on the input words to either the evidence or the conversation history.”

T5 is a Text-to-Text Transformer, a type of neural network architecture designed for text generation and understanding tasks. T5-“base” means it’s the standard-sized model. “Fine-tuned” refers to the fact that T5 is initially pre-trained on a massive dataset of general text, so it can be fine-tuned on a specialized dataset for the specific characteristics of tasks.

Wizard of Wikipedia is a dataset specifically designed for training knowledge-grounded dialogue systems. The researchers also tested on a validation set (a separate dataset).

Dropout is a technique used during model training to prevent the model from overfitting. It makes it harder for the model to see either the evidence or conversation history, depending on how much attention the researchers want the model to pay to each.

The researcher’s prediction is that high dropout on evidence might reduce the attribution score but increase the specificity score, and vice versa.

They reference Figure 3 as showing how they “can increase either the attribution or specificity scores by simply dropping portions of the conversation history or evidence respectively,” which “causes the opposite metric to decrease”:

“This demonstrates the importance of optimizing for both when designing new knowledge-grounded response generation models,” the researchers explain. “Otherwise, when looking at either metric in isolation, it is much easier to game the metric with trivial solutions.”

In short, improving one metric (attribution or specificity) often came at the cost of the other metric, which highlights the fundamental “trade-off” of the paper.

3. Can content planning help?

This is a brief section where the researchers mention how their “goal is to explore whether improved content planning can help with the attribution-specificity trade-off.”

They also list four research questions:

“RQ 1: How helpful is planning out-of-the-box, i.e. without being directly aware of the attribution and specificity metrics that are being optimized?

RQ 2: How do these metric-agnostic approaches compare with metric-aware methods, where the latter allow explicit optimization towards the desirable quality metrics?

RQ 3: What kind of structural attributes are useful in the planning stages for this task?

RQ 4: And finally, is content planning helpful to handle the attribution-specificity trade-off?”

For context on these questions, we can refer back to the beginning of this article where we reviewed the key concepts from the abstract.

Meanwhile, the researchers’ means of answering these research questions is “a framework called PLEDGE (PLan-EDit-GEnerate),” which provides “an explainable and controllable way to test out various kinds of planning variables explored in prior work, and hence, enables the analysis presented in later sections.”

4. The PLEDGE framework

This section has a bunch of fun math formulas that are over my head. (Perhaps we’ll get into those types of details in future Hamsterdam Research articles.) 😉

But this section also contains Figure 4, which shows the PLEDGE framework.

Let’s look at Figure 4 below and explain how the PLEDGE framework works, using Gemini’s help:

The purple “Generation Model” box is the first part of the system that’s responsible for taking the “dialogue history” and “evidence” and generating a “candidate plan” (the red box) — a blueprint for what the final answer should cover.

We see this elaborated on as the larger purple “Autoregressive Encoder-Decoder Generation Model” box below. “Autoregressive” means the model generates text one word (or token) at a time, which then becomes part of the input for predicting the next word. “Encoder-decoder” refers to a type of neural network architecture used for sequence-to-sequence tasks, such as knowledge-grounded dialogue. “Encoders” process the input, while “decoders” take the encoder’s output and generate the response text.

As Gemini points out, this sequence-to-sequence (Seq2Seq) architecture likely used a transformer. This is especially the case given the references to past work that involved LaMDA and BERT, both language models built on the transformer architecture.

[Aside: Transformers are a foundational element of many Natural Language Processing (NLP) tasks. Unlike previous sequence-to-sequence models — a type of neural network that transforms one sequence of data (input) into another sequence (output) — transformers rely on an “attention mechanism” that lets the model focus on specific parts of an input sequence (here it’s the conversation history and evidence) when generating an output (here it’s a candidate plan). The attention mechanism enables the model to capture long-range dependencies in the input sequence, understanding the context across longer stretches of text. Transformers are used in various AI models, including LLMs (like Gemini or ChatGPT).]

The red and green “Plan Editor” box then takes the candidate plan, the conversation history, and the evidence as input and refines the plan to better align with desired attribution and specificity qualities. This is the “modified plan” in the green box. The change from red to green indicates improvements in the quality of the plan.

The plan editor is elaborated on with the larger red and green “Plan Editor: Non-autoregressive model” box below, which achieves a modified candidate plan by using dropout (mentioned earlier) to prevent overfitting. Being “non-autoregressive” means the model analyzes the entire plan, dialogue history, and evidence all at once, rather than token-by-token. It temporarily masks parts of the plan to achieve desired specificity or attribution metrics, per Gemini’s explanation.

The final (modified) plan, conversation history, and evidence then get fed to the second purple “Generation Model” box, which uses the information to produce a final response.

Based on this, we can see what the researchers mean by earlier references to “splitting the generation into guided steps” and using a “candidate plan.”

5. Experiments

This is where the researchers test out the PLEDGE framework.

In addition to automatic metrics, they also “ran a human evaluation over different model outputs.” These human annotators rated the specificity of outputs on a scale of 1 to 5 and then gave a binary rating as to whether the response was attributable to the evidence.

The researchers found that “metric-aware edits may be useful for improving the automatic
metrics they are trained on, but these improvements do not transfer well to human judgments.”

In other words, metric-aware edits were made to the response plan (the planning stage), but the researchers found a mismatch where the automated metrics often showed an improvement, but that didn’t necessarily translate to better scores for human judgments based on the same qualities, as Gemini explains.

As Gemini helpfully elaborates, automated metrics are imperfect approximations of complex qualities, which AI systems can “game” by making superficial changes, but human judgments are subjective.

Meanwhile, the mismatch between automatic metrics and human judgments was discovered during the editing phase of PLEDGE, thus implying that content planning can be helpful, but the way edits were currently performed (guided by metrics) may need refinement, as Gemini explains.

6. Related work

This section explores related work for topics like knowledge-grounded dialogue evaluation, improving different aspects of quality, planning for text generation, and ongoing challenges, which we’ll summarize with Gemini’s help.

The paper explains how evaluating dialogue systems is a complex task because good responses require specificity and attribution (as we’ve been discussing) but also qualities like being interesting or sensible to users.

Efforts to improve metrics like attribution may involve trade-offs, such as reducing the quality of how engaging or fluent responses appear.

As mentioned, this makes sense. The less accurate (or beholden to factual information from retrieved documents) the AI model is, the more liberty it can take in its response.

Past work has focused on improving attribution or specificity, but joint optimizations of these metrics are part of newer work, like the researchers performed. Their approach (metric-aware editing) was also novel, specifically within the context of knowledge-grounded dialogue systems.

Interestingly, the researchers’ findings, although using a T-5 base, could be relevant to LLMs in general. This, I think, is an important takeaway, suggesting how the balance of attribution and specificity is a universal challenge that we can appreciate in many contexts, including human-generated content. 😉

Another good concept we can learn from the paper is that the PLEDGE framework’s planning approach (the intermediate steps of using a candidate plan) bears conceptual resemblances to chain-of-thought reasoning, which is a technique used in LLMs that encourages them to be transparent in their reasoning process during text generation.

In chain-of-thought reasoning, rather than jumping to an answer, the AI model goes through a series of steps to explain its thought process. This approach improves interpretability (a topic we explored in a past Hamsterdam Research article about ELM) as well as control.

PLEDGE’s planning stage similarly doesn’t generate responses directly but instead creates a candidate plan with key points the response should cover.

7. Conclusion

The researcher’s conclusion is a single paragraph:

“We investigated the trade-off between attribution and specificity for knowledge-grounded dialogue, analyzing whether content planning prior to final output generation can help to navigate this trade-off. We find that although content planning shows promise in general, we observe differences in the trends in automated and human evaluations. Hence, whether content planning can help to handle the trade-off remains an open question and more efforts are needed to answer it, with automated metrics that are potentially better calibrated with human judgment. We hope that the insights gained in this work inform future efforts on exploiting content planning in similar contexts.”

Based on our review of this paper, we now can understand what this conclusion means by the “trade-off between attribution and specificity,” what “knowledge-grounded dialogue” refers to, and what “content planning prior to final output generation” means.

We can also understand the reference to “automated and human evaluations,” why the benefit of “content planning” remains an “open question,” and why automated metrics “calibrated with human judgement” would be important.

And that’s pretty cool!

8. Broader impact and ethical considerations

In terms of broader impact, the researchers note how the datasets used are both “popular and publicly available for dialogue research.”

They next explain the ethical considerations of their work, which I’ll include in full:

“The primary goal of a knowledge-grounded dialogue system is to be able to converse with a user about the external world, providing the user with important new information. This could lead to dangers of spreading misinformation if a model hallucinates or shares information from untrusted sources. In this work, we put forth attribution metrics as a way of quantifying whether a system hallucinates compared to what was written in the grounding document. However, we make the assumption that the document itself is trustworthy by only using pre-selected document examples from Wikipedia. For more general-purpose systems, more work is needed to quantify the trustworthiness of underlying sources. Additionally, in this paper, we do not evaluate for other important dialogue complications, such as toxic or offensive language, which would need to be taken into account for a real-world dialogue system.”

As we see, the researchers mention risks of misinformation spreading from model hallucinations or even untrusted sources — as SEOs, we might contextualize this as E-E-A-T and reliability considerations.

To that end, general-purpose systems would need more work “to quantify the trustworthiness of underlying sources.” Other risks include “toxic or offensive language.”

[Aside: In the introduction to Hamsterdam Part 54, we explored a recent Google DeepMind paper about advanced AI assistant ethics. I also recently watched a YouTube video from Machine Learning Street Talk featuring an interview with Professor Simon Prince that included a great discussion on AI ethics. So this topic of AI ethics has been top of my mind lately, and I’m glad to see it merited a section in this paper.]

9. Limitations

The researchers explain that while specificity and attribution are “an important set of qualities that a dialogue system must ensure,” they’re not the only “set of qualities” it should have. Other aspects of quality that deserve further consideration may include “interestingness or different aspects of fluency.” Consequently, they explain that “Future work may need to extend to exploring complex multi-dimensional trade-offs that go beyond the scope of this work.”

They further explain how, beyond the planning mechanisms they used in this research, “there are other forms of planning and guiding structured output that are still largely unexplored for this task.”

In short, this is the tip of the iceberg of a fascinating and growingly consequential topic.

So why should SEOs care about PLEDGE (maybe)?

Truth be told, I don’t think SEOs should care about PLEDGE.

Having said that, I do think we should care about the “trade-off” between attribution and specificity that PLEDGE tries to solve.

Meta AI with Llama 3 incorporating Bing and Google search results as sources is just the latest example of how our universe of SEO work is expanding beyond rankings to what I call “cumulative organic visibility.”

Maybe we can’t optimize content to get mentioned or cited in an AI-generated response from a RAG model or knowledge-grounded dialogue system as easily as we would for ranked search results, but, just like with Google Discover, we can at least ensure our content (and the brand it represents) has a shot to earn visibility there.

AI chatbot mentions can build brand awareness and user trust, which can translate to better organic search performance, including higher CTR for non-brand queries and increased branded navigational searches and clicks.

Additionally, the quality of referral traffic that comes from AI-generated response citations will likely be more targeted and higher-converting than normal SERP traffic.

But that’s not all.

Being aware of the challenge of balancing attribution and specificity in AI-generated responses could also influence how we approach SEO content. For example, framing our content in chunks or candidate passages that are informative yet engaging could improve their eligibility to be used and cited as knowledge sources in conversational AI model responses.

More broadly, we can also consider how the balance of accuracy (attribution) and engagement (specificity) for answering user queries is fundamental to all forms of SEO content, especially human generated.

It’s for these reasons that I think we as SEOs should (maybe) care about this Google research paper.

Of course, we’re still talking about Google DeepMind here, not Google Search.

Before we conclude, though, let’s see what Gemini thinks, based on the discussion we’ve had about PLEDGE so far, what are the research’s implications for SEO.

*Note: this is a theoretical exercise now, not predictions or instructions. 😉

1. The need to balance specificity and attribution

Gemini points out that the PLEDGE framework highlights a larger point about how content (including in search results, like SGE overviews) shouldn’t be just generic answers to queries.

Search engines, like Google, need to find ways to ensure their AI responses directly address a user’s query but in a conversational manner. (The same can arguably be said for the content we create to rank or surface in SGE for those queries.)

As for attribution, specifically, AI-generated responses (and content, in general) need to be well-supported by authoritative sources to ensure trustworthiness and avoid factual errors. (Again, this bears similarities to E-E-A-T principles.)

2. Why metrics matter for high-quality results

This quote from Gemini made me smile, so I’m going to give it full props:

“PLEDGE shows that the reliance on automated metrics can lead to models (or content creators) ‘gaming’ the metrics and creating content that scores high without truly being good.”

In my helpful content guide, I expressed my belief that an over-reliance on traditional, formulaic approaches to SEO content creation has led to an abundance of bland, search engine-first content, the kind that gives SEO a bad name with some of the public today.

At the heart of such formulaic content-creation approaches is usually a reliance on tools and metrics of all kinds, from keyword “difficulty” to “entity” scores to “reading level” scores and even “link” placements, say nothing of ridiculous stuff like “keyword density,” in place of human judgments.

Helpful content for users, and that has SEO value today, doesn’t happen because it scores well based on tools, in my opinion. It — arguably 😉 — manifests because of the creator’s imagination, expertise, and understanding of the audience’s goals.

That’s not to say I wasn’t guilty of creating such bland content years ago, though as an independent consultant, I’m trying to make up for it now. 🙂

3. Understanding human judgment

Related to the previous point, this other point from Gemini aligns with my belief that content creation should be a collaborative process involving SEO strategists, subject matter experts (SMEs), and content specialists.

“Content should be written in a way that’s easily understandable, helpful, and aligns with how users would naturally discuss or express ideas. The disconnect between automated metrics and human perception that PLEDGE highlights reinforces the need to create for humans, not algorithms.”

While SEO-first content isn’t great, sometimes things get taken in the opposite direction, relying solely on experts with the idea of meeting some type of “E-E-A-T” score (which doesn’t exist).

This type of content may have expertise, but that doesn’t mean it satisfies a user’s search intent, especially if it’s not engaging or easy to comprehend.

That’s why I like to have the SEO strategist identify content opportunities based on data (like of search intent), and then have a SME and content specialist work together to create content that’s both informative and engaging, what we might call attribution and specificity. 😉

4. Search ranking systems

Although I tend to rely on Gemini for these SEO implication sections, because it has the context from the whole conversation — hey, that’s a relevant point for this discussion of PLEDGE 😉 — I do like this point from ChatGPT:

“Search engines like Google could incorporate similar models to PLEDGE to refine their algorithms, potentially focusing more on the balance between delivering relevant (specific) and accurate (attributable) content. This could affect how websites are ranked, pushing webmasters and SEO professionals to adopt more sophisticated content planning and creation strategies.”

I don’t think an integration of PLEDGE into Google Search is likely, but I do believe we can extrapolate, based on seeing how the DeepMind researchers dealt with a “trade-off” between attribution and specificity, that Google’s search ranking systems have to balance a lot of competing considerations, likely involving trade-offs.

It’s quite possible, for example, that just as the DeepMind researchers used dropout on certain inputs to achieve different metrics, we can envision, bigger picture, how Google’s search algorithms might make adjustments (or trade-offs) that cause the ranking fluctuations and SERP volatility that we regularly observe or associate with announced updates.

That also makes me think, in reference to the limitations section of the PLEDGE research paper, how we can imagine just how complex is the process of judging content quality.

The qualities used in the PLEDGE research paper for knowledge-grounded dialogue systems (attribution and specificity) were just some of many that could matter to good responses, as the researchers pointed out.

Now just imagine what it’s like for Google Search to determine the quality (helpfulness, reliability, relevance, etc.) of the content across its web index to all users queries.

Till next time …

I hope you’ve had some fun reading this week’s Hamsterdam Research article!

Feel free to comment with your thoughts or feedback or contact me.

So far, we’ve had a nice mix of topics, from interpretability to time series forecasting, active speaker detection, understanding text in images, and now conversational AI-model responses.

Taken all together, I think we’re building toward a more holistic knowledge of AI concepts that can help us better understand the future of search and SEO.

Stay tuned for another article next week (or check out more posts below).

Until next time, enjoy the vibes:

Thanks for reading. Happy optimizing! 🙂

SEO Strategist and Consultant

Ethan Lazuk

Previous Hamsterdam Research articles

Could Google Soon Understand Text in Your Images? Exploring Hierarchical Text Spotter (HTS) (via Google Research) & Why SEOs Should Care (Maybe)

We’ll look at Google Research’s “Hierarchical Text Spotter for Joint Text Spotting and Layout Analysis” and why SEOs should care (maybe).

April 19, 2024October 5, 2024

How Large Scale Self-Supervised Pretraining for Active Speaker Detection Works (via Google Research) & Why SEOs Should Care (Maybe)

In this rendition of Hamsterdam Research, we look at a Google Research paper on large scale self-supervised pretraining for active speaker detection (ASD) and its (possible) implications for SEO work and video-based search results.

April 14, 2024October 5, 2024

How AutoBNN Automates the Discovery of Interpretable Time Series Forecasting Models & Why SEOs Should Care (Maybe)

This is the second Hamsterdam Research article, which covers AutoBNN: Probabilistic time series forecasting with compositional bayesian neural networks.

April 6, 2024October 5, 2024