What I Learned from Jo Bergum’s “Back to Basics for RAG” Talk That’s Useful Context for SEO Today
By Ethan Lazuk
Last updated:

Truth be told, I thought all of “Back to Basics for RAG w/ Jo Bergum” (👈 YouTube video) was useful context. But I had to mention “SEO” in the main heading, right? 😅
Whether you know RAG inside and out or you associate it foremost with doing the dishes 🧽 or wiping grease from your hands after working under the hood of a car 🧑🔧, I go into a lot of detail in this post, so I think you’ll find something worthwhile.
The read time is:
Or just ask an LLM to summarize it. 😂
RAG (retrieval augmented generation) is a main component in many aspects of today’s AI chatbots, agents, and other (LLM-driven) tools.
It’s also an important concept for modern SEO — given how it powers results in AI Overviews, Copilot with Bing, and Perplexity. 👈
By learning about the “R” part of RAG, we can retrieve some helpful information — see what I did there — for contextualizing our understanding of search engine results, including rankings in Google and Bing.
If you’re not very familiar with this topic or the speaker, here’s some helpful context. 🥱
🏃➡️ Runs to a thesaurus (aka ChatGPT 🤖):

If you’re not familiar with this topic or the speaker, here’s some helpful background. 🙌
While scrolling on social last week, I came across this tweet by Hamel Husain featuring a talk on RAG Basics by Jo Kristian Bergum:
Feel free to watch that video now, but we’ll also go over it shortly. 🫡
If you’re a member of SEO Twitter (either now or historically), you may or may not know about Hamel (@HamelHusain) and Jo (@jobergum) already.
They’re active in what I’d call “ML Twitter” — but please, correct me if that label is wrong. 😇
In full disclosure, I’ve only recently discovered their content, but I’m finding it helpful to learn and get ideas from. 🙌
That above tweet, for example, led me to discover that “ICL” refers to in-context learning, where LLMs learn to perform tasks by being given examples within the prompt, rather than needing explicit training on the tasks. (And thanks to Discover, I also just read Jo’s latest blog post on it.)
Jo is a Distinguished Engineer at Vespa.ai, “the open source platform for combining data and AI, online” (per X bios). (You’ll learn more about his background in the talk, as well.)
Hamel is a researcher and independent developer focusing on LLMs. He’s also featured in the early parts of the talk, based on the full transcript (stay tuned).
For the talk (which we’ll be discussing shortly, I promise), Hamel included this in the abstract:
“This talk will cover the fundamentals of information retrieval (IR) and the failure modes of vector embeddings for retrieval and provide practical solutions to avoid them.”
– Excerpt from the abstract. (My highlights.)
The YouTube description also links to related educational resources from Parlance Labs, where there’s a transcript and slides, as well.
Unformatted YouTube transcripts are a pain. Sharing all of these resources for a talk, this is how it’s done. 🤘
⚡️ In short, Jo’s RAG Basics talk covers information retrieval (IR) topics that I found interesting from an SEO’s perspective.
He even references the API Content Warehouse leak. 📚 🌊
Oh, so now I’ve got your attention? 😆
📢 PSA: If you’re not an SEO or familiar with the field today, let me state, on the record, that I consider SEO to be about satisfying users in support of business goals.
It’s about enhancing your content’s relevance, quality, and technical delivery to (justifiably) increase its organic visibility along qualified user journeys. 🔥
Attempts at manipulating search rankings or relevance signals to earn unwarranted traffic (let alone creating plain old spam) ultimately will just disappoint users and diminish your brand’s value long term. 🙅
Feel better?
😌
Cool. Now that we have that context … background …

With that foundation, let’s now go through “Back to Basics for RAG w/ Jo Bergum” to extract helpful info from an SEO perspective. 🕺
The video runs about 45 minutes, and its transcript has 15 thematic chapters.
I’ve consolidated those (well, ChatGPT helped) into six topics:
- Foundations
- Retrieval and labeling techniques
- Performance metrics and industry standards
- Advanced evaluation with LLMs
- Representational approaches and practical applications
- Conclusion
We’ll review each of these now. 🙌
After these quick notes:
*Attribution: all of the quotes and images you’ll see below are from the transcript, slides, or YouTube video, unless otherwise stated. All highlights are mine. I’ll cite everything and link to the resources again at the end, as well. 🤝
*I’ll also add related notes (📝) with SEO or other context where helpful.
*Lastly, while this is my summary of the talk, from an SEO’s perspective, I encourage you to watch the full video (before or after this) to hear it directly from the speaker. ✌️
1. Foundations. 👷
This section includes the first 1m and 19s of the talk (starting at 8:56 in the full transcript).
It covers the 1st of the 15 chapters, “Introduction and Background.”
To begin, Jo introduces himself and mentions that he’s worked at Vespa.ai for 18 years and been in the search and recommendation space for about 20 years.
He explains that Vespa.ai is a serving platform spun out of Yahoo! and has been open source since 2017.
He’s also a fan of sharing memes on Twitter:

I’m not fluent in ML yet, but I got a kick out of seeing Gemini mention “Pooh in a suit” — Pooh was my first stuffed animal as a kid, which, by the way, I still have 🧸 🖍️:

Jo also introduces the key topics of the talk:
- Stuffing text into the language model prompt.
- Information retrieval, with a focus on the “R” in RAG.
- Evaluating IR systems and building custom evaluations.
- Representational approaches for IR, including BM25, vector embeddings, and baselines.
So, those are what you have to look forward to learning more about. 🙌
2. Retrieval and labeling techniques. 🏷️
This section goes from around 1:19 to 8:22 in the talk.
It covers three chapters, including “RAG and Labeling with Retrieval,” “Evaluating Information Retrieval Systems,” and “Evaluating Document Relevance.”
Jo explains that RAG is popular for question-answering and search, but it can also be used to build labelers or classifiers by retrieving relevant examples from training data.
Retrieval (the “R” in RAG) can involve fetching relevant annotated training examples that LLMs can reason around and predict labels from.
It can also fetch relevant context for open-ended questions — like in the models used by chatbots — to improve the accuracy of generated responses (although, it’s unlikely to be entirely hallucination free).

He also lists the multiple components to RAG:
- Orchestration
- Input
- Output
- Evaluation
- Prompting
- State management (files, search engines, vector databases)
📝 The RAG model behind Copilot with Bing is Prometheus, featuring the “Bing Orchestrater”:

Jo explains that while there’s a lot of hype around RAG and new methods, it’s important to evaluate the effectiveness on specific datasets. 👈
A successful RAG solution requires more than vector embeddings and language models, but also a sophisticated retrieval stack.
“I think there’s been like, RAG is only about taking a language model from, for example, OpenAI, and then you use their embeddings, and then you have a magical search experience, and that’s all you need. And I think that’s naive to think that you can build a great product or a great RAG solution in that way just by using vector embeddings and the language models.”
– 4:28
This is where the evaluation of IR systems comes into play. 🧑⚖️
“There are a lot of people, the brightest minds, that have actually spent a lot of time on retrieval and search, right? Because it’s so relevant across many kind of multi-billion companies like recommendation services, search like Google, Bing, and whatnot. So this has kind of always been a very hot and interesting topic. And it’s much deeper than encoding your text into one vector representation and then that’s it.”
– 5:10
📝 Perplexity’s CEO mentioned the limitations of vector embeddings and the enduring value of traditional IR methods recently in an interview with Lex Fridman.
Evaluating an IR system includes querying the system, retrieving a ranked list of documents, and then judging their quality based on relevance to the query.
“And this is kind of independent if you’re using what kind of retrieval method you’re using or combination or hybrids or face ranking or Colbert or Spade (sic) or whatnot. You can evaluate any type of system. If it’s using NumPy or files or whatnot, it doesn’t really matter.”
– 6:20
Some of that terminology might sound foreign to the SEO lexicon.
Let’s break it down. 🙌
Retrieval can refer to search engines, but it also applies to information systems, in general, including databases.
A retrieval method is the technique used to find and access specific information from the system.
Jo mentions different retrieval methods:
- Combinations: multiple methods used together.
- Hybrids: similar to a combination but with a more complex blending of techniques.
- “Face” ranking, which I believe is faceted ranking: organizing search results by categories or “facets,” like author, date, or topic.
He also mentions a couple of specific retrieval frameworks. 🧱
ColBERT (Contextualized Late Interaction over BERT) is a neural network-based method for finding relevant info in large datasets. 🤖
While SPLADE (SParse Lexical AnD Expansion) is a neural information retrieval model that offers another way to represent text as a sparse vector with learned weights (instead of TF-IDF). (H/T Jo on X for the correction.)
Jo also mentioned NumPy, which is a popular Python library used for numerical calculations and working with arrays (collections of data). 🧮
“The basic idea of building a such (sic) system is that you take a query and you retrieve those documents, and then you can have a human annotator, for example, to judge the quality of each of the documents. And there are different ways of doing this.”
– 6:39
Human annotators. ✍️ Sound familiar?

Here’s where it gets extra interesting …
Jo mentions different types of annotation (binary and graded), datasets for retrieved results (TREC, MS Marco, and BEIR Benchmark), as well as metrics for document relevance. 👈
This is a tough part to summarize. 🤕 🥊
Because it has a lot of great info. 🤗
I’ll paste in chunks from the transcript, add details in a few places, and include notes (📝) where helpful.
Let’s start with human annotators and evaluating IR systems.

“We can do it by the binary judgment, saying that, OK, this document is relevant for the query or not. Or we can have a graded judgment where you say, okay, zero means that the document is irrelevant for the query, and one, it’s slightly relevant, or two is highly relevant. And we can also use this to judge the rank list that are coming out of recommendation systems or personalization and many different systems that are producing a rank list. And in information retrieval, this is going back decades. And there are a lot of researchers working on this.”
– 6:39
Jo next explains how to assess the effectiveness of IR systems on relevancy datasets and then discusses ranking metrics (in the next section). 👀

In case you’re on mobile (🤳) and squinting at that slide, never fear! 😮💨
Let’s break down the common IR datasets on the left. 🙌
TREC (Text Retrieval Conference) “spans multiple different topics each year, news retrieval, all kinds of different retrieval tasks.”
MS Marco (for passage/document ranking) is “from Bing, actually real world data, which is annotated.” It’s a large dataset that “a lot of these embedding models are trained on.”
BEIR (Benchmarking Information Retrieval) is a collection of diverse datasets that covers many domains (like news, biomedical research, social media, etc.) and tasks (like fact-checking, question-answering, and document ranking) to “evaluate types of models without actually using the training data, but this is like in the zero-shot setting.”
On that last point, zero-shot refers to evaluating models on tasks they haven’t been specifically trained on yet. Instead of fine-tuning pre-trained models, researchers can test how well the models generalize to new data and tasks, similar to what’s encountered in real-world scenarios.
📝 Google Translate recently added 110 new languages “using Zero-Shot Machine Translation, where a machine learning model learns to translate into another language without ever seeing an example.”
3. Performance metrics and industry standards 🎯
This section covers from around 8:01 to 12:41 of the talk.
It includes the chapters “Metrics for Retrieval System Performance” and “Reciprocal Rank and Industry Metrics.”
Jo first gets into ranking metrics (the right-hand info in Slide 12 above).
“So there are many different collections and then there are metrics that can measure how well the retrieval system is actually working.”
– 8:01
He first mentions Recall@k, where “k” means “a position in the in the ranking list.” 👈
Recall@k measures the proportion of relevant items retrieved within the top k results, especially for recommender systems.
“So K could be for example 10 or 20 or 100 or 1000 and it’s a metric that is focusing about, you know, you know that there are like six items that are relevant for this query. And are we actually retrieving those six relevant documents into the to the top K?
In most systems, you don’t actually know how many relevant documents there are in the collection. In a web scale, it might be millions of documents that are relevant to the query. So unless you have a really good control of your corpus, it’s really difficult to know what are the actually relevant documents.“
– 8:33
He then mentions Precision@k, which “is much easier because we can look at those results and say, ‘Are there any irrelevant hits in the top K?’,” albeit, “it’s not really rank aware.”
As Slide 12 pointed out, Recall@k focuses on the system’s ability to find all relevant documents, while Precision@k measures its ability to return only relevant documents.
“So it’s not bothering if the missing or irrelevant hit is placed at position one or 10. The precision at 10 would be the same. It doesn’t necessarily depend on the position.“
– 9:16
📝 Kind of sounds like how AI Overviews (SGE) sources get selected, in terms of how they don’t typically align with the rankings of traditional organic results for the same query.
He also mentions nDCG@k (Normalized Discounted Cumulative Gain at k), which is a “very complicated metric, but it tries to incorporate the labels, so the graded labels, and also awareness of the rank position.”
In sum, nDCG@k considers the relevance of retrieved items as well as their position in the ranking, assuming their position will influence overall user satisfaction (top results are more relevant).
📝 That kind of sounds more like traditional search results (SERP rankings).
He then mentions Reciprocal rank, which focuses solely on the position of the first relevant item in a ranked list of results.
In other words, “Where is the first relevant hit in the position?”
“So if you place the relevant hit at position 1, you have a reciprocal rank of 1. If you place the relevant hit at position 2, you have a reciprocal rank of 0.5.“
– 10:02
Jo also mentions LGTM@10, where LGTM is “looks good to me.”
He called it, “Maybe the most common metric used in the industry,” which puzzled me (🧩) because I was struggling to figure out why Gemini hadn’t heard of it. (Dare I say, epistemic uncertainty?)
After some good old-fashion Googling, though, I realized LGTM is a joke.
Gemini was impressed by my information retrieval skills. 😂

And so, “even in the age of advanced algorithms and machine learning models,” as Gemini calls it, human judgment and manual review (and, occasionally, searching ability) still play a crucial role. ✌️
Lastly, Jo mentions industry ranking metrics, including a few words familiar to the SEO lexicon, like “click” and “dwell.”
“And of course, also in industry, you have other evaluation metrics like engagement, click, if you’re measuring what actually users are interacting with the search, dwell time or e-commerce, add to chart (sic), all these kind of signals that you can feedback.
Of course, revenue, e-commerce, search, for example, it’s not only about the relevancy, but also you have some objectives for your business. I also like to point out that most of the benchmarks are comparing just a flat list. And then when you’re evaluating each of these queries, you get a score for each query. And then you take the average to kind of come up with an average number for the whole kind of retrieval method. But in practice, in production systems, you will see that maybe 20% of the queries actually is contributed like 80% of the volume.”
– 10:18
📝 The mention of 20% of queries contributing to 80% of the volume is very interesting, because Perplexity’s head of search, Alexandr Yarats, essentially said the same thing in a recent interview: “This is a power law distribution, where you can achieve significant results with an 80/20 approach.”
4. Advanced evaluation with LLMs. 🏆
This section covers from around 11:28 to 17:04 of the talk.
It includes two chapters, “Using Large Language Models for Judging Relevance” and “Microsoft’s Research on LLMs for Evaluation.”
It’s important to note, Jo explains, that RAG and IR evaluations can be situational.
“You really have to measure how you’re doing. And since we have all these benchmarks, MTAB and whatnot, they don’t necessarily transfer to your domain or your use case. If you’re building a RAG application or retrieval application over code, or documentation, or a specific health domain, or products, because there are different domains, different use cases. Your data, your queries.
The solution to do better is to measure and building your own relevance to dataset. … look at what actually users are searching for, and look at the results, and put in a few hours, and judge the results. Is it actually relevant? …
Or you can also ask a large language model to present in some of your content, and then you can ask it, okay, what’s a question that will be natural for a user to retrieve this kind of passage? So you can kind of bootstrap even before you have any kind of user queries. And as I said, it doesn’t need to be fancy.”
– 11:28
📝 I mention this excerpt from Google’s anti-trust post-trial debrief quite often, but that last “bootstrap” remark made me think about the reference to Neeva and how “advances in the ability of computers to understand language ‘could be used as a short circuit to make ranking better.’”
“Preferably, you will have like a static collection,” Jo explains:
“When you are judging the results and you’re saying that for this query, for instance, query ID 3 and the document ID 5, you say that, oh, this is a relevant one. When we are judging the kind of, or computing the metric for the query, if there’s a new document that is suddenly appearing, which is irrelevant or relevant, it might actually change how we display things in the ranking without being able to pick it up. So that’s why you preferably have these kind of static collections.”
– 14:02
He also discusses using “language models to judge the results.”
Specifically, Jo mentions “research coming out of Microsoft and Bing team for over the last year, where they find that with some prompting techniques that they actually can have the large language models be… pretty good at judging query and passages.”
📝 A bit of a long note here, but a good one: Matt McGee wrote about Bing’s human search quality raters for SEL in 2012. Fast forward to 2023, and Dawn Anderson wrote in SEL about, “The crowd is made of people: Observations from large-scale crowd labelling,” a 2022 Microsoft Research paper that delves into the problems of “crowdsourced labels,” like how they “are prone to disagreements, spam, and various biases which appear to be unexplained ‘noise’ or ‘error.’” In that same paper, she referenced, “Large Language Models can Accurately Predict Searcher Preferences,” a 2023 Microsoft paper that explores “an alternative approach” to human-annotated relevance labels. Instead, they “take careful feedback from real searchers and use this to select a large language model,” where “the LLM can then produce labels at scale” and perhaps be “as accurate as human labellers and as useful for finding the best systems and hardest queries.”
Jumping back to the present, Jo referenced “Large Language Models can Accurately Predict Searcher Preferences,” a “very recent paper” published in May 2024 by Microsoft where “they also demonstrated that this prompt could actually work very well to assess the relevancy of queries.”

📝 That’s an image, so the play button won’t work. 😅 But “Source” will take you to the video.
That research’s findings could “free us from having this kind of static golden data sets,” Jo explains, “because we could start instead sampling real user queries, and then ask the language model to evaluate the results.”
Golden datasets 👑 are manually curated and labeled, making them a gold standard for evaluating IR systems. Involving LLMs could bypass the need for them.
“I think this is a very interesting direction,” Jo remarks. 👈
He also explains that when evaluating retrieval systems, the results can be evaluated independently of the retrieval method used.
5. Representational approaches and practical applications. 🏌️
This section spans from around 17:04 until 29:10.
It covers five chapters, including “Representational Approaches for Efficient Retrieval,” “Sparse and Dense Representations,” “Importance of Chunking for High Precision Search,” “Comparison of Retrieval Models,” and “Real World Retrieval: Beyond Text Similarity”

Jo first expands on the “representational approaches and scoring functions that can be used for efficient retrieval.”
A representational approach refers to how information (text, images, etc.) is transformed or encoded into a format that computers can easily process and compare. 🦾
The goal is to create representations that capture the meaning and relationships within the data while still allowing the system to find relevant info easily.
📝 One example is sparse vs. dense vector embeddings. We often think of vectors as pertaining to semantic search, but that’s a misconception. 🤯 BM25, for example, creates sparse embeddings (vectors based on term frequency and inverse document frequency (TF-IDF), but not meaning) for lexical search (generally). In contrast, BERT creates dense vector embeddings (capturing underlying word meanings and relationships) for semantic (or vector) search (generally). (I also just realized Jo’s talk touches on this a bit later. 😅 Stay tuned.)
On the “efficient retrieval” aspect, Jo mentions how the motivation for using a representational approach is to “try to avoid scoring all the documents in the collection”
“[S]ome of you might heard about Cohere re-ranking service or this kind of ranking services where you basically input the query and all the documents and they go and score everything, but then you have everything in memory, already retrieved the documents. And imagine doing that at the web scale or if you have 100 million documents, it’s not possible, right? And it’s also similar to doing a grep.
So instead, we would like to have some kind of technique for representing these documents so that we can index them so that when the query comes in, that we efficiently can retrieve over this representation and that we efficiently in sublinear time can retrieve the kind of top ranked docs. And then we can feed that into subsequent ranking faces.”
– 17:18
There are a few terms in that explanation that I wasn’t familiar with.
In case you’re in the same boat (🛥️), let’s break them down. 🙌
Cohere is a company that offers a service for re-ranking search results. In short, you provide a query and a set of documents, and their system scores each document based on relevance, so you can re-order the results. One issue with that approach at scale is that it requires all the documents to be loaded into your computer’s RAM at the same time.
Grep is a command-line tool common among developers and system administrators for searching plain-text data for lines that match a specific pattern. While useful for smaller tasks, it would be inefficient at scale to scan through all documents to find relevant ones.
Jo also mentioned sublinear time. This is when the time it takes an algorithm to run grows slower than the input size grows. In other words, it gets faster, comparatively. ⏱️ It’s important for IR systems to run in sublinear time, especially scalable search engines.
And here we are, where Jo gets into representational approaches (sparse and dense).
“And there are two primary representations, and that is the sparse representation, where we basically have the total vocabulary is kind of the whole sparse vector representation that you potentially take, but for a given query or a given document, only the words that are actually occurring in that document or in that query have a non-zero weight. And this can be efficiently retrieved over using algorithms like Weekend (sic) or MaxScore and inverted indexes. You’re familiar with Elasticsearch or other kind of keyword search technologies. They build on this.”
– 18:33
I’ll quickly explain some of the terms he mentioned.
WAND (short for “Weak AND”) is an algorithm that efficiently processes Boolean queries (ones that use AND, OR, or NOT operators) to quickly identify documents to skip over because they’re unlikely to satisfy the query.
MaxScore is what’s known as a heuristic (a rule of thumb) used to estimate the maximum possible score that a document could achieve given a query, based on the weights of the query terms and their occurrences in the document.
An inverted index is the fundamental data structure used in most search engines that maps terms (words or phrases) to the documents that contain them. For each term, the index stores a list of document IDs where that term appears, plus extra info like term frequency and positions. 👈
Lastly, Elasticsearch is an open-source search and analytics engine built on Apache Lucene. It relies heavily on inverted indexes for scalable search capabilities, and can also implement WAND and MaxScore as optimizations to improve performance.
Is this too much vocabulary? 🤷
Don’t worry, I’ll get briefer, except for this next one, because it’s important. 😂
Enter dense representations. 🚪
“More recently, we also have using neural or kind of embedding or sparse embedding models so that instead of having an unsupervised weight that is just based on your corpus statistics, you can also use transformer models to learn the weights of the words in the queries and the documents. And then you have dense representations, and this is where you have text embedding models, where you take some text and you encode it into this latent embedding space, and you compare queries and documents in this latent space using some kind of distance metric.”
– 18:53
📝 We mentioned BERT earlier. The “use of transformer models,” the “T” in BERT, T5, or GPT, has revolutionized natural language processing (NLP). And the point about “compare queries and documents in this latent space using some kind of distance metric” is super important. In short, your content doesn’t need to be full of keywordese or stylistically bland. 🛑 Write naturally, and artfully, and if the context is clear, the information is high quality, and you satisfy the search intent, then you’ll likely rank for those queries. Redditors aren’t paying attention to keywords (the real ones, anyway). I guarantee. 🙂↕️
The concept of tradeoffs, I think, is also critical to appreciate when it comes to IR.
Jo touches on tradeoffs in the talk:
“And there you can build indexes using different techniques, vector databases, different types of algorithms. And in this case, also, you can accelerate search quite significantly so that you can search even billion scale data sets in milliseconds, single credit. But the downside is that there are a lot of tradeoffs related to that the actual search is not exact. It’s an approximate search. So you might not retrieve exactly. The ones that you would do if you did a brute force search over all the vectors in the collection.”
– 19:30
📝 Aravind Srinivas also mentioned tradeoffs in the context of making the model or retrieval stage better in his interview with Lex Fridman.
Jo also mentions the democratization of quality embeddings given today’s language models:
“And it’s no longer like a zero-shot or transfer learning, but it’s still like a learned representation. And I think these representations and the whole ChatGPT OpenAI, ChatGPT language model, OpenAI embeddings really opened the world of embeddings to a lot of developers.”
– 20:22
That said, he explains how “there are some challenges with these text embedding models, especially because of the way they work.”

“Most of them are based on a kind of encoder style transformer model where you take the input text, you tokenize it into a fixed vocabulary. And then you have previously in the pre-training stage and the fine tuning stage, you have learned representations of each of these fixed tokens. Then you feed them through the encoder network. And for each of the input tokens, you have an output vector. And then there’s a pooling step, typically averaging into a single vector representation. So this is how you represent not only one word, but a full sentence.”
– 21:47
So what’s problematic here? 🤨
Given that the model relies on a pre-defined vocabulary, unrecognized words can lead to errors or misunderstandings (think technical jargon, new slang, or even misspellings).
The datasets used for training also might not fully represent specific domain knowledge, and while fine-tuning can help, it’s still limited by the quality and quantity of available data, leading to biases in representations.
The pooling operation (like averaging tokens to represent sentences) can also oversimplify complex sentence structures, losing nuance and creating less accurate representations.
Or as Jo explains it in the talk:
“The issue with this is that the representation becomes quite diluted when you kind of average everything into one vector, which has proven not to work that well for high precision search. So you have to have some kind of shunking mechanism in order to have a better representation for search. And this fixed vocabulary, especially for BERT-based models, you’re basing it off a vocabulary that was trained in 2018. So there are a lot of words that it doesn’t know.
So we had one issue here with a user that was searching for our recently announced support for running inference with GGF models in Vespa. And this has a lot of out-of-word, oh sorry, out-of-vocabulary words. So it gets maps to different concepts, and this might produce quite weird results when you are mapping this into the latent embedding space. And then there’s the final question is, does this actually transfer to your data, to your queries?“
– 22:23
Jo then mentions “evaluation routines” to “actually test if they’re working or not,” as well as “baselines.”
In the IR community, he explains, “the kind of de facto baseline is BM25,” which is “this scoring function where you tokenize the text, linguistic processing, and so forth.”
He mentions a library for BM25 in Python that “builds a model, kind of model, unsupervised from your data, looking at the words that are occurring in the collection, how many times it’s occurring in the data, and how frequent the word is in the total collection.”
📝 I’ve only started using Python, but I believe the library referenced is “Rank-BM25: A two line search engine.”
He also notes how researchers using a “vanilla BM-25 implementation” were beating OpenAI embeddings.
“BM-25 can be a strong baseline, I think that’s an important takeaway,” he says. 👈
Jo also mentions “a hybrid alternative … where you can combine these representations, and it can overcome this kind of fixed vocabulary issue with regular embedding models.”
That said, there’s “not also a single silver bullet,” he explains, as “It really depends on the data and the type of queries.”
That’s why it’s important to “actually evaluate and test things out, and you can iterate on it.”
Despite wanting “to get rid of chunking,” which means splitting up documents into more manageable chunks of text for processing by a language model, given context window or computational limitations, Jo says, it’s still necessary:
“So I think that’s really critical to be able to build a better rag to improve the quality of the retrieval phase. Yeah, and of course, I talked about long context and that the long context models, we all want to get rid of chunking. We all want to get rid of all the videos about how to chunk.
But the basic kind of short answer to this is that you do need to chunk in order to have meaningful representations of text for high precision search.”
– 26:53
Jo references Nils Reimers, “the de facto embedding expert,” as saying “that if you go about 250, so 256 tokens, you’re starting to lose a lot of precision, right?”

📝 Nils Reimers is the Director of Machine Learning at cohere.ai. Not to be confused with Niels Reimers, the founder and former director of Stanford’s Office of Technology Licensing (OTL). 🖐️ 😅
Jo also mentions how there are “other use cases that you can use these embeddings for, like classification,” but for high precision search, it becomes very diluted because of these pooling operations.”
With that, we’re getting close to the finish line. 🏁
And now for the part I’m sure many SEO readers have been waiting for, the leak.
Oops. Not that leak.
This one:

📝 If you want to read about that from the SEO side, check out Hamsterdam Part 60 🐹 for related tweets, or here’s a link to the SER article shown in Jo’s slide above.
In this post, though, we’re more concerned with Jo’s interpretations of the leak in the context of real-world RAG.
In short, it demonstrates that search is complicated:
“In the real-world search, it’s not only about the text similarity. It’s not only about BM25 or a single vector cosine similarity. There are things like freshness, authority, quality, page rank you heard about, and also revenue. So there are a lot of different features. And GBDT is still a simple, straightforward method. And it’s still the kind of king of tabular features where you have… specific name features and you have values for them. So combining GBDT with these kind of new neural features is quite effective when you’re starting to actually operate in the real world.“
– 28:35
And what, pray tell, is GBDT? 👇
It stands for Gradient Boosted Decision Trees, an ML algorithm used for ranking or classification tasks. 🌲
In the real world of search, there are many factors at play. 👈
📝 As Roger Montti wrote in SEJ this week, “search (and SEO) is multidimensional.”
Jo points out that GBDT is good at handling these diverse factors, which can be represented as tabular features (with specific names and values).
It combines the predictions of multiple decision trees, which each focus on different aspects of the data, to create a more accurate ranking overall. 🙌
Combining GBDT with neural network-based methods (neural features) can be a powerful combination, traditional and modern techniques blended for real-world search ranking. 🌪️
6. Conclusion. 🥲
Could it be that we’ve reached the end of our journey?
Yes, indeed. Here’s how Jo wrapped up the talk:
“So quick summary, I think that information retrieval is more than just a single vector representation. And if you want to improve your retrieval stage, you should look at building your own evals. And please don’t ignore the BM25 baseline. And choosing some technology that has hybrid capabilities, meaning that you can have exact search for exact tokens and still have matches, and also combine the signals from text search via keywords and text search via embeddings, can avoid some of these failure modes that I talked about. And yeah, and finally, real-world search is more than text similarity.”
– 29:30
Don’t set aside the 5-hour energy just yet. 🏃
There were also some audience questions. 🙋
This section was about 15 minutes long.
However, I think you’ve heard enough summarizing from me.
And I haven’t brought Claude into the mix yet. 🤔
I saw Darwin Santos share a cool thing he did with it on LinkedIn recently.
Let me try a version of that using a simple prompt:
Leave a Reply