How Google Understands (Comprehends) Synonyms: Matt Cutt’s 2010 Blog Post Compared to (Contrasted with) Search Today (A Hamsterdam History Lesson)

Last updated:

May 20, 2024

Welcome to another week of Hamsterdam History, where we look at vintage SEO articles to celebrate their contributors, get historical context, and learn how things have changed since.

This week, our plan is to explore a post from Matt Cutts’ blog, “Gadgets, Google, and SEO.”

How come?

Well, I largely missed the Cutts’ era at Google.

However, I appreciate when people in the tech space today share insights. (One of my favorite articles in this week’s Hamsterdam recap (Part 58) was an interview with Perplexity’s head of search, Alexandr Yarats.)

In that spirit, I’d like to learn more about the Cutts’ era, which I’m sure others will have fun exploring or re-living, as well.

My first thought was to write about the most popular post on his blog, according to third-party traffic estimates.

After all, who doesn’t love what’s popular?

As it turned out, though, the most-clicked post (allegedly) was about reindeer car antlers:

No, seriously:

What does it mean when the most popular article in the blog of a notable search and SEO figure is about reindeer car antlers?

I’m not sure … perhaps we should confirm with GSC before drawing any conclusions …

That said, it did make me think of this classic song (best played at high volume, preferably in a residential area):

And that led me to create this image:

Yes, that’s how I spent part of my day.

Blame caffeine …

Those are good burgers, Walter.

But getting back to our goal of reviewing one of Matt Cutts’ posts to learn something useful about SEO history, I saw one that stood out.

It was about synonyms:

It wasn’t nearly as highly visited as reindeer car antlers, but it struck a chord with me for a particular reason …

You see, last night I watched this great transformers video on YouTube:

Oops, not that one.

It was this one about these transformers:

As you can see from the tiny text at the bottom of that screenshot, the video was about attention in transformers.

Sorry, back on track …

The video is well-made. It got comments like this:

What’s really cool about it — I thought in watching it as an SEO — is the discussion of vector embeddings in high-dimensional spaces.

Think of it like this.

If we mentioned “reindeer antlers” in our page’s text early on, that word or phrase’s positional encoding in a vector space might change based on other words that add context, even if they occur much later.

That might mean our “reindeer antlers” topic starts in the contextual domain of “wildlife” but later moves to “automotive.” (We can use our imagination with the example below.)

Just don’t feed the word embeddings after midnight.

In other words, the SAME WORDS OR PHRASES can have TOTALLY DIFFERENT MEANINGS based on their SEMANTIC CONTEXT.

Caffeine …

The point is, for SEOs who are thinking, “I’ll just put this keyword on the page more often.”

Um, maybe …

But this transformer architecture we just looked at, it isn’t just seeing your words at face value like that.

It’s contextualizing all the details that give those words meaning.

Where have we heard that kind of thing before?

“… we introduced and open-sourced a neural network-based technique for natural language processing (NLP) pre-training called Bidirectional Encoder Representations from Transformers, or as we call it–BERT, for short. … This breakthrough was the result of Google research on transformers: models that process words in relation to all the other words in a sentence, rather than one-by-one in order. BERT models can therefore consider the full context of a word by looking at the words that come before and after it—particularly useful for understanding the intent behind search queries.” (Highlights and bolding added to all quotes.)
– Pandu Nayak in 2019

After all, what is a synonym, anyway?

Source

It’s a word or phrase that has (nearly) the same meaning, concept, or quality.

Isn’t that sort of like the idea of a high-dimensional vector space capturing contextual meaning?

Yet, while transformers enabled these more contextualized embeddings, word embeddings had still existed long before.

In our Microsoft search engine history post, for example, we had a quote from Google’s Peter Norvig about how his company’s approach to semantic search was at the word or phrase level, mapping words to concepts (circa 2007-2008):

Then in last week’s Hamsterdam History lesson, we reviewed a 2013 interview with Jeff Dean (then of Google Research and now of Google DeepMind), where he spoke about neural networks and mentioned word vectors (embeddings):

That was the same year (2013) that word2vec, “a family of model architectures and optimizations that can be used to learn word embeddings from large datasets,” was released.

It’s also the year that Jeff contributed to a research paper with other Google researchers called, “Efficient Estimation of Word Representations in Vector Space.”

Transformers then hit the scene in 2017.

So to put everything in a timeline, we had Google exploring words to concepts in 2007, Matt Cutts writing about synonyms in 2010 (which we’ll explore here), Jeff Dean speaking about word vectors in 2013, transformers appearing in 2017, and models like BERT appearing in 2018-2019.

Let’s check out Matt’s blog post on synonyms from 2010, follow a few breadcrumbs, and then compare our learnings to our present day.

“More info about synonyms at Google” – Matt Cutts (2010)

What I find cool about Matt’s blog is today’s version is nearly identical to how it was 14 years ago.

Here’s the current version:

And here’s how it looked the week it was published in January of 2010:

Both screenshots show the full text.

However, if you’re on mobile, you’d probably be angry if I left you to read that!

So let’s explore the text in more detail. 🙂

The introduction links to another blog post by Google engineer, Steve Baker. (We’ll explore that post in a minute.)

It also tells us that Google uses “semantics” and understands meaning at the document and query level:

“Steve Baker, an engineer in the search quality group at Google, just did a nice post about synonyms on the Google blog. A lot of people seem to think that Google only does simple-minded matching of the users’ keywords with words that we indexed. The truth is that Google does a lot more sophisticated stuff than most people realize. I’d say that Google does more with ‘semantics’ and both document and query understanding than almost any other search engine.“
– Matt Cutts (2010)

It’s not clear if that’s a general statement about Google Search or specific to the context of synonyms.

Matt mentions a few examples of synonyms from Steve’s post, including “arm reduction” vs. “arms reduction” and how “bb” could mean bottom bracket or blackberry, depending on surrounding words in the query.

He also includes a stat from Steve’s article “that we haven’t made public before,” which says “synonyms affect 70 percent of user searches.”

He calls Steve a “smart engineer” and links to a 2009 article about a Google patent for “query synonyms in query context.” (We’ll look at that post in a sec.)

Matt then offers advice for webmasters, which is pretty much in line with “focus on your users first”:

“As far as concrete advice for webmasters, the same advice still holds that we’ve always said: think about the different words that searchers might use when looking for your content. Don’t just use technical terms–think about real-world terms and slang that users will type. For example, if you’re talking about a ‘usb drive,’ some people might call it a flash drive or a thumb drive. Bear in mind the terms that people will type and think about synonyms that can fit naturally into your content. Don’t stuff an article with keywords or make it awkward, but if you can incorporate different ways of talking about a subject in a natural way, that can help users.”

I can’t speak for 2010, but if we think about the idea of words to concepts in light of high-dimensional vector spaces (post-transformers in 2017), we can start to imagine what “write for your users” means.

If I reference a hammer as a “thingimajig” randomly on a page, for example, Google probably wouldn’t know what I’m saying.

But if I say, “I used my large thingimajig to nail two boards together, then I smashed my finger with it, so I put the thingamajig in my toolbelt and called over my contractor friend to drive me from the job site to the hospital. Along the way, he said, ‘You should try a ball pein thingamajig. After all, it’s not the size of your thingimajig, it’s knowing how to use it.’”

That provides deeper context for mapping the word “thingimajig” to its meaning as a hammer.

But if my audience prefers the word thingamajig to hammer (they might …), then I’ll be creating helpful content for them by writing that way, plus Google will still understand the meaning, in theory.

Matt included a few links in his post that we can explore for more context around 2010’s search systems, as well.

Breadcrumbs of links

Reviewing the links chronologically, the first document, in reference to Steve Baker, goes to a 2009 blog post on ArnoldIT.com called “Google Nails Patent for Query Synonyms in Query Context.”

As you can see, the author discusses the patent and then quotes one part:

“A method is applied to search terms for determining synonyms or other replacement terms used in an information retrieval system. User queries are first sorted by user identity and session. For each user query, a plurality of pseudo-queries is determined, each pseudo-query derived from a user query by replacing a phrase of the user query with a token. For each phrase, at least one candidate synonym is determined. The candidate synonym is a term that was used within a user query in place of the phrase, and in the context of a pseudo-query. The strength or quality of candidate synonyms is evaluated. Validated synonyms may be either suggested to the user or automatically added to user search strings.“
– Part of the Google synonyms patent quoted in ArnoldIT.com blog post (2009)

Here’s the patent itself, which I looked up at patents.google.com.

Interestingly, it was filed in 2005 and granted in 2009. In other words, five years before Matt’s blog post. (Remember that for later.)

What’s in the patent? Well, here’s the first figure:

Of course, we’re in the Gemini era today. 😉

So I uploaded the 15-page PDF to a prompt (a mere 15k tokens out of 1 million available) and had Gemini 1.5 Flash create a summary:

“This patent describes a method for Google Search to find synonyms for search terms. Here’s how it works:

Collect Data: Google analyzes a huge database of past search queries. This data includes the queries themselves, the user who made them, and the time they were made.

Identify Query Pairs: The system looks for pairs of queries that are very similar but differ in one or more words or phrases. These differing phrases become potential synonyms.

Evaluate Synonyms: The system uses several tests to determine the quality of the potential synonyms. These tests include:

Frequency: How often do users use both the original phrase and the potential synonym in their searches?

Common Results: Do the search results for the original phrase and the potential synonym share a lot of the same web pages?

Sequential Searches: Do users often search for the original phrase and the potential synonym within a short time frame?

Score Synonyms: The system assigns a score to each potential synonym based on how well it passes the tests.

Use Synonyms: Google can then use these synonyms in various ways:

Suggest Alternatives: When a user searches, Google might suggest alternative searches using the synonyms.

Automatic Substitution: Google might automatically substitute the synonym for the original phrase in the search, so the user doesn’t have to change their query.

In short, this patent helps Google find better synonyms to improve search results and make it easier for users to find what they’re looking for.”
– Gemini’s summary of the 2009 synonym patent

Nothing in the patent speaks to vector embeddings, as I understand them.

That makes more sense if this was originally from 2005 — so far, our first reference to semantic search by Google (at least in Hamsterdam History) was 2007-ish.

Based on the 2009-granted patent, Google’s synonym system is more about statistical analysis of query logs and user substitution behaviors with search queries, i.e., co-occurrence analysis and query-result similarity. (Ok, maybe Gemini helped me with some of that vocabulary, too.)

The second link is a Google blog post linked in the intro from January 19th, 2010 (the same day he published his post), called “Helping computers understand language.”

Interestingly, it mentions “artificial intelligence” in the context of NLP (or more accurately NLU):

Enabling computers to understand language remains one of the hardest problems in artificial intelligence. The goal of a search engine is to return the best results for your search, and understanding language is crucial to returning the best results. A key part of this is our system for understanding synonyms.
– Google Blog (2010)

Super interestingly, the words artificial intelligence were linked to … a Wikipedia page for “AI-complete.”

The Wikipedia page’s first sentence says:

“In the field of artificial intelligence (AI), tasks that are hypothesised to require artificial general intelligence to solve are informally known as AI-complete or AI-hard.”
– Wikipedia page on AI-complete

So even back in 2010, Google had its sights set on AGI.

This is notably the stated ambition of Google DeepMind — the original DeepMind started in 2010 — as Demis Hassabis even said in his I/O 2024 keynote.

That said, the 2010 Google blog post notes how “Our synonyms system is the result of more than five years of research within our web search ranking team.”

That likely references the patent from 2005, which is based on statistical models (not AI).

On a separate note, I also find interesting the mention of a “bad synonym,” or a wrong association.

It happened 2% of the time at first — “For every 50 queries where synonyms significantly improved the search results, we had only one truly bad synonym.”

However, the issue wasn’t addressed one instance at a time, but rather system-wide:

“Note that you can still see that on Google today, because while we know it’s a bad synonym, we don’t typically fix bad synonyms by hand. Instead, we try to discover general improvements to our algorithms to fix the problems. We hope it will be fixed automatically in some future changes.”
– Google Blog (2010)

I think that’s important to understand that for search, it’s usually the aggregate level that counts.

We also get a mention of Google bolding synonyms in snippets:

“Historically, we have bolded synonyms such as stemming variants — like the word ‘picture’ for a search with the word ‘pictures.’ Now, we’ve extended this to words that our algorithms very confidently think mean the same thing, even if they are spelled nothing like the original term. This helps you to understand why that result is shown, especially if it doesn’t contain your original search term.”

The post then notes that Google uses “many techniques to extract synonyms” and links to a 2008 post called “Making search better in Catalonia, Estonia, and everywhere else.”

We’ve covered that post before in Hamsterdam!

It was the focus of the “This week in SEO history” section in our March 31st recap (Part 51):

TLDR: The main takeaways are that the post discusses using language models for personalization, but this was still accomplished through keyword matching, not vector embeddings (presumably).

So I think we can say based on the evidence so far that AI had an intended use for NLP by 2010, but the synonym systems of that era were still based on statistical analysis.

The third link came when Matt updated his post three days later, referencing another Google blog post from January 22nd, 2010, called “Understanding the web to make search more relevant.”

That post’s opening sentence references “Google Squared” and “Rich Snippets,” both from Searchology 2009.

As it turns out, we also wrote about the Searchology 2009 event in Hamsterdam History, including Google Squared.

Basically, Google Squared highlighted answers from structured data (and it didn’t last long):

However, that lead Google to directly answer queries in the SERP by highlighting relevant text in search snippets from unstructured data:

“Unstructured data is difficult for a computer to interpret, which means that we humans still have to do a fair amount of work to synthesize and understand information on the web.

Google Squared is one of our early efforts to automatically identify and extract structured data from across the Internet. We’ve been making progress, and today the research behind Google Squared is, for the first time, making search better for everyone with a new feature called ‘answer highlighting.’

Answer highlighting helps you get to information more quickly by seeking out and bolding the likely answer to your question right in search results.”
– Other Google Blog Post (2010)

In summary, the synonym systems for Google Search in 2010 were likely based on statistical models and not neural networks (AI).

Word embeddings did exist in the early 2000s, and breakthroughs started to happen around 2010.

However, it wasn’t until 2013 when Google introduced word2vec and 2017 when transformers advanced NLP with embeddings that captured deeper semantic context from surrounding words and long-range dependencies.

How does Google Search handle synonyms today?

We saw in our video about transformers how vector embeddings in high-dimensional spaces can capture rich context.

We also referenced BERT from 2018-2019.

Source

But how does it all fit together in Google’s current synonym systems?

Google’s How Search Works page has a section on ranking results with a subsection called “Meaning of your query,” where it appears to reference the same system as Matt’s post in 2010 (based on the “five years” reference):

“This involves steps as seemingly simple as recognizing and correcting spelling mistakes, and extends to trying to our sophisticated synonym system that allows us to find relevant documents even if they don’t contain the exact words you used. For example, you might have searched for ‘change laptop brightness’ but the manufacturer has written ‘adjust laptop brightness. Our systems understand the words and intend are related and so connect you with the right content. This system took over five years to develop and significantly improves results in over 30% of searches across languages.”
– Google How Search Works

I do wonder, based on this page, if the synonym system is still based (in part) on statistical models (like in 2010), but the related aspects of ranking systems, like for query understanding (intent), are driven by AI systems:

“Our systems also try to understand what type of information you are looking for. If you used words in your query like ‘cooking’ or ‘pictures,’ our systems figure out that showing recipes or images may best match your intent. If you search in French, most results displayed will be in that language, as it’s likely you want. Our systems can also recognize many queries have a local intent, so that when you search for ‘pizza,’ you get results about nearby businesses that deliver.”

RankBrain launched in 2015 as the “first deep learning system deployed in Search,” according to a blog post Pandu Nayak wrote in February 2022.

We saw earlier how Jeff Dean referenced neural networks and vector embeddings in 2013, yet if RankBrain was the first deep learning system, this would mean deep neural networks (DNNs) weren’t used for vector embeddings pre-2015.

Nayak also described RankBrain as the “first AI system” — but since machine learning was long a part of Google products, this likely refers to DNNs — and most importantly, his quote speaks to the implied use of vector embeddings (or how words relate to concepts):

“At the time, [RankBrain] was groundbreaking — not only because it was our first AI system, but because it helped us understand how words relate to concepts. Humans understand this instinctively, but it’s a complex challenge for a computer. RankBrain helps us find information we weren’t able to before by more broadly understanding how words in a search relate to real-world concepts. For example, if you search for ‘what’s the title of the consumer at the highest level of a food chain,’ our systems learn from seeing those words on various pages that the concept of a food chain may have to do with animals, and not human consumers.”
– Pandu Nayak (2022)

Next came neural matching in 2018, likely expanding vector embeddings to the document level:

“But it wasn’t until 2018, when we introduced neural matching to Search, that we could use them to better understand how queries relate to pages. Neural matching helps us understand fuzzier representations of concepts in queries and pages, and match them to one another. It looks at an entire query or page rather than just keywords, developing a better understanding of the underlying concepts represented in them.”

What happened the previous year that might have enabled neural matching?

Transformer architectures were discovered.

And the year after neural matching?

BERT rolled out:

“Launched in 2019, BERT was a huge step change in natural language understanding, helping us understand how combinations of words express different meanings and intents. Rather than simply searching for content that matches individual words, BERT comprehends how a combination of words expresses a complex idea. BERT understands words in a sequence and how they relate to each other, so it ensures we don’t drop important words from your query — no matter how small they are.”

Then in 2022 came MUM, and unlike BERT (an encoder-only model that understood language), MUM was multimodal (text, images, video, etc.) and “capable of both understanding and generating language,” likely using an encoder-decoder model similar to T5 and today’s LLMs.

MUM is also reportedly a “thousand times more powerful than BERT,” likely implying its vector embeddings are based on even longer-range dependencies, multimodal inputs, and maybe the knowledge graph:

“[MUM is] trained across 75 languages and many different tasks at once, allowing it to develop a more comprehensive understanding of information and world knowledge.”

We can see the advancements of “words to concepts” by Google’s systems every few years, extending now to generative AI summaries in SERPs.

Of course, whether synonym systems use those AI models or still depend on statistical models unveiled in 2009-2010 to some degree isn’t certain.

My guess is it’s not so straightforward, and pretty much wherever you turn, deep neural networks (led by BERT) are there. To what degree, though? I’m sure it depends. 🙂

Come back next week!

I appreciate you reading this week’s Hamsterdam History lesson!

That was quite a journey, and I’m sure we’ve barely (hardly) scratched the surface (explored the topic) of synonyms and Google Search.

I do think, however, that we’re beginning to see how these different topics build context in our own high-dimensional vector space of SEO history knowledge.

Drop by next week for a new lesson, or check out past articles below.

Until next time, enjoy the vibes:

Thanks for reading. Happy optimizing! 🙂

SEO Strategist and Consultant

Ethan Lazuk

“Book Me A Trip To Washington, DC”: Revisiting Google Research’s Ambitions with Neural Networks in 2013 a Decade Later, After Google I/O 2024 (Hamsterdam History)

In this Hamsterdam History article, we’ll take a look at Google Research’s AI-driven goals in 2013 to see if they’ve come true after Google I/O…

May 16, 2024October 5, 2024

Revisiting the Cre8tive Flow Blog in 2005 to Learn about Jared Spool and the Timelessness of Usability (for SEO), A Hamsterdam History Lesson

Jared Spool was called the “bad boy” of usability. Looking back at a 2005 post from Cre8tive Flow Blog, we can see the timelessness of…

May 6, 2024October 5, 2024

“Hello, World”: Exploring the History of Microsoft’s Search Engines, from MSN to Bing Sources in Today’s AI Chats (Hamsterdam History)

This Hamsterdam History lesson examine the history of search engines from Microsoft, from MSN Search in 1998 to Copilot with Bing and other AI chats…

April 24, 2024October 5, 2024

Ethan Lazuk

How Google Understands (Comprehends) Synonyms: Matt Cutt’s 2010 Blog Post Compared to (Contrasted with) Search Today (A Hamsterdam History Lesson)

“More info about synonyms at Google” – Matt Cutts (2010)

Breadcrumbs of links

How does Google Search handle synonyms today?

Come back next week!

Related history posts:

“Book Me A Trip To Washington, DC”: Revisiting Google Research’s Ambitions with Neural Networks in 2013 a Decade Later, After Google I/O 2024 (Hamsterdam History)

Revisiting the Cre8tive Flow Blog in 2005 to Learn about Jared Spool and the Timelessness of Usability (for SEO), A Hamsterdam History Lesson

“Hello, World”: Exploring the History of Microsoft’s Search Engines, from MSN to Bing Sources in Today’s AI Chats (Hamsterdam History)

Like this:

Leave a ReplyCancel reply

How Google Understands (Comprehends) Synonyms: Matt Cutt’s 2010 Blog Post Compared to (Contrasted with) Search Today (A Hamsterdam History Lesson)

“More info about synonyms at Google” – Matt Cutts (2010)

Breadcrumbs of links

How does Google Search handle synonyms today?

Come back next week!

Related history posts:

“Book Me A Trip To Washington, DC”: Revisiting Google Research’s Ambitions with Neural Networks in 2013 a Decade Later, After Google I/O 2024 (Hamsterdam History)

Revisiting the Cre8tive Flow Blog in 2005 to Learn about Jared Spool and the Timelessness of Usability (for SEO), A Hamsterdam History Lesson

“Hello, World”: Exploring the History of Microsoft’s Search Engines, from MSN to Bing Sources in Today’s AI Chats (Hamsterdam History)

Share this:

Like this:

Leave a ReplyCancel reply

Discover more from Ethan Lazuk