Google’s Search Quality: Blog Scrapers & Content Farms in 2011 vs. AI Spam in 2024

Last updated:

April 3, 2024

Welcome to another rendition of Hamsterdam History!

This will be a shorter post, but it’s a big topic, so I hope we do it justice.

How did we land on the topic of Google’s search quality?

Well, I chose a random article on Search Engine Land via Internet Archive’s WayBack Machine and explored its topics and modern-day parallels.

This is where we landed, and I think it’s quite timely.

In short, by looking at the chain of events for Google’s statements and ranking system updates around scraped content, content farms, and webspam from early 2011 into 2012, it feels a bit like the state of affairs today.

But you be the judge.

The backstory: Google signals algorithm changes (2011)

On August 26th, 2011, Matt McGee wrote an article for Search Engine Land called, “Google Signals Upcoming Algorithm Change, Asks For Help With Scraper Sites.”

Here’s how it looked contemporaneously on the SEL website (via WayBack Machine):

The article references a tweet by Matt Cutts asking for people in the webmaster community to share knowledge of blog scraper pages to provide data points for testing.

The speculation was these would be used for a future algorithm update, per the article’s title.

That conclusion was derived from the Linked Google Doc form, which “explains that Google ‘may use data you submit to test and improve our algorithms.’”

This was a “long-running problem,” explains McGee, not only scraper sites stealing content but “particularly scraper sites that are ranking higher than the original page.”

He also notes how “the issue of scraper sites has been particularly prominent in recent months.”

Sound familiar?

The first sentence of the section in the SEL article called “Google vs Scraper Sites” feels eerily familiar to today:

“Google has always had critics but, within the last year, many of them grew more vocal about what they perceived as a decline in quality of Google’s search results.”
– Matt McGee, SEL (2011)

Google penned their own blog post earlier that year in January (2011), responding to this criticism and saying they disagreed search results had grown worse.

Here’s a link to that blog post on WayBack Machine, which was called, “Google search and search engine spam.” It was authored by Matt Cutts, Principal Engineer.

The post starts off referencing how January of 2011 “brought a spate of stories about Google’s search quality.”

I can’t help but relate this to January of 2024 — before the mighty sword of the March 2024 core update and spam update fell — when spam sites in Google Search were an ongoing concern, and claims of search quality degradation were in the news:

Back in 2011, Cutts wrote:

“according to the evaluation metrics that we’ve refined over more than a decade, Google’s search quality is better than it has ever been in terms of relevance, freshness and comprehensiveness.”
– Matt Cutts, Google Blog (2011)

Similarly, in 2024, Google stated (in its post-trial debrief) that its search quality had improved:

Google in its post trail debrief says its search quality has improved, not degraded https://t.co/aEmqGAMdQY hat tip @gsterling pic.twitter.com/5vuKXD5Fiu
— Barry Schwartz (@rustybrick) March 1, 2024

We can also draw parallels between the spam problems of 2011 with 2024.

In the 2011 Google Blog post, Cutts related spam to cheating (what we might call “SEO spam” today) but also “off-topic webspam,” (what I call “gross spam“):

“Just as a reminder, webspam is junk you see in search results when websites try to cheat their way into higher positions in search results or otherwise violate search engine quality guidelines. A decade ago, the spam situation was so bad that search engines would regularly return off-topic webspam for many different searches. For the most part, Google has successfully beaten back that type of “pure webspam”—even while some spammers resort to sneakier or even illegal tactics such as hacking websites.”
– Matt Cutts, Google Blog (2011)

Here’s something else that’s interesting.

Cutts seemingly makes a connection between spam and “freshness”:

“As we’ve increased both our size and freshness in recent months, we’ve naturally indexed a lot of good content and some spam as well.”
– Matt Cutts, Google Blog (2011)

Freshness has also been an alleged loophole in Google’s systems for spammers in 2024, as Roger Montti pointed out in Search Engine Journal here and again here (excerpt below):

“It’s my hypothesis that the reason these spam sites rank is that they’re taking advantage of a loophole in Google’s algorithms that allows new content to receive an initial boost, what Google’s John Mueller has described as Google testing the website or the webpages out.”
– Roger Montti, SEJ (2024)

Another interesting parallel is Google’s use of classifiers.

In another part of Cutts’ 2011 blog post, he references a “classifier” at the page or document level as well as site-wide efforts against hacked sites:

“To respond to that challenge, we recently launched a redesigned document-level classifier that makes it harder for spammy on-page content to rank highly. The new classifier is better at detecting spam on individual web pages, e.g., repeated spammy words—the sort of phrases you tend to see in junky, automated, self-promoting blog comments. We’ve also radically improved our ability to detect hacked sites, which were a major source of spam in 2010.”
– Matt Cutts, Google Blog (2011)

The use of the word classifier is reminiscent of the helpful content system, which launched in August of 2022 and used a “classifier process” that was “entirely automated, using a machine-learning model.” The classifier was site-wide, however, and the helpful content system was later absorbed into Google’s core ranking systems in March of 2024. It appears the site-wide classifier remains, but that old system (now incorporated among others) also now works at a page level.

Another parallel is the issue of content quality.

Cutts goes on to say “changes” are being evaluated for dealing with “content farms” (or what we might call “unhelpful” content today):

“And we’re evaluating multiple changes that should help drive spam levels even lower, including one change that primarily affects sites that copy others’ content and sites with low levels of original content. We’ll continue to explore ways to reduce spam, including new ways for users to give more explicit feedback about spammy and low-quality sites.

As ‘pure webspam’ has decreased over time, attention has shifted instead to ‘content farms,’ which are sites with shallow or low-quality content. In 2010, we launched two major algorithmic changes focused on low-quality sites. Nonetheless, we hear the feedback from the web loud and clear: people are asking for even stronger action on content farms and sites that consist primarily of spammy or low-quality content.”
– Matt Cutts, Google Blog (2011)

I’m not familiar with the reference to the two major algorithmic changes in 2010.

In doing some research, SEJ mentions the MayDay Update from April 28th, 2010. (This seems like one of Cutts’ mentioned updates.) HubSpot also mentions a Caffeine Update from June 8th, 2010. (I’m not sure if this would be called a “major algorithmic change,” unless it was the freshness of documents enabled by Caffeine.) Those are the largest two updates I found.

But that aside, what I find interesting is the rollout of those major changes in 2010 (which, in parallel, I think of kind of like the third helpful content update of September 2023) was followed by calls for another round of significant updates for low-quality content and spam, which Google heeded (and which, in parallel, I think of kind of like the March 2024 core and spam updates).

Here’s the timeline of events:

One week later after publishing that Google Blog post on January 28th, 2011, Matt Cutts wrote a new post on his personal site announcing Google had made an algorithm change in response to the “changes” referenced in the blog post:

“My post mentioned that ‘we’re evaluating multiple changes that should help drive spam levels even lower, including one change that primarily affects sites that copy others’ content and sites with low levels of original content.’ That change was approved at our weekly quality launch meeting last Thursday and launched earlier this week.

This was a pretty targeted launch: slightly over 2% of queries change in some way, but less than half a percent of search results change enough that someone might really notice. The net effect is that searchers are more likely to see the sites that wrote the original content rather than a site that scraped or copied the original site’s content.”
– Matt Cutts, Personal Blog (2011)

The two blog posts Matt Cutts wrote were both published in late January of 2011.

The following month on February 27th, Google rolled out the first Panda update (then called the Farmer Update).

A couple of months later on April 11th, 2011, Google rolled out the second Panda Update (Panda 2.0).

A month later on May 9th, 2011, it rolled out the third Panda Update (Panda 2.1).

Then on June 21st, 2011, it rolled out the fourth Panda Update (Panda 2.2).

Then on July 23rd, 2011, it rolled out the fifth Panda Update (Panda 2.3).

And again on August 12th, 2011, it rolled out the sixth Panda Update (Panda 2.4).

There had now been six monthly updates to Panda, and here was Search Engine Land on August 26th writing a story where Google was asking for help with scraper sites in an effort to improve search quality.

Why was this necessary?

Here’s how Matt McGee explains that portion of the timeline:

“after the Panda update rolled out in February, many webmasters flooded Google’s help forums with reports that it had gotten worse.

A few months later, during our SMX Advanced conference in Seattle, Cutts confirmed that the newest Panda update would target scraper sites. That update — Panda 2.2 if you’re scoring at home — rolled out in mid-June.

And that pretty much brings us up to today — where Google is ‘testing algorithmic changes for scraper sites (especially blog scrapers),’ and apparently looking for some examples that it thinks it may have missed.”
– Matt McGee, SEL (2011)

Here’s where things get interesting.

A month later on September 28th, 2011, a new Panda update came out, the 7th update (Panda 2.5). According to Moz’s algorithm change history page, “Specific details of what changed were unclear, but some sites reported large-scale losses.”

Ok, so we have Panda updates happening monthly for most of 2011, and while that’s going on in August, Google’s Matt Cutts asks for examples of scraper sites.

Then, eight months later on April 24th, 2012, Google rolled out Penguin.

Cold as ice

Importantly, the name of the Google Blog post announcing the first Penguin update was, “Another step to reward high-quality sites.” It was published by Matt Cutts.

Here’s how it looked on the Webmaster Central Blog (where it was cross-published) at that time:

I wonder which version ranked first, Google Blog or Webmaster Central Blog? 😉

It’s also amusing to read the comments (yes, Google allowed comments on their Webmaster Central Blog posts) in light of the Penguin update.

Why? They’re largely positive and optimistic.

As for the tofu and potatoes of the Penguin post’s content, it really makes sense when we consider the sentiments in light of the timeline we’ve just been examining.

Thinking back to the context of Matt Cutts’ blog posts from January of 2011, this is an interesting excerpt, where he uses a white-hat vs. black-hat SEO metaphor:

“The opposite of ‘white hat’ SEO is something called ‘black hat webspam’ (we say ‘webspam’ to distinguish it from email spam). In the pursuit of higher rankings or traffic, a few sites use techniques that don’t benefit users, where the intent is to look for shortcuts or loopholes that would rank pages higher than they deserve to be ranked.”
– Matt Cutts, Google Blog (2012)

But the part that stands out most to me is the conclusion:

“Sites affected by this change might not be easily recognizable as spamming without deep analysis or expertise, but the common thread is that these sites are doing much more than white hat SEO; we believe they are engaging in webspam tactics to manipulate search engine rankings. …

We want people doing white hat search engine optimization (or even no search engine optimization at all) to be free to focus on creating amazing, compelling web sites.”
– Matt Cutts, Google Blog (2012)

There’s a lot to unpack here, but I think we can swap a few words, like “shallow or low-quality content” for “unhelpful content,” “webspam tactics” for “search engine-first tactics,” and “amazing, compelling websites” for “helpful, reliable, and people-first content,” and we’d be hard pressed not to think this was discussing our present 2023-2024 era of Search.

What about AI?

Ah, yes. The essence of the 2011 SEL post (and Matt’s tweet) was scraper sites.

Fast forward to 2023, where we all remember this:

We pulled off an SEO heist that stole 3.6M total traffic from a competitor.

We got 489,509 traffic in October alone.

Here's how we did it: pic.twitter.com/sTJ7xbRjrT
— Jake Ward (@jakezward) November 24, 2023

Which then resulted in this:

Couldn’t agree more! Many people thought we actually ‘stole’ content, including Google. We were hit with a manual penalty because of ‘copied content’. Definitely shouldn’t have used certain words haha. I believe we were made an example of.
— Jake Ward (@jakezward) December 14, 2023

But that was in December of 2023.

A few months later, we also had this:

Just kidding! No 👏 needed. Here's the data 😉

TLDR:

– 1.7% of the sites in our database were completely deindexed
– A total of 837 sites
– Over 20 million monthly organic visits gone
– Loss of $446k in display ad revenuehttps://t.co/QzF4L8lniq
— Ian Nuttall (@iannuttall) March 8, 2024

Then again, we also have instances like this:

@searchliaison @dannysullivan @JohnMu A few years ago I traveled to Peru and wrote the first guide for Huchuy Picchu, which was a newly opened hike at that time.

My blog post ranked #1 on Google for years, which makes sense, because I did the hike myself and other Peru websites… pic.twitter.com/454zAgGnXT
— The World Travel Guy (@aworldtravelguy) March 30, 2024

There’s a lot there, but it generally speaks to the continuation of a classic and timeless story, one with “copied content,” shortcuts for rankings manipulation, Google’s webspam and search quality efforts, both manually and algorithmically, and all the nuances therein.

But we can also see evolutions in that nuance.

I wasn’t doing SEO prior to 2015, but it’s been interesting to follow the opinions of those who have seen the evolutions of these systems, like what Glenn Gabe talks about here:

Relevant share from 3/22. It was amazing to watch as Google released previous Penguin suppression when 4.0 rolled out. Checking sites impacted by the Sep HCU(X) seems like the HCU classifier is still being applied. I have no idea if that's the case, and Google has said that… https://t.co/vFGfWi3Smy
— Glenn Gabe (@glenngabe) April 1, 2024

Jumping to today, it does seem like the signals of the old helpful content system are now active on a page-level and site-wide level, per Danny Sullivan here:

We had this in our Search Central blog post, but it's probably worth highlighting that the helpful content system of old is much different now:https://t.co/76vrohZb4K

"Just as we use multiple systems to identify reliable information, we have enhanced our core ranking systems to…
— Google SearchLiaison (@searchliaison) March 22, 2024

And John Mueller here:

It's both. It's been documented for a while, including very directly in https://t.co/dxd3uVXjqZ
— John (@JohnMu) April 2, 2024

But I think the takeaway to this post (which has grown longer than I expected — remember in the intro where I said it’d be short?) is that these updates can feel like one thing when they’re happening, but stepping back, we can see how timelines of statements and updates from Google, coupled with sentiments from the SEO community and the larger public, can contribute to series of events.

As Matt Cutts said in January of 2011, Google already had “two major algorithmic changes focused on low-quality sites. Nonetheless, we hear the feedback from the web loud and clear: people are asking for even stronger action on content farms and sites that consist primarily of spammy or low-quality content.”

Then came Panda and Penguin.

You can bet that, however the March 2024 core update (and related changes to search ranking systems) shake out (and especially what awaits us in May with the site reputation abuse policy update, not to mention the unannounced global rollout of AI overviews/SGE), the saga will continue.

A parting reminder

Whether you’re creating “amazing, compelling web sites” (2012) or “helpful, reliable, and people-first content” (2024), if you want to stay ahead of the curve, just follow this reminder:

If you create content, a reminder: create your content for people, not robots, for success with Google Search. That's long been our advice. For a refresher, see our guidance about creating helpful, reliable, people-first content: https://t.co/NaRQqb1SQx
— Google SearchLiaison (@searchliaison) June 16, 2023

That tweet came out 3 months before the third helpful content update … and so did the SEO replies to it. 😉

I hope you’ve enjoyed this rendition of Hamsterdam History!

Check out the other history articles available or sign up for the newsletter, and stay tuned for more!

Until next time, enjoy the vibes:

Thanks for reading. Happy optimizing! 🙂

SEO Strategist and Consultant

Ethan Lazuk

Ethan Lazuk

Google’s Search Quality: Blog Scrapers & Content Farms in 2011 vs. AI Spam in 2024

The backstory: Google signals algorithm changes (2011)

Sound familiar?

Cold as ice

What about AI?

A parting reminder

Like this:

Leave a ReplyCancel reply

Google’s Search Quality: Blog Scrapers & Content Farms in 2011 vs. AI Spam in 2024

The backstory: Google signals algorithm changes (2011)

Sound familiar?

Cold as ice

What about AI?

A parting reminder

Share this:

Like this:

Leave a ReplyCancel reply

Discover more from Ethan Lazuk