Ethan Lazuk

SEO & marketing professional.


Doing the Global Minimum: Thinking About SEO More in the Context of Neural Network Architectures (A Hamsterdam Research Opinion Piece)

By Ethan Lazuk

Last updated:

Japanese style wave art with faint neural network background.

SEO is 24/7.

It’s non-stop learning, implementing, iterating, evolving, reflecting, testing, screwing up, learning again, succeeding, celebrating, sharing, and learning some more.

Even when we’re not doing SEO, we’re seeing the world through that lens, laughing about staying in hotel room “404” or crying when our loved ones click on search ads instead of organic results when looking up info online … just me?

I even catch myself dreaming about site audits occasionally.

But the point is …

Our world of search evolves constantly, and sometimes abruptly.

We can’t get away with doing the minimum.

What I think we should do instead is the global minimum.

In this Hamsterdam Research opinion piece, I’ll introduce more discussion of neural networks to broaden our SEO lexicon.

I started writing this before Google I/O, but now we have 121 additional reasons why it’s relevant.

Google I/O 2024 AI count.
Source

The goal is to expand our thinking and inspire new ways of analyzing problems and discussing solutions.

The goal is to do the global minimum.

The global minimum is a mathematics term used in machine learning that refers to the absolute lowest point on a loss function’s graph.

It’s the B dot below.

Global minimum.
Source

Finding the global minimum is the ideal scenario in machine learning.

It represents the point where a model’s parameters achieve the lowest possible loss on the training data, and optimization algorithms can’t minimize the error further.

In other words, the global minimum is where a model’s output is as close as possible to its target.

It’s also often impossible to reach in practice, given model complexity and dataset sizes.

I think there are two general takeaways from this.

The first is that we can create some figurative SEO analogies from it. (I’ll explain those shortly.)

The second is that we can use this and related concepts to broaden our lexicon and worldview.

What I mean is, the more I delve into the world of neural network architectures, the less recognizable the vocabulary becomes.

That’s probably fair, in part.

After all, so much of how AI is used by search engines is proprietary, and SEO has enough trouble keeping out myths and misinformation.

So I’m sensitive to that aspect.

At the same time, I believe there are benefits to expanding our SEO vocabulary to include more AI terms.

This won’t be easy.

It took me years before I felt comfortable discussing even SEO basics, and I still see understanding them as a life-long journey.

Neural networks (and AI generally) are equally as complex, and evolve just as fast.

The more I learn, the less I feel I know.

That said, studying deep neural networks (DNNs) over the last few months has expanded my worldview and impacted how I think about search.

That’s why, I want to help make ML terms a part of everyday SEO dialogue.

While some terminology applies directly to situations or topics within SEO, I think learning the lingo of AI can help us think about search a bit differently in general, with a more robust context.

We mentioned the global minimum, the lowest point of a loss function’s graph. Let’s start with what all of that means.

A loss function, also called a cost function, quantifies the difference (error) between a model’s predicted output (target value or goal) and the actual output.

The regression graph below charts the relationship between an independent and dependent variable. The black dots show the model’s output, while the blue diagonal line shows the target value. The distance between them (the short vertical blue lines) is the error.

Loss function.
Source

Before we get too far into loss functions, though, let’s review types of machine learning models, with an emphasis on neural networks.

Basic categories of machine learning include supervised learning (labeled training data), unsupervised learning (unlabeled training data), and reinforcement learning (updated with feedback).

Types of machine learning.
Source

Within these, there are various models.

Linear regression models find the relationship between a dependent variable and one or more independent variables, where the relationship is approximately linear.

Linear regression.
Source

In the above example, we see how as temperature increases, there’s a linear correlation with the dependent variable.

Logistic regression is a classification algorithm that models relationships for binary outcomes, like true or false.

Logistic Regression.
Source

As we see above, the data above the threshold value would be positive (true), while data below it would be negative (false).

Decision trees can be used for classification or regression tasks and use a tree-like model with nodes (decisions) and features, branches (decision outcomes), and leaf nodes (class labels or numerical values).

Decision tree example.
Source

As we see above, the source node is the question of the object being alive, and it works through a series of yes/no questions to land at the final answer.

Support vector machines are classification algorithms that apply to various dimensions, including low- and high-dimensional spaces, and excel at finding a hyperplane to separate data points (often with complex relationships), like into different classes, with the maximum margin.

Support Vector Machine.
Source

In the example above, we see how the SVM is used for classification, noting a maximum margin between positive and negative data points.

Naive Bayes involves probabilistic classifiers suitable for large datasets, like text classification or spam filtering (like by calculating the probability based on the words an email contains).

Naive Bayes.
Source

As we see above, the probabilistic model is “naively” classifying data by assuming the presence of a feature in a class is unrelated to other features. Such assumptions don’t always hold true in real-world scenarios, but the model is still useful for its simplicity and computational efficiency.

K-nearest neighbor (KNN) can be used for classification and regression. In regression, it predicts the value of a new data point by averaging the numerical values of its ‘k’ nearest neighbors, while in classification, it assigns a data point to the majority class of its nearest neighbor.

K-NN before and after.
Source

As we see above, the algorithm predicts the category of the new data point based on the majority class among its ‘k’ nearest neighbors, where Euclidean distance is used to calculate the distances between the new data point and all other data points in the existing categories.

Clustering algorithms are unsupervised learning algorithms (like DBSCAN, K-means, hierarchical clustering, and recently TeraHAC) that group similar data points.

Clustering algorithm.
Source

In the above example, the model groups an unlabelled dataset using a clustering algorithm (K-means).

Ensemble methods are combinations of multiple individual machine-learning models.

Ensemble method of learning.
Source

As we see above, multiple pre-trained models are used to classify data, then the ensemble method aggregates their outputs.

Neural networks are machine-learning models with hidden layers of neurons with weights (functions) that transform the input into the output.

This chart is a bit older (2016) but gives you a general idea of different neural network architectures.

Neural networks chart.
Source

You may have also heard the term “deep” in reference to neural networks.

There’s a distinction between shallow neural networks and deep neural networks, which have multiple hidden layers and are more adept at learning complex relationships (patterns) and hierarchal representations of data.

Simple neural network vs. deep learning neural network.
Source

Deep neural networks can also benefit from transfer learning, where pre-trained models are fine-tuned for specific tasks.

Deep learning is the subfield of machine learning that focuses on deep neural networks, as illustrated in the figure below.

Hierarchy of AI chart from machine learning to deep learning.
Source

If you’re feeling overwhelmed at this point, let me give you some hope.

That figure above was originally from my article (first published last year) on Google’s rankings volatility. At the time when I added it, I didn’t know much beyond the main labels. Looking at it again, I now have a more solid understanding of its different components. It’s taken a lot of reading AI research papers and watching YouTube videos to get there — and like I said, I’ve barely scratched the surface — but there is a way forward!

On that note, let’s dig in a bit, because within deep neural networks, there’s a lot to know.

We’ll start with some common examples of DNNs.

Feedforward neural networks (FNN) are where information flows in one direction. They’re ideal for simple tasks where the relationship between inputs isn’t complex.

Feed Forward Neural Network.
Source

You can see above how the input goes unidirectionally through the model to create the output.

Convolutional neural networks (CNN) are designed for grid-like data (such as images or videos) and are good for pattern recognition and feature extraction.

Convolutional Neural Network.
Source

You can see how the model uses a grid to analyze parts of the input image to classify it in the output.

Recurrent neural networks (RNNs) process sequential data (time series, text, etc.) and have the ability to maintain an internal state or memory that captures information (dependencies) from previous steps in the sequence.

Recurrent Neural Network.
Source

You can see how the output gets reevaluated through the hidden layers, unlike a single forward pass.

Long short-term memory networks (LSTM) are an evolution on RNNs designed to mitigate the vanishing gradient problem (where gradients of early layers in a deep network become small, slowing the learning process), making them better for long-range dependencies.

Long short-term memory neural network.
Source

This is a little harder to interpret, but essentially, the model selects important parts of the input to remember (or forget) to arrive at the correct output.

Transformers are a newer architecture that has largely replaced RNNs for natural language processing (NLP) tasks and excel at capturing long-range dependencies in sequential data using an attention mechanism (focusing on relevant parts of the input sequence), making them instrumental in LLM development.

Transformer architecture.
Source

You can see both an encoder and decoder above. These are fundamental components of many neural network architectures, especially ones designed for sequence-to-sequence tasks (like translation or text summarization).

Let’s dig into some of these concepts related to transformers a bit more.

Encoders take the input sequence (like a sentence or image) and transform it through multiple layers (like of an RNN or transformer) into fixed-length vector representations called context vectors (or latent representations), which encapsulate the meaning.

Latent representation example.
Source

BERT (Bidirectional Encoder Representations from Transformers) is most commonly an encoder-only LLM that takes input sequences (like sentences) and processes them bidirectionally, considering the context of each word from both sides.

BERT model.
Source

Decoders take the context vectors from encoders and generate output sequences (like translated sentences or summaries).

Decoder
Source

GPT (Generative Pre-trained Transformer) in earlier versions used a traditional architecture that was a decoder-only LLM.

GPT model.
Source

T5 (Text-to-Text Transfer Transformer) framework is an encoder-decoder model (and a BERT variation) that can both understand input and generate output, handling a wide range of tasks involving multiple steps or modalities (including images).

T5 Model.
Source

MUM (Multitask Unified Model) is a proprietary model but likely built in part upon T5. Modern GPT models (like GPT-4) are likewise capable of encoding and decoding.

Transformers also became central to NLP.

Natural language processing is a field that focuses on enabling computers to understand, interpret, and generate human language, like with tokenization, part-of-speech tagging, and named entity recognition.

NLP system architecture
Source

Natural language understanding (NLU) is a subset within NLP that focuses on machines understanding the meaning and intent of human language, a critical aspect of search.

NLU architecture.
Source

Natural language generation (NLG) is another subset within NLP where machines generate human-like text.

NLG architecture.
Source

Large language models (LLMs) are DNNs that use NLG and are trained on massive amounts of text data, enabling them to generate and translate human language.

LLM architecture example.
Source

You can see below how foundational the transformer architecture was for LLM development.

Transformer based architectures.
Source

Getting back to our examples of DNN models …

Generative adversarial networks (GANs) are where a generator and discriminator compete against each other to generate realistic synthetic data.

Generative Adversarial Network.
Source

You can see how the generator tries to create synthetic images so realistic the discriminator can’t tell the difference.

Graph neural networks (GNN) are a specialized type of neural network for non-Euclidean data that learn from the relationships between nodes and edges, making them powerful for real-world applications and complex relationships.

Graph neural network examples.
Source

If you remember nothing else from this article, remember GNNs. They not only apply to knowledge graphs, but graphs can also be powerful ways to visualize many types of datasets, as shown above.

We also have to touch on multimodal AI models, such as GPT-4o or Gemini 1.5, which are natively designed to process and understand multiple modalities, or content types, such as text, images, audio, and video.

These multimodal models likely employ different techniques to fuse different data types, but one possible component may be graph representations (as mentioned above), which capture interactions and dependencies between different modalities.

Multimodal graph learning.
Source

Phew! That was a lot.

And keep in mind, this field develops incredibly fast, so we’re only getting at some basic concepts here.

But now that we have a better idea of machine learning and neural network architectures, let’s turn our attention back to our core idea of the global minimum.

It starts with choosing a loss function.

The choice of loss function depends on its mathematical properties, the specific machine-learning task (like regression might use mean squared error, classification might use cross-entropy loss, etc.), the characteristics of the data, and the desired behavior of the model during training.

Loss function types.
Source

Something else to bear in mind is that, unlike hand-coded or rule-based systems, where weights are manually assigned, neural networks learn optimal weights automatically through a process of training and optimization.

Rule-based vs. Machine learning algorithm example chart for NLP.
Source

The above chart is in reference to NLP.

Here are more examples of neural network applications:

  • The input might be a picture of a dog. The output might be the label, “Dog.” (Image classification.)
  • The input might be a link-graph. The output might be a determination of a PBN or not. (Link analysis.)
  • The input might be a document. The output might be vectors embedded in a high-dimensional space. (Vector embeddings.)
  • The input might be a query. The output might be a vector embedding approximating its nearest (most relevant) documents. (Semantic search.)

In a neural network, the input data gets fed forward (forward pass or propagation) through the hidden layers of neurons to produce the output.

Forward propagation.
Source

The loss function quantifies the difference (error) between the model’s target value and the actual output.

Loss function example from classification model.
Source

The graph of the loss function visualizes the relationship between the model’s parameters and the error it produced.

Loss function graph.
Source

The method used to calculate the gradients (the direction and magnitude of change) required for each parameter is backpropagation, which happens by working backward through the network, starting with the output.

Forward pass and backpropagation.
Source

An optimization algorithm called gradient descent then uses the gradients determined by backpropagation to update the model’s weights and biases iteratively (step by step) in the direction that minimizes the loss.

Gradient descent 2d gif.
Source

Here’s another example:

Gradient descent 3d gif.
Source

Along the way, there can be local minima, or the lowest points of a function within a specific interval, where gradient descent can get stuck.

Local minimum on a graph.
Source

One way to overcome getting stuck in the local minimum is to use techniques like stochastic gradient descent (SGD).

Stochastic gradient descent.
Source

And also using momentum.

SGD with momentum.
Source

The global minimum, meanwhile, is the lowest point overall, or the optimal solution.

Global minimum as optimal solution.
Source

It’s not always possible to reach a global minimum, however, such as in complicated functions. (Yikes!)

Complicated function in a graph.
Source

Hyperparameter tuning can help improve a model’s performance by finding better configurations for the learning process.

Hyperparameter tuning example.
Source

However, the goal is often to find a “good enough” solution.

I think it’s helpful to think about SEO in terms of optimizing for the global minimum.

There are at least three different ways to approach this in a figurative sense. (These are metaphorical examples. Not literal interpretations.) 😉

The most obvious is search engine systems.

Guide to Google Search ranking systems.
Source

Let’s say, for example, that Google had a site-wide classifier for content helpfulness.

That classifier might use a transformer model like BERT to understand semantic relevance or depth of context.

That might be complimented by GNNs to analyze the website’s structure, relationships between pages, and link graph, identifying topical clusters or internal and external linking patterns to better assess topical authority and overall quality.

It might use logistic regression to make a binary classification of extracted content features as helpful or not.

Or it might use a more complex classification algorithm or decision tree to answer a series of yes/no questions regarding helpfulness.

It could use support vector machines to set the boundary between helpful and unhelpful content based on a set of features.

Or it might use Naive Bayes, if the classifier primarily analyzes textual content on the website.

Convolutional neural networks might also be used if the classifier considers visual elements of the website, such as images or video.

Recurrent neural networks or LSTMs could be used as well to classify the sequential flow of content or user behavior on the site, like if users are finding content engaging or helpful based on their reading patterns.

Loss functions would be important for penalizing the model when it incorrectly assesses a website’s helpfulness (what we might call HCU casualty recoveries in the future).

Gradient descent would be critical for iteratively adjusting the classifier’s parameters and avoiding local minima while better optimizing its accuracy.

Hyperparameter tuning would also be needed to find the best settings for the algorithms and architecture within the classifier to achieve the most accurate classifications, like exploring different learning rates, number of hidden layers, or ‘k’ values in a KNN algorithm.

Considering the scale of the web (dataset), the classifier would likely also need to strike a “good enough” balance between accuracy and computational efficiency, involving techniques like pruning and quantization.

Ensemble methods might also apply to combining multiple models to improve the robustness and accuracy of the classifier, such as the earlier example of BERT and GNNs combined with a regression model.

Multimodal AI models would likely be useful as well for analyzing not just text but also images, videos, and user interaction patterns for a more comprehensive assessment of content helpfulness.

All of the applicable models would then require constant evolution to refine and update the algorithms for new trends and challenges.

And that’s just one individual system.

Now imagine if that system were incorporated within larger core systems.

Then imagine how many different types of models and systems are at work in those core systems, and how they might interact or depend on each other and other minor systems.

Imagine next the scale of the web and the data points those systems operate with, factoring in refreshes of the index and changes for user preferences and also technology.

The plural of minimum is “minima.”

If we think about the global minimum in search as all of the individual components in the system working together optimally to serve the best results for a user’s query, consider all of the local and global minima within the systems contributing to that goal, and how they may evolve day to day.

Then consider the SERPs themselves.

Google desktop SERP.

Some of those systems above we just reviewed would no doubt analyze content for indexing, using techniques like NLP to extract entities, topics, and sentiments from text or computer vision algorithms to analyze images and videos.

Others might determine the content’s quality or relevance to topics, assessing factors like semantic similarity between documents and queries or even alignment (conceptually) with E-E-A-T criteria.

While other systems might pair down listings of results for ranking, using algorithms that weigh hundreds of factors and predict optimal orders of results based on historical click data.

Then there’d be systems to build search result pages, dynamically assembling them.

Those pages might be different based on the user’s query, search history, preferences, device, or location.

Then there are different types of elements, like AI overviews (SGE), text-based web results, shopping graph or knowledge graph results, multimedia results, and social media results that each likely have individual underlying systems as well as overarching systems that fit them all together.

Imagine then how many systems are at play for a single person’s search, each one with its own algorithms and data inputs, all interacting and coordinating.

Then think about system-wide responsiveness to dynamic inputs, like user interaction data, where clicks, scrolls, and query refinements send back signals that can be used to update models for future results, creating a feedback loop where user behavior influences search system evolution.

We might say (in simplified thinking) the global minimum for a SERP is the point where the user finds the absolute best result for their query in the shortest amount of time — immediate satisfaction.

Every extra second of consideration, superfluous click, scroll, or hover, or query refinement is a parameter (a hint of misalignment with the user’s needs and intent) in a SERP’s loss function, so to speak (again, simply) — it’s the error between what it’s showing and what that user ideally wanted.

Local minima might be less-than-ideal SERPs tied to ambiguous queries, evolving user expectations, limitations of ranking algorithms, or content quality issues.

Every tweak, experiment, or fluctuation is an iterative step toward a global minimum, a perfect click.

Now imagine those dynamics at scale, across billions of users, queries, and documents, and how many local and global minima stand in between.

Looking at these examples from that point of view, I think this gives new perspective to complex topics like “ranking volatility” and even “HCU-impacted sites,” as well as understated phrases like “Google uses machine learning.”

Figuratively or literally, the global minimum is the optimal result for every situation.

Meanwhile, the journey to get there is staggeringly complex.

We can also think about our website content this way.

My blog.

If we’re creating content for users to find organically online that satisfies their search intent or passive interest and furthers our business goals, there’s a point at which the content is the most helpful.

It could be a blog post that thoroughly answers a complex question or a product page that gives all the information needed to lead to a purchase.

That’s the global minimum, where the content has the right information, style, authority, and other parameters (so to speak) to where we can’t optimize it further.

Of course, that global minimum might be different for every person within our target audience, which means it’s a moving target.

To hit the global minimum would be to satisfy every target user, which is impossible.

That’s why, realistically, we must embrace a “good enough” balance, pleasing as much of our ideal audience as possible, while also iteratively (and constantly) moving toward the global minimum (as the optimal solution) — finding opportunities for improvement based on our own loss function (i.e., rankings, freshness, relevance, CTR, engagement rates, user feedback, conversions, etc.).

Final thoughts

Thinking figuratively about our content or a search engine’s systems and results in this context of neural networks and global minima can help us see SEO from a different perspective, sparking new ideas or frameworks for discussion.

Maybe keyword rankings are a local minimum, when the true global minimum is an optimal user experience fostering conversions and long-term engagement.

Maybe a knowledge of neural network vocabularies can enrich the way we discuss algorithm updates, going beyond “machine learning” to thinking about what types of models might be used.

Maybe we can use GNNs to optimize website structures or NLP to understand semantic relevance and user intent.

Maybe by speaking the language of neural networks, we can better collaborate with the professionals moving the AI field forward, contributing to the conversation, like why AI Overviews are or aren’t helpful.

Outro

My hope is this article has contributed to the vocabulary and knowledge of neural networks, as well as helped broaden a few worldviews of search and what it means to optimize websites holistically in light of a figurative global minimum.

One final vocabulary word I’ll leave you with is interpretability.

This refers to the ability to explain the reasoning behind a model’s predictions or decisions.

Sometimes algorithms in neural networks are described as a “black box.”

After all, they literally use “hidden” layers.

This means it’s not always easy to explain why outcomes are the way they are.

I think having more knowledge of neural network architectures can help us here, as well.

The next time a core update rolls out, maybe we can’t say, “This update seemed to have targeted ABC sites that are doing XYZ.”

But we can say, “The systems have moved closer to a new global minimum, accounting for updated data sources, user behavior, and model capabilities.”

And when sites have negative rankings impacts, maybe we can start by acknowledging that rankings probably aren’t the true global minimum, anyway.

Especially now that we’re in the Gemini era. 😉

I’ll continue to work on this article to improve its information and readability, as well as add more AI-themed articles to my blog and through its Hamsterdam Research project.

Until next time, enjoy the vibes:

Thanks for reading. Happy optimizing! 🙂

Editorial history:

Created by Ethan Lazuk on:

Last updated:

Need a hand with a brand audit or marketing strategy?

I’m an independent brand strategist and marketing consultant. Learn about my services or contact me for more information!

Leave a Reply

Discover more from Ethan Lazuk

Subscribe now to keep reading and get access to the full archive.

Continue reading

GDPR Cookie Consent with Real Cookie Banner