Ethan Lazuk

SEO & marketing professional.


Could Anthropic’s Identification of Millions of Features (Concepts) Activated Inside Its LLM (Claude 3 Sonnet) Influence Our Semantic SEO Strategies? (Maybe)

By Ethan Lazuk

Last updated:

Scaling monosemanticity artistic rendering of feature neighborhood.

Welcome to another week of Hamsterdam Research, where we look at recent AI research papers to learn just what the heck they’re talking about and explore their possible implications for the future of search and SEO.

While we typically look at papers from Google Research, this is a special edition where we’ll be investigating an Anthropic paper called “Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet.”

As SEOs of the modern age, we commonly optimize for concepts over keywords.

These semantic SEO strategies commonly involve knowledge graph entities and vector embeddings, i.e., the meanings behind words.

Anthropic’s latest research paper reveals millions of concepts that get at a similar idea, but with a twist.

Rather than entities in a knowledge graph, their research shows neighborhoods of related features activated in the neural networks of its large language model, Claude.

A-wickety-whaaat?!

That’s right.

The paper lets us peek behind the curtain at how LLMs compute answers.

For AI researchers, it helps move past “black box” issues of interpretable model behavior, like for improved safety.

Meanwhile, for us SEOs, it offers further insight into how AI models might encode real-world concepts as vector embeddings based on semantic relationships, which we could potentially use to push semantic SEO strategies forward.

In other words, more concepts, fewer keywords.

In terms of actionable SEO insights, this is still theoretical and based on early-stage research.

But these concepts in feature neighborhoods could be translated to pillar-cluster groups (like we’ve historically done with keywords) for even more holistic and semantically informed content strategies, site architectures, and buyer’s journey models.

Quick story on selecting the topic, and why it matters to SEO (probably)

I discovered the Anthropic paper when I came across a thread from Emmanuel Ameisen, a research engineer at Anthropic:

I wasn’t following Emmanuel yet, so I also want to give credit to Darwin Santos for retweeting another thread by Muratcan Koylan on this topic.

The paper itself is quite large, as you can see from its contents overview:

Content outline of Anthropic research paper.

I read half of it on my phone and was planning to summarize it all here.

Fortuitously, though, I saw an Ars Technica article by Kyle Orland in Google Discover the other morning that already did a nice job of that:

Ars Technica Here's what's really going on inside an LLM's neural network.

I then checked Google’s People also view results and saw Anthropic’s own post with a summary:

Anthropic post Mapping the Mind of a Large Language Model.

Here are some choice excerpts from the Anthropic post:

“Today we report a significant advance in understanding the inner workings of AI models. We have identified how millions of concepts are represented inside Claude Sonnet, one of our deployed large language models. This is the first ever detailed look inside a modern, production-grade large language model. This interpretability discovery could, in future, help us make AI models safer. …

It turns out that each concept is represented across many neurons, and each neuron is involved in representing many concepts.

Previously, we made some progress matching patterns of neuron activations, called features, to human-interpretable concepts. We used a technique called ‘dictionary learning’, borrowed from classical machine learning, which isolates patterns of neuron activations that recur across many different contexts. In turn, any internal state of the model can be represented in terms of a few active features instead of many active neurons. Just as every English word in a dictionary is made by combining letters, and every sentence is made by combining words, every feature in an AI model is made by combining neurons, and every internal state is made by combining features. …

Luckily, the engineering and scientific expertise we’ve developed training large language models for Claude actually transferred to helping us do these large dictionary learning experiments. We used the same scaling law philosophy that predicts the performance of larger models from smaller ones to tune our methods at an affordable scale before launching on Sonnet. …

We successfully extracted millions of features from the middle layer of Claude 3.0 Sonnet, (a member of our current, state-of-the-art model family, currently available on claude.ai), providing a rough conceptual map of its internal states halfway through its computation. This is the first ever detailed look inside a modern, production-grade large language model.

Whereas the features we found in the toy language model were rather superficial, the features we found in Sonnet have a depth, breadth, and abstraction reflecting Sonnet’s advanced capabilities. …

We were able to measure a kind of ‘distance’ between features based on which neurons appeared in their activation patterns. This allowed us to look for features that are ‘close’ to each other. …

This holds at a higher level of conceptual abstraction: looking near a feature related to the concept of ‘inner conflict’, we find features related to relationship breakups, conflicting allegiances, logical inconsistencies, as well as the phrase ‘catch-22’. This shows that the internal organization of concepts in the AI model corresponds, at least somewhat, to our human notions of similarity. …

The fact that manipulating these features causes corresponding changes to behavior validates that they aren’t just correlated with the presence of concepts in input text, but also causally shape the model’s behavior. In other words, the features are likely to be a faithful part of how the model internally represents the world, and how it uses these representations in its behavior. …

We hope that we and others can use these discoveries to make models safer. For example, it might be possible to use the techniques described here to monitor AI systems for certain dangerous behaviors (such as deceiving the user), to steer them towards desirable outcomes (debiasing), or to remove certain dangerous subject matter entirely. …

The features we found represent a small subset of all the concepts learned by the model during training, and finding a full set of features using our current techniques would be cost-prohibitive (the computation required by our current approach would vastly exceed the compute used to train the model in the first place). Understanding the representations the model uses doesn’t tell us how it uses them; even though we have the features, we still need to find the circuits they are involved in.” [Highlights added.]

– Anthropic blog (Mapping the Mind of a Large Language Model)

Since these summaries are available already, I decided to focus on two aspects of the paper that I think are relevant to SEO professionals:

  1. Key vocabulary and concepts in the Anthropic paper that can expand our knowledge of AI models, in general.
  2. Insights related to features (concepts) that are related to vector embeddings and semantic relationships.

I’ve mentioned feature neighborhoods a couple of times already, but what does that mean?

Well, basically features are concepts (activated from neurons in LLMs) grouped in neighborhoods by distance based on their semantic similarities.

Here’s an example of the immunology feature:

Nearest neighbors to the immunology feature graphic from Anthropic paper.

As you can see, it’s surrounded by related concepts (like the lymphatic system, immune system functioning, and inflammation), which are grouped together within larger categories (like common diseases, vaccines and immunizations, and autoimmunity).

Here’s a zoomed-in example from another feature, “inner conflict”:

Inner conflict feature from Anthropic paper.

We can see how related concepts can be abstract, like romantic struggles, hesitation detection, and mixed emotions.

So how might SEOs interpret this information?

In addition to my own thoughts, I asked my colleagues — ChatGPT (GPT-4o) and Gemini Advanced — and seeing as we’re looking at an Anthropic paper, Claude was part of the mix, too.

Here’s what everyone thought, specifically about the paper’s feature neighborhood sections:

ChatGPT said:

SEOs might find the information about feature neighborhoods and monosemanticity relevant as it mirrors how search engines interpret and rank content based on semantic similarity and context. Understanding how closely related features or topics are grouped together can inform content clustering, keyword research, and the creation of comprehensive content hubs. This approach enhances semantic SEO by covering related subtopics, improving internal linking, and ensuring content completeness. Additionally, insights into feature splitting and concept coverage can guide SEOs in targeting both broad and specific keywords, analyzing competitors, and better understanding search engine algorithms, ultimately boosting site visibility and relevance in search results.”

Gemini said:

Research on AI language models reveals how they process and connect concepts, offering SEOs valuable insights for keyword research, content clustering, and internal linking strategies. The study’s findings on feature completeness and frequency highlight content gaps and the importance of content prioritization based on frequency in training data. Additionally, understanding how AI models scale and infer meaning can help SEOs predict future trends in AI and search, while also tailoring content to better align with user intent and search engine algorithms.”

Claude said:

“SEOs might care about this information detailing the inner workings and representations learned by large language models like Claude because it provides insights into how advanced AI systems extract, represent, and relate different concepts and entities like people, places, code syntax, etc. As search engines integrate more AI capabilities for understanding content semantics and relationships beyond just keyword matching, having a window into the feature spaces and computations used by cutting-edge language models could help SEOs anticipate future shifts in search relevance algorithms. This could inform strategizing content creation and optimization to ensure entities, concepts, and semantic connections are modeled in a way that will be recognized as maximally relevant by AI-powered search engines.”

And I said:

“Given the breadth of training data that these feature embeddings are based on, we as SEOs may be able to find unique associations between concepts to inform new content topics, site architectures, or user journeys that we may not have identified through other methods, such as keyword or competitor research. There may be instances where this type of feature interpretability can be applied to specific datasets, say a corpus of relevant websites or customer reviews. Visually seeing how LLMs activate features and make sense of the world’s information semantically can also enhance our understanding of how AI-driven search systems may understand natural language.”

Still intrigued?

Let’s start with the paper’s introduction 😉

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet” was published on May 21st, 2024, by Anthropic.

It has several contributors, including Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L Turner, Callum McDougall, Monte MacDiarmid, Alex Tamkin, Esin Durmus, Tristan Hume, Francesco Mosconi, C. Daniel Freeman, Theodore R. Sumers, Edward Rees, Joshua Batson, Adam Jermyn, Shan Carter, Chris Olah, and Tom Henighan

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet.

It opens with a few introductory paragraphs and a first section called “Scaling Dictionary Learning to Claude 3 Sonnet.”

We’ll focus on defining the key vocabulary from these early parts first to get a sense of the paper’s content, as well as learn foundational knowledge about AI models, in general.

Note: I’ll be using Gemini Advanced and Claude 3 Sonnet to break down some of the vocabulary and concepts from the paper. All quotes will be attributed to the paper unless otherwise stated.

The opening paragraph explains how this research builds on prior efforts, scaling from a one-layer transformer to a production-grade LLM:

“Eight months ago, we demonstrated that sparse autoencoders could recover monosemantic features from a small one-layer transformer. At the time, a major concern was that this method might not scale feasibly to state-of-the-art transformers and, as a result, be unable to practically contribute to AI safety. Since then, scaling sparse autoencoders has been a major priority of the Anthropic interpretability team, and we’re pleased to report extracting high-quality features from Claude 3 Sonnet, Anthropic’s medium-sized production model.”

Let’s go through the key concepts in this section, in hierarchical order.

First off, neural networks are a type of machine learning model that have interconnected nodes (artificial neurons) that receive input data and process it using weighted connections, producing an output signal.

These weights are updated iteratively based on training data and backpropagation (known as learning) to achieve the highest accuracy (minimal loss). Because they can learn complex patterns or relationships in data, neural networks are great for natural language processing tasks.

Transformers are a type of neural network called a sequence-to-sequence model. They take a sequence of input data (like words in a sentence) and produce a sequence of output data (like a response).

Transformers use an encoder-decoder structure, where the encoder processes the input data and the decoder generates the output.

Sparse autoencoders (SAEs) are a type of neural network that uses unsupervised learning to extract meaning from a compressed representation of input data. The encodings, as these representations are called, are numerical vectors that capture the most salient features of the original input.

Unlike dense embeddings (where more features are represented in a high-dimensional space to capture nuanced relationships), sparse embeddings from SAEs only associate objects with a small subset of features.

The monosemantic features recovered from the sparse autoencoders are individual components (neurons) within the model that only activate in response to a particular concept, making them easier to understand and interpret (compared to polysemantic features).

The small one-layer transformer the researchers initially used had one layer of self-attention and could only capture simple relationships in data.

The attention mechanism is the core of a transformer, allowing the model to weigh the importance of different parts of the input sequence to create dense embeddings with contextual meaning captured from long-range dependencies, or relationships between words far away.

More layers of attention lead to state-of-the-art results for language modeling. The concern was scaling to a production-grade LLM, which uses multiple attention layers. The research found, however, that scaling law applied, and they could identify features in Claude 3 Sonnet, a medium-sized production model.

Now let’s talk about the details of features.

“We find a diversity of highly abstract features. They both respond to and behaviorally cause abstract behaviors. Examples of features we find include features for famous people, features for countries and cities, and features tracking type signatures in code. Many features are multilingual (responding to the same concept across languages) and multimodal (responding to the same concept in both text and images), as well as encompassing both abstract and concrete instantiations of the same idea (such as code with security vulnerabilities, and abstract discussion of security vulnerabilities).”

Features are matching patterns of neuron activations that could be described as both abstract and concrete representations of concepts. When we talk about entities in SEO, we see them as invariable to language or form. (An iPhone is the same in any language or whether we’ve typed the word or shown a picture.) This is what’s meant by multilingual and multimodal features.

That takes us through the vocabulary in the paper’s introductory paragraphs.

Now let’s explore the section called “Scaling Dictionary Learning to Claude 3 Sonnet.”

Our general approach to understanding Claude 3 Sonnet is based on the linear representation hypothesis and the superposition hypothesis. … At a high level, the linear representation hypothesis suggests that neural networks represent meaningful concepts – referred to as features – as directions in their activation spaces. The superposition hypothesis accepts the idea of linear representations and further hypothesizes that neural networks use the existence of almost-orthogonal directions in high-dimensional spaces to represent more features than there are dimensions.

If one believes these hypotheses, the natural approach is to use a standard method called dictionary learning. Recently, several papers have suggested that this can be quite effective for transformer language models. In particular, a specific approximation of dictionary learning called a sparse autoencoder appears to be very effective.” [References removed.]

In a neural network, each neuron receives input signals that it processes and then produces an output signal called an activation.

In transformer architectures, these activations are intermediate computations that happen at various stages of the model.

Each neuron’s output is a single value, but since there are many neurons in each layer, their combined activations form a vector in a high-dimensional space.

The linear representation hypothesis says the directions of these vectors correlate with meaningful concepts, or features.

Another way to say it is that each neuron in a model is a dimension in a vast space, and certain combinations of dimensions (creating vectors moving in a specific direction) correspond to particular concepts.

The superposition hypothesis then says that neural networks can efficiently represent a vast number of concepts, even more than the number of their individual neurons. In other words, the number of concepts derived from the dimensions is greater than the sum of their parts.

This happens because the directions of dimensions in the high-dimensional space can be nearly orthogonal (independent) to each other, which means multiple concepts can be superimposed (combined) in the same activation space without interfering with each other.

[Analogy: Think of each neuron (dimension) as a musical note, and each concept as a chord (multiple notes). A musician can play multiple cords using a limited number of notes. (Now we’ll have to end this article with some Adam Jones guitar … stay tuned!)]

Dictionary learning is a technique used in machine learning to represent complex data in more efficient and interpretable ways. It finds a set of atoms or basis vectors (aka, the dictionary) that can be linearly combined to reconstruct the original idea. (Think of atoms as building blocks, like corners, edges, or textures of an image.)

This is why sparse autoencoders are used, because the idea is to use as few atoms as possible while still retaining accuracy.

The sparse autoencoder introduces a sparsity constraint into the learning process, so only a few atoms from the dictionary are used to represent each input — the monosemantic features.

“SAEs are an instance of a family of ‘sparse dictionary learning’ algorithms that seek to decompose data into a weighted sum of sparsely active components.

Our SAE consists of two layers. The first layer (‘encoder’) maps the activity to a higher-dimensional layer via a learned linear transformation followed by a ReLU nonlinearity. We refer to the units of this high-dimensional layer as ‘features.’ The second layer (‘decoder’) attempts to reconstruct the model activations via a linear transformation of the feature activations.

Once the SAE is trained, it provides us with an approximate decomposition of the model’s activations into a linear combination of ‘feature directions’ (SAE decoder weights) with coefficients equal to the feature activations. The sparsity penalty ensures that, for many given inputs to the model, a very small fraction of features will have nonzero activations.

Let’s break this down starting with the first layer.

The encoder takes the raw activation values from a layer in the transformer (presumably the middle layer) and applies a learned linear transformation. This essentially rotates and stretches the activation space into a new, higher-dimensional space where the underlying features are more easily separable and interpretable.

ReLU nonlinearity is a type of activation function — recall this was defined in last week’s Hamsterdam news recap (Part 58) as a mathematical operation applied to the weighted sum of the inputs to neurons in the hidden layers of a neural network, helping the model learn nonlinear relationships — that sets negative values to zero and keeps positive values as they are, allowing the encoder to capture more complex (nonlinear) relationships between features.

The decoder then takes the feature activations and applies another learned linear transformation, mapping the features back into the original dimension of the transformer activations.

The goal is to learn a transformation that can accurately reproduce the original activations from the sparse set of features generated by the encoder — capture the essential information (“features“) present in the activations.

After training, the sparse autoencoder has seen a large amount of data and now knows how to efficiently represent a transformer’s activations as a combination of specific directions (features) in the activation space.

Each decoder weight, as they’re known, is a vector in the space, its direction corresponding to a concept the sparse autoencoder has learned.

The sparsity penalty ensures only a small fraction of features have “nonzero activations,” which means only a few feature directions are used to represent the input (sparse representation). (This is in contrast to dense representations (embeddings) where most features are nonzero.)

[Analogy: If a painter wants to recreate a complex scene with a limited palette of colors, each feature direction would be like a color, and each feature activation (coefficients) is the amount of the color. Multiple colors can be mixed or blended in different proportions to recreate the original scene.]

Phew!

That was a lot of vocabulary, but I think it sets us up for understanding what we’re talking about as far as what features are in AI models.

Now we’ll dig into examples of features, which is where it’ll get fun.

There’s a lot left in the remainder of the paper that’s relevant to AI engineers, specifically around how features influence model behavior. I’m going to skip over much of that, though.

Interpretable features in AI models

In the section called “Assessing Feature Interpretability,” the researchers look at four examples of features.

These include two more straightforward features (The Golden Gate Bridge and brain sciences) and two that are more complex and abstract (monuments and popular tourist attractions and transit infrastructure).

In these examples, the researchers included the top 20 text inputs in the sparse autoencoder dataset, and then showed how strongly they activated based on the darkness of their orange color.

These examples below show the degrees of activation for features at the token level (words or parts of words):

Activated features at token level from Anthropic research paper.

Does that remind you of anything?

It made me think of the Google Cloud natural language API demo tool, which I think SEOs have played with since many years ago:

Google Cloud natural language API demo example.

Of course, we’re talking about fundamentally different topics here (knowledge graph entities vs. LLM feature activations), but I see cross-over — like they’re both relevant to semantics (concepts over keywords).

The relevance of the concept from an activated feature is known as specificity. (You might recall in our article on PLEDGE we talked about trade-offs between attribution (correctness) and specificity (relevance).)

Here we see an example of feature activation distributions for the Golden Gate Bridge feature, where the higher a feature’s activation level was, the higher its specificity score, per the researcher’s rubric:

Feature activation distribution for Golden Gate Bridge.

Here’s an interesting correlation they noted:

“… we see that these features become less specific as the activation strength weakens. This could be due to the model using activation strengths to represent confidence in a concept being present. Or it may be that the feature activates most strongly for central examples of the feature, but weakly for related ideas …”

The word “confidence” is also used in the natural language demo tool, where confidence scores are shown for certain aspects of passages, like overall categories:

Confidence scores for categories in Google Cloud natural language API demo.

Confidence scores used to be shown for the entities themselves, but no more.

Anyway, that reference to model confidence just reminded me of confidence scores, which you’ll find mentioned in Google patents in various contexts.

Now that we’ve seen what activated features look like as tokens, let’s explore the “local structure of features, which are organized in geometrically-related clusters that share a semantic relationship.”

This is where we get into the domain of feature neighborhoods.

In these examples, you’ll see circles in reference to 1M (yellow), 4M (green), and 34M (blue) runs. These refer to the number of parameters in the sparse autoencoders used for the dictionary learning.

The closeness between features is “measured by the cosine similarity of the feature vectors,” where neighborhoods include “features that share a related meaning or context.”

Cosine similarity refers to the measure of similarity between two nonzero vectors in a high-dimensional space. It’s calculated as the cosine of the angle between the two vectors, ranging from -1 (completely dissimilar) to 1 (identical). (It’s also used when Euclidean distance (straight line) isn’t useful due to “the curse of dimensionality,” which impacts computational efficiency and overfitting.)

We saw examples of feature neighborhoods earlier in this post. Those were taken from screenshots in the paper.

However, the researchers also created an interactive feature UMAP with additional neighborhoods to explore!

You might recall from our post about the embedding language model (ELM) that UMAP (uniform manifold approximation and projection) is helpful for when data follows a pattern that’s not easily captured by flat lines.

The following examples are taken from the feature UMAP.

This first neighborhood shows the highlighted feature of “Golden State Warriors team.” Check out some of the concepts around it:

Golden State Warriors team feature neighborhood.

A few that jump out to me are “Grateful Dead band” and “Low clouds and morning fog.”

Let’s say we’re creating website content about San Francisco. Those are the types of related topics our audience might find relevant but that we may not think to incorporate based on keyword research.

Another interesting neighborhood was from the “Conflicting perspectives” feature, which includes “Cultural identity” as a neighbor.

Conflicting perspectives feature neighborhood.

In addition to related topics, this also made me think about how these multilingual concepts could be used for creating more localized or culturally relevant content, basing translations on concepts as opposed to words.

But features aren’t only useful for making topical inferences of relevance.

We could even use the insights to make content appealing with emotional empathy.

Based on a person’s motivation to solve a problem (search intent), their state of mind might be better addressed in the tone of content, as informed by activated features related to emotion or sentiment:

Activated features related to human emotions.

One caveat is that drawing out insights from features for SEO work would need to take place at scale.

As the researchers noted in the paper:

Often the top-activating features on a prompt are related to syntax, punctuation, specific words, or other details of the prompt unrelated to the concept of interest. In such cases, we found it useful to select for features using sets of prompts, filtering for features active for all the prompts in the set.”

On another note, I’ve focused on possible applications of features for SEO strategies, namely around content research, site architecture, and buyer’s journeys.

More generally, seeing these features activated at a token level and embedded in dictionary vector spaces can also help us understand how AI-driven search engine systems might interpret natural language or semantic relationships between concepts.

The rest of the Anthropic paper focuses largely on safety, like related to model bias and security.

There is also a section about other approaches to identifying meaningful directions in model activation spaces.

One such approach involves linear probes, a technique to understand internal representations in neural networks, particularly for NLP applications, that involve training a linear classifier to make predictions on top of the activations in a neural network layer.

However, the researchers note the advantages of the dictionary learning method are that it’s a one-time cost that produces millions of features, and because it’s an unsupervised method, it can uncover abstract concepts or associations that may not have been predicted in advance.

That latter advantage gets to some of why I think features are an intriguing topic for SEO, in general.

We can use feature neighborhoods as one more source of information for building more semantic SEO strategies that focus on related concepts over individual keywords.

Of course, this is just a tentative idea in light of research that’s in its early stages. But keep an eye out! 😉

Till next time …

Thanks for checking out this week’s Hamsterdam Research article!

It’s a big topic, so I hope we’ve done it some justice.

Feel free to comment or contact me with your feedback.

For next steps, I’d suggest checking out the the summary articles or X (Twitter) threads linked earlier in this post.

Stay tuned for a new research article next week, or check out related Hamsterdam Research posts below.

Until next time, enjoy the vibes (as promised, some Adam Jones guitar):

Thanks for reading. Happy optimizing! 🙂


Related posts:

Editorial history:

Created by Ethan Lazuk on:

Last updated:

Need a hand with a brand audit or marketing strategy?

I’m an independent brand strategist and marketing consultant. Learn about my services or contact me for more information!

Leave a Reply

Discover more from Ethan Lazuk

Subscribe now to keep reading and get access to the full archive.

Continue reading

GDPR Cookie Consent with Real Cookie Banner