USER-LLM: Contextualizing LLMs with User Embeddings for Enhanced Personalization (via Google Research), & Why SEOs Should Care (Maybe)
By Ethan Lazuk
Last updated:

Welcome to another week of Hamsterdam Research, where we look at recent AI research to explain just what the heck it’s talking about and explore possible implications for the future of search and SEO.
Are you a fan of watching movies, reviewing restaurants, or clicking around websites?
Then you’re in luck!
Because this week is all about user interaction data, vector embeddings, and personalizing LLMs.
We’ll be looking at a Google Research paper called “USER-LLM: Efficient LLM Contextualization with User Embeddings.”
Why is this relevant to SEO?
Well, there’s always been speculation about how Google might employ user interaction data for ranking its search results.
We also saw in Google’s anti-trust trial debrief that such data is purportedly less needed now — or “in the midst of a dramatic change,” as they put it — given how LLMs are “predicting the usefulness of documents”:
“And the testimony of Google engineers confirmed that Google’s use of user interaction data has always been but one of many inputs into Google’s systems, has decreased over time, and that the reliance on that data is in the midst of a dramatic change given the rise of large language models that employ fundamentally different techniques to solve the same task, i.e., predicting the usefulness of documents given the query.”
– Google Post-Trial Debrief (Excerpt from my post)
But getting beyond rankings, there’s another aspect of search engines that impacts visibility a lot, but doesn’t get nearly the same attention.
It’s called personalization.
User interaction data is noisy and complex, but what if Google could get beyond the noisiness of clicks, hovers, views, etc. by encoding the data as dense vector embeddings for improved personalization?
These “user embeddings” could be used to dynamically contextualize language-based services and apps with a user’s preferences, maybe even including search.
Well, Google Research is working on a framework called USER-LLM that could do just that.
Instead of inputting user interaction data directly into prompts, this novel framework uses dense embeddings to contextualize LLMs through cross-attention and soft-prompting.
It’s pretty fascinating stuff that helps large language models understand their users’ behavior in much deeper ways.
I also think the general ideas in the paper are helpful for advancing modern semantic SEO practices.
We’ll explore the research in detail throughout this article.
First, here’s a little context …
Quick story on selecting this topic, and why it matters to SEO (maybe)
Google Research’s USER-LLM research paper was published on February 21st, 2024.
Its contributors include Lin Ning, Luyang Liu, Jiaxing Wu, Neo Wu, Devora Berlowitz, Sushant Prakash, Bradley Green, Shawn O’Banion, and Jun Xie.
The topic came to my attention this week, though, when Google AI shared a new blog post about it on X (Twitter):
Why is this topic of interest to SEOs?
The short answer is that frameworks like USER-LLM could lead to more personalized search results (hypothetically) and AI assistant responses (likely).
The longer answer is that user embeddings are part of a larger knowledge field pertaining to vector embeddings that’s helpful for SEO professionals to understand.
In my opinion, vector embeddings and knowledge graphs are more important than keyword research for today’s SEO strategies.
Vector embeddings get to the heart of targeting concepts over keywords, for example.
This helps with creating more people-first content based on search journeys (addressing the deeper context behind user intent) rather than targeting the literal meanings of keywords (which can lead to search engine-first content).
Not all vector embeddings are created equal, though.
I was watching a video the other night about LLMs where an engineer spoke about word embeddings using Word2Vec.
Word2Vec is a computationally efficient architecture for creating word embeddings, but it also produces static representations that don’t capture the nuances of word meanings in different contexts.
For example, the word “bank” would be the same vector whether it refers to a financial institution or the edge of a river.
On the other hand, embeddings from a transformer-based model like BERT are more dynamic and contextually rich. This means they can capture deeper and more nuanced representations of semantic meaning.
“Bank” would be placed into a different vector space, for example, depending on the context of its surrounding words.
The benefit of transformer-based architectures is that they use the attention mechanism, which allows for dense embeddings that incorporate more context.
Transformers have revolutionized natural language processing (NLP), and by extension search. They’ve also led to modern LLMs.
If you’d like to learn more on this topic, there’s a great video about attention in transformers. I also included a few screenshots from it in my recent Hamsterdam History post about how Google handles synonyms in 2010 vs. today, like this one:

The above shows a high-dimensional vector space with word embeddings, where their locations denote semantic relationships.
Embeddings can represent more than individual tokens (words), though. They can describe passages, documents, entities, and much more.
It helps me to conceptualize an embedding as starting out in one place, then it relocates as its vector gets updated with context from surrounding words or features (attention).
Here’s another screenshot, where the vector embedding of “king” changes as context from a Shakespearean play gets added:

I say this encourages concepts over keywords because it’s what makes it possible for us to go to Google Search, type a query like, “which Shakespearean character murdered their predecessor and lived in Scotland,” and get the right answer:

Speaking of Shakespeare, one of my favorite articles because it introduced me to vectors was published in 2020 by Bill Slawski (RIP) called “Author Vectors: Google Knows Who Wrote Which Articles,” which mentions Shakespeare in the context of how author vectors could be created from writing styles. (Full patent here.)
Mike King also referenced embeddings in the context of authors in his recent blog post about the Google API Content Warehouse.
The same leak led to discussions of “site2vec,” for example, as pointed out in Andrew Ansley’s summary.
This is a reminder that vectors can be for websites, like “website representation vectors” (also from a 2020 Google patent), where features like URLs, titles, and HTML structures could be used to classify websites based on the theme or quality of their content or other attributes. (Full patent here.)
So, we see how Google might use vectors for words, authors, or entire websites.
The USER-LLM framework introduces us to “user embeddings,” or dense vectors of interaction data used to contextualize LLMs “to improve personalization services.”
An interesting point about those patents mentioned above is they were granted in 2020 but appear to have been filed in 2018. That’s the same year BERT was released. It’s also shortly after the transformer architecture was introduced in 2017 in the seminal paper, “Attention is All You Need.”
USER-LLM likewise uses a transformer-based architecture, and it focuses on multimodal inputs.
The big news coming out of Google I/O 2024 was the rollout of AI Overviews (formerly SGE), however, I imagine the future of search will be more like the AI-organized result pages that were announced.
These SERPs are designed for users to explore the full context of a topic, rather than showing results about a specific query. This is likely enabled by the “multi-step reasoning capabilities” of the multimodal Gemini model used in Search.
User embeddings from user interaction data likewise could be used to further personalize Search experiences, just as the USER-LLM research shows they can for LLMs.
The implications of more personalized search for us SEOs would be that users see even more unique SERPs that align with their preferences.
Keyword rankings would thus become even more obsolete, because results for a query would be based on predicted relevance for a user’s location, interests, device type, etc.
Consequently, targeting search journeys depends more on understanding personas and underlying concepts of search intent.
In a way, we face this reality already with AI Overviews and web citations in AI summaries, like in Copilot, Gemini, or Perplexity. User embeddings would be just another step in that direction, part of Google’s long history of personalization efforts.
But those are just my initial thoughts.
Let’s also see what ChatGPT and Gemini think about USER-LLM’s implications for the future of search.
ChatGPT said: “This research paper on “User-LLM: Efficient LLM Contextualization with User Embeddings” is significant for SEOs because it outlines a framework that enhances personalization by leveraging user embeddings to capture and contextualize user preferences and behaviors. This leads to more relevant search results and deeper user understanding, which can improve SEO strategies by aligning content with user intents and optimizing for specific keywords. Additionally, the paper’s approach to efficiently handle complex and noisy user data, as well as its adaptability and performance gains in long sequence processing, can help SEOs create more targeted, effective, and scalable strategies. The integration of multimodal data further supports the optimization of diverse content formats, essential for modern SEO practices.”
Gemini said: “The research paper “User-LLM” is of significant interest to SEOs as it proposes a novel way for language models like Google’s to better understand and personalize search results. By utilizing “user embeddings,” which are compact representations of user behavior and preferences, Google could potentially deliver more relevant results, especially for long-tail keywords and multimodal searches. This could impact SEO strategies in several ways, from optimizing for local SEO to understanding long-term user trends. If implemented, the findings could give Google a competitive edge, prompting SEOs to adapt their strategies to these potential algorithmic changes. However, SEOs should also be mindful of the potential privacy concerns raised by this technology.”
We’ll explore the SEO implications of USER-LLM in more detail at the end, after reading the full paper. 🙂
Now without further ado, let’s dig into the research!
We’ll begin with the paper’s abstract 😉
If you wish to read about USER-LLM from Google, their blog post has a nice summary.
For this analysis, though, we’ll concentrate on the full research paper. Go big or go home!
Let’s start with a breakdown of the abstract.

Here is the text version of the abstract:
“Large language models (LLMs) have revolutionized natural language processing. However, effectively incorporating complex and potentially noisy user interaction data remains a challenge. To address this, we propose User-LLM, a novel framework that leverages user embeddings to contextualize LLMs. These embeddings, distilled from diverse user interactions using self-supervised pretraining, capture latent user preferences and their evolution over time. We integrate these user embeddings with LLMs through cross-attention and soft-prompting, enabling LLMs to dynamically adapt to user context. Our comprehensive experiments on MovieLens, Amazon Review, and Google Local Review datasets demonstrate significant performance gains across various tasks. Notably, our approach outperforms text-prompt-based contextualization on long sequence tasks and tasks that require deep user understanding while being computationally efficient. We further incorporate Perceiver layers to streamline the integration between user encoders and LLMs, reducing computational demands.” [Highlights added to all quotes.]
If that sounds a bit daunting, never fear!
We’ll explore the main vocabulary to establish context for the rest of our analysis.
Note: I’ll be using Gemini Advanced and Gemini 1.5 Flash, grounded with the research paper’s contents, to help with interpreting some of the vocabulary and concepts. All quotes will be attributed to the Google Research paper, unless otherwise noted.
The researchers mention that large language models have revolutionized natural language processing. Large language models (LLMs) are a type of AI model based on neural networks (specifically an architecture called transformers) that are trained on massive amounts of text data.
These models (LLMs) learn patterns, relationships, and representations of human language from the training data, making them useful for natural language processing (NLP), specifically tasks like text generation, question answering, translation, and summarization.
LLMs can also be multimodal and handle different input modalities besides text (like images, video, audio recordings, etc.). This is often done within a single framework using techniques like cross-attention and soft prompts. (More on these in a moment.)
User interaction data is the raw information collected about how users interact with digital systems, including actions such as text inputs (like prompts or queries), clicks, views, purchases, ratings, or location data. The researchers mention that incorporating this data with LLMs is a challenge because it’s complex and noisy.
To incorporate user interaction data with LLMs would be to make it be part of the model’s training or operation. What makes this data complex is that it comes in many forms from various sources, such as what a user watches, what they buy, where they go, or how they interact with websites or apps. User interaction data is also noisy because it’s not always consistent or reliable, since people’s behavior can change over time and collection methods can be prone to errors.
The researchers’ solution is USER-LLM, a novel (or unique) framework that leverages user embeddings to contextualize the LLM with user interaction data.
By contextualizing the LLM, the researchers mean providing additional information about a user to tailor the model’s responses or predictions. For example, if you’ve ever given ChatGPT custom instructions, it might have answered prompts or behaved differently from that. Now imagine that happening organically and dynamically based on your behavior.
The user embeddings themselves are dense vector embeddings (representations) of an individual user’s preferences and behavior. Since LLMs can more easily process information when it’s encoded as vector embeddings, this format is beneficial for contextualizing the model so it generates more personalized and relevant outputs.
The researchers next explain how user embeddings are distilled from diverse interactions using self-supervised pretraining to capture latent user preferences and their evolution over time.
Self-supervised pretraining is a machine learning (ML) technique where the model learns from the data itself without needing explicit labels. The USER-LLM framework has a user encoder that processes the raw user interaction data, learning to predict certain aspects of interactions (like the next item a user might click or their favorite genre) based on the large dataset it was trained on, and then distills those into user embeddings, integrating with the LLM.
User embeddings capture latent user preferences, which means underlying or hidden preferences that were not directly expressed. For example, a user might have a history of watching action movies but not explicitly state that’s their preference when asking for a recommendation, yet the model could still return those types of suggestions after identifying the pattern in the user’s interaction data.
These user preferences can also evolve over time, so the USER-LLM framework is designed to capture those changes and continuously update its embeddings as new interaction data becomes available. This allows the model to learn and dynamically adapt how it personalizes recommendations or responses based on the user’s context.
Getting to an important part about how the LLM uses the embeddings, the researchers mention they integrated them through cross-attention and soft-prompting.
Cross-attention is a mechanism that lets the LLM directly access and consider the user embeddings while it processes text. In other words, rather than focusing on the text and embeddings separately, the model can give attention to both simultaneously, dynamically adjusting its understanding and text generation based on the user’s context.
Soft-prompting is where the user embeddings are added as a prefix to the text input of the LLM, which subtly guides its response with information (hints or suggestions) about a user’s interests and behaviors. It’s a more subtle and indirect mechanism than cross-attention, since the LLM doesn’t directly attend to the user embeddings during text processing.
In terms of experiments used to test the effectiveness of USER-LLM, the researchers used multiple datasets. MovieLens is a dataset of 20 million movie ratings used for evaluating recommendation systems. Amazon Review is a collection of nearly 35 million product reviews, though the researchers focused on the movie and TV reviews. Lastly was Google Local Review, which includes reviews and ratings for places on Google Maps, though the researchers focused on New York locations.
To gauge its performance, USER-LLM was compared to text-prompt-based contextualization. This is a method where raw text from a user’s interaction history is fed into the LLM directly as a prompt. This method is how LLM responses are most commonly personalized, except it’s computationally inefficient and less effective for long sequences of user interactions or for tasks that require a deeper understanding of a user’s preferences.
Finally, the researchers mention that they improved computational efficiency by incorporating Perceiver layers to streamline the integration between the user encoders and LLMs.
A Perceiver is a type of transformer architecture designed to handle different types of input data. In USER-LLM, the Perceiver layers compressed the user embeddings into a compact format. To do this, they used a trainable latent query (like a filter) to extract the most relevant information from the embeddings. This reduced the number of tokens needed to represent a user’s history, thus streamlining the integration and reducing computational demands.
To see the USER-LLM framework in action, here’s Figure 2 from the paper:

We’ll revisit this diagram in the next section and break it down step by step.
That takes us through the paper’s abstract.
Feel free to grab a snack before we hit the depths of the main sections.
Now for a deep dive into the research!
The full paper, “USER-LLM: Efficient LLM Contextualization with User Embeddings,” is available from arXiv if you wish to follow along.
I’ll be referencing the PDF, but there’s also an HTML version that’s more mobile-friendly.

The paper has five sections, including an introduction, related work, USER-LLM framework, experiments, and a conclusion.
We’ll summarize each below.
1. The introduction
The researchers start by mentioning how LLMs have revolutionized the field of natural language processing, but beyond that, given “their ability to learn and adapt from massive amounts of textual data, LLMs offer significant opportunities for user modeling and personalization.”
By “analyzing user interactions and understanding user preferences,” they explain, “LLMs can be leveraged to power recommendations, language generation, summarization, and question answering in ways that are highly relevant and engaging to users.”
These improvements could pave the way for “creating more personalized and context-aware language-based applications and services.”
Being “context aware” means an LLM could understand and adapt to the specific context of a user’s interaction, like fitting its response to their preferences, history, or current situation.
The mention of “language-based applications and services” is also interesting. These could include AI chatbots and virtual assistants, translation tools, or content creation tools, but also potentially search engines.
I wasn’t sure on the last point, so I asked Gemini a few times, but it reaffirmed my suspicions:
“Yes, the language-based services and applications mentioned in this context could include search engines. The authors mention that LLMs can be used to power recommendations, language generation, summarization, and question answering. These are all tasks that search engines can perform. For example, a search engine could use an LLM to generate personalized recommendations for websites or products based on a user’s search history. It could also use an LLM to summarize web pages or answer questions posed by users in natural language.“
– Gemini’s response
The researchers next provide examples of user interactions, which “represent a rich source of behavioral data generated from a user’s engagement with digital systems” and “hold valuable insights for user modeling.”
They span a “wide range from textual input, search queries, media consumption (e.g., videos watched or rated), to social media activities, navigation patterns, location visits, and more.”
A lot of these activities seem relevant to search engines.
Next, the researchers explain the context of their work.
They expound on several themes from the abstract, including why fine-tuning LLMs with user interaction data in prompts directly (“text-prompt-based contextualization”) won’t unlock “the full potential of LLMs in user modeling and personalization”:
“One straightforward approach to leveraging this data with LLMs is to directly finetune LLMs on textual components, using a user’s interaction history as the text prompt. However, user interaction data is often complex, spanning multiple journeys with sparse data points, various interaction types (multimodal), and potential noise or inconsistencies. This complexity can hinder an LLM’s ability to identify and focus on the most relevant patterns. Moreover, effective personalization often requires a deep understanding of the context and latent intent behind user actions, which can pose difficulties for LLMs trained predominantly on vast, surface-level language corpora. Additionally, user interaction data, such as extended histories, can be very lengthy. Processing and modeling such long sequences (e.g., a year’s worth of history) with LLMs can strain computational resources, making it practically infeasible.”
To get around these issues, the researchers propose USER-LLM, which is centered around user embeddings.
The difference between the embeddings and text prompting is represented in Figure 1 in the paper:

As “compressed representations, distilled from diverse and noisy user interactions,” user embeddings can “effectively capture the essence of a user’s behavioral patterns and preferences across various interaction modalities,” the researchers explain.
Their goals are to contextualize the LLM with user embeddings during fine-tuning or inference to:
- “Enhance its ability to identify relevant patterns,” despite the complexity and noisiness of the data.
- “Facilitate understanding and adaptation to the latent intent, dynamic context, and temporal evolution behind user interactions.”
- Reduce the computational demands from “processing extensive interaction histories by working with condensed representations.”
In short, their framework enables the LLM “to tailor responses and generate personalized outcomes” based on a “deeper understanding of users’ historical patterns and latent intent.”
There are two key phases in the USER-LLM approach.
“In phase one, we pretrain a Transformer-based encoder on user interaction data, utilizing self-supervised learning to capture behavioral patterns across multiple interaction modalities. We use a multi-feature autoregressive Transformer to generate embeddings that capture long-range dependencies and contextual relationships within sequential data while handling multimodal user data effectively.“
To break that down a bit, transformers are particularly effective at sequential data, which often means words in a sentence, but in this case refers to a user’s actions over time.
The transformer-based encoder can capture “contextual relationships” (time of day, device used, etc.) between data elements, even if they’re far apart (“long-range dependencies”) or in different modalities (text, clicks, ratings, watch history, location visits, etc.).
The “multi-feature autoregressive Transformer” refers to a specific type of model that can handle different types of user interaction data (“multi-feature”) and can predict the next item in a sequence based on previous items (“autoregressive”).
“In phase two, we integrate user embeddings with an LLM during finetuning using cross-attention, where the LLM’s intermediate text representations attend to the output embeddings from the pretrained user encoder, enabling dynamic context injection (similar to Flamingo Alayrac et al. (2022)).”
In other words, once the embeddings are created by the autoregressive encoder (phase one), the LLM uses cross-attention to “attend to” (focus on) relevant parts of the embeddings for context during an “intermediate text representation” (one of multiple stages of processing), so the LLM can tailor its response or prediction dynamically based on a user’s preferences or behaviors.
The researchers mention their experiments across “three public datasets and various tasks,” which showed how “the embedding-based approach successfully mitigates the limitations associated with directly finetuning LLMs on raw textual user interaction data.”
They then mention “several application domains,” where this framework could enhance performance, including:
- User understanding: More accurate predictions of user preferences and better personalization of responses.
- Personalized recommendations: Generating recommendations for products, services, or content tailored to an individual user based on a deeper understanding of user preferences and interests.
- Text generation: Text that is more coherent, relevant, and engaging in alignment with a user’s style and preferences, like for emails, product descriptions, or creative writing.
In short, wherever there’s an LLM in the mix, personalization could improve. My mind immediately jumps to Google Discover recommendations, AI Overview presentations, and organic shopping results of all kinds.
That covers the introduction.
Now we’ll review related work, then get into the nitty-gritty of how USER-LLM works.
2. Related work
The researchers mention four categories of related work, which we’ll summarize below.
2.1. User modeling
Dual encoders and self-supervised learning are common methods in user modeling and recommendation systems.
Dual encoders is a type of model architecture that consists of two separate neural networks, one for encoding user information and another for encoding item information (like movies or products). The encoders represent both in a shared embedding space to make recommendations or predictions based on similarities without needing explicit labels (self-supervised learning).
Other self-supervised learning approaches include contrastive learning, which involves comparing positive and negative examples, leveraging the inherent structure to learn meaningful representations from noisy datasets.
Another is graph representation learning, where user interaction data is represented as nodes (users and items) and edges (clicks, purchases, ratings, etc.) in a graph. The model then learns to represent nodes as vector embeddings.
Pre-trained BERT models have also been applied to sequential recommendation systems. Examples include Bert4Rec and U-Bert, which have been leveraged for pre-training (learning underlying structures from large datasets) and fine-tuning (smaller datasets related to specific tasks) approaches.
The authors express how their work builds on these previous approaches.
2.2. Language model-based personalization
The researchers note that while there has been “extensive work on different ways to format user historical interactions as prompts and to leverage LLMs to generate recommendations … directly incorporating a user’s entire history into an LLM’s context window is often impractical or overly expensive.”
A previous study used user embeddings for this purpose, but rather than derive embeddings from the actual text of a user’s activity, their framework uses “user activity tokens,” or compact and abstract representations for efficiency. They also use more advanced methods for combining different types of user interaction data into embeddings (“sophisticated fusion techniques”).
2.3. Long context in LLMs
Long context (processing more text at once) is needed for incorporating “long-term user data and behavior patterns.”
The researchers explain past work on enabling long context for LLMs — such as extending the context window by remapping positional encodings, distilling or compressing information, or modifying attention computation — and then explain how their method uses “a single token to represent each user event.”
This reduces the amount of information the LLM needs to process. These extracted representations also “align with the LLM text embedding space,” making them compatible with existing LLM-based techniques and applications.
2.4. Multimodal LLMs
The researchers explain the development of multimodal LLMs.
Early models, like CLIP (OpenAI) or ImageBind (Meta AI), aligned image and text representations, like connecting images with corresponding text descriptions or learning a joint embedding space.
The next generation of models, like Flamingo (DeepMind) and Coca (Microsoft), used fusion techniques, including cross-attention (dynamically weighing the importance of different parts of the input based on context) and soft-prompting (adding pieces of information to the input to guide the model’s output).
More recent models, like NextGPT, OneLLM, and Anymal, have explored unified frameworks (a single model) that can handle a wider variety of input modalities.
Then there are natively multimodal models, like Gato (DeepMind) and Gemini, which are trained end-to-end, meaning they learn to perform multiple tasks simultaneously.
USER-LLM focuses on LLM contextualization using multimodal user interaction data, inspired by these prior advances.
Let’s now get into the USER-LLM framework in more detail.
3. The USER-LLM framework
As mentioned, USER-LLM is a two-stage approach.
First, a pre-trained user encoder generates the embeddings, then the embeddings are integrated with the LLM through cross-attention and soft-prompting, giving it additional context and guidance for personalized responses.
The user embeddings are generated from “ID-based featured sequences.” Since user interaction data can have different modalities, each one is mapped to integer IDs and has a distinct embedding representation. These are then fused into a single embedding.
It’s the “sequence of fused embeddings” that then “serves as input to the user encoder for generating user embeddings.”
An autoregressive transformer is used as the user encoder.
As you can see in the section of Figure 2 below, modalities might include an item’s name, rating, or category.

The model concatenates them, processes them through the transformer decoder, and then projects the output back to the original feature spaces, while softmax layers calculate probabilities for the next item in the sequence.
A cross-entropy loss function measures the difference between the model’s predictions and the actual next items, allowing the autoregressive design to learn through self-supervised training.
The autoregressive transformer generates an embedding for each input item, capturing the user’s preferences and behaviors.

USER-LLM integrates the user embeddings with the LLM through cross-attention, where the user embeddings are combined with the intermediate text representation within the LLM.
The framework is flexible, however, and can also use soft-prompting as an integration method, where user embeddings are added as a prefix to the text input of the LLM.
Pre-trained weights (parameters), condensed representations (dense embeddings and single tokens for events), and cross-attention contribute to the framework’s efficiency.
Furthermore, the Perceiver layer enhances efficiency by compressing the user embeddings into a more compact format.
The performance of USER-LLM can be optimized for different tasks using various fine-tuning strategies:
- Full: Fine-tunes all parts of the model for the highest levels of personalization, but also the most computational resources and a risk of overfitting.
- Enc: Only the encoder and projection layers are fine-tuned, while the LLM is frozen (unchanged).
- LoRA: Low-Rank Adaptation is a technique that reduces the number of trainable parameters.
- Proj: Fine-tunes only the projection layers, keeping the LLM and encoder frozen, which is the most lightweight approach.
I asked Gemini to explain Figure 2, as well.

Here was its added analysis.
Left side, showing the pretraining of the user encoder:
- User timeline represents a timeline of events, where each event includes information like an item’s name, rating, or category.
- Autoregressive encoder is trained to predict future interactions based on past ones, taking the sequence of user activities as input.
- Cross-entropy loss measures the difference between the predictions and actual next items in a user’s history (loss function).
Right side, showing the LLM contextualization:
- Autoregressive user encoder, once pre-trained, processes the user’s interaction history and generates user embeddings.
- The embeddings pass through the Perceiver layer, which compresses them.
- LLM decoder uses cross-attention to integrate the embeddings and dynamically adapt its responses.
- Output are personalized responses to various queries, like a favorite food category.
Now we’ll summarize the experiments and conclusion.
4. Experiments
I’m not going to go into too much detail here, but will explain high level the experiments done.
The researchers used the three datasets we mentioned earlier (MovieLens20M, Google Local Review Dataset, and Amazon Review Dataset) to assess model performance on three types of tasks:
- Next item prediction: The model predicts the next movie a user watches based on a historical sequence.
- Favorite genre or category prediction: The model predicts a user’s favorite genre or category based on a sequence of items.
- Multimodal review generation: The model generates reviews from multimodal input features.
Their baselines included Dual Encoder (DualEnc), Bert4Rec, and TextPrompt (TP).
There are a number of tables showing results for different types of tasks.
Here’s Figure 3, showing how USER-LLM compared to TextPrompt for next item prediction on the Movie Lens dataset:

As you can see, USER-LLM had a higher recall (proportion of times the correct item was found within the top 10 predictions made by the model) relative to the sequence length.
For our purposes, it helps to understand that user embeddings outperformed text inputs in various performance metrics, like accuracy, depth of understanding, and efficiency.
5. Conclusion and future work
The researchers conclude starting with the following:
“In this paper, we introduced User-LLM, a framework for contextualizing LLMs through user embeddings. These embeddings, derived from self-supervised pretraining on diverse user interactions, capture hidden user preferences and their evolution. By integrating these embeddings with LLMs through cross-attention and soft-prompt, User-LLM empowers LLMs to adjust dynamically to user contexts.”
After doing this analysis, we can understand what all that means, and that’s awesome!
Given the “competitive performance” of USER-LLM as well as its “computational efficiency and ability to preserve LLM knowledge further,” it’s a “highly suitable approach for real-world user understanding applications.”
In terms of future opportunities, the researchers suggest advanced pre-training techniques to improve the quality of the user embeddings, such as more sophisticated self-supervised tasks or incorporating additional data sources.
They also suggest investigating the alignment between user embeddings and the language model space, meaning exploring how relevant the information captured in the embeddings is to the LLM’s understanding of language and context.
Lastly, they suggest training USER-LLM on a wider range of tasks to improve its generalization abilities, learning more types of user behavior and preferences.
So why should SEOs care about USER-LLM (maybe)?
I’m not sure if SEOs should care about USER-LLM itself as much as what it sets out to accomplish — creating vector embeddings of user interaction data and using those to further personalize LLM-based services or applications.
I personally feel studying vector embeddings is more important than keyword research for understanding the underlying concepts in user journeys.
What I didn’t think about before now was also the potential to use user embeddings to model buyer’s journeys. If LLMs could use this information to personalize responses or recommendations, couldn’t we apply the same techniques to understand how a buyer’s journey might evolve, based on a given persona?
And of course, there’s the personalization of search, which could easily be among the eligible LLM-based services or applications that user embeddings apply to. The more personalized search results become, the more we’ll need to target concepts and personas rather than keywords.
Of course, we’re still talking about Google Research here, not Google Search.
But before we wrap up, let’s ask what Gemini thinks the implications are of USER-LLM for search and SEO, based on the context of our conversation so far.
*Note: this is a theoretical exercise now, not predictions or instructions. 😉
1. Personalized search results
User embeddings would enable search engines to understand individual user preferences and behaviors more deeply, which could lead to search results that are tailored to each user’s unique interests and search history. What I was referring to earlier with AI-organized result pages.
2. Improved user intent understanding
User embeddings could add deeper context to how search engines interpret the intent of queries, resulting in more accurate search results. This could include intent shifts (volatility) that happen during ranking system updates.
3. Long-term behavior analysis
If user embeddings can capture long-term user behavior patterns, search engines could better understand how user preferences evolve over time, leading to predictive search experiences.
For our part, SEOs could leverage these insights to create content that caters to expected interests or emerging trends.
4. Multimodal SEO
User interaction data can be processed into vector embeddings from different modalities. That insight can inform multimodal SEO strategies based on understanding the relationship between different types of data for building holistic and semantically relevant contextual experiences throughout user journeys.
5. Local SEO
This is an interesting thought. If user embeddings incorporate location data, this could fuel more hyper-local search results.
For our part, we can leverage those insights to understand local intent and address it through multimodal local SEO strategies.
Till next time …
I hope you’ve enjoyed this week’s Hamsterdam Research article!
Feel free to comment or contact me with feedback.
I’ll likely also return to this article in the future for updates and improvements, as well.
Stay tuned for a new article next week, or if you’ve still got energy, check out related posts below.
Until next time, enjoy the vibes (a little dark, but pretty on point, ha):
Thanks for reading. Happy optimizing! 🙂
Related research articles:
PLEDGE (via Google DeepMind), Content Planning for Navigating Trade-Offs of Specificity & Attribution in KGD Systems, & Why SEOs Should Care (Maybe)
We look at Google DeepMind’s PLEDGE framework of content planning for attribution and specificity trade-offs in knowledge-grounded dialogue systems, and why SEOs should care (maybe) in this Hamsterdam History article.
Investigating TeraHAC (via Google Research), a Novel Graph Clustering Algorithm for Massive Datasets, & Why SEOs Should Care (Maybe)
In this rendition of Hamsterdam Research, we look at TeraHAC, a novel graph clustering algorithm from Google Research for large datasets.
Could Anthropic’s Identification of Millions of Features (Concepts) Activated Inside Its LLM (Claude 3 Sonnet) Influence Our Semantic SEO Strategies? (Maybe)
Anthropic identified millions of features (concepts) activated in Claude 3 Sonnet. Hamsterdam Research reviews the implications for semantic SEO strategies.
Leave a Reply