Investigating TeraHAC (via Google Research), a Novel Graph Clustering Algorithm for Massive Datasets, & Why SEOs Should Care (Maybe)

Last updated:

May 10, 2024

Welcome to another week of Hamsterdam Research, where we look at recent AI research to learn just what the heck it’s talking about and explore its possible implications for the future of SEO and search.

This week, we’ll be looking at “Scaling hierarchical agglomerative clustering to trillion-edge graphs.”

This is a blog post from Google Research published on May 1st, 2024, by Laxman Dhulipala and Jakub Łącki, Research Scientists on the Graph Mining Team at Google Research.

Why is it of interest?

Well, we’re familiar with Google Search’s goal of identifying a user’s search intent and returning the most accurate and relevant results for a query.

What if Google could more accurately cluster related webpages and other data points on the scale of trillions? Not just understanding what content means but also how each page connects thematically.

This is the ambition behind cutting-edge graph clustering algorithms like TeraHAC.

Here’s a video clip from the blog post, which by the end of this article, you’ll understand in full:

Video description: “Sequential HAC (on the left) merges one pair of clusters per step. The parallel algorithm (on the right) can merge multiple pairs of clusters at each step, which results in fewer steps overall.” (Source: Google Research.)

In this article, we’ll learn about clustering algorithms like HAC, the researcher’s novel algorithm called TeraHAC, and related concepts from machine learning.

Then we’ll explore the topic’s hypothetical implications for SEO professionals (at the end).

Quick story on selecting this topic, and why it matters to SEO (maybe)

In previous weeks, I’ve selected AI papers for Hamsterdam Research directly from Google Research’s list of publications.

However, this week was different, as I had three options in mind already.

The first was this tweet from Google AI (h/t: Andrea Volpini for retweeting this):

Graph clustering merges similar items into groups to better understand relationships in data. Today, read about our recent works, including key techniques that enabled us to scale a high-quality algorithm that can cluster trillion-edge graphs. Read more → https://t.co/9pYF1xUEJ7 pic.twitter.com/jSkPuOSXBL
— Google AI (@GoogleAI) May 1, 2024

The second option was from this tweet, which would involve a retrospective look at a key paper from 2014:

Congratulations to Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Their paper “Intriguing properties of neural networks” was the #ICLR2024 Test of Time Award runner-up. Learn more at: https://t.co/ZxkfzKifbh
— Google AI (@GoogleAI) May 7, 2024

The third option came from this article by Analytics India Magazine, “The World Needs Something Better Than the Transformer”:

Which led to the paper (from Austrian researchers), “xLSTM: Extended Long Short-Term Memory”.

I couldn’t decide!

So I consulted with my colleagues — ChatGPT and Gemini — and both said (to paraphrase), “It’s a no-brainer. Choose the one about graph clustering algorithms.”

ChatGPT said: “The paper on ‘Scaling Hierarchical Agglomerative Clustering to Trillion-Edge Graphs’ is particularly relevant for SEO professionals because it addresses the challenge of analyzing massive datasets, a common scenario in SEO where understanding vast web structures and user behaviors is crucial. This research by Google highlights advanced graph clustering techniques that can significantly improve the efficiency of tasks like detecting duplicate content, optimizing website architecture, and segmenting audiences based on behavior. Furthermore, as it comes from Google, the innovations discussed may influence future algorithm updates or tools used in SEO, making it especially pertinent for SEOs aiming to stay ahead in a data-driven marketing landscape. This paper not only reflects cutting-edge technology in handling large-scale data but also aligns closely with the practical needs and interests of SEO professionals.” (My bolding) (*LLM disclaimers apply.)

Gemini said: “Search engines operate on massive datasets, making the ability to process and analyze information at scale crucial for understanding how they work. Hierarchical agglomerative clustering (HAC) is a powerful technique for organizing and grouping data, which can shed light on how search engines might categorize and understand website content. This paper delves into optimizing HAC algorithms for trillion-edge graphs, a scale directly relevant to search engines. Understanding these techniques can provide SEOs with insights into how search engines potentially cluster websites, topics, and keywords, informing strategies for content organization and optimization.” (My bolding) (*LLM disclaimers apply.)

And I said: “The more I delve into machine learning and learn about different types of algorithms, the more clustering algorithms stand out:

To take you on a quick detour, in Hamsterdam Part 45, we talked about ScaNN, an approximate nearest neighbor (ANN) algorithm. ANN can be used within clustering (such as k-means algorithms). In Hamsterdam Part 53, we saw a video on vector (semantic) search from Google Cloud, which called ScaNN the “Foundation for many Google services such as Google Search and YouTube” and showed how it can be applied in combination with other techniques:

What are you looking at above? Let’s break it down:

A combination of techniques is applied to reduce the search space (space pruning) and compress the vector size to increase ranking efficiency.
The first step of space pruning is to divide the vector space into hierarchical partitions (search tree). Each node in the tree represents the centroid (center point of a partition).
Next, the partitions closest to the query (red squares) remain, while the system prunes the rest (irrelevant search space).
The system starts by searching from the root node then moves to the branches (partitions) and then the leaves (subpartitions). This preserves the most relevant subpartitions for the second technique: data quantization.
Data quantization compresses the data points to save space and reduce indexing time.

Also in Hamsterdam Part 45, we quoted a paper from Google Cloud that said “modern semantic search engines use a two-stage retrieval approach to generate results. First, they use an ANN retriever to do a quick first pass and bring up results, and then apply a re-ranking model to fine-tune the results and make sure the most relevant ones are at the top — all in milliseconds.’ Meanwhile, the video also says how Google’s Vertex AI (commercial search product) uses ‘a similar more advanced version of ScaNN.’ (I thought it might refer to SOAR, but Gemini said unlikely.) But why am I mentioning ALL of this? Well, the point is that ANN can be used within clustering. ScaNN also stands for “Scalable” approximate nearest neighbor. So we not only can see the importance of clustering data points for search, but also in doing so at a massive scale, and efficiently. That more generally gets to the purpose of HAC. The more we grasp these concepts related to clustering and semantic search, the more we can understand in general how the content we create for SEO may be perceived as relevant by search engines to our target audiences’ search journeys.“

I wrote all of that beforehand. After creating this article, however, I also think there’s some interesting vocabulary and general machine learning concepts to learn, as well.

Still intrigued?

Let’s start with the paper’s introduction 😉

Since this is a blog post and not a research paper, we don’t have an abstract but rather an introduction:

Here is the text, as well:

“We describe a series of our recent works on building more scalable graph clustering, culminating in our “TeraHAC: Hierarchical Agglomerative Clustering of Trillion-Edge Graphs” paper. We cover some of the key ideas behind this work and explain how to scale a high-quality clustering algorithm that can cluster trillion-edge graphs.”

Let’s explore some of the main concepts in there first.

Note: I’ll be using Gemini Advanced to help break down some of the vocabulary and concepts from the blog post. All quotes will be attributed to the post unless otherwise stated.

First, let’s start with what a graph is. A graph is a mathematical structure used to represent relationships between objects. It consists of nodes (individual data points) and edges (relationships between them). Edges can also have weights (numerical values representing the strength or similarity of connections).

A specialized type of graph SEOs might know is a knowledge graph. In a knowledge graph, the nodes are labeled (entities) while the edges go beyond numerical values to represent specific types of relationships.

However, graphs can also represent other data. A common example is a social media network, where nodes would be people (users) and edges would be their relationships, like whether they follow each other or how often they speak. Graphs can also apply to items, such as products on an e-commerce website (think recommender systems).

Most critically for this discussion, though, graph nodes can represent individual documents (think web pages, videos, images, etc.) and edges can represent the semantic relationships between them (think hyperlinks, co-occurrence of terms, word embeddings (vectors), etc.).

Graphs can enable information retrieval tactics such as query expansion (related documents that may not have the query mentioned directly), ranking signals (similarity), and search intent (the meaning behind words).

In the image we saw earlier from the Google Cloud video (shown again below), we had nodes and edges in a hierarchical tree structure (left). This is an example of a directed graph (one-way relationships), where edges imply a parent-child relationship (similar to a website’s structure). There are also undirected graphs, with mutual relationships. The vector space beside the tree (right) shows how the data points (nodes) would be mapped in a high-dimensional space, where closer data points would be more similar based on their embedding vectors. This vector space could also inform a hierarchical clustering process (parent-child relationships).

Graph clustering is the process of grouping nodes (data points) together based on their similarity. This can lead to discoveries of hidden patterns or less obvious types of groups (think related content on a user journey), simplify data analysis (think exploration or data compressed for speed), or enable recommendation systems (think related searches or even Google Discover).

In Hamsterdam Part 52, we saw a video from Shaw Talebi on text embeddings. In the screenshot below, we see an example of how people would be eligible for job roles (queries) based on their professional profiles (word embeddings). We can think of the same colored dots as classifying similar job roles in a vector space:

Something else to keep in mind is that there’s a difference between clustering and classification.

Classification involves a labeled dataset and is a type of supervised learning. Whereas in clustering, the dataset is unlabeled, which means the model uses unsupervised learning to group the input data based on patterns of similarity. Here’s an example from Analytics Vidha:

Since we’re talking about massive datasets here, I think it’s also helpful to see an example of what hierarchical clustering would look like on a larger-scale graph. Here’s an image shared by Aleksa Gordić:

There are also different types of graph clustering, including partitioning-based (pre-specified number of clusters), spectral clustering (uses a Laplacian matrix to summarize node connections based on strength), community detection (groups within a larger network), and Hierarchical Agglomerative Clustering (HAC) (a bottom-up approach where each node starts as its own cluster and clusters get merged iteratively based on proximity or similarity).

HAC is the focus of the Google Research post we’ll be reviewing. Specifically, the researchers introduce a novel algorithm called TeraHAC, which is meant to address the limitations of HAC for dealing with massive graphs (trillions of edges).

Phew! That was a lot of vocabulary, but hopefully it sets the stage. 🙂

Now for a deep dive into the research!

The post “Scaling hierarchical agglomerative clustering to trillion-edge graphs” is available from Google Research, if you wish to follow along.

There’s also an associated paper, “TeraHAC: Hierarchical Agglomerative Clustering of Trillion-Edge Graphs.” It was published on August 7th, 2023, by Laxman Dhulipala, Jason Lee, Jakub Łącki, and Vahab Mirrokniavailable and is available on arXiv. (Here’s the full PDF.) We’ll reference the paper a little, as well.

The blog post has five sections, including an introduction, “Hierarchical agglomerative clustering (HAC),” “Speeding up HAC,” “The TeraHAC algorithm,” and a conclusion.

We’ll summarize each of these below.

1. The introduction

The researchers begin by setting the context of clustering applications as “a type of unsupervised machine learning that aims to group similar items together.”

They reference clustering within the context of “large-scale prediction and information retrieval tasks.”

The examples of such tasks they provide include:

De-duplication: identifying duplicate (or near-duplicate) data entries for removal or merging.
Anomaly detection: identifying unusual data points that don’t fit the overall pattern of the data (like outliers, errors, or potential standout events).
Privacy: deriving insights without revealing sensitive information about individuals (clustering individuals based on shared attributes).
Entity matching: grouping similar entities together based on attributes (like the same entity present in multiple datasets).
Compression: reducing the size of a large dataset for storage or efficient transmission (lower number of data points).

They also reference related posts for more examples, including:

Balanced Partitioning and Hierarchical Clustering at Scale: introduces a new distributed algorithm for balanced graph partitioning, or dividing a graph into smaller pieces of roughly equal size. (That paper mentions this applies in “the serving backend for web search,” presumably meaning Google Search.)
Best of both worlds: Achieving scalability and quality in text clustering: tackles the dilemma of choosing between quality (embedding models) and scalability (cross-attention models) with a new clustering algorithm called KwikBucks that applies aspects of both.
Practical Differentially Private Clustering: introduces a mathematical framework to limit the identification of individuals in a dataset with clustering algorithms that incorporate differential privacy, or a controlled amount of noise to mask individual contributions.

The researchers next describe a major challenge of “clustering extremely large datasets containing hundreds of billions of datapoints.”

They identify computational costs and the lack of design for certain environments as limitations:

“While multiple high-quality clustering algorithms exist, many of them are not easy to run at scale, often due to their high computational costs or not being designed for parallel multicore or distributed environments. The Graph Mining team within Google Research has been working for many years to develop highly scalable clustering algorithms that deliver high quality.”

To the point of computational costs, clustering hundreds of billions of data points involves a vast number of calculations to determine similarity. Given their complexity, some clustering algorithms may have calculations that increase faster than the size of the input, requiring days or expensive, high-performance machines to run.

The researchers also mentioned parallel multicore and distributed environments. The first refers to computers having multiple cores to work on tasks simultaneously, while the latter refers to multiple computers linked together. While these allow for scalable computing, not all clustering algorithms are adaptable to these resources (since the algorithms require sequential calculations that can’t be easily split up).

The researchers mention affinity clustering, a high-quality and scalable algorithm from ten years ago. Affinity clustering was based on “the simple idea that each datapoint should be in the same cluster as the datapoint it’s most similar to.”

While it worked for MapReduce implementation (a term we’ll explain more later), affinity clustering also had shortcomings:

“… the guiding principle for the algorithm was not always correct as similarity is not a transitive relation, and the algorithm can erroneously merge clusters due to chaining, i.e., merging long paths of pairwise related concepts to place two unrelated concepts in the same cluster, as illustrated below.”

It’s not wise to upset a wookie.

The researchers also noted how a “big improvement to affinity clustering came with the development of the Sub-Cluster Component (SCC) algorithm, which places a point in the same cluster as its most similar neighbor, only if their similarity is at least a given threshold.” (My bolding.)

This gets us back to the idea of edges denoting numerical values, where SCC would require those values (degrees of similarity) to meet a certain threshold, and then “the algorithm repeats this approach, and gradually decreases the threshold at each step.”

“Inspired by this line of work on affinity clustering and SCC,” the researchers “decided to take a fresh look at designing high-quality extremely large-scale clustering algorithms.”

In the blog post, they next describe a series of their “recent works on building more scalable graph clustering,” culminating in the paper, “TeraHAC: Hierarchical Agglomerative Clustering of Trillion-Edge Graphs.”

2. Hierarchical agglomerative clustering (HAC)

The affinity clustering and SCC algorithms that inspired the researchers were themselves inspired “by the popular hierarchical agglomerative clustering (HAC) algorithm, a greedy clustering algorithm that is well known to deliver very good clustering quality.”

“The HAC algorithm greedily merges the two closest clusters into a new cluster, proceeding until the desired level of dissimilarity between all pairs of remaining clusters is achieved.”

What do they mean by “greedy“?

Well, in machine learning, a “greedy” algorithm makes choices that seem immediately optimal at each step, without considering long-term benefits or different paths or revisiting the decision later, even if subsequent information suggests a different choice would be better. The upside is computational efficiency (doesn’t consider all possible combinations at each step), while the downside can be suboptimal results compared to a more extensive search.

More specifically to HAC, “greedy” means the clustering algorithm merges the two closest clusters regardless of how that might impact the overall clustering later. Once merged, clusters aren’t split up again.

This introduces a scalability constraint, as “there is no obvious way to parallelize the algorithm” (multiple cores working simultaneously, for example) because each merge depends heavily on the previous merges, so the work can’t be easily split up.

HAC had other scalability constraints, as well:

“Moreover, the best previous implementations had quadratic complexity, that is the number of operations they performed was proportional to n², where n is the number of input data points. As a result, HAC could only be applied to moderately sized datasets.”

Quadratic growth refers to where as the input (dataset) size doubles, the number of operations the algorithm performs quadruples, becoming computationally expensive for large datasets.

The researchers also noted that “HAC can be configured using a linkage function that computes the similarity between two clusters; for example, the popular average-linkage measure computes the similarity of (A, B) as the average similarity between the two sets of points.”

Linkage functions compute the similarity between clusters. Common linkage functions include single-linkage (closest pair of points, with one point from each cluster), complete-linkage (furthest pair of points, with one from each cluster), and average-linkage (considers all points within a cluster).

“Average-linkage is a good choice of linkage function as it takes into account the sizes of the clusters that are being merged,” the researchers note. Gemini also pointed out it’s robust to outliers.

However, there was still the issue of parallelism, given that “HAC performs greedily.”

3. Speeding up HAC

The researchers, in a line of work that started in 2021, “carefully investigated the complexity and parallel scalability of HAC, from a sparse graph-theoretic perspective.”

“Sparse graph” refers to a graph where the number of edges is much smaller than the maximum possible number of edges between nodes. It implies that most pairs of data points likely have weak or no similarity, and only relatively few pairs are genuinely similar.

In practical terms, researchers can exploit the sparsity and skip calculations of similarity on data points that are unlikely to be related, improving computational efficiency.

“We observed that in practice, only m << n² of the most similar entries in the similarity matrix were actually relevant for computing high-quality clusters. Focusing on this sparse setting, we set out to determine if we could design HAC algorithms that exploit sparsity to achieve two key goals:

a number of computation steps proportional to the number of similarities (m), rather than n², and

high parallelism and scalability.”

In this scenario, (m) refers to the tiny fraction of strong connections (high similarity scores) between data points. These are what truly matter for finding good clusters, not the total number of potential connections (n²).

This relates to the first goal, where computations won’t be wasted on weak connections, and to the second goal, where sparse graphs can be broken into chunks for simultaneous processing on multiple cores (parallelism) or across multiple machines (distributed computing).

In more detail, the researchers pursued their first goal of making HAC more computationally efficient on sparse graphs by “presenting an efficient sub-quadratic work algorithm,” meaning it doesn’t grow exponentially with data size.

To accomplish this, they complemented the exact algorithm “with a simple algorithm for approximate average-linkage HAC, which runs in nearly-linear work … and is a natural relaxation of the greedy algorithm.” In short, this isn’t guaranteed to find the best clusters but can come close, involving a trade-off.

Their “next step was to attempt to design a parallel algorithm.” They “confirmed the common wisdom that exact average-linkage HAC is hard to parallelize due to the sequential dependencies between successive greedy merges.”

Think back to the “greedy” nature HAC to iteratively merge the most similar clusters. This is what’s meant by “sequential dependencies.”

In short, exact HAC, as it’s called, is difficult to parallelize.

This is where they shifted strategy and the idea of approximate HAC, or what they called “ParHAC.”

ParHAC is designed to run on multicore machines (scalable) and groups edges (divides edges into weight classes based on similarity) and processes them in parallel. This showed success on a large dataset (Web Data Commons) representing hyperlink relationships between webpages. (The WBC graphs were designed for research on search algorithms, spam detection, graph analysis algorithms, and web science research.)

As the researchers explain in more detail:

“On the algorithmic side, we developed a parallel approximate HAC algorithm, called ParHAC, that we show is highly scalable and runs in near-linear work and poly-logarithmic depth. ParHAC works by grouping the edges into O(log n) weight classes, and processing each class using a carefully designed low-depth symmetry-breaking algorithm. ParHAC enabled the clustering of the WebDataCommons hyperlink graph, one of the largest publicly available graph datasets with over 100 billion edges, in just a few hours using an inexpensive multicore machine.”

Now we can revisit the video from earlier with greater context.

The “Sequential” side speaks to the greedy algorithm that merges the nearest cluster (one per step), while the “Parallel” side uses weight classes and can merge multiple clusters per step.

Here we see the difference in speed (as well as the distribution of clusters and widths of edges):

Which ultimately produces the same result:

4. The TeraHAC algorithm

The researchers then get to the final frontier, “to obtain scalable and high-quality HAC algorithms when a graph is so large that it no longer fits within the memory of a single machine.”

This refers to graph data that is too large to fit into the memory (RAM) of a single computer, like a graph with trillions of edges.

This is where they introduce their novel algorithm, TeraHAC:

“TeraHAC is based on a new approach to computing approximate HAC that is a natural fit for MapReduce-style algorithms. The algorithm proceeds in rounds. In each round, it partitions the graph into subgraphs and then runs on each subgraph independently to make merges.”

You may recall earlier with affinity clustering how the researchers noted it “worked for MapReduce implementation.”

MapReduce is a programming model for processing vast amounts of data in parallel across multiple computers. It involves a map phase, where input data is divided into chunks that are processed independently (mapper function), and a reduce phase, where results are aggregated and combined (reducer function).

TeraHAC operates in multiple rounds of processing. In each round, the graph is split into smaller subgraphs (partitioning), which can be handled by different machines individually. Each machine clusters its assigned subgraph independently, then the results are combined after each round in a way that maintains an approximate hierarchical structure.

At most, TeraHAC uses “a few dozen rounds on all of the graphs.” This allows it to “compute a high-quality approximate HAC solution on a graph with several trillion edges in under a day using a modest amount of cluster resources.”

But as the researchers point out, “The tricky question is what merges is the algorithm allowed to make?”

“The main idea behind TeraHAC is to identify a way to perform merges based solely on information local to the subgraphs, while guaranteeing that the merges can be reordered into an approximate merge sequence. The paradox rests in the fact that a cluster in some subgraph may make a merge that is far from being approximate. However, the merges satisfy a certain condition, which allows us to show that the final result of the algorithm is still the same as what a proper (1+ε)-approximate HAC algorithm would have computed.”

Local merges refers to independent clustering decisions made on each subgraph using only the information available in that subgraph. This is essential for the distributed computing approach (scalability).

However, since the merges are done locally, there must be guaranteed reordering, where the final results across all subgraphs can be re-ordered into a sequence that approximates the results of a traditional (exact) HAC algorithm.

What’s challenging is a subgraph might not have the full context of the entire graph, so what looks like a good merge locally might not be the best choice overall, or “far from being approximate.”

The solution is that TeraHAC defines a set of specific conditions that local merges must satisfy so that merges stay consistent (don’t disrupt the overall clustering process) and allow for reordering as the final clustering result.

The researchers find that “TeraHAC achieves the best precision-recall tradeoff compared to other scalable clustering algorithms, and is likely the algorithm of choice for very large-scale graph clustering.”

Those results are shown in the figure below:

Here, the Y-axis represents precision, or the proportion of data points correctly classified within a cluster. The X-axis represents recall, or the proportion of data points actually belonging to a cluster retrieved by the algorithm.

We see that TeraHAC (green line) has the highest recall-to-precision ratio of all tested algorithms.

This appears to be Figure 13 from the full paper, and represents a dataset of search queries. This is based on section “6.3 Large Scale Clustering of Web Queries”:

“Lastly, we study the quality of TeraHACon a large-scale dataset of web search queries. We use a graph whose vertices represent queries, and edges connect queries of similar meaning. The weight of each edge is computed using a machine-learned model based on BERT.”
– TeraHAC: Hierarchical Agglomerative Clustering of Trillion-Edge Graphs

The similarity of queries in the study was determined by intent:

“For evaluation, we use a sample of 53 659 pairs of queries. Each pair is assigned a human-generated binary label specifying whether the two queries likely carry the same intent and thus should belong to the same cluster.”

This is an interesting point for SEOs, who are likely familiar with BERT. In this context, an ML model based on BERT was used to compute the edges (relationships) between queries in a graph. As we saw, those relationships were based on having “the same intent.”

As Pandu Nayak wrote in a 2022 blog post for Google’s The Keyword called “How AI powers great search results“:

“Launched in 2019, BERT was a huge step change in natural language understanding, helping us understand how combinations of words express different meanings and intents. Rather than simply searching for content that matches individual words, BERT comprehends how a combination of words expresses a complex idea. BERT understands words in a sequence and how they relate to each other, so it ensures we don’t drop important words from your query — no matter how small they are. … our BERT systems excel at two of the most important tasks in delivering relevant results — ranking and retrieving. Based on its complex language understanding, BERT can very quickly rank documents for relevance. We’ve also improved legacy systems with BERT training, making them more helpful in retrieving relevant documents for ranking.”
– Pandu Nayak

The figure up above shows “precision and recall with respect to these labels for TeraHAC, SCC and DBSCAN,” per the paper.

The researchers further explain:

“TeraHAC improves in quality over both SCC-5 and SCC-50, in particular delivering about 20% better recall than SCC-5. At the same time, it runs about 2x faster than SCC-50 and about 2x slower than SCC-5. This difference in performance seems to be explained by reduction in the number of edges and nodes present in the graph for TeraHAC, as compared with SCC.”
– TeraHAC: Hierarchical Agglomerative Clustering of Trillion-Edge Graphs

There’s plenty more information in the paper not included in the blog post. I asked Gemini 1.5 Pro in Google AI Studio to list some of the differences:

The paper gives formal definitions of reducible cluster similarity functions, including the single-linkage and complete-linkage functions we mentioned earlier, as well as median-linkage (median of all data points in a cluster).
The paper goes into more detail about ParHAC (the parallel approximate HAC algorithm) and contrasts it with TeraHAC.
The paper has more proofs and mathematical analysis, including for TerHAC, as well as details about parameter settings.
It also mentions good edges, specifically the number available to the algorithm in each round of processing.

5. The conclusion

Despite being a classic method for clustering, “HAC still has many more secrets to reveal,” the researchers conclude.

In terms of future directions, they mention “designing good dynamic algorithms for HAC” and “understanding the complexity of HAC in low-dimensional metric spaces.”

A dynamic algorithm, in the context of clustering, would efficiently update the clustering results as the underlying data changes. Traditional HAC algorithms need to re-tune the entire clustering process if new data points are added or others modified. A dynamic algorithm would be critical for clustering many real-world datasets, which are themselves dynamic.

Low-dimensional metric spaces have a small number of dimensions, so understanding HAC in this context could lead to specialized algorithms that are more efficient than general-purpose HAC.

In the full paper’s conclusion, the researchers also mention:

“As a future work, it would be interesting to see whether we can theoretically bound the number of rounds required by the TeraHAC algorithm (possibly for a carefully chosen graph partitioning method). Another open question is whether TeraHAC could be extended from computing the bottom (high similarity) part of the dendrogram to computing the entire dendrogram. Finally, we believe the notion of (1 + 𝜖)-good merges may be useful for designing efficient HAC algorithms in other models, for example in the dynamic setting.”
– TeraHAC: Hierarchical Agglomerative Clustering of Trillion-Edge Graphs

Since TeraHAC operates in rounds (clustering subgraphs independently), a theoretical upper bound could improve efficiency and understanding of the algorithm’s behavior.

The mention of “dendrogram” refers to a tree-like diagram that shows hierarchical relationships between clusters. TeraHAC is optimized for understanding the bottom part (clusters with high similarity) but not necessarily the complete dendrogram.

Lastly, the reference to “good merges” refers to whether the merge operation in TeraHAC could be used to design efficient algorithms for other types of HAC, like dynamic HAC.

In terms of general takeaways, the researchers transformed a classic clustering algorithm (HAC) into a novel scalable, distributed algorithm (TeraHAC) that exploits sparsity (by ignoring edges or relationships between data points that wouldn’t matter) to efficiently cluster massive graphs. This opens up new possibilities for clustering and analyzing large-scale datasets.

So why should SEOs care about TeraHAC (maybe)?

Well, just as the researchers focused on a massive dataset with trillions of edges, I think we can zoom out big picture, as well.

We can imagine how clustering algorithms could be used on unlabeled datasets to find similarities between data points, which could include webpages and other content eligible for ranking in search results, as well as search queries and their intent.

The more precisely and efficiently search engines can determine the relationships among documents at scale, the more impact that could have on what results get returned. From our perspective as SEOs, this can influence how we think about search intent and the relevance of content, and how we understand search engine capabilities on the whole.

Of course, we’re still talking about Google Research here, not Google Search.

But before we wrap up, let’s ask what Gemini thinks, based on the discussion we’ve had about TeraHAC so far, what this research’s implications are for SEO and search.

*Note: this is a theoretical exercise now, not predictions or instructions. 😉

1. Insight into search engine algorithms

Search engines like Google likely use clustering algorithms to group and categorize content. Understanding how these algorithms work (especially at a massive scale) can help SEOs better understand how search engines may view their content or group it based on relationships.

Furthermore, as clustering algorithms, like TeraHAC, improve, so could search engines’ understanding of the intent behind queries in dynamic environments, leading to more accurate, personalized, and relevant results.

2. Improved keyword and topic research

SEOs can also use clustering algorithms themselves to discover related keywords, topics, or even competitors that might not be immediately obvious, leading to new content topics (filling gaps) or optimization opportunities.

Clustering algorithms could also be used for the semantic clustering of keywords or topics to create more comprehensive content clusters relevant to user journeys.

3. Enhancing content optimization strategies

Clustering algorithms could also inform internal linking strategies based on semantically related content.

As search engines improve their understanding of content generally, they may also place more emphasis on websites that demonstrate expertise for specific topics based on deeper or more nuanced semantic relevance.

Till next time …

I hope you’ve enjoyed this week’s Hamsterdam Research article!

Feel free to comment with feedback or contact me.

Stay tuned for a new article next week (or if you’ve still got energy, check out more posts below).

Until next time, enjoy the vibes:

Thanks for reading. Happy optimizing! 🙂

SEO Strategist and Consultant

Ethan Lazuk

Previous research articles:

PLEDGE (via Google DeepMind), Content Planning for Navigating Trade-Offs of Specificity & Attribution in KGD Systems, & Why SEOs Should Care (Maybe)

We look at Google DeepMind’s PLEDGE framework of content planning for attribution and specificity trade-offs in knowledge-grounded dialogue systems, and why SEOs should care (maybe) in this Hamsterdam History article.

April 26, 2024October 5, 2024

Could Google Soon Understand Text in Your Images? Exploring Hierarchical Text Spotter (HTS) (via Google Research) & Why SEOs Should Care (Maybe)

We’ll look at Google Research’s “Hierarchical Text Spotter for Joint Text Spotting and Layout Analysis” and why SEOs should care (maybe).

April 19, 2024October 5, 2024

How Large Scale Self-Supervised Pretraining for Active Speaker Detection Works (via Google Research) & Why SEOs Should Care (Maybe)

In this rendition of Hamsterdam Research, we look at a Google Research paper on large scale self-supervised pretraining for active speaker detection (ASD) and its (possible) implications for SEO work and video-based search results.

April 14, 2024October 5, 2024