How Large Scale Self-Supervised Pretraining for Active Speaker Detection Works (via Google Research) & Why SEOs Should Care (Maybe)
By Ethan Lazuk
Last updated:

Welcome to the latest rendition of Hamsterdam Research!
In this article, we’ll take a look at self-supervised pretraining for active speaker detection.
But why should SEOs care about that (maybe)?
Well, could you imagine if Google not only understood the content of a video but also pinpointed who’s speaking and when, and even interpreted visual cues?
A recent paper from Google Research that we’ll be reviewing shows how researchers are working on this kind of technology, which could have implications for AI models understanding video data better, specifically identifying who’s speaking in videos.
How might that change how we approach video content in SEO strategies? What other implications might this knowledge have? We shall explore!
But why this topic, specifically?
Well, there was a lot to choose from this week in the list of Google Research’s latest publications.
But after consulting with my colleagues — Gemini and ChatGPT 😉 — we landed on, “Large Scale Self-Supervised Pretraining for Active Speaker Detection.”
The paper was published by Alice Chuang, Keith Johnson, Olivier Siohan, Otavio Braga, Tony (Tuấn) Nguyễn, Wei Xia, and Yunfan Ye.

Gemini said, “This paper has the most direct relevance to SEO. Since Google focuses on understanding and indexing multimedia, techniques for identifying speakers in audio/video content can enhance search and content organization.”
ChatGPT said, “Self-supervised learning, as discussed in the paper, could provide valuable understanding into how AI models are trained on large datasets without labeled data, which is similar to how search algorithms learn from user interactions to improve search results and user experience. Insights from this study could help SEO professionals grasp newer AI methodologies that might influence future changes in search engine algorithms, particularly in how they analyze and interpret user data and content.”
And I said, even though SEO content generally focuses on text-based content for webpages, the growing emphasis on multimodal search on Google (think MUM, SGE, and Gemini) and other AI-driven search engines means video content can play a big(ger) role in the searcher’s journey, whether it’s from YouTube, TikTok, or elsewhere. The concept of AI models learning from unlabeled user data, and how they can be trained in general, is also interesting to appreciate in the context of ranking systems.
We’ll review these (potential) SEO implications more at the end! (Or you can jump ahead).
That said, from reviewing this article, I learned some new concepts around machine learning that I found generally helpful, as well. Hope you’ll feel the same!
Still intrigued?
Let’s start with the paper’s abstract
“In this work we investigate the impact of a large-scale self-supervised pretraining strategy for active speaker detection (ASD) on an unlabeled dataset consisting of over 125k hours of YouTube videos. When compared to a baseline trained from scratch on much smaller in-domain labeled datasets we show that with pretraining we not only have a more stable supervised training due to better audio-visual features used for initialization, but also improve the ASD mean average precision by 23% on a challenging dataset collected with Google Nest Hub Max devices capturing real user interactions.”
If that sounds a little daunting, no worries!
We’ll start with some basic concepts and vocabulary.
In short, the goal of this research is to make AI models better at identifying the speakers in videos.
When we read that an unlabeled dataset of YouTube videos was used, this means there were no labels to identify who was speaking, so the AI model learned by watching and detecting patterns.
Self-supervised pretraining involves giving the AI model a head start to study videos (the over 125k hours of YouTube videos) and then fine-tuning the model with a smaller, labeled dataset of real user interactions from Google Nest Hub Max devices, which are these:

Active speaker detection (ASD) refers to when the AI model (or a computer, in general) detects the person who is speaking at any given moment.
When the researchers mention their findings compared to a baseline trained from scratch on smaller in-domain labeled datasets, this refers to a technique of supervised learning with data of the same type that the model will be used on eventually. (In this case, it would be like using labeled Nest speaker data for pretraining instead of unlabeled YouTube data.)
Finally, when the abstract mentions the improved ASD mean average, this simply refers to the model having better accuracy for identifying the speakers in videos.
As for this research’s possible implications for SEO, if AI models can better detect the speakers in videos, they could also more accurately attribute their words in the chunks (snippets of information extracted from videos) that could be used for SGE, AI answer engine, or chatbot answers, say nothing of normal video search results.
Now let’s get into the nitty-gritty of the research
I’ll be referencing the paper on IEEE Xplore, if you wish to follow along.
The paper has nine main sections, including an introduction, relation to prior work, datasets, model, supervised fine-tuning, self-supervised pretraining, training, results, and conclusion.

We’ll check these out in sequence.
1. Introduction
The paper’s introduction starts out by mentioning the use cases for ASD, “a core component of many video analysis systems, such as diarization, multi-person Audio-Visual Automatic Speech Recognition (AV-ASR), and speech enhancement, to name a few.”
Of these, according to Gemini, multi-person AV-ASR is most relevant to the SEO field because it identifies speakers, which could help search engines attribute information to speakers in video search results, as well as helps with speech recognition, like for understanding a video’s context for indexing and improved discoverability.
Alternatively, diarization is mostly used for identifying speakers but not content, so it would apply more to transcription, while speech enhancement refers to audio quality for better speech recognition.
As for the dataset from the Nest Hub Max, it uses data from the Look and Talk feature.
“The ASD task described in this paper was investigated in the context of a larger system called Look and Talk , a feature on Nest Hub Max devices that allows users to interact with the device without having to use a hotword (“Ok Google”). To use Look and Talk, users simply need to look at the device and start speaking. Once the user interaction has been detected, the device then starts listening for commands.”
As for why YouTube videos were chosen for self-supervised learning (SSL) during pretraining, it’s because the “public and private labeled datasets,” which typically would be used for supervised learning, “are relatively small … and scaling is a costly process.”
“The main purpose and contribution of the present work is to show how we can leverage a large scale unsupervised dataset extracted from thousands of hours of YouTube videos to pretrain an ASD model. Self-supervised learning has been a success story in many applications, in natural language processing, computer vision and speech processing, and to the best of our knowledge it hasn’t been explored in the context of ASD.”
In other words, this research is innovative because it applies SSL during pretraining, rather than supervised learning, in the context of ASD.
[As a side note, the mention of SSL being used for natural language processing (NLP) could be notable for SEOs to explore. Per Gemini, SSL for NLP could apply to topic modeling (identifying the main topics and themes of text for categorization), named entity recognition (NER) (think semantic or vector search and knowledge graphs), and intent recognition (for search queries).]
2. Relation to prior work
This research is a continuation of past work from Google Research.
In previous research, the authors “proposed a multi-task loss for Audio-Visual Automatic Speech Recognition (AV-ASR) and ASD.”
The YouTube dataset used “was labeled for ASR,” however, it lacked “ASD labels,” which would be “a prohibitively costly process.”
This research paper thus takes the “next logical step” to “evaluate the impact of fine-tuning the model in the presence of a much smaller in-domain dataset with ASD labels after pretraining a model on this much larger unsupervised YouTube data.”
The previous work referenced includes:
- End-to-end multi-person audio/visual automatic speech recognition (2020)
- A closer look at audio-visual multi-person speech recognition and active speaker selection (2021)
- Best of both worlds: Multi-task audio-visual automatic speech recognition and active speaker detection (2022)
As for how this work is different and its overall purpose, the authors explain:
“Note that our focus is not to cover the vast array of ASD model architectures in the literature. Instead, we settled on one broad model architecture that satisfies our real time constraints, and focus on describing and evaluating the pretraining + fine-tuning strategy, which we believe generalizes to other architectures and domains, and is the main contribution of this paper.”
To dig in a bit more, for earlier research, the researchers worked around not having a labeled ASD dataset by using “an auxiliary cross-entropy loss applied within a minibatch for ASD that proved to be helpful on unsupervised data.”
As Gemini explains, how this worked was the researchers added another loss function specifically focused on ASD to the model — a loss function compares a model’s predictions to the actual answers (labeled data) — and then trained in batches so the model received “ASD-focused feedback frequently during training.”
[Note: Backpropagation is the algorithm used to figure out how to reduce the loss in a model by adjusting parameters. If we accept that a search engine is composed of various subsystems that work together, we can assume some machine learning models use backpropagation to learn specific tasks.]
But rather than taking that step of using an auxiliary loss function, this research takes the “next logical step” by investigating what happens if a model is pretrained on a large unlabeled dataset and then fine-tuned on a smaller labeled dataset.
Speaking of which …
3. The datasets
For the pretraining phase (the unlabeled data), the researchers “use over 125k hours of transcribed short YouTube video segments extracted with the semi-supervised procedure” first proposed for automatic speech recognition (ASR) in 2013 and later for audio-visual ASR in 2019.
In terms of the details of the YouTube dataset, this is how the researchers explain it:
“We extract short segments where the force-aligned user-uploaded transcription matches the transcriptions from a production quality ASR system. From these segments, we then keep the ones in which the face tracks match the audio with high confidence. … For the purposes of this work, the transcripts are ignored and we only use the audio and video tracks.”
To elaborate a bit, with Gemini’s help, on the selection process, the researchers started with a large pool of YouTube videos with user-uploaded transcripts. Next, they used an automated speech recognition system and machine-generated transcripts to compare against the user-provided ones, where close matches implied clearer audio. Finally, face tracking technology was used to check the synchronicity between mouth movements and the audio. This produced a high quality dataset that was more conducive for the model to learn from than, say, a random selection of videos.
For the fine-tuning phase (labeled data) and evaluation, the researchers used “two datasets labeled for ASD on data collected internally from users who opted in to donate their interactions with their Nest Max Hub devices.”
The TFDF dataset, which was the main dataset of interest for the research, consisted of “1 million 20-60s videos” from which they “labeled around 15 thousand, spanning over 200 households.” Of those, “11K were used to train ASD and for offline metrics.”
The IPSOS dataset was used to “evaluate fairness for the core models” and consisted of “56.3K 20s videos of around 3K subjects of various ages, genders and skin tones,” where “A third of the data has gaze and active speaker labels.”
Here’s an example of the labeling tool used for these Google Nest Hub datasets:

It’s a little hard to read, but the colors are labels and say things like “0. Paying attention and speaking to assistant” or “2. Paying attention but not speaking.”
[Aside: I appreciate the reference here to “fairness” of the model. While I never participated in an AI research project, I was part of anthropological research projects as a college undergrad, where I labeled data manually from field research. In general, as much as these AI developments are potentially exciting for SEO and digital marketing, I’m hopeful they’ll also help expand anthropological research that promotes more cultural relativism and helps eliminate biases that contribute to ethnocentrism and political polarization.]
4. The model
This section gets a little into the weeds on some things, but I’ll try to keep it focused on what’s most pertinent.
Here’s how the researcher’s ASD model system architecture looks, where the gray boxes are fixed functions and the yellow boxes are trained:

The paper goes into detail about each component.
I asked Gemini to summarize the process high level:
“Audio Preparation:
- Breaks the audio into tiny chunks.
- Analyzes the sound patterns in each chunk.
Video Preparation:
- Matches the video speed to the audio analysis.
- Focuses on the speaker’s face.
Feature Extraction:
- Special computer ‘eyes’ find patterns in the video.
- Special computer ‘ears’ find patterns in the audio.
Combining Clues:
- Puts together the video and audio patterns.
Decision Time:
– Gemini summary
- For each moment in the video, predicts if someone is speaking.”
Since that’s a little elementary, we’ll now explore some of the components in a bit more detail, with Gemini’s help. (All quotes go to the paper, unless otherwise specified.)
Here is the first component:
“Acoustic Features. The 16kHz audio is framed with 25ms windows smoothed with the Hann window function, with strides of 10ms between frames. We compute log mel filter bank features with 80 channels, and fold every 10 consecutive feature vectors together, yielding a 800-dimensional feature vector every 100ms (10Hz). These parameters are selected so that we have acoustic features at the same frame rate as the videos from the supervised ASD datasets.”
Audio typically isn’t analyzed all at once but divided into chunks (frames). In this component, the audio is divided into frames and smoothed by a Hann window function to prevent artificial frequencies. Mel filter banks — related to the Mel scale of measuring how humans hear pitch — divides the audio so it’s closer to how human ears work, which is important for speech recognition “because important cues about who is speaking are embedded in these frequency patterns,” per Gemini.
Lastly, the mention of selecting parameters refers to ensuring the frame rate matches the video datasets that will be used later in the supervised fine-tuning portion.
Another component was “Audio and Video Synchronization,” where the YouTube videos for pretraining had “frame rates ranging from around 23 to 30 fps,” so “in order to make the input uniform” they resampled them “with nearest neighbor interpolation” at (10Hz). Meanwhile, “videos in the supervised training sets (TFDF and IPSOS) are already collected at 10fps.”
Gemini explains this isn’t a sophisticated method but is computationally very fast, and basically involves creating a new frame rate (new data point) by selecting only the closet pixel in the original frame for the new frame.
[Aside: The idea of nearest neighbor also applies in a machine learning algorithm called K-Nearest Neighbors (KNN), which categorizes (labels) data based on closet data points. I’ve also recently started geeking out about ScaNN (Scaleable Approximate Nearest Neighbor), an algorithm from Google (which I tweeted about here) that finds semantically similar documents to a query (also very fast) based on closest neighbors in a vector space.]
There was also a visual frontend and acoustic frontend. The visual frontend involves the computation of “visual features with a 10-layer causal (2+1)D ConvNet,” which refers to a convolutional neural network (CNN) used for time-series data, like maintaining the order of events in a video. The researchers also mention how they “compute acoustic features with a 4-layer causal 1D ConvNet,” a less deep CNN that can depict things like pitch patterns or timbre (sound quality).
Since the acoustic and visual features have been processed separately, a fusion layer fuses them “with simple concatenation,” which Gemini explains is “literally stacking the feature vectors on top of the other” to create a “new combined audio-visual feature vector.”
Next, the features are fed into an audio-visual encoder, explained by Gemini as a “standard neural network” to help “the model learn how to interpret” the features together “for a more accurate prediction.”
Finally, project layers take the output of the encoder and, per Gemini, reduce it to “a single value per timestep,” also called a “logit.” “These logits will directly inform if the model thinks a speaker is active in each frame.”
5. Supervised fine-tuning
At this point, the model has been pretrained on the large unlabeled YouTube dataset. So in this stage, the labeled dataset from Google Nest Hub videos, with labels for active and non-active speakers, will be used for supervised fine-tuning for the ASD task.
As the researchers explain:
“We first describe the supervised setup, as the self-supervised pretraining builds on this description. With explicit ASD labels Y ∈{0,1}B×T in hands for the TFDF and IPSOS supervised training from scratch and fine-tuning scenarios we use a weighted cross entropy loss: Lsup=−1BT∑b,tw×ybtlog(pbt))+(1−ybt)log(1−pbt) where ybt are the ground truth labels and pbt are the model sigmoid predictions from the tensors Pav|a|v = σ(Zav|a|v). Since there are many more NOT SPEAKING labels than SPEAKING, w is a weight for the positive examples used to counter the class imbalance. To simplify the exposition we omit the tensor masks, but the reader should have in mind that each minibatch includes a mask tensor for each sequence and each timestep, and the losses need to take this mask into account.”
If those equations look a little perplexing, the easiest way to think about it is how Gemini lays it out, where (Y) are labels of active or non-active speakers, (pbt) is the model’s prediction, and (ybt) is the actual labels.
Since there are more “not speaking” moments, (w) is a weight used to counter this imbalance, Gemini explains, which “prevents the model from becoming biased towards always predicting ‘not speaking.’” Then we have the sigmoid, a mathematical function that squishes the model’s outputs into a 0 to 1 range.
The researchers also use auxiliary losses (additional loss terms) to force multimodal learning, or focusing the model on both audio and visual cues.
For simplicity, the researchers also omitted masks, but per Gemini, “In real implementation, they would use masks to ignore irrelevant parts of the input data (like if videos have varying lengths).”
[Asides: I’m learning how transformers work, and when I saw sigmoid I first thought of a softmax function, but Gemini explained this isn’t similar. A sigmoid is “suitable for binary classification” while softmax works for “multi-class classification (one of many possible options).” I also asked if the masks used here were similar to training BERT, but it’s a different type of masking. BERT used randomly masked tokens in the input to teach the model, whereas masking in this paper likely refers to “Padding shorter sequences with zeros or a special value to make them the same length as the longest sequence in the batch,” and that would “indicate which parts of the input are real data and which parts are padding,” or can be ignored.]
6. Self-supervised pretraining
Since the goal is to pretrain the model on unlabeled data, the researchers create artificial labels for active speakers by correlating the audio and video.
The researchers explain:
“During training, each example from the unlabeled YouTube dataset consists of a pair of matched audio and video (face) tracks, and we construct artificial labels for this dataset as follows: Starting from matched visual and acoustic features V∈RB×T×Dv and A∈RB×T×Da, respectively, we build an augmented unsupervised minibatch with the concatenation , where V+ = V corresponds to the positive examples and to the negative examples. The negative examples are constructed with V−btv=V(b+1modB)tv[.]”
Gemini explained it like this:
The starting point is a large dataset of YouTube videos with synchronized audio (A) and video (V) tracks. In an augmented minibatch, if the audio (A+) and video (V+) match, the speaker should be active and it’s a positive example. If there’s a mismatch, created by taking the next video (V-) but the original audio (A-), so the audio and mouth movements don’t align, then this is a negative example.
As Gemini (funnily says), “this works (sort of),” because these labels aren’t always accurate, “but it gives the model enough examples to start learning general patterns.”
The researchers also take a shortcut by modifying the video features (A, V) instead of the raw videos, which “saves computational power during training,” per Gemini.
This uses “essentially the same model architecture as for supervised training,” per Gemini, except there are no auxiliary losses because they have “generated roughly the same number of ‘fake’ positives and negatives.”
7. Training
For the pretraining (self-supervised with unlabeled data), they “use the Adam optimizer with parameters β1 = 0.9 and β2 = 0.98,” which is “conducted on 128 TPU cores and takes approximately 2 day.”
For the fine-tuning (supervised with labeled data), they “train for 40k steps, sampling from the IPSOS/TFDF training sets with a 0.4/0.6 probability, respectively” and “use the Adagrad optimizer.”
Let’s look at what these optimizers mean as well as the numbers we’re given.
In the pretraining stage, they use the Adam optimizer. According to Gemini, this is a “very popular optimization algorithm widely used for training deep neural networks.” Essentially, Adam helps the model find the right weights and biases (parameter settings) to minimize the loss function (difference of the output from the actual data). Adam is common for “problems involving sparse features,” like NLP.
As for the parameters mentioned (which are beta1 and beta2 symbols), the first controls the decay rates “of the exponential moving average of past gradients” and “past squared gradients,” which helps smooth the optimization path and “influences how much past information affects the current update,” respectively.
For the fine-tuning stage, they use Adagrad optimizer, which is also “well-suited for tasks with sparse features.” Adagrad doesn’t use parameters like Adam but instead “focuses on accumulating historical squared gradients for each parameter,” per Gemini. While the probability reference was to the amounts of data used (recall how TFDF was the “main dataset of interest” (.6), while IPSOS was used for “fairness” (0.4).
8. Results
The researchers summarize their results as follows:
“When compared to the purely supervised baseline, we show a 23% improvement on TFDF, while not degrading the performance on IPSOS, which is the main purpose of this dataset. Moreover, we see that not only we get better mAP with pretraining, but also much more stable/lower variance training probably because of better features learned with the larger unsupervised dataset, which makes it much easier to compare different models on the supervised finetuning stage.”
Here is a table showing that 23% improvement:

And here is a chart showing “mAP variance on the dev set during training on TFDF”:

I asked Gemini to explain this graphic further:
The mAP is a common metric used for object detection model performance. The variance is the performance indicator, with lower variance indicating “the model is consistently performing well on the development set throughout training,” per Gemini. The development set refers to a subset of the TFDF dataset used to monitor model performance and help prevent overfitting during training, and it’s separate from training and test data.
As for the chart, “In an ideal scenario, the line would be relatively flat and close to a high mAP value. This would indicate that the model’s performance on the development set is consistent and good throughout training.”
Based on that note from Gemini, we can see how the blue line (semi-supervised pretraining on unlabeled data followed by fine-tuning on labeled data) is much flatter than the green line (only supervised training with a labeled dataset). This indicates better performance (mAP).
Well, it took us a while to get there, but that’s a pretty cool finale!
Now for the real one …
9. Conclusion
The authors “showed a self-supervised pretraining strategy that yields 23% improvement in ASD mean average precision when compared to training from scratch on a challenging labeled dataset collected in a realistic setting from Google Nest Max Hub devices.”
I think after our journey here, we can understand what that means! And that’s awesome.
As for future implications, they state, “It would be interesting to see in future work whether a similar pretraining strategy applies to public datasets,” or openly available datasets for ASD research. Gemini implies these datasets could be AVADIAR (group discussions), AMI Corpus (meeting recordings), or AVA-ActiveSpeaker (derived from movies).
As takeaways in my own words, I’d say generally to remember that this paper shows how self-supervised pretraining of AI models on unlabeled datasets followed by fine-tuning with labeled datasets can produce better performance than supervised learning (labeled datasets) alone. And specifically, I’d say keep in mind that detecting “active speakers” in video content is a priority for Google. Here we saw it for Nest data (consumer side), but YouTube data was used for pretraining, so this could easily translate to other areas like Search.
Speaking of which …
Why SEOs should care (maybe) about self-supervised pretraining for ASD
To start, I can speak for myself and say before writing this post, I had no idea what ASD was, nor did I have the same level of understanding for self-supervised learning vs. supervised learning. So if learning about AI models and machine learning is the goal generally, that’s been helpful.
But what about specifics?
Let’s ask Gemini, based on all of our past conversations on this topic so far.
*Note: this is a theoretical exercise now, not predictions or instructions. 😉
1. Understanding video in search
Gemini believes the paper highlights the increasing importance that Google places on understanding video content, not only for spoken words but also visual cues.
As SEO professionals, we need to think of videos beyond just their spoken content for transcription, but how the video holistically contributes knowledge to solve a user’s search intent.
2. Google’s investment in multimodal AI
This research paper is further evidence of Google’s investment in understanding multimodal content, including audio and video. We remember the introduction of MUM, and of course, the Gemini unveil (what the quack!).
We also know that Google recently announced that Gemini 1.5 Pro now supports audio stream processing while putting it in public preview on Vertex AI.
SEO is more than text-based content; it’s all content — including social media. 😉
3. Building your own models
Given the success of pretraining on unlabeled data, SEOs could use unlabeled web data to build their own models that determine how to improve search performance. This helps avoid the cost and difficulty of creating labeled datasets.
4. Other search and AI answer engines
Competitors with Google Search are likely just as focused on the role of video content, so understanding these models, in general, can help interpret those companies’ releases of their own models, as well as optimize content for their search results or citations in summaries.
5. Understanding ranking systems
Gemini hypothesizes that active speaker detection (ASD) is likely a priority for Google’s search algorithms, as well. Understanding this technology can help SEOs evaluate the relevance and quality of videos on a deeper level.
And more generally, this research paper has shown the sophistication of AI systems, which can help SEOs contextualize how search engines “think” about content and user data.
It’s never as simple as “Google loves or hates this type of site.” 😉
Till next time …
Thanks for checking out this Hamsterdam Research article! I hope you enjoyed it.
Feel free to share feedback in the comments or contact me about that, SEO consulting, or whatever else your heart desires (except AI spam).
Stay tuned for a new research article next week (or check out past articles below). I’ll likely also revisit this article to clean up the writing, add some key points for clarity, and maybe a few more insights as they’re learned.
Until next time, enjoy the vibes:
Thanks for reading. Happy optimizing! 🙂
Related posts
How AutoBNN Automates the Discovery of Interpretable Time Series Forecasting Models & Why SEOs Should Care (Maybe)
This is the second Hamsterdam Research article, which covers AutoBNN: Probabilistic time series forecasting with compositional bayesian neural networks.
The Embedding Language Model (ELM) & Why SEOs Should Care (Maybe)
This Hamsterdam Research article looks as ELM from “Demystifying Embedding Spaces using Large Language Models,” a Google Research paper.
Leave a Reply