How AutoBNN Automates the Discovery of Interpretable Time Series Forecasting Models & Why SEOs Should Care (Maybe)

Last updated:

April 6, 2024

Welcome to another rendition of Hamsterdam Research!

In this week’s article, we’ll take a look at AutoBNN from Google Research to explain what it is, how it works, and why SEOs should care (maybe).

AutoBNN was introduced in a blog post called, “AutoBNN: Probabilistic time series forecasting with compositional bayesian neural networks,” which was posted by Urs Köster, Software Engineer at Google Research, on March 28th, 2024.

If that title sounds daunting, never fear.

Let’s first look at some basics of the vocabulary:

We can think of AI as computers thinking or acting like humans.

Machine learning (ML) falls within AI as a specific technique where computers learn on their own from data without explicit programming.

In the machine learning world, “NN” is an acronym for neural network.

Neural networks use nodes (artificial neurons) inspired by the structure of the human brain. There’s an input layer, hidden layers, and an output layer.

The connections between the nodes in a NN model are weighted. Those weights adjust in order to fine-tune the model to reach a desired output (or close to it) based on the inputs.

There are different types of neural networks that handle various sorts of data for specific tasks. This is why you’ll often see another letter in front of NN.

In BNN, the “B” stands for Bayesian, or Bayesian neural network.

This refers to Bayesian inference (statistics). The Bayes’ theorem starts with prior knowledge of data probabilities and then updates the probability of a hypothesis as more data becomes available. Bayesian methods are used in different ML models, including here with neural networks.

Why are BNNs necessary?

One issue with traditional neural networks that use fixed weights is overfitting, where the model only has limited data, so it memorizes the training data and thus struggles with new data.

A Bayesian neural network views weights between nodes as probabilities rather than fixed values, so it gives the final answer (output) with a measure of uncertainty.

This is helpful for knowing the confidence level of a prediction, particularly for scenarios where data is limited or safety is critical, like a medical diagnosis.

Introducing AutoBNN

The blog post’s introduction explains the context behind AutoBNN in two paragraphs.

Let’s start with the first paragraph:

“Time series problems are ubiquitous, from forecasting weather and traffic patterns to understanding economic trends. Bayesian approaches start with an assumption about the data’s patterns (prior probability), collecting evidence (e.g., new time series data), and continuously updating that assumption to form a posterior probability distribution. Traditional Bayesian approaches like Gaussian processes (GPs) and Structural Time Series are extensively used for modeling time series data, e.g., the commonly used Mauna Loa CO2 dataset. However, they often rely on domain experts to painstakingly select appropriate model components and may be computationally expensive. Alternatives such as neural networks lack interpretability, making it difficult to understand how they generate forecasts, and don’t produce reliable confidence intervals.”
– AutoBNN blog post, Google Research

We see that we’re talking about “time series problems.” The examples given include forecasting weather, traffic, or economic trends.

But being as we’re SEOs, let’s consider some time series problems relevant to our world:

Search demand: understanding seasonality for topical domains of queries (keywords) or spotting emerging trends.
Search trends: spotting long-term shifts in search behavior (user intent), like which topical domains are becoming more popular or declining.
Traffic anomalies: finding spikes or drops in traffic could signal rapid search volume trends (breaking news), technical issues, or bots (fraud issues).

We’ll explore these more in the final hypothetical section.

Bayesian approaches

The introduction explains how Bayesian approaches start with assumed data patterns (prior knowledge), collect evidence, and then continuously update the assumption “to form a posterior probability distribution.”

That jives with our vocabulary introduction earlier.

Now we also get some examples of Bayesian approaches, including “Gaussian processes (GPs)” and “Structural Time Series.”

Let’s quickly look at each of these using the references from the paper with summation help from Gemini Advanced.

Gaussian processes

Gaussian processes “can model complex relationships and temporal dependencies in time series data. … Rather than giving a single best-fit line, GPs provide a full probability distribution over functions. This captures uncertainty and allows for risk assessment,” explains Gemini.

GPs can incorporate prior knowledge about the modeled system, which is incorporated through the choice of a kernel, among other parameters.

What’s a kernel?

It’ll be an important concept to know for AutoBNN, so let’s dig in more:

“The kernel function is the heart of a GP. It defines the similarity between data points – how smooth or wiggly the potential functions are expected to be. The choice of kernel reflects your beliefs about the underlying process. …

Time series experts can inject their knowledge into the model by carefully choosing kernel functions that represent their understanding of seasonality, cyclical behavior, or other known characteristics of the data.”
– Gemini

The source listed in the blog post’s introduction for Gaussian processes is gaussianprocess.org, which has a full book for download.

In chapter 7, called “Theoretical Perspectives,” the book speaks to traditional feed-forward neural networks (described as “ANNs” or an artificial neural networks):

“In the 1980’s there was a large surge in interest in artificial neural networks (ANNs), which are feedforward networks consisting of an input layer, followed by one or more layers of non-linear transformations of weighted combinations of the activity from previous layers, and an output layer. One reason for this surge of interest was the use of the backpropagation algorithm for training ANNs. Initial excitement centered around that fact that training non-linear networks was possible, but later the focus came onto the generalization performance of ANNs, and how to deal with questions such as how many layers of hidden units to use, how many units there should be in each layer, and what type of non-linearities should be used, etc.

For a particular ANN the search for a good set of weights for a given training set is complicated by the fact that there can be local optima in the optimization problem; this can cause significant difficulties in practice. In contrast for Gaussian process regression and classification the posterior for the latent variables is convex.

One approach to the problems raised above was to put ANNs in a Bayesian framework, as developed by MacKay [1992a] and Neal [1996]. This gives rise to posterior distributions over weights for a given architecture, and the use of the marginal likelihood (see section 5.2) for model comparison and selection. In contrast to Gaussian process regression the marginal likelihood for a given ANN model is not analytically tractable, and thus approximation techniques such as the Laplace approximation [MacKay, 1992a] and Markov chain Monte Carlo methods [Neal, 1996] have to be used. Neal’s observation [1996] that certain ANNs with one hidden layer converge to a Gaussian process prior over functions (see section 4.2.3) led us to consider GPs as alternatives to ANNs.”
– Gaussian Processes for Machine Learning (Chapter 7)

Not all of that information is necessary to carry forward with us, but we see the relation of ANNs to GPs.

What makes GPs different from ANNs?

“GPs and ANNs are generally considered distinct classes of machine learning models. GPs are probabilistic, kernel-based methods, while ANNs are focused on learning complex feature representations through interconnected nodes. …

The theoretical connection discovered by Neal suggests that some types of infinitely wide neural networks essentially behave like GPs. This highlights the overlap in what these models can represent.”
– Gemini

So now we have an understanding that “infinitely wide” neural networks can behave like GPs. That will come up again later.

But let’s also explore what “local optima in the optimization problem” refers to.

This is the explanation from Gemini, which seems to help explain why GPs are different from (and when they’re preferable to) ANNs:

“Neural Networks (ANNs):

Complex Landscape: Imagine a hilly and bumpy terrain. Training an ANN involves finding the lowest point (optimal weights) in this landscape that minimizes the error between the network’s prediction and the actual data.

Local Optima: Unfortunately, the landscape can have many dips and valleys (local optima) that aren’t the absolute lowest point (global optimum). The training algorithm might get stuck in one of these local minima, leading to subpar performance.

Optimization Challenges: Finding the best network architecture (number of layers, neurons) and avoiding local optima can be challenging and requires careful tuning of hyperparameters.

Gaussian Processes (GPs):

Smooth Distribution: GPs define a smooth probability distribution over possible functions. Think of it like a gently rolling landscape. The most likely functions (those with higher probability) will pass close to the observed data points.

Simpler Optimization: Due to the smoothness, there are no sharp cliffs or valleys to get stuck in. Finding the best function (maximizing the probability) is often mathematically easier compared to the complex optimization in ANNs.

Key Advantage of GPs:

This smoother optimization landscape makes GPs generally less prone to getting stuck in local optima compared to ANNs. This can be particularly beneficial for smaller datasets or complex problems where the risk of local optima is higher.”
– Gemini

For clarity, the issue of local optima speaks to a different problem from the interpretability uncertainty mentioned in the introduction of the blog post. The first is about finding the best possible weights for a network, while the latter is about understanding how those predictions for weights are arrived at.

That said, one problem with GPs relative to ANNs is they can be computationally expensive for large datasets.

Structural time series (STS)

Along with GPs, STS was the other “traditional Bayesian approach” mentioned in the blog post’s introduction.

The introduction links to the TensorFlow Blog, specifically a post from March 20th, 2019, called, “Structural Time Series modeling in TensorFlow Probability.”

This post explains how “Structural time series (STS) models [3] are a family of probability models for time series that includes and generalizes many standard time-series modeling ideas.”

It goes on:

“An STS model expresses an observed time series as the sum of simpler components …

The individual components are each time series governed by a particular structural assumption. For example, one component might encode a seasonal effect (e.g., day-of-week effects), another a local linear trend, and another a linear dependence on some set of covariate time series.

By allowing modelers to encode assumptions about the processes generating the data, structural time series can often produce reasonable forecasts from relatively little data (e.g., just a single input series with tens of points). The model’s assumptions are interpretable, and we can interpret the predictions by visualizing the decompositions of past data and future forecasts into structural components. Moreover, structural time series models use a probabilistic formulation that can naturally handle missing data and provide a principled quantification of uncertainty.”
– TensorFlow Blog

So now we know STS models are interpretable and can produce reasonable forecasts despite little data.

How are STS models and GPs different?

“Need Explainability & Limited Data? STS models might be preferred. They decompose the time series into understandable components, making them helpful for explaining trends and seasonal patterns.

Prioritize Uncertainty & Complex Patterns? GPs can be a better fit. They don’t make rigid structural assumptions, allowing them to capture intricate patterns and provide reliable confidence intervals for their predictions.”

– Gemini

So now that we have an understanding of GPs and STS models and the limitations of traditional ANNs, let’s explore the second part of the introduction.

What is AutoBNN?

After explaining traditional Gaussian processes and the limits of NNs, the blog post introduces AutoBNN:

“To that end, we introduce AutoBNN, a new open-source package written in JAX. AutoBNN automates the discovery of interpretable time series forecasting models, provides high-quality uncertainty estimates, and scales effectively for use on large datasets. We describe how AutoBNN combines the interpretability of traditional probabilistic approaches with the scalability and flexibility of neural networks.”
– AutoBNN blog post, Google Research

Based on this section, we can extrapolate some of the benefits of AutoBNN:

It “automates the discovery of interpretable time series forecasting models,” which speaks to the issue of interpretability in NNs.
It “provides high-quality uncertainty estimates,” which speaks to the value of Bayesian processes, like GPs or STS models.
It “scales effectively for use on large datasets,” which are computationally expensive for GPs.

Now, let’s delve into the rest of the blog post for more specifics on how AutoBNN works.

Kernel structures

The blog post explains the origins of AutoBNN as being “based on a line of research that over the past decade has yielded improved predictive accuracy by modeling time series using GPs with learned kernel structures.”

Recall the use of a kernel in GPs from earlier on? Well, here it is explained again from a thesis by David Duvenaud, linked from the blog post:

“the choice of kernel (a.k.a. covariance function) determines almost all the generalization properties of a GP model. You are the expert on your modeling problem – so you’re the person best qualified to choose the kernel!”
– David Duvenaud

The blog post from Google Research also speaks about two types of kernels: a base kernel (single) and a composite kernel “that combines two or more kernel functions.”

The composite kernel also “serves two related purposes,” including that “it is simple enough that a user who is an expert about their data, but not necessarily about GPs, can construct a reasonable prior for their time series.”

Here’s where AutoBNN enters the picture.

AutoBNN replaces “the GP with Bayesian neural networks (BNNs) while retaining the compositional kernel structure.”

Advantages of BNNs to GPs

Recall how earlier on we asked Gemini to explain the advantages of GPs to NNs, and it focused on the local optima problem for small data sets?

Well, now let’s look at the blog post’s reasonings for the advantage of BNNs to GPs:

“BNNs bring the following advantages over GPs: First, training large GPs is computationally expensive, and traditional training algorithms scale as the cube of the number of data points in the time series. In contrast, for a fixed width, training a BNN will often be approximately linear in the number of data points. Second, BNNs lend themselves better to GPU and TPU hardware acceleration than GP training operations. Third, compositional BNNs can be easily combined with traditional deep BNNs, which have the ability to do feature discovery. One could imagine ‘hybrid’ architectures, in which users specify a top-level structure of Add(Linear, Periodic, Deep), and the deep BNN is left to learn the contributions from potentially high-dimensional covariate information.”
– Google Research Blog

There are a few points to examine from this.

TPUs

First, this mentions TPU hardware. TPU stands for Tensor Processing Unit. (Remember how we learned about STS models from the TensorFlow Blog?)

“Google designed TPUs specifically to optimize the execution of TensorFlow models,” according to Gemini.

TPUs are specifically designed for machine learning and “excel at the large matrix computations that are fundamental to training and running neural networks” by offering “significant speed improvements and power-efficiency compared to CPUs and GPUs for deep learning tasks,” per Gemini again.

Deep BNNs vs. compositional BNNs

With Gemini’s help, let’s compare what a “hybrid” architecture might mean.

Typically for neural networks, “deep” refers to deep neural networks (DNNs), which empower deep learning. (If you think of Google DeepMind, deep learning is at the core of its methodologies.)

Deep BNNs would therefore have multiple hidden layers, like a DNN, used to help find complex and non-linear relationships for time series. This could lead to feature discovery, or discovering patterns in time series data that wouldn’t otherwise be readily apparent.

Compositional BNNs have more pre-defined structures and are used for determining more linear trends or known characteristics of time series data, such as seasonality. Their scalability would make them a good choice for large datasets.

The “hybrid” architectures referenced in the blog post thus could mean meshing the scalability and interpretability of compositional BNNs with the feature learning capabilities of deep BNNs.

Translating GPs to BNNs

The blog post next talks about translating a GP with compositional kernels into a BNN. In short, researchers discovered that “many popular GP kernels (such as Matern, ExponentiatedQuadratic, Polynomial or Periodic) can be obtained as infinite-width BNNs with appropriately chosen activation functions and weight distributions. Furthermore, these BNNs remain close to the corresponding GP even when the width is very much less than infinite.”

Why does this matter? Think back to the reference from that book chapter about “infinitely wide” neural networks behaving like GPs.

Here’s how Gemini summarizes the key points:

“The Bridge: Infinitely Wide BNNs

The Theory: The passage suggests that a single layer neural network (a very simple BNN architecture) can theoretically approximate a GP if the number of neurons in that layer becomes infinitely large (infinite width).

The Reverse Mapping: More interestingly, researchers have discovered that many commonly used GP kernels (like Matern, Exponential Quadratic, etc.) can be mapped to BNNs with a finite number of layers (not infinitely wide).

Key Points

Matching Kernels with BNNs: By carefully choosing activation functions and weight distributions in a BNN, it’s possible to achieve a similar output to a GP with a specific kernel.

Beyond Infinity: Even with a limited number of layers (much less than infinite), these BNNs can still closely resemble the corresponding GP. This is a significant finding as it allows us to leverage some of the benefits of BNNs while retaining the capabilities of GPs.”

– Gemini

The blog post shows examples of GPs and BNNs using four different kernels (RBF, periodic, quadratic, and linear). The top rows are GPs and the bottom rows are BNNs:

How to use AutoBNN

The blog post next mentions how the AutoBNN package is part of the TensorFlow Probability Python library (recall the STS model blog):

“TensorFlow Probability (TFP) is a Python library built on TensorFlow that makes it easy to combine probabilistic models and deep learning on modern hardware (TPU, GPU). It’s for data scientists, statisticians, ML researchers, and practitioners who want to encode domain knowledge to understand data and make predictions.”
– TensorFlow

The package is implemented in JAX, a numerical computation library developed by Google that offers automatic differentiation capabilities (think ML model training), and uses the flax.linen neural network library, which offers pre-built NN modules and was specially designed to work with JAX.

As Gemini explains:

“The combination of JAX and Flax.linen provides a powerful foundation for AutoBNN. This framework allows researchers to explore the intersection of Gaussian Processes and neural networks, potentially leading to more interpretable and efficient models.”
– Gemini

Eligible kernels

Lastly, the blog post mentions how AutoBNN “implements all of the base kernels and operators discussed so far (Linear, Quadratic, Matern, ExponentiatedQuadratic, Periodic, Addition, Multiplication) plus one new kernel” called “OneLayer.”

Here are representations of each:

Using that reference, let’s quickly look at what these different kernels mean, with help from Gemini:

RBF (Radial Basis Function): “Captures smooth, non-linear relationships between data points. Data points closer together in space are considered more similar and influence each other more significantly in the prediction process. … Widely used as a general-purpose kernel for a variety of regression and classification tasks.”
Matern: “Similar to the RBF kernel, it models smooth, non-linear relationships. However, it offers more flexibility in controlling the smoothness of the function through a parameter called “nu” (nu). … Lower nu values make the function less smooth and allow for sharper transitions.”
Linear: “Captures linear relationships between data points. Data points with higher values on the x-axis will have a proportional increase in the predicted y value. … Suitable for situations where you expect a linear relationship between the input features and the target variable. Can also be used as a baseline model for comparison with more complex kernels.”
Quadratic: “Captures relationships where the output increases or decreases quadratically (like a parabola) with respect to the input features. … Useful when you expect a U-shaped or an inverted U-shaped relationship between the features and the target variable.” (Note: Parabola is a TOOL song on Lateralus, an album heavily based on mathematics. Guess we know which tune we’ll end this with.) 😉
OneLayer: “Can learn a wider range of potentially complex, non-linear relationships compared to the other basic kernels mentioned above. … Provides a way to leverage the learning capabilities of neural networks within the GP framework, potentially leading to more flexible and expressive models.”
Periodic: “Captures repeating patterns in the data, specifically useful for time series data where there might be seasonality (e.g., daily, monthly, yearly patterns). … Ideal for modeling cyclical or seasonal trends in time series forecasting or other types of data with repeating patterns.”

We can see why choosing the right kernel (or combining them, compositionally) matters for capturing the underlying data structure in a GP or, in this case, a BNN.

The blog post concludes by saying:

“AutoBNN provides a powerful and flexible framework for building sophisticated time series prediction models. By combining the strengths of BNNs and GPs with compositional kernels, AutoBNN opens a world of possibilities for understanding and forecasting complex data.”
– Google Research

It then links to a colab for users to try out AutoBNN.

Why SEOs should care (maybe) about AutoBNN

Let’s now explore some of the reasons SEOs may be interested in AutoBNN or time series data more generally based on interpretations from Gemini.

*Note: this is a theoretical exercise, not a prediction or instructions. 😉

1. Understanding search intent through time series

When search queries exhibit trends or seasonal patterns, SEOs can advise on which content would be most relevant for those users.

In this case, AutoBNN could be used to:

Forecast query volumes and trends.
Identify emerging topics based on changes in search patterns.
Understand the factors that influence search behavior changes over time.

2. More accurate knowledge graphs for semantic search

AutoBNN could help find relationships in search data that inform new connections between entities in knowledge graphs. This could also influence the importance or prominence of entities based on search activity, surfacing trending entities based on demand or seasonal relevance.

From the SEO side, this would pertain more to content creation and optimizing for relevancy.

3. Signaling trust for information retrieval in RAG models

By knowing which data sources and patterns influenced the generated text in a RAG model, and providing a confidence level for that text, AutoBNN could offer more interpretability for the results.

From the SEO side, this could help signal which factors influence trust in sources and hint at optimization opportunities.

More potential uses of AutoBNN for SEO tasks

AutoBNN could be applied directly to SEO-related tasks in the following ways:

Search volume forecasting of query topics over time. (One caveat being needing to have sufficient data to power accurate forecasts.)
Analyzing internal website search data for time-based patterns to spot trends and inform content topics. (For example, let’s say your internal search data shows popularity for certain products or informational topics during particular times of the year or circumstances, you can leverage that insight to maximize those assets or create related content.)
Forecast popularity scores for existing content to coordinate refreshes before peak periods of interest.
Identify trending topics or content patterns (images, videos, etc.) based on competitors’ search traffic trends to inform content opportunities.

Implications of using time-series data for SEO

In general, SEOs may care about time-series analysis for the following reasons:

Proactive content strategy: Reveal the dynamics behind keyword volumes (query topics) for seasonal patterns, long-term trends, or sudden shifts.
Data-driven justifications: Align resource allocation with time-series forecasts to provide tangible justifications for recommendations.
Anomaly detection: Spot unexpected dips or increases in traffic to inspect technical problems or issues with bots.
Predictive mindset: Familiarity with time-series data builds a predictive mindset for forward-thinking recommendations.

“Like a parabola”

I hope you’ve enjoyed this rendition of Hamsterdam Research.

Recall how the quadratic kernel from earlier mentioned it can capture “relationships where the output increases or decreases quadratically (like a parabola) with respect to the input features.”

Well, Parabol and Parabola are also TOOL songs from the album Lateralus, which is heavily based on mathematics.

Stay tuned for another article next week (or check out the previous week’s article below).

Until next time, enjoy the vibes:

Thanks for reading. Happy optimizing! 🙂

SEO Strategist and Consultant

Ethan Lazuk

The Embedding Language Model (ELM) & Why SEOs Should Care (Maybe)

This Hamsterdam Research article looks as ELM from “Demystifying Embedding Spaces using Large Language Models,” a Google Research paper.

March 29, 2024October 5, 2024

Ethan Lazuk

How AutoBNN Automates the Discovery of Interpretable Time Series Forecasting Models & Why SEOs Should Care (Maybe)