Introducing Differential Transformer, and why marketers should care.

October 8, 2024

🐹 Hamsterdam Research is back with a new and exciting topic: Differential Transformer.

You may be familiar with Transformers, the basis for ML models like GPT and BERT.

A Differential Transformer (Diff Transformer) is an improved version designed to handle attention better by focusing on relevant information and filtering out noise.

In this recent Microsoft Research paper published on October 7th by Tianzhu Ye, Li Dong, Yuqing Xia, Yutao Sun, Yi Zhu, Gao Huang, and Furu Wei, the authors give the following explanation in their abstract:

“Transformer tends to overallocate attention to irrelevant context. In this work, we introduce Diff Transformer, which amplifies attention to the relevant context while canceling noise. Specifically, the differential attention mechanism calculates attention scores as the difference between two separate softmax attention maps. The subtraction cancels noise, promoting the emergence of sparse attention patterns. Experimental results on language modeling show that Diff Transformer outperforms Transformer in various settings of scaling up model size and training tokens. More intriguingly, it offers notable advantages in practical applications, such as long-context modeling, key information retrieval, hallucination mitigation, in-context learning, and reduction of activation outliers. By being less distracted by irrelevant context, Diff Transformer can mitigate hallucination in question answering and text summarization. For in-context learning, Diff Transformer not only enhances accuracy but is also more robust to order permutation, which was considered as a chronic robustness issue. The results position Diff Transformer as a highly effective and promising architecture to advance large language models.“
– Abstract (my bolding)

The paper has four main sections, including an introduction, section on the Diff Transformer, experiments, and a conclusion.

Let’s review some of the key details.

1. Introduction

The authors explain how “the decoder-only Transformer” has emerged as “the de facto standard for large language models (LLMs)” in recent years.

📝 Decoder models, like GPT, take the condensed representation from an encoder model (like BERT) and reconstructs the output. This often involves looking at vector embeddings to understand semantic meaning of words, images, or other content.

The heart of the Transformer is the attention mechanism, the authors write, “which employs the softmax function to weight the importance of various tokens in a sequence.” One challenge, however, is LLMs’ ability to retrieve key information for context.

“The issue arises from non-negligible attention scores assigned to irrelevant context, which ultimately drowns out the correct answer. We term these extraneous scores as attention noise.”

They show an experiment where the “task is to retrieve an answer embedded in the middle of a pile of documents.” The Diff Transformer has a .31 signal-to-noise ratio for its normalized attention score, compared to .03 for a standard Transformer. It also has an 85% retrieval capability, compared to 55%.

The authors describe Diff Transformer as “a foundation architecture for large language models,” which uses a “differential attention mechanism … to cancel attention noise with differential denoising.”

They explain more about how differential attention works:

“Specifically, we partition the query and key vectors into two groups and compute two separate softmax attention maps. Then the result of subtracting these two maps is regarded as attention scores. The differential attention mechanism eliminates attention noise, encouraging models to focus on critical information.”

They describe the approach as “analogous to noise-canceling headphones and differential amplifiers in electrical engineering, where the difference between two signals cancels out common-mode noise.”

From their experiments, the authors uncover that “Diff Transformer requires only about 65% of model size or training tokens needed by Transformer to achieve comparable language modeling performance.”

2. Differential Transformer

Let’s get into more of the nuts and bolts of how Diff Transformer works, summarizing the author’s second section of the paper.

The architecture consists of multiple layers (called L layers), each with two modules, including a differential attention module and a feed-forward network. The conventional softmax attention of Transformers is replaced by the differential attention mechanism mentioned earlier.

Differential attention computes two separate attention maps using the query (Q), key (K), and value (V) vectors, creating two query and key sets. By subtracting these two attention maps, the attention scores are derived, cancelling out noise to focus on relevant content.

Diff Transformers employ a multi-head attention mechanism, like standard Transformers, but each head applies differential attention independently. (Instead of a single softmax attention map, each head computes two softmax maps then subtracts them.)

Once attention is applied, the results are normalized (using GroupNorm) and concatenated. GroupNorm helps ensure the outputs from each attention head have consistent statistical properties, improving training stability and performance. Once each head has been normalized, the concatenation combines the diverse attention results into a single tensor that represents the attended information from multiple perspectives.

The final output of each head is scaled with a fixed multiplier to maintain consistent gradient flow, similar to a traditional transformer. This applies during backpropagation and ensures gradients (used to update model weights) propagate through the network. Consistency here is crucial for stable training, allowing Diff Transformer to inherit the same hyperparameters and optimization behavior as a standard Transformer.

3. Experiments

The authors performed a variety of experiments to evaluate Diff Transformer for LLMs.

Here’s a summary:

“We evaluate Differential Transformer for large language models from the following perspectives. First, we compare the proposed architecture with Transformers in various downstream tasks (Section 3.1) and study the properties of scaling up model size and training tokens (Section 3.2). Second, we conduct a length extension to 64K and evaluate the long-sequence modeling capability (Section 3.3). Third, we present the results of key information retrieval, contextual hallucination evaluation, and in-context learning (Sections 3.4, 3.6 and 3.5). Forth, we show that Differential Transformer can reduce outliers in the model activations compared to Transformer (Section 3.7). Fifth, we conduct extensive ablation studies for various design choices (Section 3.8).”

Since we’re focusing on Diff Transformer insights for marketers, let’s focus on the experiment that might have the most impact for us in that field: key information retrieval.

As marketers, we often need to extract information from large datasets or mine customer interactions for data, such as trends, feedback, or preferences. The ability of Diff Transformer to filter out irrelevant context could improve the efficiency of these tasks, leading to more helpful insights for tasks like market research or competitive analysis.

The authors performed what they called a “Needle-In-A-Haystack test,” which is “widely used to evaluate the ability to extract critical information embedded in a large context.”

Here’s a description of the test, followed by some data on the findings:

“We follow the multi-needle evaluation protocol of LWM [22] and Gemini 1.5 [32]. The needles are inserted into varying depths within contexts of different lengths. Each needle consists of a concise sentence that assigns a unique magic number to a specific city. The goal is to retrieve the magic numbers corresponding to the query cities. We position the answer needle at five different depths within the context: 0%, 25%, 50%, 75%, and 100%, while placing other distracting needles randomly.”

As the chart above shows, Diff Transformer performed comparatively better. “We compare the normalized attention scores when key information is inserted at different positions,” the authors write, and “Compared with Transformer, Diff Transformer allocates higher attention scores to the answer span and has lower attention noise.”

4. Conclusion

Let’s review the authors’ conclusion in full, followed by some notes on why marketers should care about this development of the Diff Transformer.

“In this work, we introduce Differential Transformer (a.k.a. Diff Transformer), which amplifies attention to the relevant context while canceling noise. Experimental results on language modeling show that Diff Transformer outperforms Transformer in terms of scaling properties, long-context modeling, key information retrieval, hallucination mitigation, in-context learning, and reduction of activation outliers. The results emphasize the importance of reducing attention noise. Moreover, the differential attention mechanism can be easily implemented with FlashAttention [8]. The findings position Diff Transformer as a distinctive and promising foundation architecture for large language models. In the future, we can develop efficient low-bit attention kernels due to the reduced magnitude of activation outliers. As the attention pattern becomes much sparser, we would also like to utilize the property to compress key-value caches.”

So why should marketers care about Diff Transformers?

I’ll first start by saying I believe marketers should be paying attention to all potentially large advancements in LLMs.

Part of the reason I broadened my career beyond SEO to focus on holistic marketing and brand-focused strategies is I saw the writing on the wall when it came to LLM usage and the potential for brands to influence that visibility.

If Diff Transformer can improve an LLM’s ability at key information retrieval, that could not only impact what we’re able to do with data as marketers but also how customers for brands search for their information.

Speaking of LLMs, I asked ChatGPT GPT-4o what it thinks the implications of Diff Transformers are for marketers and SEOs. Here’s what it said:

“Diff Transformers offer better precision, reliability, personalization, and efficiency in AI-driven tasks, making them a valuable tool for marketers and SEOs to enhance content generation, data analysis, and customer engagement.”

More specifically, ChatGPT gave the following “impacts,” as it called them:

Faster and more effective data mining.
Fewer hallucinations in content generation.
More effective personalization through in-context learning.
Strategies based on wider scopes of data from longer-context modeling.
Cost-effective scalability of tools.

In summary, as generative AI progresses, so to will our strategies as marketers. Staying abreast of developments like Diff Transformer help us remain on top of our game when it comes to the future of brand-focused strategies.

Thanks for checking out this rendition of Hamsterdam Research. 🐹

Until next time, enjoy the vibes:

Thanks for reading. Happy marketing! 🤗

Marketing Consultant

Ethan Lazuk

Introducing Differential Transformer, and why marketers should care.

1. Introduction

2. Differential Transformer

3. Experiments

4. Conclusion

So why should marketers care about Diff Transformers?

Share this:

Like this:

Leave a ReplyCancel reply

Discover more from Ethan Lazuk