Could Google Soon Understand Text in Your Images? Exploring Hierarchical Text Spotter (HTS) (via Google Research) & Why SEOs Should Care (Maybe)
By Ethan Lazuk
Last updated:

Welcome to another rendition of Hamsterdam Research, where we look at AI research papers to learn their details and possible implications for the future of search and SEO.
This week, we’ll be looking at Hierarchical Text Spotter (HST), a model from Google Research for text spotting and layout analysis.
You’ve probably heard that Google Search may not see text embedded in your website images, so you should use crawlable text (HTML) on the page.
John Mueller addressed this topic just 3 months ago (link below).
But what if Google could understand text within your images, including in real-life photos, like local business signs?
How might that technology impact the relevance of our website, Google Business Profile, or even social media content in search results? Let alone change our ideas about image optimization.
Well, Google Research is working on models that could have an impact in this area.
In this article, we’ll learn about HTS, as well as explore its hypothetical implications for SEO (at the end).
Quick story on selecting this topic, and why it matters to SEO (maybe)
Each week in Hamsterdam Research, I take a list of publications from Google Research and consult with my colleagues — ChatGPT and Gemini 😉 — to decide which is the most relevant to search marketing professionals.
This week, it came down to “Hierarchical Text Spotter for Joint Text Spotting and Layout Analysis” and “Delphic Offline Reinforcement Learning under Nonidentifiable Hidden Confounding.”
We settled on the former because it’s more immediately actionable for SEO practices, while the latter is more about the long-term and potential evolution of search engines with AI. Maybe we’ll explore it later!
“Hierarchical Text Spotter for Joint Text Spotting and Layout Analysis” was authored by Shangbang Long, Siyang Qin, Yasuhisa Fujii, Alessandro Bissacco, and Michalis Raptis.

It was submitted on October 25th, 2023, and says it’s scheduled to appear at the Winter Conference on Applications of Computer Vision 2024 (WACV). (This must be why it’s near the top of the list of the Google Research publications.)
As for why it’s relevant to SEOs (maybe), here were our team’s thoughts:
ChatGPT said: “This paper focuses on advancements in text spotting and layout analysis, which are critical for understanding how search engines can better process and interpret visual content on web pages. … Understanding these technologies could provide insights into how search engines might evolve in handling and indexing images and videos with embedded text …. This can help SEO professionals anticipate changes in search algorithms and adjust their strategies accordingly.” (My bolding) (*LLM disclaimers apply.)
Gemini said: “Search engines are evolving to better understand not just written content, but also the visual context of web pages. Text spotting and layout analysis within images and videos are crucial for search engines to make sense of this vast amount of information. … SEO professionals need to optimize images and videos with appropriate keywords, alt text, and surrounding text to help search engines interpret the content. Understanding text spotting technology can inform better visual SEO practices.” (My bolding) (*LLM disclaimers apply.)
And I said: “We’ve seen discussions around the ability of search engines to read text in images (see below), as well as evidence this may be getting better, such as in Google Business Profile images (see below again). Not only could this technology in Google Search allow for more creative visuals — and fewer limitations on having to add HTML text over images — but just imagine all of the historical and real-world visuals on the web that could now be better discovered based on their textual relevance. Users likely don’t care if text is part of an image or in separate HTML code, as long as it’s readable and useful. So if Google wants to reward people-first content, this would be in line with that goal.”
In terms of limitations, here’s an example of a discussions around text in images where Barry Schwartz reported on a comment from John Mueller, who implied “substantial” text should be crawlable:
Meanwhile, in terms of potential and what this text spotting and layout analysis technology could imply, here’s a tweet from Chris Long with commentary from Andy Simpson, showing the (potential) value of keywords in images for GBP optimization:
I advise my clients to use crawlable text with images.
However, in my personal blog posts and service pages, I often have text within images. I always rely on ALT text for accessibility and context, but my supposition is Google will soon read it fine enough, if it can’t already.
Am I wrong?
“You’re not wrong, Walter. You’re just an … “
Ok, then.
Let’s start with the paper’s abstract 😉

“We propose Hierarchical Text Spotter (HTS), a novel method for the joint task of word-level text spotting and geometric layout analysis. HTS can recognize text in an image and identify its 4-level hierarchical structure: characters, words, lines, and paragraphs. The proposed HTS is characterized by two novel components: (1) a Unified-DetectorPolygon (UDP) that produces Bezier Curve polygons of text lines and an affinity matrix for paragraph grouping between detected lines; (2) a Line-to-Character-to-Word (L2C2W) recognizer that splits lines into characters and further merges them back into words. HTS achieves state-of-the-art results on multiple word-level text spotting benchmark datasets as well as geometric layout analysis tasks.”
If that language is challenging — it is for me — never fear!
Let’s review the basic concepts and vocabulary to get started.
*Note: I’ll be getting help from Gemini 1.5 Pro in Google AI Studio (Gemini Pro) as well as Gemini Advanced (Gemini), each prompted with the research paper for context. All quotations will be attributed and by default will refer to the Google Research paper unless otherwise specified. (*LLM disclaimers apply.)
“HTS” stands for “Hierarchical Text Spotter.” It’s the name of the model being proposed in the paper.
The abstract says the researchers are offering a “novel method” (original) for a “joint task.”
This task first involves “word-level text spotting,” which refers to “detecting and recognizing individual words within an image,” per Gemini Pro.
The reason the approach is described as “hierarchical” in the paper is because, as Gemini Pro explains, it involves “identifying characters first and then grouping them into words based on spacing and context.”
The second task involves “geometric layout analysis,” which is about “understanding the spatial arrangement” of text in images. This pertains to the “4-level hierarchical structure” described in the abstract, including “characters, words, lines, and paragraphs,” which Gemini Pro calls “text entities of different levels.”
HTS has “two novel components.” The first of these is a “Unified-Detector-Polygon (UDP),” which, as Gemini explains, locates lines of text within an image, including curved or angled lines using Bezier curves (the “Bezier Curve polygons” mentioned in the paper).
The second component is a “Line-to-Character-to-Word (L2C2W) recognizer,” which takes the detected line of text (from the UDP) “as an input and splits it into individual characters” to identify boundaries between them before merging the characters back to words, “likely using language models for context,” per Gemini.
Based on existing benchmark datasets and tasks, “HTS achieves state-of-the-art results,” per the researchers.
If that sounds a bit confusing, this image (Figure 1) from the paper helps us visualize the process:

We can see how HTS is analyzing a real image of a restaurant sign. This figure also shows us the “4-level hierarchical structure.”
The “Golden Sea” text in the image is a curved line (Bezier curves), while the gray box shows how the word “Golden” is being identified as a separate word from “Sea” and both are a separate paragraph from “Restaurant Seafood & Bar.”
We also see how the “G” and “O” are separated as individual characters in the word. Lastly, we can see how “& Bar” is identified as a separate line from “Restaurant Seafood.”
Stay tuned, because the paper has some even cooler figures with examples.
Now for a deep-dive into the research
I’ll be using the PDF found via arXiv, if you wish to follow along.

The paper has five main sections, including an introduction, related works, methodology, experiments, and a conclusion.
We’ll summarize each section.
1. The introduction
In the introduction, the researchers explain how the “extraction and comprehension of text in images play a critical role in many computer vision applications.”
Computer vision refers to an interdisciplinary field of AI and computer science of enabling computers to understand visual data from the real world.
In terms of why this research is a “novel method,” the researchers explain it’s because of the “joint task” aspect:
“Previously, defining the geometric layout of extracted textual content occurred independent of text spotting and remained focused on document images. In this paper, we aim to further the argument that consolidating these separately treated tasks is complementary and mutually enhancing.”
As for the implications of their research, they explain:
“We postulate a joint approach for text spotting and geometric layout analysis could provide useful signals for downstream tasks such as semantic parsing and reasoning of text in images such as text-based VQA and document understanding.”
The downstream tasks mentioned include “semantic parsing and reasoning of text in images,” which Gemini explains “refers to the ability of a computer to extract meaning from text within images and understand the relationships between different elements.”
Gemini explains “text-based VQA” refers to “visual question answering,” such as what color is the car in an image. While “document understanding” means extracting “structured information and insights” from images in documents.
As for why SEOs might care about these tasks, Gemini explains we’ll need to consider the meaning of text in images (in terms of its relevance to users and search engines). Also, images could be potential sources of structured data, such as “product details” or “event information” for search engines. Images could also be seen as directly “answerable” to queries.
[Aside: This made me think about how Google’s chief health officer, Dr. Karen DeSalvo, recently announced in The Keyword on March 19th, 2024, that Google Search had “added images and diagrams from high quality sources on the web that make it easier to understand symptoms, like neck pain, for example.” (My bolding.) The name of that post was, “How we’re using AI to connect people to health information.”]
Another innovation of HTS is that it accounts for the context of words:
“Existing text spotting methods most commonly extract text at the word level, where ‘word’ is defined as a sequence of characters delimited by space without taking into account the text context.”
[Aside: Where else have we heard about “context” from Google? One sentence that comes to mind is “BERT models can therefore consider the full context of a word by looking at the words that come before and after it—particularly useful for understanding the intent behind search queries.” (My bolding.) This is from Pandu Nayak’s 2019 post on The Keyword called “Understanding searches better than ever before.”]
Recall how the two main components of HTS were UDP and L2C2W. The introduction shares more of the details on their functionality.
For UDP, the researchers explain:
“Notably, we find that the conventional way of training Bezier Curve polygon prediction head, i.e. applying L1 losses on control points directly, fails to capture text shapes accurately on highly diverse dataset such as HierText. Hence, we propose a novel Location and Shape Decoupling Module (LSDM) which decouples the representation learning of location and shape. UDP equipped with LSDM can accurately detect text lines of arbitrary shapes, sizes and locations across multiple datasets of different domains.”
LSDM, or “Location and Shape Decoupling Module,” addresses an issue of Bezier curves struggling with “highly diverse datasets,” Gemini explains, by focusing on understanding the position (location) and curvature (shape) of the text separately. As the researchers explain, this makes HTS more accurate “across multiple datasets of different domains.”
Domains refers to the subject matter of data. In the previous Hamsterdam Research article on Assisted Speaker Detection (ASD), we heard about “domain-specific data,” where one domain had YouTube videos as a dataset and another had Google Nest videos.
As for “L2C2W,” which is a “text line recognizer based on Transformer encoder-decoder that jointly predicts character bounding boxes and character classes,” the researchers explain how its value is that it “only needs a small fraction of training data to have bounding box annotations.”
As Gemini explains, the L2C2W recognizer takes the text line image as an input and identifies the characters in the line and delimits them into meaningful words. It uses a transformer — the “T” in ChatGPT and the focus of the famous 2017 attention paper — which is a powerful neural network architecture great for tasks involving sequences, like text.
The deep learning architecture of the transformer encoder-decoder works by having the encoder capture the relationships between the parts of the image (the input), while the decoder then uses bounding boxes to predict the location and shape of characters in a line and identifies each character (classes) in each box.
The reason only a “fraction of training data” is needed is because the transformer efficiently learns the relationships between characters to predict bounding boxes.
As for the “state-of-the-art text spotting results,” these are obtained from experiments on multiple datasets, including ICDAR 2015 (incidental scene text), Total-Text (diversified orientations), and HierText (Hierarchical Text with Open Images dataset samples).
HTS also surpasses Unified Detector (a method of unifying text detection and layout analysis) on the HierText “geometric layout analysis benchmark.”
We’re not as concerned with these specifics (though you can learn more with the links above), as the primary takeaway is that HTS works better than previous models or methods.
That said, one interesting point the researchers make is that “these results are obtained with a single model, without fine-tuning on target datasets.”
This essentially means the HTS model is more adaptable, flexible, and scalable for real-world deployments across diverse content, possibly even web content, as Gemini points out.
2. Related works
This section speaks to existing research for text spotting, text detection, text recognition, and layout analysis.
For the sake of clarity, I’ll use Gemini to help summarize each section’s key points.
Text spotting
There are two approaches to text spotting in images, which involves the detection and recognition of text in images, including two-stage text spotters and end-to-end text spotters.
HST uses a two-stage text spotting approach due to the limitations of end-to-end approaches.
End-to-end approaches are computationally efficient because they reuse the features extracted during the detection stage for recognition. However, since the detection and recognition parts might learn at different rates (known as “asynchronous convergence”) this makes for a complex approach.
In two-stage approaches, meanwhile, the first stage identifies and locates text regions within an image, typically with shapes around the detected text, with a focus on whole words. Then in the second stage, specific sections are taken to decipher the actual text, like turning the image of the word “cat” into the letters “c,” “a,” and “t.”
Some of the vocabulary introduced about text spotting included:
- Bounding polygons – shapes used to define the location of detected text.
- Granularity – the level of detail (units of text detection, like words or letters).
- Cropped – taking a specific section of an image.
- Input image pixels – raw data from an image before processing.
- Encoded backbone features – extracted information from an image using a deep learning architecture.
- Decoded text transcription – converting identified text regions into readable text characters.
Other text spotting approaches include implicit feature feeding (a two-stage approach without bounding boxes) and single-stage approaches (newer methods that combine detection and recognition as a sequence-to-sequence problem, but that don’t account for layout analysis (text structure)).
Text detection
For text detection, the problem discussed is with curved text, as neither of the two main ways to do text detection (top-down (whole objects) or bottom-up (smaller steps)) are effective for detecting orientation (mask limitation).
The one exception is fine-tuned models designed for curved text, but according to the researchers, it isn’t clear if these would work on datasets with different text shapes.
An open question raised is whether polygon prediction methods (Bezier curve) would work on various shapes of text or require fine-tuning.
Text recognition
We have two main methods of text recognition explained, including sequence-to-sequence learning and character detection.
Sequence-to-sequence learning is a deep learning approach that converts the text line (image sequence) into actual characters (another sequence). While character detection is a more accurate method that identifies individual characters based on their location (bounding box), but it’s also more complex (and expensive) given higher training data requirements.
The researchers use the L2C2W recognizer, which falls under sequence-to-sequence learning.
Layout analysis
Geometric layout analysis is a field within text spotting that focuses on identifying coherent blocks of text (like paragraphs or columns) to analyze their relationships to each other. Beyond detecting text, this pertains to understanding how its organized.
This section got a little technical, but some of the concepts discussed included object detection (text blocks as objects in an image), semantic segmentation (classifying pixels as belonging to specific text blocks), and graph convolutional networks (GCN) (specialized neural networks to model text element relationships (OCR (optical character recognition) tokens) as a graph structure).
[Aside: The mention of “graph structure” is worth keeping in mind. A graph structure for text uses nodes and edges to explore spatial, semantic, and hierarchical relationships. Compared to linear sequences of text, graph structures allow computers to better understand context. As Gemini says, “It can ‘jump around’ the graph based on the relationships, improving its understanding of the text’s overall meaning.” This also aids in OCR and layout analysis (relationships between text blocks).]
Now we’ll get into how the HTS model works in practice.
3. The Methodology
This section has three main subsections, including Hierarchical Text Spotter, Unified Detection of Text Line and Paragraph, and Line-to-Character-to-Word Recognition.
We’ll summarize each of these. But first, let’s revisit Figure 1 from earlier (zoomed in), as the researchers start by calling out how this demonstrates the two stages, or Unified Detection Stage and Line Recognition Stage:

Focusing on the white boxes this time, we can see how the Unified-Detector-Polygon uses the line bounding polygons to annotate the words as paragraphs, one for “Golden Sea” and another for “Restaurant Seafood & Bar.”
That also gets cropped and rectified in the L2C2W Recognizer, which takes the three lines of text this time (or “Golden Sea,” “Restaurant Seafood,” and “& Bar”) and creates individual character boxes.
That goes to the HTS model and assigns the Character bounding boxes to the image.
Hierarchical Text Spotter
As Gemini summarizes this for us, it begins with the Unified Detection Stage (UDP) where the model detects text lines, including shapes using Bezier curves, and groups them into paragraphs.
Next comes the Line Recognition Stage, which employs a transformer-based model to recognize text within detected lines, predicting character classes (letters, numbers, etc.) and their bounding boxes within the lines, as well as identifying spaces to separate words in lines.
The processed lines are then combined to create a hierarchical structure of characters, words, lines, and paragraphs within the image.
Unified Detection of Text Line and Paragraph
This section mentions Figure 2, where we see the UDP (top), Bezier polygon prediction (middle), and Bezier curves (bottom):

UDP (top part of image) is a new model for text detection that gets around the limitations of using masks for text detection (which aren’t effective with curved text) by using a Bezier polygon prediction head (the middle part of the image).
As Gemini explains, “Bezier curves are a mathematical way to represent smooth and flexible shapes using control points.”
The researchers also introduce Location and Shape Decoupling Module (LSDM), a novel component designed to get around the challenge in text detection when text appears in various locations, orientations, or with different aspect ratios (widths and heights). LSDM uses a two-part process that predicts the bounding box (Axis-Aligned Bounding Box (AABB)) (location head) and the Bezier control points (shape head). In other words, it allows the model to better handle variations in text.
Line-to-Character-to-Word Recognition
This section mentions Figure 3, which shows L2C2W.
The top section shows the entire process, from text line image preparation to character recognition (with the transformer encoder-decoder), and bounding box prediction.
The middle section shows a sample output with recognized text, character bounding boxes, and how the characters are grouped into words.
While the bottom section shows how the text lines are processed into separate words with bounding boxes for each word.

The L2C2W recognition method includes multiple steps, starting with Text Line Image Preparation, where input images are processed to extract individual lines of text. These are then cropped and straightened with a BezierAlign tool and converted to grayscale images for the recognition model.
Next comes Character Recognition with the Transformer Encoder-Decoder, which recognizes the characters within the text lines, uses a convolutional neural network (CNN) for encoding pixel information, then a decoder processes this to predict the sequence one character at a time.
In the Character Bounding Box Prediction, the model predicts character classes and bounding boxes for each character and then normalizes them based on the height of the text line image.
For Training the Model, a combination of real-world image data and synthetic text data is used to classify characters accurately in tightly enclosed bounding boxes within each text line.
Lastly comes Text Line Post-Processing using the generated character predictions and bounding boxes. Predicted spaces are used to break text lines into separate words with a bounding box calculated, and then the character and word bounding boxes are projected to their original location within the input image.
4. Experiments
The researchers test the model in a number of experiments.
We can get the gist of these findings in the conclusion next, but one point of interest is Figure 4, which shows the 4-level hierarchical structure with results for characters, words, lines, and paragraphs:

Even if this technology doesn’t get applied to Google Search or impact our lives as SEOs, we can probably use it to create some cool poster and t-shirt designs. 😉
5. Conclusion
The paper’s conclusion is quite brief, harking back to our learnings from the abstract:
“In this paper, we propose the first Hierarchical Text Spotter (HTS) for the joint task of text spotting and layout analysis. HTS achieves new state-of-the-art performance on multiple word-level text spotting benchmarks as well as a geometric layout analysis task.”
If nothing else, we now know what HTS is, how its “joint task of text spotting and layout analysis” works, and why it’s a novel approach capable of “state-of-the-art performance.”
And that’s pretty cool!
So why should SEOs care about HTS (maybe)?
In light of changes we’ve seen to some SEO paradigms over the last year, especially with how we think about “SEO content” in light of the third helpful content update and March 2024 core update, I think this illustrates, big picture, how our methods can evolve.
For example, ALT text will likely always be a necessity (perhaps). And if nothing else, it’s a reasonable step to ensure accessibility, to say nothing of search engine’s comprehension of images.
As for text embedded in images vs. HTML text, maybe these recommendations will soften in the future, or the type of technology we’ve seen with HTS could become a backup to crawlable text.
The examples we saw throughout the HTS paper also illustrated real-world scenarios for text detection, like images of signs. As Google’s technology evolves to get smarter about understanding images, that could impact the relevance of existing content on the web, particularly for local SEO or ecommerce sites, especially for visual customer UGC or even social content, as well as how we as SEOs think about images in content.
We also saw last time in Hamsterdam Research how Google researchers are looking into ways to identify speakers in videos. HST seems like just another step in the direction of better multimodal content understanding by Google.
Of course, we’re talking about Google Research here, not Google Search.
But before we wrap up, let’s ask Gemini for its thoughts on this topic, based on our conversations about HTS so far.
*Note: this is a theoretical exercise now, not predictions or instructions. 😉
1. Image content understanding
As search engines like Google are constantly trying to improve their understanding of content, techniques like text detection in HTS could help search engines decipher the actual text in images, allowing for more accurate image search results. This would also benefit websites with informative images that complement their content.
2. Local business or product information
Techniques like HTS could play a role in extracting text from product images to understand their details or specifications, further improving discoverability for shopping results.
Key details in local business images, like those associated with Google Business Profiles or social media accounts, could likewise help improve their relevance and discoverability for local search.
3. UGC content and brand monitoring
Text detection could help businesses monitor the sentiments around their brands as indicated by UGC images appearing in search results or on social media platforms, providing insights for marketing strategies.
4. Larger implications
Beyond HTS, advancements in the areas of layout analysis and text recognition could lead to improved semantic understanding of images by search engines, improving their content discovery capabilities.
Text detection could also be used to create alternative text descriptions for better web content accessibility, such as for visually impaired users, perhaps compensating for when ALT text is absent.
Till next time …
I hope you’ve enjoyed this article from Hamsterdam Research!
Feel free to comment with thoughts or feedback or contact me.
Stay tuned for another article next week (or check out previous posts below).
Until next time, enjoy the vibes:
Thanks for reading. Happy optimizing! 🙂
Related posts
How Large Scale Self-Supervised Pretraining for Active Speaker Detection Works (via Google Research) & Why SEOs Should Care (Maybe)
In this rendition of Hamsterdam Research, we look at a Google Research paper on large scale self-supervised pretraining for active speaker detection (ASD) and its (possible) implications for SEO work and video-based search results.
How AutoBNN Automates the Discovery of Interpretable Time Series Forecasting Models & Why SEOs Should Care (Maybe)
This is the second Hamsterdam Research article, which covers AutoBNN: Probabilistic time series forecasting with compositional bayesian neural networks.
The Embedding Language Model (ELM) & Why SEOs Should Care (Maybe)
This Hamsterdam Research article looks as ELM from “Demystifying Embedding Spaces using Large Language Models,” a Google Research paper.
Leave a Reply