🇰🇭 Addressing the Gap: How the Khmer Semantic Search Engine (KSE) Works & What That Can Teach Us as SEOs 🙌

Last updated:

July 19, 2024

Saumosvakom mokkan sa bta thmei nei karosravochreav Hamsterdam! 🐹 🤖

👉 Translation: Welcome to a new week of Hamsterdam Research!

If you’re new here, this is where we typically explore recent AI research papers to learn about them and discuss their hypothetical implications for the future of search.

But this week, we’ll be switching it up and learning how a semantic search engine works using a real-world example! 🙌 🌏

We’ll review “Khmer Semantic Search Engine (KSE): Digital Information Access and Document Retrieval,” 👈 authored by Nimol Thuon, a member of IEEE.

This paper discusses semantic search in depth, specifically in the context of the Khmer language of Cambodia. 🇰🇭

It was submitted to arXiv on June 13th, 2024.

Its opening also notes:

“This work extends the Khmer semantic search engine project developed in 2016 by author N. Thuon. The project was supported by ARES-CCD (Belgium) and the Institute of Technology of Cambodia (ITC). Special thanks go to Dr. Chhun Sophea for supervision and to the ITC students for their data contributions.”
– Introductory note. (My bolding.)

This will be a bit of a longer read:

27–41 minutes

However, there’s a table of contents not far below, if you wish to jump to a spot of interest. 🎉

Where did I come across this paper, and how is it relevant to SEO? 👀

Well, I was searching on arXiv.org for research papers related to “Google Search” when I saw this one about KSE in the results:

It stood out to me because I wasn’t familiar with the Khmer language and wanted to learn more.

Once I started reading it, I realized it’s not only about an interesting topic, KSE, but it also has a wealth of information about semantic search that’s helpful to grasp for SEOs and web content creators today.

👉 The paper also describes specific challenges that major search engines, like Google, may face when serving content for certain languages — in this case, Khmer, spoken by around 14 million people globally.

Stepping back a moment for context, there’s been talk recently about translation’s role in Google Search. 🔍

In July 2024, Google expanded the languages it supports for translated search results. It’s also testing country labels for results, interestingly enough.

📝 On that point, Glenn Gabe recently wrote a post that’s handy for analyzing GSC data from translated search results.

Beyond Search, Google Translate also recently expanded its supported languages in June, adding 110 using zero-shot ML translation.

Truth be told, when I look in the GA4 mobile app at what page titles on my website people are viewing, I’m often surprised (proudly) to see that they’re reading translated versions.

📝 In another Hamsterdam Research post about epistemic and aleatoric uncertainty, I mentioned how I believe we’re entering a new platinum age of global scholarship, thanks to the universal translation capabilities of multimodal AI models. (P.S., Researchers at Microsoft and Beihang University just published a new paper about universal embeddings with multimodal LLMs — looks interesting!)

Even still, despite all the progress researchers at Google, Microsoft, and elsewhere have made to help democratize information accessibility, there appears to still be ground left to cover. 🤔

That’s where Nimol is coming from in this paper about the Khmer Semantic Search Engine. 👈

✍️ Here’s the full abstract:

“The search engine process is crucial for document content retrieval. For Khmer documents, an effective tool is needed to extract essential keywords and facilitate accurate searches. Despite the daily generation of significant Khmer content, Cambodians struggle to find necessary documents due to the lack of an effective semantic searching tool. Even Google does not deliver high accuracy for Khmer content. Semantic search engines improve search results by employing advanced algorithms to understand various content types. With the rise in Khmer digital content—such as reports, articles, and social media feedback—enhanced search capabilities are essential. This research proposes the first Khmer Semantic Search Engine (KSE), designed to enhance traditional Khmer search methods. Utilizing semantic matching techniques and formally annotated semantic content, our tool extracts meaningful keywords from user queries, performs precise matching, and provides the best matching offline documents and online URLs. We propose three semantic search frameworks: semantic search based on a keyword dictionary, semantic search based on ontology, and semantic search based on ranking. Additionally, we developed tools for data preparation, including document addition and manual keyword extraction. To evaluate performance, we created a ground truth dataset and addressed issues related to searching and semantic search. Our findings demonstrate that understanding search term semantics can lead to significantly more accurate results.“
– Abstract. (My highlights and bolding.)

We’ll review the main vocab now, then delve into the full paper. 🙌

📝 I’ll be using Gemini Advanced and ChatGPT to assist me. All words will be mine (same for bolding or highlights), and all quotes will be from the KSE paper, unless otherwise attributed.

Let’s start by discussing the Khmer language and semantic search engines.

Khmer is the official language of Cambodia. 🇰🇭

The University of Hawai’i Khmer language page explains more details:

“Khmer is the official language of Cambodia and one of the major languages of Southeast Asia. It is spoken by eleven million people in Cambodia and nearly three million more outside Cambodia. It belongs to a family of languages widely distributed in southern Asia and is the idiom of one of the earliest of the great nation-states in the region. As such it is the vehicle of sophisticated forms of architecture, the plastic arts, music, dance, literature and statecraft which are transmitted to Cambodia’s neighbors and which are still admired today. The writing system it employs is a member of the Indic script family, which spread from India into Inner Asia and all of Southeast Asia except Vietnam. Khmer is non-tonal but has vowel distinctions not found in English; its grammar is uncomplicated, with neither verb conjugations nor noun declensions, and is not unlike that of English.”
– The Khmer (Cambodian) Language Program at the University of Hawai‘i

You can also hear Khmer spoken by international students in this video from the Center of Khmer Studies:

The Khmer alphabet is more commonly referred to as a “script,” which originally descends from the Brahmi script of ancient India.

Here’s an example:

Source.

And here’s a couple of sentences in Google Translate, so you can see the Khmer script written out (and phonetically):

Because we’re all about the arts (as well as the sciences 😁) here in Hamsterdam Research, here’s an example of Khmer rock music from the ’60s-’70s by Pan Ron 🎸:

There’s a contemporary Khmer song at the end of this article, as well. 🎵

Now we’ll review the semantic search-related terminology from the abstract. 🤖

A semantic search engine tries to understand the meaning and user intent of search queries, rather than just matching keywords in documents (lexical search), with the goal of producing more relevant results.

Semantic matching techniques are methods used to determine similarities in the meanings of words, phrases, or documents, again, going beyond keyword matching to consider their relationships in context.

Formally annotated semantic content is text data that has been carefully labeled with information about the meaning of words and phrases, helping search engines to better understand the content.

Semantic search based on a “keyword dictionary” involves using a dictionary of words and their meanings to understand search queries and match them to relevant documents.

Alternatively, semantic search based on “ontology” is an approach that uses a structured knowledge base (ontology) that defines the relationships between concepts, enabling search engines to grasp the deeper meaning of queries and documents.

Lastly, semantic search based on “ranking” involves ranking search results based on their relevance to a query, taking into account both keyword matching and semantic similarity.

Data preparation is the process of cleaning, organizing, and transforming raw data into a format that’s suitable for analysis and use in a search engine.

Document addition is the process of adding new documents to the search engine’s index to be included in search results. (I guess that one is fairly obvious, in hindsight. 😅)

Manual keyword extraction is a process of manually identifying and selecting the most important keywords from a document, which are then often used to create a “ground truth” dataset.

That ground truth dataset is then used as the standard for comparison and evaluation — like when evaluating the performance of automatic keyword extraction tools — because it’s considered to be accurate and reliable.

In the case of this paper, the ground truth dataset is a set of documents with manually extracted keywords that also includes the top five ranking documents for various queries, which is then used for evaluating the search engine’s ranking algorithms.

That covers the abstract and its key concepts. 🕺

Feel free to fuel up 🍟 before we hit the road and dig in deeper. 🚙

Source.

And we’re back at it …

Time now to delve into the full paper for some deep learning about KSE and semantic search. 🤿

If you wish to follow along, you can grab the PDF on arXiv (or a mobile-friendly HTML version).

The full paper has 7️⃣ sections, including: 1) introduction, 2) related works, 3) challenges in Khmer language processing, 4) Khmer Semantic Search Engine, 5) experimental setups, 6) results, and 7) conclusion.

We’ll summarize each of them below (within reason). 🤗

1. Introduction 👋

“The rapid growth of digital content has significantly increased the need for efficient information retrieval systems,” Nimol writes.

“Search engines play a critical role in this process by helping users find relevant information across various fields such as education, health, and entertainment. However, traditional search engines often rely on syntactic matching, which fails to capture the nuanced meanings of different terms. This limitation is particularly evident in the context of Khmer language content, where users frequently struggle to find the documents they need due to the lack of effective semantic search tools.”

📝 In reading that, I also thought how Khmer represents a smaller language corpus for search engine systems to train on. We saw this issue described in 2008 in a Google blog post by Paul Haahr and Steve Baker (“Making search better in Catalonia, Estonia, and everywhere else”), where they explained how “it takes more than a year of searches in Catalan to provide a comparable amount of data as a single day of searching in English.” Perhaps innovations in zero-shot learning for ML systems, as we saw with Google Translate earlier, will improve that. 👇

I asked Gemini (prompted with Nimol’s paper) if my observations above were reasonable. It pointed out that “The reference text does mention that ‘limited research and development’ is a challenge in Khmer language processing, which could be interpreted as a lack of training data.” Gemini also mentioned that “Zero-shot learning … could be a promising avenue for improving semantic search in languages like Khmer.” We’ll hear more about these topics in the paper’s “challenges” section.

The motivation for KSE is, quite literally, the needs of Khmer speakers:

“In Cambodia, the generation of Khmer digital content such as reports, articles, and social media feedback is on the rise. Despite this, the absence of a robust Khmer semantic search engine means that users often cannot access the most relevant documents.”

Nimol explains that “Even major search engines like Google do not yet provide high accuracy for Khmer language searches, highlighting a significant gap in the available technology.”

One reason provided is that “Khmer documents, both historical and modern, exhibit complex writing forms and grammar, presenting additional challenges.”

“The objective of this research,” Nimol writes, “is to address this gap by developing the Khmer Semantic Search Engine (KSE),” which utilizes “advanced semantic matching techniques.”

“By formally annotating semantic content and extracting meaningful keywords from user queries, KSE can perform precise matching and deliver the best matching URLs.“

The paper “presents the research design, implementation, and evaluation of the Khmer Semantic Search Engine (KSE),” as shown in Figure 1:

👆 What you’re seeing above are two distinct methods of processing search queries (a global view).

We’ll learn more later from the paper’s contents, but I’ll quickly break them down here:

Offline processing: the model searches a pre-existing database of documents with normal ranking (the frequency of a document’s appearance in search results for a given keyword).
Online webpage processing: the model dynamically searches the constantly changing content on the web and uses weight scoring, a more nuanced ranking considering the internal structure of the document (like whether keywords appear in the title (higher weight) compared to the body (lower weight).

I was kind of intrigued by the focus on keywords related to Figure 1 above, given that the theme of the paper’s introduction was semantic search.

So I asked Gemini for clarification:

“The paper emphasizes the importance of semantic search, which aims to understand the meaning and intent behind search queries rather than just matching keywords. However, it also dedicates significant attention to keyword extraction and matching techniques. This might seem contradictory, but it’s important to understand how keywords fit into the broader semantic search approach in this context.

Keyword Extraction as a Foundation: Keywords are still essential in the initial stages of the search process. They serve as the starting point for understanding the user’s query and identifying potentially relevant documents. The paper discusses various keyword extraction methods, including TF-IDF and manual extraction, to ensure that the most relevant terms are identified.

Semantic Matching Beyond Keywords: While keywords are the starting point, the KSE doesn’t stop there. It goes beyond simple keyword matching by incorporating semantic matching techniques. These techniques consider the relationships between words, their meanings in context, and the overall intent of the query. This allows the search engine to retrieve documents that might not contain the exact keywords but are still semantically relevant to the query.

Ontology-Based Semantic Search: The paper also emphasizes the use of ontologies, which are structured knowledge bases that define relationships between concepts. By incorporating ontologies, the KSE can understand the deeper meaning of queries and documents, leading to more accurate and relevant search results.

In summary, while the paper does discuss keyword extraction and matching, it’s important to note that these are just the initial steps in a more comprehensive semantic search process. The KSE ultimately aims to move beyond simple keyword matching and incorporate semantic understanding to deliver more relevant and accurate search results for Khmer language content.“
– Gemini Advanced

As Gemini’s response alluded to, the paper proposes “three semantic search frameworks“:

Based on a keyword dictionary
Based on ontology
Based on ranking

Those are reviewed in the KSE section.

It also describes “the development of tools for data preparation, including document addition and manual keyword extraction.”

Overall, Nimol’s research contributes to the field of semantic search in the following ways:

Development of a Khmer Semantic Search Engine
Advanced semantic matching techniques
Comprehensive data preparation tools
Evaluation with ground truth dataset

2. Related works 📚

“Semantic search has been a focal area of research, particularly for languages with a rich vocabulary and complex syntax,” Nimol writes.

As a result, “Various approaches have been explored to enhance the accuracy and relevance of search results by understanding the semantic context of queries and documents.”

Let’s review those. 🙌

2.1. Ontology-based approaches 🦉 🪜

“Ontologies play a crucial role in representing domain knowledge and facilitating semantic matching,” writes Nimol.

According to the definition by Gruber referenced in the paper, an ontology is “a specification of a representational vocabulary for a shared domain of discourse—definitions of classes, relations, functions, and other objects.”

It consists of three elements, Nimol explains, including:

Vocabularies
Explicit specifications
Constraints

Vocabularies describe the domain of an ontology, while objects “are generally represented by nouns, and relationships are usually specified by verbs in a sentence.”

👉 Figure 2 in the paper shows an example of a search engine ontology, with “vocabularies such as search, author, write, person, document, thesis, paper, topic, semantic web, and search”:

Additionally, Nimol writes, a “search engine is specified by properties such as query, results, relevance score, and search time.”

“The cardinality defines that a search engine processes multiple queries, each returning multiple results. This can be expressed in ontological terms as ‘a search engine performs Search a query’ and ‘a query returns multiple results.’“

Explaining it further, cardinality refers to the numerical relationships between entities in an ontology. Essentially, it tells us how many instances of one entity can be associated with a single instance of another entity.

In the context of the KSE paper, cardinality refers to the relationship between a search engine (like Google) and the queries it processes and results it returns, including how each of those queries can result in multiple webpages being returned.

It’s a one-to-many relationship.

Here’s an analogy Gemini gave me: picture a search engine as a library 📚, each visitor as a query 🧑‍💻, and the books they find on specific topics as the results 📘. Different visitors may all want different books, but they’re still all associated with the same library. 🤝

2.2. Keyword extraction methods 🪨 ⛏️

“Keywords are essential for indexing and efficiently retrieving documents,” Nimol writes.

📝 One key takeaway I had from Jo Bergum’s talk on RAG basics is how IR methods like BM25 are still important in the age of dense embeddings, highlighting the advantages of a hybrid approach.

Nimol mentions several methods “for automatic keyword extraction” — the process of identifying the most significant words or phrases in a document that represent its context, essential to indexing and retrieval in search engines — including general approaches as well as one specifically tailored for the Khmer language:

Term Frequency-Inverse Document Frequency (TF-IDF): This evaluates a term’s statistical importance based on how frequently it appears within a single document compared to how often it appears across all other documents.
Keyphrase Extraction Algorithm (KEA): This is an automatic and widely used keyword extraction tool that combines statistical and supervised ML methods (requiring training data).
Incorporating linguistic knowledge: This refers to methods that consider the syntactic (sentence structure) and semantic (underlying meaning) of a text to improve the accuracy of keyword extraction.
Multi-step processes involving a stop word corpus dictionary: This approach is specifically designed to address unique challenges of the Khmer language for search engines, like its complex script (which we saw earlier) and the lack of spaces between words.

2.3. Semantic matching 🐰 🎩

Semantic matching techniques, between queries and documents, “aim to understand the meanings and relationships of words to improve the accuracy of search results.”

Nimol cites several examples, including latent semantic analysis (LSA), one of the techniques shown in Figure 3:

Let’s review them all now:

WordNet: This lexical database helps measure semantic similarity between English words by grouping them into synsets (synonym sets) and providing definitions and examples of usage.
Wu and Palmer Algorithm: This calculates semantic similarity based on the hierarchical relationships between WordNet synsets by considering their depth and least common subsumer (LCS).
- 📝 A subsumer is a concept or word that encompasses the meanings of other concepts or words, like a more general term that includes the meanings of more specific terms. The LCS would be the most specific one (closest to the “shared ancestor” of the concepts, as Gemini put it).
Information-theoretic approaches: This method leverages a large text corpus and uses statistics and information content to quantify how related words are based on their usage.
- 📝 Information content (in the context of information theory and NLP) is a measure of how much information a word provides, where less frequent words have higher information content because they’re rarer and thus more surprising and informative, as opposed to common articles like “the” or “and.”
Word embeddings: Embeddings represent words as vectors in a continuous space, capturing semantic relationships based on word co-occurrence patterns (like how semantically related words tend to appear together in similar contexts) or other linguistic features for sophisticated matching techniques.
Latent Semantic Analysis (LSA): This technique is used in NLP to uncover hidden (latent) relationships between words and documents, which is achieved by reducing the dimensionality of a term-document matrix (a fancy way of saying analyzing lots of text data in a table representing word frequencies in documents to identify patterns in how those words are used together).
Ontology-based matching: This method uses ontologies (those structured knowledge bases we mentioned earlier) to define concepts and their relationships within a specific domain, enabling more precise query interpretation and document classification for more accurate and relevant search results.

3. Challenges in Khmer language processing 🤔

“The development of semantic search tools for the Khmer language faces several significant challenges. While substantial progress has been made in semantic search technologies for widely spoken languages like English, these advancements have not yet been fully realized for Khmer.“

“This disparity can be attributed to several factors,” Nimol explains:

Complexity of Khmer syntax: Syntax are rules governing the structures of sentences or phrases in a language. Khmer has complex syntactic structures, including “a wide array of compound words, intricate grammatical rules, and context-dependent meanings.” These make it challenging to develop NLP tools that can “parse and understand” Khmer script accurately.
Limited research and development: Given the “relatively little research focused on Khmer language processing compared to more widely spoken languages” (as mentioned earlier, and likely due, in part, to fewer resources and tools), there have been “fewer advancements and slower progress in developing effective NLP tools for Khmer.” Current NLP tools “often suffer from low accuracy and fail to meet the needs of Khmer-speaking users.”
Diverse dialects and regional variations: “Khmer is spoken in various dialects across different regions of Cambodia,” adding “another layer of complexity to the development of a universal semantic search tool that can understand and process all variations accurately.”

📝 Here’s a language family map of mainland Southeast Asia that I found in a post on Medium by Venti Chiau called “The Misconception of Asia”:

Source.

Inadequate support for Khmer in major search engines: “Major search engines like Google have not yet achieved high accuracy for Khmer content,” writes Nimol. “The lack of support from these platforms means that Khmer-speaking users do not have access to efficient search tools, further emphasizing the need for specialized solutions.”

4. Khmer Semantic Search Engine 🇰🇭

“The Khmer Semantic Search Engine (KSE) is designed to address the unique challenges of processing and retrieving Khmer language content.“

KSE is free, open source, and can be accessed from any computer with an internet connection. Though still under development, it’s already being used by several organizations in Cambodia.

Let’s learn about how KSE works, on a more technical level. 🤓

As mentioned with Figure 1 above, the architecture of KSE includes components and processes that work together for both offline and online processing.

4.1. Offline processing 🗃️

Offline processing refers to KSE’s ability to match user queries with documents that were previously uploaded and indexed in a database (as opposed to live web search).

This is shown in Figure 4:

The steps in offline processing include:

Document upload: Users or administrators upload their documents into the system (KSE).
Keyword extraction and indexing: The system extracts significant keywords from the documents and indexes them in its database.
User’s query: When a user searches a query using KSE’s graphical user interface (or GUI), the system matches the query’s keywords with indexed documents’ keywords using ontology-based matching, along with other techniques.
Ranking and results: KSE ranks the results based on relevance scores, considering the semantic similarity between the query and the documents’ keywords. The ranked search results then get displayed to the user, with the most relevant documents appearing first.

4.2. Online webpage processing 🕸️

Online webpage processing refers to KSE’s ability to search the live web in real time, retrieving up-to-date information for users’ queries.

It’s shown in Figure 5:

The steps for live web search include:

Web crawling: KSE uses an unspecified web crawler to continuously crawl the web to gather new content from various sources (like news sites, blogs, and social media).
Real-time indexing: Upon discovering new content, KSE extracts keywords from the webpages and indexes them in real-time, making them immediately searchable.
Query matching: When a user searches a query, KSE matches its keywords with both the indexed offline documents (from the database we referenced earlier) and newly indexed online content.
Semantic matching: KSE uses ontology-based matching, semantic similarity calculations, and other advanced semantic matching techniques to ensure accurate matches between queries and content.
Ranking: Considering both offline documents and online webpages, KSE then ranks the results based on their relevance to the query.
Display: The combined ranked results are then displayed to the user in order of most to least relevant.

4.3. Ontology-based matching example 🗺️

Figure 6 gives us an example of ontology-based matching for queries in a tourism knowledge graph for Cambodia:

What you’re seeing above is a hierarchical structure (ontology) of tourism-related concepts in Cambodia, going from broad to specific.

When a user enters a query, KSE analyzes it to identify relevant concepts, like how the query “temples in Siem Reap” would be associated with “Temple” and “Siem Reap” concepts in the ontology.

Based on the relationships defined in the ontology, KSE matches query concepts with relevant documents.

Nimol’s paper goes deeper into the details of the semantic search matching process.

Let’s dig in a bit more!

4.4. Semantic search matching process (details on its three main components) 🗣️

“The semantic search matching process in our Khmer Semantic Search Engine (KSE) involves three main components: semantic search based on keywords, semantic search based on tourism ontology for Cambodia, and semantic search based on ranking. Each component is designed to enhance the accuracy and relevance of the search results through detailed and technically robust processes.”

1️⃣ Semantic search based on keywords involves the following process:

Tokenization: “The query is split into individual words or tokens. For example, the query ‘best cultural sites in Phnom Penh’ is tokenized into [‘best’, ‘cultural’, ‘sites’, ‘in’, ‘Phnom’, ‘Penh’].”
Stop-word removal: “Common Khmer stop words are removed to focus on significant terms,” which involves using “a predefined list of stop words specific to the Khmer language,” such as “(in), (of), (has), and (is).”
Stemming and lemmatization: This reduces words “to their base or root form,” like making “sites” into “site,” which “ensures that variations of a word are considered equivalent.”
TF-IDF calculation: This weighs “the importance of each term within the query relative to its frequency in the document corpus,” as discussed earlier.
Keyword matching: With the keywords now “extracted and weighted,” they “are then used to search the indexed documents.” This pre-processing is applied to “[e]ach document in the database … in a similar manner, with keywords extracted and stored in an index for fast retrieval.” The keywords from the user’s query are then compared “with those in the document index to find potential matches.”

2️⃣ Nimol also explains Semantic search based on a tourism ontology for Cambodia.

“The second component leverages a domain-specific ontology to enhance the semantic understanding of user queries. The tourism ontology for Cambodia includes detailed information about various tourist destinations, cultural landmarks, historical sites, and other points of interest in Cambodia. The ontology provides a structured representation of domain knowledge, including the relationships between different entities.”

The ontology used was “developed using the Web Ontology Language (OWL)” — a family of knowledge representation languages for authoring ontologies — “and consists of multiple classes and properties that capture the complexities of Cambodian tourism,” including:

Entities: These are familiar to SEOs as representations of things with a distinct and independent existence, such as in a knowledge graph. In the context of KSE, entities “include tourist attractions (e.g., Angkor Wat, Royal Palace), types of attractions (e.g., temples, museums), and locations (e.g., Phnom Penh, Siem Reap),” where each one “is described with properties such as historical significance, visitor reviews, operational hours, and geographical coordinates.”
Relationships: These “define how entities interact,” like how “‘Angkor Wat’ (entity) is a ‘temple’ (type) located in ‘Siem Reap’ (location).” Hierarchical relationships are also included, “such as ‘Siem Reap’ is a part of ‘Cambodia’.”
Ontology-based query expansion: This process occurs when “a user enters a query, the system uses the ontology to expand the query semantically.” As an example, a query like “temples in Phnom Penh” might get expanded “to include synonyms and related terms like ‘wats’ and specific temple names in Phnom Penh.” Leveraging “synonyms and related entities defined in the ontology” helps “ensure comprehensive query coverage.”
Ontology details: “The ontology comprises over 500 entities [and] includes approximately 1000 relationships that map these entities in various contexts, providing a rich semantic network for query expansion.” The query “temples in Phnom Penh,” for example, “might be expanded to include ‘Wat Phnom’ and ‘Wat Botum’, both specific temple names in Phnom Penh.”
Semantic similarity calculation: “The system calculates the semantic similarity between the expanded query and the documents using ontology-based metrics,” using “techniques such as cosine similarity on the vector representations of entities and their relationships, ensuring that documents related to the context of the query are prioritized.”

3️⃣ Finally, Nimol explains semantic search based on ranking, a process that “combines both keyword-based and ontology-based relevance and involves several steps,” including:

Relevance scoring: “Documents are scored based on both keyword relevance and semantic relevance,” where keyword relevance is determined by TF-IDF scores and “[s]emantic relevance is assessed by comparing the ontology-based context of the query with the document content,” which “involves measuring the semantic similarity between the entities and relationships identified in the query and those in the documents.”
Weighted scoring algorithm: Each document’s overall score is weighted with a “combination of keyword relevance and semantic relevance,” which “ensures that both the direct relevance of keywords and the deeper semantic connections are considered, providing a balanced and accurate ranking of documents.”
Popularity metrics: In addition to relevance scoring, “documents are also ranked based on popularity metrics such as click-through rates, user ratings, and the number of views,” metrics that “provide insight into how users interact with the documents, indicating their perceived value and relevance.” To incorporate these metrics “into the final ranking score, we normalize the data to ensure comparability,” so, for example, CTRs “might be scaled to a range between 0 and 1, user ratings could be averaged, and view counts normalized based on the highest and lowest values in the dataset.” The inclusion of these popularity metrics helps “ensure that frequently accessed and highly rated documents receive an appropriate boost in their ranking, reflecting their utility and relevance to users.”
Ranking adjustments: “The system applies final adjustments to the ranking based on additional factors such as the recency of the document, user-specific preferences, and contextual relevance, where “more recent documents or those with higher user engagement might receive a slight boost in their ranking,” which “ensures that the most relevant and timely information is presented to the user.”
Presentation of results: “The top-ranked documents are then presented to the user in a list, with the highest-scoring documents appearing first,” as well as “a brief snippet highlighting the relevant keywords and context” for each result “to help the user quickly identify the most pertinent information.”

🎁 To tie a ribbon on the entire process we just reviewed, here’s Nimol’s summary:

“By integrating these three components, the KSE provides a comprehensive and accurate search experience tailored to the specific needs of Khmer language users and the domain of Cambodian tourism. This multi-layered approach ensures that the search engine can handle complex queries and deliver relevant, high-quality results effectively.“

5. Experiment setups (but really a look at Khmer Semantic Search Tool) ✌️

This section of the paper contains a lot of great information. However, since we’ve gone over a lot of details already, I’ll let you explore that on your own. 🤗

Instead, let’s get right to it and check out Figure 7, which shows the Khmer Semantic Search Engine in action:

Pretty cool, right?!

6. Results 📊

“In this section, we present the testing scenarios using the ground truth for evaluation and the results obtained from these tests. For international understanding, we will provide examples in English translated from Khmer.”

The paper includes three comparisons, summarized below:

Comparison of keyword extraction methods: The authors compared KSE-generated keywords with manually extracted ones from the same set of articles, including from titles and the body of articles, using precision, recall, and F1 score (harmonic mean of precision and recall) as evaluation metrics to assess the accuracy and relevance of KSE-generated keywords against the manual ground truth, finding that KSE’s effectiveness in extracting relevant keywords had a positive impact on search accuracy.
Comparison of ontology-based semantic search: The authors found that leveraging structured domain knowledge significantly improved search accuracy, retrieving contextually relevant documents, even when they lacked exact search terms.
Comparison of search results ranking: Using 100 randomly selected keywords and an evaluation involving two ranking methods, including Normal Ranking (frequency of document occurrences in search results) and Weight Scoring (different weights assigned to the title and body of a document), the results were an average F1 score of 0.75, demonstrating KSE’s potential as a significant advancement over traditional search engines for Khmer-language searches.

7. Conclusion 🏁

With all of that ground truth knowledge we’ve just derived about semantic search engines and KSE, in particular, let’s check out excerpts from the conclusion:

“Based on the results we have compared, we conclude that our system, the Khmer Semantic Search Engine (KSE), achieves high accuracy in search performance. … The experimental results show that our search engine is more effective than traditional search engines because we integrated keyword extraction and string matching algorithms for better understanding. …

We expect that KSE can be applied to websites and Khmer documents used in campuses, organizations, ministries, etc. Due to its effectiveness and better results, we also anticipate that this research serves as the first version for future studies related to keyword extraction and search engine learning. This work paves the way for future advancements in search technologies for non-Latin languages, addressing a significant gap in the current search engine capabilities.

Overall, KSE’s development marks a pivotal advancement in enhancing information retrieval for the Khmer language. By overcoming language-specific challenges and incorporating advanced semantic matching techniques, our system provides a robust foundation for further innovation. Future work will focus on integrating more sophisticated machine learning models and expanding the ontology to cover broader domains, ultimately aiming to provide even more accurate and contextually relevant search results for Khmer-speaking users.“

We can now understand what “language-specific challenges” and “advanced semantic matching techniques” refer to, as well as anticipate that the future direction of semantic search engines, like KSE, may involve “more sophisticated machine learning models” and/or “expanding the ontology to cover broader domains,” and that’s pretty 🆒!

But why should SEOs care about KSE and semantic search? 🧐

I don’t think the answer is to learn how search engines work to try and game them. 😇

Here’s why:

If the goal of a search engine is to return the most helpful results to a user, and those rankings are based on sophisticated algorithms that not only use “advanced semantic matching techniques” but also “popularity metrics such as click-through rates, user ratings, and the number of views,” among other factors, and they deal with large-scale aggregated data, then we’d have to assume they know what users want, directionally.

Of course, search engine systems aren’t perfect, but to misalign with what they intend to rank would probably be to also misalign with users’ interests and goals. In short, it’d lead to short-term visibility that could result in paying a long-term price.

On the other hand:

By knowing better how search engines understand and serve web content, we can more naturally factor those considerations into how we create user-first website or social content.

If we think back to the ontology-based matching example for tourism queries, for instance, the term “temples in Phnom Penh” could be expanded to include “Wat Phnom” and “Wat Botum,” as both are temple names in Phnom Penh.

Photo of Wat Phnom. Source.

Understanding how contextual relationships like that are made could then impact how we perceive “relevance” for users.

I enjoy creating longer articles — I am a TOOL fan, after all (12 min song below) 😄:

Still, there was nothing in our review of KSE or semantic search engines, in general, to suggest that long content is a requirement or the only way to create contextual relevance.

As I saw it, meaning matters, and so do users, of all languages. 🤗 🌎 🌍 🌏

Phew! We made it. 😅

However, since we’ve had a long conversation with Gemini Advanced about KSE, let’s ask what it predicts the significance of the paper is from an SEO perspective (and probably depending on what’s retained in its context window).

*Note: this is me summarizing LLM-based content now as a theoretical exercise. 😊

1. Focus on meaning, not just keywords

Given how search engines are becoming increasingly more sophisticated at understanding the meaning and intent behind queries, optimize content for relevance and context, not just keywords. This means creating with an eye toward quality and information that addresses the needs, interests, and questions of your target audience:

Sometimes, creating content is as simple as answering questions. Create content that provides your customers with answers. pic.twitter.com/zM9vfUAsdb
— SearchLab (@SearchLabAgency) July 19, 2024

Word up.

2. Structured data is key

Because semantic search engines can rely on ontologies (structured data) to understand the relationships between concepts, using schema markup (Schema.org) can help search engines better understand your content and improve its visibility in search results.

LLMs can also be handy for schema generation, just be sure to verify it.

3. User experience matters

Search engines are increasingly prioritizing user experience factors, so optimizing your website for these could help improve your search rankings and drive more organic traffic.

As we saw, “popularity” counts.

4. Stay informed about AI advancements

The field of AI, particularly NLP, is rapidly evolving. Staying informed about the latest advancements and how they impact search can help you stay ahead of the curve and adapt your SEO strategies accordingly.

Hey, I think I know a place for that! 🐹 🤖

Outro

Khnhom sangkhumtha anak ban rikreay nung atthabt sravochreav Hamsterdam now sa bta nih!

👉 Translation: I hope you’ve enjoyed this week’s Hamsterdam Research article!

I’ll likely come back to this for a second edit to improve the writing, but I’ll make sure all the core information remains the same.

As always, feel free to comment or contact me with your feedback. 🤗

Stay tuned for another research article soon (in a week or so), or check out related posts below.

Until next time, enjoy the vibes:

Thanks for reading. Happy optimizing! 🙂

SEO Strategist and New Khmer Music Fan

Ethan Lazuk

Stumbling Upon Google Engineer Ni Lao’s Work, & Exploring What It Can Teach Us about ML, IR & NLP for SEO Insights (a Hamsterdam Research Post)

In this Hamsterdam Research post, we look at the work of Ni Lao, a Google engineer, for learnings about machine learning, information retrieval, and NLP.

June 20, 2024November 13, 2024

Doing the Global Minimum: Thinking About SEO More in the Context of Neural Network Architectures (A Hamsterdam Research Opinion Piece)

In this opinion piece, we broaden the SEO lexicon to include more deep neural network terminology, expanding our worldview with the global minimum.

May 15, 2024October 5, 2024

Could Anthropic’s Identification of Millions of Features (Concepts) Activated Inside Its LLM (Claude 3 Sonnet) Influence Our Semantic SEO Strategies? (Maybe)

Anthropic identified millions of features (concepts) activated in Claude 3 Sonnet. Hamsterdam Research reviews the implications for semantic SEO strategies.

May 24, 2024October 5, 2024

Ethan Lazuk

🇰🇭 Addressing the Gap: How the Khmer Semantic Search Engine (KSE) Works & What That Can Teach Us as SEOs 🙌

But this week, we’ll be switching it up and learning how a semantic search engine works using a real-world example! 🙌 🌏

Where did I come across this paper, and how is it relevant to SEO? 👀

Stepping back a moment for context, there’s been talk recently about translation’s role in Google Search. 🔍

That’s where Nimol is coming from in this paper about the Khmer Semantic Search Engine. 👈

✍️ Here’s the full abstract:

We’ll review the main vocab now, then delve into the full paper. 🙌

Now we’ll review the semantic search-related terminology from the abstract. 🤖

Time now to delve into the full paper for some deep learning about KSE and semantic search. 🤿

1. Introduction 👋

2. Related works 📚

2.1. Ontology-based approaches 🦉 🪜

2.2. Keyword extraction methods 🪨 ⛏️

2.3. Semantic matching 🐰 🎩

3. Challenges in Khmer language processing 🤔

“This disparity can be attributed to several factors,” Nimol explains:

4. Khmer Semantic Search Engine 🇰🇭

4.1. Offline processing 🗃️

4.2. Online webpage processing 🕸️

4.3. Ontology-based matching example 🗺️

4.4. Semantic search matching process (details on its three main components) 🗣️

5. Experiment setups (but really a look at Khmer Semantic Search Tool) ✌️

Instead, let’s get right to it and check out Figure 7, which shows the Khmer Semantic Search Engine in action:

6. Results 📊

7. Conclusion 🏁

But why should SEOs care about KSE and semantic search? 🧐

1. Focus on meaning, not just keywords

2. Structured data is key

3. User experience matters

4. Stay informed about AI advancements

Outro

Related research articles:

Stumbling Upon Google Engineer Ni Lao’s Work, & Exploring What It Can Teach Us about ML, IR & NLP for SEO Insights (a Hamsterdam Research Post)

Doing the Global Minimum: Thinking About SEO More in the Context of Neural Network Architectures (A Hamsterdam Research Opinion Piece)

Could Anthropic’s Identification of Millions of Features (Concepts) Activated Inside Its LLM (Claude 3 Sonnet) Influence Our Semantic SEO Strategies? (Maybe)

Like this:

Leave a ReplyCancel reply

🇰🇭 Addressing the Gap: How the Khmer Semantic Search Engine (KSE) Works & What That Can Teach Us as SEOs 🙌

But this week, we’ll be switching it up and learning how a semantic search engine works using a real-world example! 🙌 🌏

Where did I come across this paper, and how is it relevant to SEO? 👀

Stepping back a moment for context, there’s been talk recently about translation’s role in Google Search. 🔍

That’s where Nimol is coming from in this paper about the Khmer Semantic Search Engine. 👈

✍️ Here’s the full abstract:

We’ll review the main vocab now, then delve into the full paper. 🙌

Now we’ll review the semantic search-related terminology from the abstract. 🤖

Time now to delve into the full paper for some deep learning about KSE and semantic search. 🤿

1. Introduction 👋

2. Related works 📚

2.1. Ontology-based approaches 🦉 🪜

2.2. Keyword extraction methods 🪨 ⛏️

2.3. Semantic matching 🐰 🎩

3. Challenges in Khmer language processing 🤔

“This disparity can be attributed to several factors,” Nimol explains:

4. Khmer Semantic Search Engine 🇰🇭

4.1. Offline processing 🗃️

4.2. Online webpage processing 🕸️

4.3. Ontology-based matching example 🗺️

4.4. Semantic search matching process (details on its three main components) 🗣️

5. Experiment setups (but really a look at Khmer Semantic Search Tool) ✌️

Instead, let’s get right to it and check out Figure 7, which shows the Khmer Semantic Search Engine in action:

6. Results 📊

7. Conclusion 🏁

But why should SEOs care about KSE and semantic search? 🧐

1. Focus on meaning, not just keywords

2. Structured data is key

3. User experience matters

4. Stay informed about AI advancements

Outro

Related research articles:

Stumbling Upon Google Engineer Ni Lao’s Work, & Exploring What It Can Teach Us about ML, IR & NLP for SEO Insights (a Hamsterdam Research Post)

Doing the Global Minimum: Thinking About SEO More in the Context of Neural Network Architectures (A Hamsterdam Research Opinion Piece)

Could Anthropic’s Identification of Millions of Features (Concepts) Activated Inside Its LLM (Claude 3 Sonnet) Influence Our Semantic SEO Strategies? (Maybe)

Share this:

Like this:

Leave a ReplyCancel reply

Discover more from Ethan Lazuk