Sparse Embeddings Demystified: From One-Hot Encoding to BM25
- Mathis Embit
- Tech
- October 16, 2024
This blog article aims to clarify some Information Retrieval (IR) methods related to sparse embeddings (or sparse vectors).
We will study the four most well-known methods. Let’s examine them in increasing order of complexity.
Before diving into the methods, we must specify the settings. We suppose we have a corpus $\mathcal{C}$ of documents, and that each document is made up of words. This corpus can be used to precompute some values. Then, we are given a query, which is made up of words, and the goal is to find the document from the corpus that is most relevant to it. In fact we assign score to each document of the corpus and the most relevant document will be the one with the highest score. Because the query is unknown beforehand, its embedding must be computed dynamically at “query-time” and cannot be precomputed.
To perform this, we first construct a vocabulary $V$, which is a set of words. This is a problem in itself, as there are multiple ways of constructing it. For now, we will consider that the vocabulary consists of the set of all words appearing in the corpus.
For each method, we provide some intuition, explain how it is related to sparse embeddings, what can be precomputed when given a corpus of documents, and how we find the document that is most relevant to a query.
One-Hot Encoding
A one-hot embedding is the simplest sparse representation, where each dimension corresponds to a unique word in the vocabulary. If a word is present, its position is marked with a 1; otherwise, it’s 0. The dot product between two one-hot vectors counts the number of common words, giving a basic measure of overlap.
Embedding. The sparse embedding is the sparse vector where each dimension corresponds to a unique word in the vocabulary. The vector has a 1 for a word’s presence and 0s everywhere else.
Precomputed. Creation of the one-hot embedding of each document.
Score computation at query-time. We create the one-hot embedding of the query. Then, the dot product between the query embedding and a document embedding counts the number of common words between the two.
The larger the dot product with a document, the more relevant the document is.
Bag of Words
A Bag of Words (BoW) embedding is a sparse representation where each dimension corresponds to a unique word in the vocabulary. The value of each dimension represents the count of occurrences of that word in the document. If a word appears multiple times, its corresponding value reflects the frequency; if it doesn’t appear, its value is 0. The dot product between two Bag of Words vectors gives a measure of overlap based on the frequency of common words, providing a more nuanced similarity score compared to One-Hot Encoding.
Embedding. The sparse embedding is the sparse vector that counts the frequency of each word in the document. Unlike one-hot, multiple occurrences of a word increase its value.
Precomputed. Creation of the BoW embedding of each document.
Score computation at query-time. We create the BoW embedding of the query. Then, the dot product gives a score based on how many times words overlap, factoring in their frequency (more overlap means a higher score).
The larger the dot product with a document, the more relevant the document is.
TF-IDF
TF-IDF improves on this by weighting words based on their importance. Term Frequency (TF) counts how often a word appears in a document, while Inverse Document Frequency (IDF) reduces the weight of common words across the corpus. For instance, ’the’ might have high TF but low IDF, resulting in a low weight overall. The TF-IDF weights are obtained by multiplying TF and IDF. Then the dot product between two TF-IDF vectors gives a weighted overlap, emphasizing rare but relevant terms.
Embedding. The sparse embedding is the weighted vector where each word is assigned a value based on its importance. TF measures word frequency in the document, and IDF reduces the weight of common words across the corpus.
Precomputed.
For each term $t \in V$ and each document $D \in \mathcal{C}$, we can precompute
$$ \text{TF}(t,D) = \frac{f_{t,D}}{\sum_{t’ \in D} f_{t’,D}} $$
where $f_{t,D}$ is the number of occurrences of the term $t$ in the document $D$.
This leads to a table of $|V| \times |\mathcal{C}|$ values.
For each term $t \in V$ we can precompute
$$ \text{IDF}(t) = \log\left(\frac{|\mathcal{C}|}{1 + n_t}\right) $$
where $|\mathcal{C}|$ is the total number of documents, and $n_t$ is the number of documents containing term $t$.
Hence, for each document $D \in \mathcal{C}$ we have access to $\text{TF-IDF}(t,D)=\text{TF(t,D)} \times \text{IDF}(t)$ for each term $t \in V$. This means we have an embedding of size $|V|$ for each document of the corpus.
Score computation at query-time. Given the query $Q$ we compute $\text{TF-IDF}(t,Q)$ for each term $t \in V$. We then have an embedding of size $|V|$ for the query $Q$. We finally compute the dot product (or cosine similarity) of the query and document TF-IDF vectors gives a weighted overlap, emphasizing rare but relevant terms.
The larger the dot product (or cosine similarity) with a document, the more relevant the document is.
BM25
BM25 further refines this by introducing a saturation function that limits the impact of term frequency, making it more effective for longer documents. It also accounts for document length normalization and adds a tunable relevance boost for words appearing in both query and document. BM25 balances term importance, frequency, and document length to compute a relevance score, improving over raw TF-IDF.
Embedding. No explicit vector. BM25 computes a relevance score using term frequency, document length, and word importance (IDF).
Precomputed.
IDF for each term $t \in V$:
$$ \text{IDF}(t) = \log \left( \frac{|\mathcal{C}| - n_t + 0.5}{n_t + 0.5} + 1 \right) $$
where $|\mathcal{C}|$ is the total number of documents in the corpus, and $n_t$ is the number of documents containing term $t$.
$f_{t,D}$ the number of occurrences of the term $t$ in the document $D$, for each term $t \in V$.
Average document length $\text{avgdl}$.
Score computation at query-time. BM25 score is computed as:
$$ \text{Score}(Q, D) = \sum_{t \in Q} \text{IDF}(t) \cdot \frac{f_{t,D} \cdot (k_1 + 1)}{f_{t,D} + k_1 \cdot \left(1 - b + b \cdot \frac{|D|}{\text{avgdl}}\right)} $$
where $f_{t,D}$ is the number of occurrences of the term $t$ in the document $D$, $k_1$ and $b$ are tunable parameters, $|D|$ is the document length, and $\text{avgdl}$ is the average document length in the corpus $\mathcal{C}$.
The larger the score, the more relevant the document is.
Conclusion
In summary, One-Hot Encoding is the simplest, representing word presence. Bag of Words adds frequency information. TF-IDF weights words based on importance, and BM25 further adjusts for frequency saturation and document length, making it highly effective for ranking documents.