Abstract watercolor series | FLUX.1 [dev]

SPLADE: Combining Sparse and Semantic Approaches

Mathis Embit
Tech
December 10, 2024

In Information Retrieval (IR), current research focuses on leveraging neural networks to find documents relevant to a query. Earlier methods, known as Bag-of-Words (BoW), were based on word appearance statistics. With the development of neural language models, other retrieval methods emerged, such as dense retrieval. Language models also enable reranking the relevance of retrieved documents. Currently, research is focused on improving the initial retriever in ranking pipelines, i.e., before the reranking step. This is the problem that SPLADE addresses.

The Challenge with Information Retrieval

Before looking at how SPLADE [Formal et al., 2021] works, let’s try to briefly understand the current challenge in IR.

In the world of search engines, retrieving relevant documents from a massive pool of data is not just about finding exact keyword matches. Modern users expect search engines to understand context, synonyms, and related concepts. For instance, when searching for “electric vehicle,” users also expect results that mention “EVs,” “battery-powered cars,” or even “Tesla”.

Traditional models like Term Frequency - Inverse Document Frequency or Best Match 25 (see our blog article about sparse embeddings) are Bag-of-Words retrieval functions that rank a set of documents based on word frequencies. This means that they rely purely on lexical matching, while dense neural models like BERT-based retrieval systems focus on semantic matching.

We will call embedding a numerical representation of text. It’s a numerical vector. From what we’ve discussed, the two types of embeddings are:

sparse embeddings: built using TF-IDF, BM25, they have very few non zero values. Most of the time their size is the size of the vocabulary.
- Pros: faster retrieval, no fine-tuning, exact term matching.
- Cons: vocabulary mismatch problem.
dense embeddings: built using neural network models such as transformers, they are lower dimensional but information rich.
- Pros: good for semantic, multi-modality.
- Cons: may need fine-tuning, more compute, no exact match.

Vocabulary mismatch problem (bag-of-words models):

Bag-of-Words models represent text as collections of words without considering their order or context. They rely on exact word matches between queries and documents. This approach leads to the vocabulary mismatch problem, where semantically similar words (like “car” and “automobile”) are treated as different because they don’t match exactly. As a result, these models may fail to retrieve relevant documents that use different words to express the same concept.

Inability to explicitly model term matching (neural ranking models):

Neural ranking models, which use deep learning techniques, can capture the semantic meaning of queries and documents through dense representations. However, they often lack the ability to explicitly model term matching as effectively as traditional sparse models. This means that while they understand the overall meaning, they may not give enough importance to exact matches of key terms, which can be crucial in certain information retrieval scenarios, especially when precision is needed. For example, for the sentence “best adapter for product id12345” a neural model might suggest pages related to “best adapters” without focusing on the specific term “id12345” which is crucial for the query.

SPLADE strikes a balance by combining sparse lexical representations with semantic expansion, thus capturing the best of both worlds.

What is SPLADE?

SPLADE, short for SParse Lexical AnD Expansion model, is a model designed for the first-stage ranking in information retrieval systems. Unlike dense retrieval models, which generate continuous embeddings, SPLADE maintains sparse, high-dimensional representations—similar to traditional lexical models—but with the added power of semantic expansion. This enables it to capture both explicit and implicit relationships between terms, significantly enhancing the search process. SPLADE uses a pretrained language model like BERT to identify synonyms to some words, and have some additional parameters trained to construct relevant embeddings. Producing alternatives allows to enrich the sparse vector embedding.

How Does SPLADE Work?

SPLADE leverages a transformer-based architecture (e.g., BERT) to generate sparse representations of queries and documents. Here’s how it works step-by-step:

Embedding construction

SPLADE takes a query or document and transforms it into a sparse high-dimensional vector. We will write this input sequence as $t_1 \cdots t_n$. This input is fed to a transformer model such as BERT, producing $n$ dense embeddings $h_1 \cdots h_n$.

We now create $|V|$ weights for each of the $n$ input tokens. $\forall j = 1, \cdots, |V|,$

$$ w_ {ij} = transform(h_i)^TE_j + b_j $$

where $transform$ is a linear layer with GeLU activation and LayerNorm. Thus it is equivalent to an MLM prediction and we can simply use a pre-trained MLM model.

To be clear $w_{ij}$ represents how much vocabulary token $j$ is related to input sequence token $i$.

Now, for each input token we have a distribution over the vocabulary telling which word are the most important to the token. We can obtain a global importance estimation for the whole input by computing

$$ w_j = \sum_{i \in t} \log(1+\text{ReLU}(w_{ij})) $$

This way, for each token from the vocabulary we have a weight telling if it is relevant to the input sequence. This $w$ is the SPLADE embedding of the input sequence. Its size is $|V|$.

Then for ranking documents we compute the dot product between the embedding of the query $q$ and the embedding of each document $d$ of the corpus:

$$ s(q,d) = \sum_{j=1}^{|V|} w_j^q.w_j^d $$

The higher the value of $s(q,d)$, the more relevant $d$ is to to $q$.

Training

SPLADE learns sparse representations. This means that SPLADE does not just add synonyms to a BoW embedding. Indeed the $transform$ linear layer and the bias $b$ are learnable parameters trained to bring closer relevant query/document couples and move away irrelevant ones. Furthermore, as we will see with the loss, the training encourages the representations to be sparse. We can note that the transformer can also be trained but most of the time we can use a pre-trained one.

Contrastive learning

The model is trained using a contrastive learning objective, which encourages it to maximize similarity between queries and relevant documents while minimizing it for irrelevant ones.

This training strategy helps the model learn to rank documents effectively, focusing on both precise term matching and capturing the underlying meaning.

This is the loss presented in the paper:

$$ \mathcal{L}_{\text{rank-IBN}} = - \log \left( \frac{ e^{s(q_i,d_i^+)} }{ e^{s(q_i,d_i^+)} + e^{s(q_i,d_i^-)} + \sum_j e^{s(q_i,d_{i,j}^-)} } \right) $$

where $q_i$ is a query in a batch, a positive document $d_i^+$, a (hard) negative document $d_i^-$ (e.g. coming from BM25 sampling), and a set of negative documents in the batch (positive documents from other queries) $\{d_{i,j}^{-}\}_{j}$.

Regularization

SPLADE uses a regularization technique to ensure sparsity by keeping only the most relevant terms with non-zero weights. This results in efficient, interpretable representations.

$$ l_{\text{FLOPS}} = \sum_{j \in V} \bar{a}_j^2 = \sum_{j \in V} \left( \frac{1}{N} \sum_{i=1}^{N} w_j^{(d_i)} \right)^2 $$

The overall loss is

$$ \mathcal{L} = \mathcal{L}_{\text{rank-IBN}} + \lambda_q \mathcal{L}_{\text{reg}}^q + \lambda_d \mathcal{L}_{\text{reg}}^d $$

Where $\mathcal{L}_{\text{reg}}$ is a sparse regularization ($l_1$ or $l_{\text{FLOPS}}$).

They use two distinct regularization weights $\lambda_q$ and $\lambda_d$ for queries and documents. This way they can encourage even more the sparsity for queries, which is what enables fast retrieval.

This sparsity control allows SPLADE to generate compact representations while maintaining rich semantic information.

Pros of SPLADE

SPLADE captures both exact keyword matches and semantic relationships between terms, improving the relevance of search results.
Sparse representations are inherently more interpretable than dense embeddings. This allows search engineers to understand why certain documents were retrieved, which is crucial in applications where transparency matters.

Cons of SPLADE

Slow retrieval compared to other sparse retrieval (higher number of non zero values than in bm25).

SPLADE Variants and Improvements

Several improvements have been made to SPLADE with SPLADE v2 [Formal et al., 2021]. SPLADE v2 enables to reduce the number of non zero values through two improvements.

$w_j = \sum_{i \in t} \log(1+\text{ReLU}(w_{ij}))$ becomes max pooling:
$$ w_j = \max_{i \in t} \log(1+\text{ReLU}(w_{ij})) $$
and the use of this similarity measure:
$$ s(q,d) = \sum_{j \in q} w_j^d $$
which means that there are no query expansion nor query term weighting. Hence the ranking score only depends on the document term weights. This enables the ranking score to be almost entirely precomputed, requiring only the summation of the correct document representations at query time. Inference costs are reduced, and they still achieve competitive results.

Final Thoughts

SPLADE represents a big step in bridging the gap between traditional sparse methods and modern semantic approaches. By leveraging sparse lexical representations along with semantic expansion, SPLADE ensures efficient, interpretable, and scalable retrieval without sacrificing performance.

Whether you’re building a search engine for your business, an academic institution, or a content platform, SPLADE could be the key to unlocking better search experiences.

References

SPLADE: Combining Sparse and Semantic Approaches

The Challenge with Information Retrieval

What is SPLADE?