## Introduction

A similarity model is a set of abstractions and metrics to define to what extent things are similar. That’s quite a general definition. In this article I will only consider textual similarity. In this context, the uses of similarity models can be divided into two categories: classification of documents, with a finite set of categories where the categories are known; and information retrieval where the problem can be defined as ‘find the the most relevant documents to a given query’. In this article I will look into the latter category.

Elasticsearch provides the following similarity models: default, bm25, drf and ib. I have limited the scope of this article to default and bm25. The divergence from randomness and information based similarities may feature in a future article.

## Default Similarity

The default similarity model in Elasticsearch is an implementation of **tf/idf**. Tf/idf is the most common vector space model. A vector space model is a model where each term of the query is considered a vector dimension. This allows for defining one vector for the query and another for the document considered. The scalar product of the two vectors is then considered the relevance of the document to the query. This implies that the positions of the words within a document are not used - a document is just a bag of words.

A simple vector space model would be to set each coordinate in the document vector to one if the document has the word for the dimension, otherwise set it to zero. The query vector would be all one’s if each word has the same weight. In such a model the scalar product would then be the sum of the words which are common for the document and the query. However, such a model would rank a document referring once to a query term in a side note the same as a document using the term repeatedly throughout the text. Which of the the two documents’ topics is identical to the term? The tf part in tf/idf helps us resolve this: tf stands for term frequency, and the solution is to use the term frequency instead of a simple zero or one in the document vector.

In its simplest form you could calculate the term frequency by counting the number of occurences of the term in the document. This simplistic approach has one inherent problem. Assuming a query with the terms: *fire fox* generates the document vectors D1{15, 0} and D2{5, 5}. With a query vector of Q{1, 1}, document D1 will have a score of 15 and document D2 will have a score of 10. Let’s say D1 is a lengthy document about fire hazards and D2 is a tutorial on how to install the browser Firefox. Which document do you think corresponds best to the query? To get the best of both worlds, by rewarding documents for specifically being about a term and also being about all the terms, tf-idf uses the logarithm of the term frequency for the vector value. Those who remember their calculus know that the logarithm of zero is infinitely negative while zero is the logarithm to one. To get the desired behaviour tf-idf simply adds one to all term frequencies before taking the logarithm. Using natural logarithms the document vectors become D1{2.772588722239781, 0} and D2{1.791759469228055, 1.791759469228055} with ranks 2.772588722239781 and 3.58351893845611 respectively. We now have a model that values having all terms without disregarding that a document can be more about a term than another.

The other part of tf-idf yet to be mentioned is IDF, the Inverse Document Frequency. The definition of idf is the total number of documents in the collection (or index) divided by the number of documents that contain the word. It is used to address the fact that some words are more common in the language than others. In fact some words are that common they are considered noise. If a word is included in every document of a collection, how can it be helpful in choosing documents? Why don’t we use stop words you say? Well, Elasticsearch does use stop words, however using stop words can prove to be inaccurate and of course, stop words are not only language specific, but also domain specific. IDF is a measure of the significance of a word’s occurrence in a document. Similarly to TF, it is common to use the logarithm of IDF. The base number of the logarithm is not important, the crux of the matter is that it requires an exponential increase in the frequency to increase the IDF.

As division by zero is somewhat problematic, it’s recommended to add 1 to the denominator of the fraction. To avoid a negative IDF it is then common to add a +1 to the idf outside of the logarithm as in Lucene’s TFIDFSimilarity.

## BM25

BM25 belongs to the probabilistic models while TF-IDF is a vector space model, but their formulas are not as different as you might expect. Both models define a weight for each term as a product of some idf-function and some tf-function and then summarize that term weight as the score for the whole document towards the given query.

\(\mathrm{Score}_{i} = A(tf_{i}) * B(idf_{i})\)

For those who would like a thorough summary of the theoretical basis of BM25 I recommend The Probabilistic Relevance Framework: BM25 and Beyond. In this article I will rather try to explain the practical difference.

### Saturation

Both tf-idf and bm25 acknowledge that for highly frequent terms a further increase in term frequency has little significance for the relevance. The difference between them is that BM25 takes this a bit further. The saturation function of BM25 asymptotically approaches a limit for high term frequencies, while the logarithms of tf-idf has no boundary. For tf-idf, multiplying the the term frequency by the base of the logarithm, usually \(e \approx 2,72\), will always increase the logarithm by one.

This difference in growth implies that for any fixed set of tuning parameters, the base of the logarithm included, you can create a document that will get a higher relative increase in term frequency score in tf-idf than in bm25.

The following functions describe how term frequency contributes to score in BM25 and tf-idf:

- \(\mathrm{Saturation}_{BM25}(tf_{i}) = \frac{tf_{i}}{k_{1} + tf_{i}}\)
- \(\mathrm{Saturation}_{tf-idf}(tf_{i}) = \mathrm{ln}\left(tf_{i} + 1\right)\)

If we plot these functions for typical values of k1 and term frequency from 0 to 100 we get:

When reading this chart it’s important to note that the absolute height of each graph is irrelevant as it could easily be adjusted with a simple boost factor. What is interesting is the relative growth of each curve. There are two things to note from this graph. Firstly, it supports the fact that the logarithm has no upper boundary and secondly, the differences between \(k_{1}\) in the range from 1,2 to 2,0 are subtle. The latter observation has the practical consequence that if you don’t have any idea what to choose for *\(k_{1}\)*, the default value will most likely not be too far off.

### Average Document Length

The second major difference between tf-idf and BM25 is the use of document length in BM25. BM25 uses document length to compensate for the fact that a longer document in general has more words and thus is more likely to have a higher term frequency without necessarily being more pertinent to the term and thus no more relevant to the query. But it’s not all black and white, some documents actually have a wider scope which makes the longer text justified. BM25 adjusts the term factor by a factor \(B\), which is calculated from a tuning parameter \(b\) \(\left(0 \leq b \leq 1\right)\), the document length \(dl\) and the average document length \(avdl\), like this:

\(B = \left((1-b)+b\frac{dl}{avdl}\right)\)

By defining \(tf_{adjusted} = tf * B\) and substituting \(tf\) for \(tf_{adjusted}\) in previous weight defintion we get:

\(W_{i}^{TF} = \frac{tf_{i}}{tf_{i} + k1\left(\left(1 - b\right) + b\frac{dl}{avgdl}\right)}\)

This is how term frequency is used in BM25.

### The Complete Term Weight

To get the full weight for a term in BM25 we need to multiply \(W_{i}^{TF}\) with \(IDF_{i}\). However BM25 defines idf slightly differently than the default similarity and this difference is reflected in Lucene’s implementation. The definition used is:

\(IDF_{i} = \mathrm{log}\left(1 + \frac{D - d_{i} + 0.5}{d_{i} + 0.5}\right)\)

Where \(D\) is the total number of documents int the collection and \(d_{i}\) is the number documents containing the i’th term of the query.

### How Lucene Does BM25

Lucene uses a variation of BM25 where the numerator of the fraction is multiplied with \(\left(k_{1} + 1\right)\). That’s a common trick to make the formula more similar to other ranking functions and hence also easier to compare the resulting values. As this expression is a constant, it has not consequence for the ranking produced.

\(W_{i}^{\mathrm{Lucene BM25}} = idf_{i} * boost * \frac{(k_{1} + 1)tf_{i}}{tf_{i} + k_{1}\left(\left(1 - b\right) + b\frac{dl}{avgdl}\right)}\)

As Lucene aims for performance, the formula is not calculated entirely at query time. Lucene uses the similarity model chosen both while indexing and querying. This reflects on Elasticsearch as well.

### Tuning BM25

Regarding the tuning parameters, \(k_{1}\) and \(b\) there is a reason they are defined as parameters. The ideal value is dependent on the document collection. Unfortunately, there is no silver bullet and trial and error is the way to go, but it does help to have an idea of which documents are relevant to the queries used and to have a methodological approach.

## Conclusion

There is a reason why TF-IDF is as widespread as it is. It is conceptually easy to understand and implement while also performing pretty well. That said, there are other, strong candidates. Typically, they offer more tuning flexiblity. In this article we have delved into one of them, BM25. In general, it is known to perform just as good or even better than TF-IDF, especially on collections with short documents. Bearing that in mind, don’t forget that similarity rank is not the only ranking contributor in Elasticsearch. To create a good search experience it is key to combine textual similarity rank with metadata suiting the given case, such as the last time document was updated or if there is some proximity between the authors of the query and the document. Read our follow-up article where we compare the precision and recall of the two models using Wikipedia articles: BM25 vs Lucene Default Similarity.