May, 2021

Ordering fuzzy search results by relevancy using trigrams

In this article, I'll share how I've improved a search feature by ranking fuzzy string matches by similarity to the searched term. I'll also share how to improve the search performance with indexes.

Say that I want to search users by name or email. If I type 'ana' I want to get all the users whose name or email contains the character combination 'ana'. I'm saying contains since there might be emails or names that are not an exact match to 'ana ' but can contain it - e.g. 'joana', 'diana', 'user@banana.com' - I don't want to exclude those from my results. Other names or emails similar to 'ana' can be valuable to whoever is searching.

This type of search is called fuzzy string matching and can be implemented with SQL using wildcards. I'll be using PostgreSQL but similar SQLs can be written in other databases:

SELECT id, name, email
FROM users
AND users.name ILIKE '%ana%'
OR users.email ILIKE '%ana%'
LIMIT 20

This query will return all the users whose name or email contains 'ana' but I won't get them in any specific order. I could order them alphabetically by 'name' and 'email' but that does not guarantee relevance. Since I'm limiting results to the first 20 I could even be excluding relevant results by ordering alphabetically.

Luckily I've learned that Postgres has this very interesting extension called pg_trgm. Quoting from the postgresql documentation:

[pg_trgm] provides functions and operators for determining the similarity of alphanumeric text based on trigram matching, as well as index operator classes that support fast searching for similar strings.

In linguistics, a trigram is a group of three consecutive written units such as letters, syllables, or words. What Postgres will do is to split words into trigrams. For instance, the 'ruby' trigram would be {" r"," ru","by ",rub,uby}. Note that some trigrams have empty spaces. That's because each word is considered to have two spaces prefixed and one space suffixed when determining the set of trigrams contained in the string. What we really need to take away from this is that postgres will split words into smaller chunks of 3 alphanumeric characters and then check the similarities with the same size chunks of other words.

So when comparing two strings using trigrams, Postgres can score the similarity between them from 0 to 1 (where 1 is a perfect match). Let's play with the pg_trgm extension to understand how it works.

First, I'll add the extension:

CREATE EXTENSION pg_trgm;

Now, I can compare two words:

SELECT SIMILARITY('Ana', 'Addriana'); 
-- returns 0.3

SELECT SIMILARITY('Ana', 'Alana'); 
-- returns 0.42857143

This illustrates the consequences of ordering alphabetically. If I search 'Ana', 'Addriana' will show up in the results before 'Alana' despite having a lower similarity score. It might be ok for you, depending on the requirements of your product, to order alphabetically. Using similarity adds a bit of complexity to your code but in this case, it does add value to the search feature I'm working on.

So now that we know how trigrams and the similarity function work, let's add that to our initial query. To order by similarity I'll first have to SELECT that score, give it an alias, and pass that variable to the ORDER BY clause:

SELECT
id,
name,              
email,
SIMILARITY(name, 'ana') AS name_score,
SIMILARITY(email, 'ana') AS email_score
FROM users
WHERE users.name ILIKE '%ana%'
OR users.email ILIKE '%ana%'
ORDER BY name_score DESC NULLS LAST, email_score DESC NULLS LAST, name
LIMIT 20;

Looking closely at the ORDER BY clause, note that since I'm searching on two columns, I've considered that the most relevant is name_score. Then I'll look at the email_score and I'll finally order alphabetically if necessary.

Fast searches with trigram indexes

Now that we've got our query, I've analyzed its performance and it seems fairly fast (~100-130ms). Still, since I want to return search results as my users are typing, I don't want this search to be just fast - I want it blazing-fast. I need immediate feedback. I'm going to test adding a couple of indexes to see if they can improve the response time.

GIN (Generalized Inverted Index) and GiST (Generalized Search Tree) indexes are very useful in text search features since they can map multiple values (e.g.: list of trigrams) to a single row. On the other hand, Postgres default B-Tree indexes are optimized for when a row has a single key-value.

When choosing between GIN vs GiST, I've decided to go with the GIN index. Although it might be a bit slower to write it is faster to read than the GiST index. To make sure that it will not raise any performance issues, I've tested creating my indexes locally on data amounts similar to the production data. Creating an index locks out the table for writes and in problematic cases, it can cause downtime. To avoid this, I'll tell Postgres to create these indexes concurrently. These can take longer to build but they will not require a lock that blocks writes.

So with the help of pg_trgm, which supports indexing based on trigrams, I'll add a GIN index to each of the columns that I'm searching - name and email - and pass the operator class gin_trgm_ops to instruct Postgres to build these indexes with trigrams.

CREATE INDEX CONCURRENTLY index_users_on_name_trigram
ON users 
USING gin (name gin_trgm_ops);

CREATE INDEX CONCURRENTLY index_users_on_email_trigram
ON users 
USING gin (email gin_trgm_ops);

With these two indexes, I've managed to reduce the query runtime to ~60-80ms 🚀.

Other performance remarks: `ILIKE` vs `LIKE`

There are two reasons why I've decided to use the Postgresql ILIKE function instead of the LIKE. One has to do with the functionality per se and the other with performance.

In terms of functionality, the main difference between the two is that ILIKE is case-insensitive. So if I want to know if 'Ana' and 'ana' are the same, ILIKE will tell you that statement is true, and LIKE will tell you otherwise:

SELECT 'Ana' ILIKE 'ana'; 
-- returns true

SELECT 'Ana' LIKE 'ana'; 
-- returns false

You can make LIKE case-insensitive as well with the help of the LOWER function:

SELECT 'Ana' ILIKE 'ana'; 
--returns true

SELECT LOWER('Ana') LIKE LOWER('ana'); 
-- returns true

But this is a more verbose and prone to errors version that does not add any value to this feature. Besides, I've noted a slight improvement in performance using ILIKE.

If you're interested in understanding the performance differences between these two approaches, I recommend this stackoverflow thread. Here are my 2 cents after reading about it and doing some benchmarks: The latter option (LIKE with LOWER) can be significantly faster than the first (ILIKE) if you're not using indexes. When you use good indexes, those differences are reduced or even inverted. In my case, creating different GIN indexes for name, email ended up showing that the ILIKE option is the most performant (though with indexes this difference is not that relevant).

Conclusion

Using the PostgresSQL pg_trgm extension not only improved the relevancy of the search results but also brought this feature with great performance improvements (vs. a previously sluggish user experience). To come up with the final query and index solutions, I've used pgAdmin to test different hypotheses. Postgres EXPLAIN ANALYZE is an awesome command that I've also used heavily throughout this development. Remember that each problem is different so I'd advise anyone to use the tools that Postgres provides to analyze and measure different potential solutions.