Ishaan - Finding Word Anagrams - The magic of preprocessing

The Problem

I recently came across a problem that involved finding all anagrams for a given word within a provided dictionary. Recall that an anagram of a word is another word that has the exact same letters but arranged in a different order e.g. silent is an angaram of listen, loop and polo are anagrams of pool.

The below widget demonstrates the functionality. The dictionary used is a file words.txt that contains 10,000 English words listed one per line in alphaberical order.

Give it a try with inputs like post, listen or are and you will see their anagrams listed out immediately upon pressing Find Anagrams. And of course not all words will have anagrams. You should see a message indicating this for such words.

The rest of the post is an analysis of the problem and a fast and efficient solution for it, based on which this widget was built. It provides an example of how the right use of algorithms and data structures can often provide efficient solutions to problems that are very inefficient to solve using direct, brute force methods.

Potential Solutions

Manual Version

Suppose you are given a physical dictionary and the assigned task is Find all the words in that dictionary that are anagrams of the word ‘listen’. It should be clear pretty quickly that the task is very tedious. There are a couple of approaches that come to mind :-
1. Go through all the words in the dictionary from start to end and for each word, check if it is an anagram of ‘listen’.
2. Generate all permutations of the word ‘listen’ and for each permutation, check if it exists in the dictionary.

Both approaches involve a lot of work. For short words, #2 could be more efficient as the number of permutations will be small and a dictionary can be looked up quickly since it is sorted in alphabetical order. For long words, the number of permutations will be large and it may be more efficient to just scan the whole dictionary (i.e. #1) and do an anagram check for each word. In this process, we could optimize a bit by skipping words whose length is not equal to that of the provided word.

Computer Version

Suppose the dictionary is provided as a file containing each word on a separate line e.g. words.txt. And suppose the task is to build an app that will take a word as input and will provide the list of its anagrams that are present in the dictionary file. It would be natural to lean towards approach #2 in this case, as processing the whole dictionary on every run seems like a big task. However, it the input is a long word, the number of its permutations could easily exceed the number of words in the dictionary !

The number of permutations for a string of length \(n\), without any repeated letters, is \({}^{n} \mathrm{ P }{}_n = n!\) This factorial function grows at an extremely rapid rate — a phenomenon sometimes referred to as combinatorial explosion. A 10-letter word, without any repeated letters, has \({}^{10} \mathrm{ P }{}_{10} = 10! = 3,628,800\) permutations while our dictionary i.e. words.txt has only 10,000 words. Thus, for longer words, even a scan of the whole dictionary could be less work !

Note

Repeated letters will cut down the number of permutations. A 10-letter word with two letters repeated twice each will have \({}^{10} \mathrm{ P }{}_{10} / (2!)(2!) = 907,200\) permutations. The 6-letter word \(banana\) has 3 repeated \(a\)s and 2 repeated \(n\)s. Thus it has \({}^{6} \mathrm{ P }{}_{6} / (3!)(2!) = 60\) permutations instead of the \(6! = 720\) that are there for the word \(silent\) which also has 6 letters but no repetitions .

Looking for an efficient solution

How to check if 2 words are anagrams ?

The fundamental property shared by all anagrams is that they consist of the exact same set of characters. Thus, anagrams will all produce the same result if their letters are sorted in alphabetical order.

This allows for the creation of a canonical representation, or a standard form, for any group of anagrams. This means that all anagrams in a set will have the same canonical representation which can be generated by sorting the letters of any word in that set. Also we can use this as a unique key for a set of anagrams. Any other set of anagrams will have a different key which is again unique for that set.

Note

Anagrams contain the same letters with the same frequency for each letter. Sorting the letters of a word typically requires a bit more processing than calculating the frequencies of the various letters in a word. But we will stick with the sorting approach since it fits well into the optimized, overall solution that we will come up with next.

Preprocessing gets us there !

A powerful way to speed up searches is to preprocess the data and store it in such a form that specific queries can be faster. In general, this is called indexing. Effectively, the data is preprocessed so that results for certain types of queries are stored in a data structure that allows lookups to be very fast. A hash map is often a suitable and powerful data structure for this purpose.

Note

A lookup is usually a direct query using only a key, and that can return results very fast, e.g. if the keys and their associated values are stored in a hash map data structure. A search is a broader function where the query could have multiple parts and results are produced by combining results from several subqueries (that often involve lookups) and then ranking by relevance. The ranking is important as inexact or fuzzy matches can also be returned in the results.

Based on the fact that a unique key can be easily derived for a set of anagrams, as disussed in the last section, we could process the dictionary file once and populate a hash map where the key is the sorted word or the canonical form, and the value is a list of words which have the same canonical form. Thus each value list is a set of anagrams.

If a word has no anagrams, then it will have only a single entry in the list for its key and that entry will be the word itself. So after processing the dictionary and populating the hash map, all keys mapping to a value list with a single item can be removed from the hash map for efficieny.

The Algorithm

In summary, the algorithm is as follows :-
1. Initialize a Hash Map: A hash map (a dict in Python) is chosen for its average O(1) time complexity for key-value lookups.
2. Process the Dictionary: Iterate through every word in the dictionary.
3. Populate Hash Map: For each word, generate its canonical key by sorting its characters. Use this key to store the word in the hash map. The map will associate each canonical key with a list of all dictionary words that produce it.

The result is a hash map where each key points to a complete anagram group.

Example of the final data structure:

{
  "eilnst": ["listen", "silent"],
  "opst": ["stop", "post", "spot", "tops"],
  "ader": ["read", "dear", "dare"]
}

PreprocessingCode

The preprocessing program written in Python is available here. It takes the dictionary file containing one word per line, e.g. words.txt as input and populates the hash map described above.
As a final step, the populated hash map is serialized and written out as the json file anagram_data.json. This can be loaded by an app and used subsequently for finding anagrams for a given word.

Note

The preprocessing is a one-time activity. As long as the dictionary does not change, there is no need to re-run the preprocessing program to regenerate the json file.

The App Widget

This json file anagram_data.json can be loaded by an app and used to answer user queries very fast. All that the app has to do for an anagram query is to sort the word to generate the canonical key, lookup the loaded map using this key and return the anagrams from the value list. While listing the results, we could choose to omit the word itsef from the list of anagrams.
If the key is not found, then the word does not have any anagrams in the dictionary and a message can be displayed to indicate that to the user.
The Javascript widget provided at the start of this post implements this functionality. It loads the anagram_data.json produced by the preprocessing Python program initially.