A brute force anagram finder, a teaching tool, and a one gram pun. For more information, please consult Lisa Simpson.
A number of people with whom I work use the Internet Anagram Server, and apparently we are not alone. It seems always to be down or overwhelmed by requests. To minimize its CPU usage, it works with a small list of common words. I'm interested in more interesting words.
For example, embrace inclusivity
is an anagram for
unmiscable veracity
cystic nerve bulimia
and a few other phrases --- but who knew?
As a member of University of Richmond's professional staff, I am in continual need of classroom and workshop examples that show the solutions, complete or approximate, to classic computer science problems. Because it is difficult to find such examples in the wild, I usually write my own.
I take this moment to credit and thank my former intern Alina Enikeeva, now a consulting engineer with RTS Labs, for adding convenience iterators to the underlying tree data structure, and class of 2025 intern Skyler He for her extensive proofreading of the code and development of test cases.
Anagrams are an example of a exact cover problem, as described in Knuth's The Art of Computer Programming, Combinatorial Algorithms, Part 2, pages 66 ff. This book is often referred to as simply "4B." An exact cover (alternatively: perfect cover) is a collection of non-empty, disjoint sets whose union is the target set we are trying to "cover." Knuth calls the subsets options and the elements of each option are items. Thus, no item occurs more than once in an option, and no item appears in more than one option in any single cover.
One of the subtle and confusing complications of anagrams is that an
item (letter) may appear in more than once in an option (word). For a
cover (phrase) to be considered complete, it must use all instances of
a repeated item. In the example above embrace inclusivity
has three
items whose value is i, two whose value is c, and two whose value is e.
We are trying to cover this set:
['a', 'b', 'c', 'c', 'e', 'e', 'i', 'i', 'i', 'l', 'm', 'n', 'r', 's', 't', 'u', 'v', 'y']
not this one
['a', 'b', 'c', 'e', 'i', 'l', 'm', 'n', 'r', 's', 't', 'u', 'v', 'y']
The approach taken in this program is a modified implementation of Knuth's Algorithm X, described in detail in this 2000 paper, and in summary form in the Wikipedia article. Algorithm X is a part of backtracking, an important class of algorithms that are summarized here.
Anagrammar's approach is straightforward exploitation of Algorithm X: each option (word) we examine is removed as we go, and we are left with a smaller exact cover problem. If we are eventually left with a null set, then we have found an exact cover (anagram). The options are construed to be words in some dictionary, and if our residual (Knuth's term for the remainder) cannot be spanned by the remaining options, then this branch was a dead end, and we go back to consider other options.
I borrowed the lowest level representation from Martin Schweitzer (@martinschweitzer) by assigning the letters of the alphabet to the 26 smallest prime numbers. This legerdemain eliminated the string operations, and the entire analysis is reduced to integer arithmetic --- something that computers are known to be good at. If the approach is unfamiliar, no more than five minutes spent reviewing the fundamental theorem of arithmetic will clear it up.
In the anagrammar.py
file, the mapping is represented this way:
primes26 = (2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41,
43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97, 101 )
primes = dict(zip("eariotnslcudpmhgbfywkvxzjq", primes26))
The ordering of the letters in alignment with the frequency of letters in
English reduces the magnitude of the composite numbers that represent
options, although experience has shown that because the factors of the
large composite numbers are comparatively small, the divmod
operation goes
quickly even with a less optimized mapping. For example, our test phrase (embrace inclusivity
)
maps to
695823563878969513260, which is greater than 269, whereas a word
like race
which might be a component of an anagram maps to only 870,
not even 210.
The above ordering goes one step farther, as it is an empirical analysis of the frequency of letters in the Linux dictionary rather than some corpus of English. We need not concern ourselves with how common a word is, only that it appears exactly once in any single dictionary.
The above mapping is optimized for each list of words when the dictbuilder
is used. The resulting dictionary has not only the words and values, but also
contains the optimum assignment of each letter to a prime number so that
the resulting values are as small as possible.
Of course, anagrams differ from the basic perfect cover problem in other material ways. The first is that we are not looking for just one cover/anagram --- we want to find them all. The second is that nearly every combination of two and three letters has been used as an acronym for something in English, for example, TLA -- Three Letter Acronym. We must be careful to exclude some of these non-word words for the results to be amusing.
When you read about Algorithm X you will notice that Knuth starts out by saying that one should consider the smallest option first, and from several such options all of equal size, the selection can be made arbitrarily. The implementation choice to represent each option as a composite integer that stands for a dictionary word has the advantage that it is easy (and deterministic) to start with the ``smallest'' option, and easy to sort the options by size. Unlike Knuth, we never need to choose among options of equal size because the use of prime factors effectively creates a set that is strictly well-ordered instead of merely well ordered.
Another difference between covers and anagrams is that within any anagram, several words may map onto the same option.
These words are self-anagrams ---
any anagram that uses one of the words can
freely use any of the others. For example
the title of the well known album by Miles Davis,
Live-Evil,
contains a pair of such words. In our assignment of prime numbers to
letters, both words map to the integer 25438 (EVIL) -> (2 * 79 * 7 * 23) along with
the words veil
, vile
, and levi
. What unlikely bedfellows.
anagrammar.py
--- The main program. The core is a recursive function that
implements Algorithm X.
dictbuilder.py
--- Converts a white-space delimited file of "words" to a
pickle of a dict where the keys are integers and the values are tuples of the
words that map to the key. The output also contains a pickle that represents
the mapping of letters to prime numbers.
urdecorators.py
--- Contains the @trap
decorator to assist with debugging.
sloppytree.py
--- A general purpose data structure derived from Python's dict
that creates a flexible n-ary tree.
urlogger.py
--- A wrapper around Python's logging
module that makes it quite
a bit easier to use, or at least it is harder to make mistakes.
Unfortunately for anagrams, exact cover problems are difficult to attack with parallel programming methods. There is no clear way to partition the options (words) into subspaces (dictionaries) such that we can be assured that we do not omit covers (anagrams) that use options from multiple spaces.
Of course, the space being covered grows rapidly as the phrase we are analyzing grows longer. Accepting Knuth's matrix representation, the number of columns in the matrix is equivalent to the number of letters in the phrase, and the number of rows is equivalent to the number of words from the dictionary (reduced by the live-evil-vile-veil-levi mapping noted earlier) that can be formed from the letters in phrase. The number of rows converges to the number words in the dictionary as the phrase becomes longer and every letter in the alphabet is present.
Anagrammar does some pruning with its backtracking, identifying dead ends as it goes. Thus, no terminal is evaluated more than once. Of course, constantly searching an ever growing pile of dead ends adds some overhead. This is called memoization. The pruning could be slightly improved, although how much it could be improved is an open question.
NB The following explanation conflates the length of words with the value of integer to which the words are mapped. Hopefully, the explanation is still clear.
The most important pruning technique is this one. Suppose we are looking for
anagrams for george flanagin
. Starting with the short words, we find age long fearing
when using age
as a starting point. When we get to four letter words, we examine
long
. There is no need to look at any word shorter than long
because all anagrams
involving age
have already been found when we used age
as the starting point. There
are a few improvements that can be added at the cost of a good bit of code complexity.
A standard distro of Python. This program does not use external
libraries like numpy
, and the source is cut into a few files simply to give
it a straightforward organization.
A dictionary. If you do not like the provided dictionaries (taken from the
Webster's Second International Dictionary), you can use the
dictbuilder.py
file to create your own. If you run the program
on Linux or Mac OS, the program should find the system dictionaries
without your having to do anything extra. I have been told that
Windows is a BYODictionary experience.
usage: anagrammar [-h] [-d DICTIONARY] [-m MIN_LEN]
[--nice {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19}]
[-t CPU_TIME] [-v VERBOSE] [-z]
phrase
A brute force anagram finder.
positional arguments:
phrase The phrase. If it contains spaces, it must
be in quotes.
options:
-h, --help show this help message and exit
-d DICTIONARY, --dictionary DICTIONARY
Name of the dictionary of words, or a pickle
of the dictionary.
-m MIN_LEN, --min-len MIN_LEN
Minimum length of any word in the anagram.
The default is 3.
--nice {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19}
Niceness may affect execution time. The
default is 7, which is about twice as nice
as the average program.
-t CPU_TIME, --cpu-time CPU_TIME
Set a maximum number of CPU seconds for
execution.
-v VERBOSE, --verbose VERBOSE
Set the logging level on a scale from 10 to
50. The default is 35, which only logs
errors.
-z, --zap If set, remove old logfile[s].
How about george flanagin as the phrase? First, source the anagram.bash
file, and you will then be able to just type anagram
to run the program. The
output will appear on the screen, which is not always useful if there are a lot
of anagrams to your phrase.
Here is what the output looks like using the (provided) dictionary of the 30000
most common English words when we route the anagrams to gkf.txt
.
anagram -d 30000. -m 3 georgeflanagin > gkf.txt
--cpu-time 60 --dictionary 30000. --min-len 3 --nice 7 --phrase georgeflanagin --verbose 35 --zap False
0.78user 0.01system 0:00.87elapsed 92%CPU (0avgtext+0avgdata 39204maxresident)k
17624inputs+1064outputs (90major+8194minor)pagefaults 0swaps
In the gkf.txt
file, the most interesting anagram to me is long fearing age
.
You will notice in the output that the results are written this way, one anagram per line, with the substitutions ('enon', 'neon', 'none') elaborated.
(('fragile',), ('gag',), ('enon', 'neon', 'none'))
This allows for a more compact representation than something like:
('fragile', 'gag', 'enon')
('fragile', 'gag', 'neon')
('fragile', 'gag', 'none')
The assumption in dictbuilder.py
is that the input is a file of words, white
space delimited. You type in something like:
python3 dictbuilder.py [-b] -i /path/to/your/file nameyouwanttouse
The result will be a file named nameyouwanttouse.numbers
that you can
reference in your calculation as nameyouwanttouse
. If you use the -b
option, the dictionary will ignore the prebuilt lists of 3, 4, and 5 letter
words, and use exactly what is in the dictionary that you provide.
This is why God gave you the ability to fork the repo. The dictionaries are
built and read entirely within the file dictbuilder.py
, so you can change
this. There are only about 100 lines of code that do the work.
OK, first I have heard worse and I have committed much bigger sins. Recently. My best professional friend once described my programming as "entirely adequate." A written performance review once included the words "the only thing George knows how to do is write a compiler." Both statements are probably correct.
If it bothers you, you can experiment with AutoPEP8. It is excellent.