Two Word Anagram Finder Algorithm (in Ruby)
Today I have got some sourcecode for you. There is a little programming challenge named The Self-Documenting Code Contest that is quite fun, they try to find the cleanest and easiest to read code for this task:
Write a program that generates all two-word anagrams of the string “documenting”. Here’s a word list you might want to use: wordlist.zip.
When you’re done, send the results to selfdocumenting@hotmail.com.
Good luck!
So this caught my interest and i wrote a little entry in Ruby that is 23 lines long with whitespace and very nice to read. But I won’t show you this code until the contest is over, and this is not the reason for this post. The reason is, that the nice version takes about 2 seconds, and somebody else has coded a Python solution that takes only 1 second (I have no idea what his code looks like). This post is about a fast anagram finding algorithm, and how I developed this algorithm. The final result takes about 0.11 seconds.
Algorithm
The most basic algorithm has two phases:
- Read in the file
- Build all combinations of two words and compare the letter count with the query.
Building the combinations is usually done with two nested loops and takes O(n^2) runtime. This is slow, so I have added another step in between:
Idea #1: Filter out Candidate Words
The second step is really slow, but it would be a lot faster if it has to handle less words. So I wrote a little filtering step that lets only words through which are made out of the same letters as the query word.
For example when the query is documenting, the word men or go and even too are extracted, even if the number of letters might not match. But that’s not important, what is important is that the number of possible words are reduced a lot, and so the next phase is faster.
Idea #2: Use a Commutative Hashing Function
String comparisons are slow. To common way to find out if the strings coming with tuned is an anagram of the word documenting is to sort the letters and make a comparison, like this:
irb(main):003:0> "documenting".unpack("c*").sort.pack("c*")
=> "cdegimnnotu"
irb(main):004:0> ("coming" + "tuned").unpack("c*").sort.pack("c*")
=> "cdegimnnotu"
The strings are equal, so we have a match. But this comparison is terribly slow! What’s worse, the computations have to be redone for each match. It would be much better to just compare hash values, and find a hash function to quickly check if we might have a match, and only do the string comparison when the hash check matches. The hash has to be good enough that we don’t have too much false positives (hashes are equal but the real comparisons not) to get a speed advantage. So why not just sum up all the letters bytes?
irb(main):005:0> "documenting".sum => 1181 irb(main):006:0> "coming".sum + "tuned".sum => 1181
Ruby’s String#sum does exactly this. we can now precalculate the sum for each word, and to find a match we just add the two hashes and compare the result to the query’s hash:
irb(main):007:0> query="documenting"; first="coming"; second="tuned" => "tuned" irb(main):008:0> first.sum + second.sum == query.sum => true
When this very quick check returns true, we have to do the string comparison to be absolutely sure it is a match. This considerably speeds up the whole program, but it is still O(n^2).
Idea #3: Reformulate Problem
Now here comes the trickiest and coolest part. Since Idea #2 the slowest part is matching the numbers, with still quadratic complexity. But the hard task is not anagram finding any more, we have reduced it to finding two hashes that combined have the same hash as the query. We can reformulate this problem into something completely detached from the anagram problem:
Given a list of numbers, find all combination of two numbers that add up to a given number
When we concentrate on just this problem and ignore the rest, we might come up with a better way of doing things.
I came up with a fast solution, described below. Somebody posted a better solution that is both faster and simpler, if you want just this final solution skip ahead to Idea #4 as the following description is outdated.
It clearly looks stupid to just try all combinations to add the numbers.
So lets sort them first. Quicksort is fast, especially with numbers, so no worries here. Now consider a list of numbers like this example:
1 3 7 10 10 12 17 20 22 23 24 24 25 26 30
Find all the combinations of two number that add up to 27. They are
- 1 + 26 = 27
- 3 + 24 = 27
- 7 + 20 = 27
- 10 + 17 = 27
- 10 + 17 = 27 (a second time)
You can detect a pattern here: the first number always increases, the second number always decreases! We can now formulate an algorithm for this:
We can have two pointers to the array, one starting from the left side, the other starting from the right side. When the numbers behind the pointers add up to a bigger result than the query (e.g. 1 + 30 = 31), we decrease the right pointer to find a smaller combination (1 + 26 = 27). When the sums are too small (1 + 25 = 26), we move the left pointer to the right (3 + 25 = 28).
This way we walk through the whole array in O(n) time and the sum of the pointers is always kept as close the the desired result as possible. When the pointers meet each other, we can stop the whole process or otherwise we would just reverse the words.
This algorithm gets a bit more complicated when you consider that we might have lots of numbers in it that are equal, whenever this happens you have to fall back into an O(n^2) matching algorithm for just this section.
Idea #4: Use Hash directly
UPDATE Scrap the implementation in idea #3. A blog post here from a reader of this article posted a way to do this really in O(n), without any sorting which is O(n*log(n)). The idea is to use a hashmap that maps from the hash key of the word to its matches:
M = {}
S = the target sum
for each element e in the list
if M[S-e] exists? (e,S-e) is a pair
add e to the M
Just use a Hashmap that maps from the cummulative hash of a word to a list of words that have the same hash. Whenever a new word is added, get the list of words that is stored under query.sum - current_word.sum. When the hashes are the same we just have to create a list of all the matches under this key, and check each of the matches sequentially for equality. This is just normal hash collision handling through a linked list. That’s very simple and works like a charm.
I have revised the code, it got both simpler and faster. That’s a win-win situation, wohoo!
The Sourcecode
I hope the code is understandable now with the above explanation. If you have any questions or ideas, please share them here!
#!/usr/bin/ruby
# created by Martin Ankerl http://martin.ankerl.com/
class String
# creates an array of characters
def letters
unpack("c*")
end
end
class Array
# converts an array of letters back into a String
def word
pack("c*")
end
end
query = "documenting"
query_letters_sorted = query.letters.sort
txt = File.read('wordlist.txt').downcase
# to quickly check if a letter is part of the query word
used_letters = Array.new(256, nil)
query_letters_sorted.each do |letter|
used_letters[letter] = true
end
# Maps from cummulative hash of a word to a list of words that have this hash code.
hashToWords = Hash.new do |hash, key|
hash[key] = Array.new
end
query_hash = query.sum
prev = 0
txt_size = txt.size
separator = 10
idx = txt.index(separator, prev)
while prev < txt_size
letter_idx = prev
# no need to check end of word because it is \n
# which is not part of the word anyways
while used_letters[txt[letter_idx]]
letter_idx += 1
end
# ignore word if the above quick check fails
if letter_idx == idx
word = txt[prev, idx-prev]
# check all key matches
key = word.sum
hashToWords[query_hash - key].each do |other_word|
if (word.letters + other_word.letters).sort == query_letters_sorted
puts "#{word} #{other_word}"
puts "#{other_word} #{word}"
end
end
# insert word
hashToWords[key] << word
end
prev = idx + 1
# no need to check end of file because we have to end with new line
idx = txt.index(separator, prev)
end
When you rewrite the algorithm in C++ or Java or Python I am sure it will be faster than this one. But this is not the point of this post. The point is, “The Best Optimizer is between Your Ears” (Michael Abrash, Graphics Programming Black Book).
Have fun!
Optimized pow() approximation for Java, C / C++, and C#
I have already written about approximations of e^x, log(x) and pow(a, b) in my post Optimized Exponential Functions for Java. Now I have more
In particular, the pow() function is now even faster, simpler, and more accurate. Without further ado, I proudly give you the brand new approximation:
Approximation of pow() in Java
public static double pow(final double a, final double b) {
final int x = (int) (Double.doubleToLongBits(a) >> 32);
final int y = (int) (b * (x - 1072632447) + 1072632447);
return Double.longBitsToDouble(((long) y) << 32);
}
This is really very compact. The calculation only requires 2 shifts, 1 mul, 2 add, and 2 register operations. That’s it! In my tests it usually within an error margin of 5% to 12%, in extreme cases sometimes up to 25%. A careful analysis is left as an exercise for the reader. This is very usable for in e.g. metaheuristics or neural nets.
I use Linux, Java 1.6.0-b105 with the server VM, and execute the benchmark with this command:
sudo nice -n -20 java -cp . -server PowTest
The approximation is 27 times faster than Math.pow() on my Pentium-M. On a Pentium 4 it is 41 times faster. Unfortunately, microbenchmarks are difficult to do in Java, so your mileage may vary. You can download the benchmark PowTest.java and have a look, I have tried to prevent overoptimization while still having a low overhead.
Approximation of pow() in C and C++
double pow(double a, double b) {
int tmp = (*(1 + (int *)&a));
int tmp2 = (int)(b * (tmp - 1072632447) + 1072632447);
double p = 0.0;
*(1 + (int * )&p) = tmp2;
return p;
}
Compiled on my Pentium-M with gcc 4.1.2:
gcc -O3 -march=pentium-m -fomit-frame-pointer -fno-strict-aliasing
This version is 7.8 times faster than pow() from the standard library.
WARNING! you HAVE to use the -fno-strict-aliasing option, or this does not work!
Approximation of pow() in C#
Jason Jung has posted a port of the this code to C#:
public static double PowerA(double a, double b) {
int tmp = (int)(BitConverter.DoubleToInt64Bits(a) >> 32);
int tmp2 = (int)(b * (tmp - 1072632447) + 1072632447);
return BitConverter.Int64BitsToDouble(((long)tmp2) << 32);
}
How the Approximation was Developed
It is quite impossible to understand what is going on in this function, it just magically works. To shine a bit more light on it, here is a detailed description how I have developed this.
Approximation of e^x
As described here, the paper “A Fast, Compact Approximation of the Exponential Function” develops a C macro that does a good job at exploiting the IEEE 754 floating-point representation to calculate e^x. This macro can be transformed into Java code straightforward, which looks like this:
public static double exp(double val) {
final long tmp = (long) (1512775 * val + (1072693248 - 60801));
return Double.longBitsToDouble(tmp << 32);
}
Use Exponential Functions for a^b
Thanks to the power of math, we know that a^b can be transformed like this:
- Take exponential
a^b = e^(ln(a^b))
- Extract b
a^b = e^(ln(a)*b)
Now we have expressed the pow calculation with e^x and ln(x). We already have the e^x approximation, but no good ln(x). The old approximation is very bad, so we need a better one. So what now?
Approximation of ln(x)
Here comes the big trick: Rember that we have the nice e^x approximation? Well, ln(x) is exactly the inverse function! That means we just need to transform the above approximation so that the output of e^x is transformed back into the original input.
That’s not too difficult. Have a look at the above code, we now take the output and move backwards to undo the calculation. First reverse the shift:
final double tmp = (Double.doubleToLongBits(val) >> 32);
Now solve the equation
tmp = (1512775 * val + (1072693248 - 60801))
for val:
- The original formula
tmp = (1512775 * val + (1072693248 - 60801))
- Perform subtraction
tmp = 1512775 * val + 1072632447
- Bring value to other side
tmp - 1072632447 = 1512775 * val
- Divide by factor
(tmp - 1072632447) / 1512775 = val
- Finally, val on the left side
val = (tmp - 1072632447) / 1512775
Voíla, now we have a nice approximation of ln(x):
public double ln(double val) {
final double x = (Double.doubleToLongBits(val) >> 32);
return (x - 1072632447) / 1512775;
}
Combine Both Approximations
Finally we can combine the two approximations into e^(ln(a) * b):
public static double pow1(final double a, final double b) {
// calculate ln(a)
final double x = (Double.doubleToLongBits(a) >> 32);
final double ln_a = (x - 1072632447) / 1512775;
// ln(a) * b
final double tmp1 = ln_a * b;
// e^(ln(a) * b)
final long tmp2 = (long) (1512775 * tmp1 + (1072693248 - 60801));
return Double.longBitsToDouble(tmp2 << 32);
}
Between the two shifts, we can simply insert the tmp1 calculation into the tmp2 calculation to get
public static double pow2(final double a, final double b) {
final double x = (Double.doubleToLongBits(a) >> 32);
final long tmp2 = (long) (1512775 * (x - 1072632447) / 1512775 * b + (1072693248 - 60801));
return Double.longBitsToDouble(tmp2 << 32);
}
Now simplify tmp2 calculation:
- The original formula
tmp2 = (1512775 * (x - 1072632447) / 1512775 * b + (1072693248 - 60801))
- We can drop the factor 1512775
tmp2 = (x - 1072632447) * b + (1072693248 - 60801)
- And finally, calculate the substraction
tmp2 = b * (x - 1072632447) + 1072632447
The Result
That’s it! Add some casts, and the complete function is the same as above.
public static double pow(final double a, final double b) {
final int tmp = (int) (Double.doubleToLongBits(a) >> 32);
final int tmp2 = (int) (b * (tmp - 1072632447) + 1072632447);
return Double.longBitsToDouble(((long) tmp2) << 32);
}
This concludes my little tutorial on microoptimization of the pow() function. If you have come this far, I congratulate your presistence
UPDATE Recently there several other approximative pow calculation methods have been developed, here are some others that I have found through reddit:
- Fast pow() With Adjustable Accuracy — This looks quite a bit more sophisticated and precise than my approximation. Written in C and for float values. A Java port should not be too difficult.
- Fast SSE2 pow: tables or polynomials? — Uses SSE operation and seems to be a bit faster than the table approach from the link above with the potential to scale better when due to less cache usage.
Please post what you think about this!
Ajax Dojo Comet Tutorial
EDIT: This tutorial is for an old version of dojo / comet, and it will not work in a recent version!
Markus Holzmann, an intern at Profactor of my fellow colleague Philipp Hartl, had the opportunity to experiment with Ajax during his job. He wrote a tutorial about how to push events from the server to the client. For example, display popup messages on all browsers at the same time (see screencast in full resolution here):
Read on how Markus did this:
Erlang Syntax Highlighting
I have written a language definition file for GtkSourceView to get a nice syntax highlighting for Erlang with applications that use this component, e.g. Gnome’s standard editor gedit.
The highlighting looks like this:

Here is how to get this to work:
Read more
TextAnalyzer – Automatically Extract Characteristic Words
TextAnalzyer is a text analyzer tool that finds out words that are characteristic for a given input file. It is independent from any language, and even seems to work well with HTML files.
This program is only a little prototype, that shows that this technique seems to work. It’s public domain, feel free to do whatever you like with it:
Read more