Table of Contents
- Overview 👈
- Construction Benchmarks
- Modifying Benchmarks
- Accessing
- Conclusion
I’ve spent a long time developing my robin_hood::unordered_map, and after claiming that it is now the fastest hashmap I understandably got quite a few skeptic comments. Some of the comments were quite right, and my benchmarks were not as unbiased as they could be, I did not test as many unordered maps as I should have, my compiler options were not choosen well, and so on.
That’s why I have now spent considerable time to create a highly improved benchmarks, where I have tried to remedy all most of the critique that I got. The results are not as flattering to my robin_hood::unordered_map, but I am still very pleased with the results.
What is actually Benchmarked?
This benchmark has evalued 20 different unordered_map implementations, each with 5 different hashing implementations. So there are a total of 20*5 = 100 hashmap variants to benchmark. Each of this 100 hashmaps was evaluated in 10 different benchmarks, so in total 1000 benchmark evaluations. I ran each benchmark 9 times and show the median, to filter out any outliers. So in total I ran 9000 benchmarks, which took about 6 days on my Intel i7-8700 at 3200 MHz. To get highly accurate results, I’ve isolated a core to only benchmarking, and disabled all frequency scaling.
Hashmaps
- Google’s Abseil’s
abseil::flat_hash_map
,abseil::node_hash_map
. They are brand new, and have just recently pushed the boundary on what’s possible to achieve for unordered_maps. It uses several interesting optimizations, described in CppCon 2017: Matt Kulukundis “Designing a Fast, Efficient, Cache-friendly Hash Table, Step by Step. - boost multiindex:
boost::multi_index::hashed_unique
. Boost.MultiIndex is a versatile container that is highly configurable, it’s main features is not speed but it’s versatility. It is not a straight forwardstd::unordered_map
replacement, the implementation for the wrapper was thankfully provided by joaquintides. - Boost’s unordered map
boost::unordered_map
is very similar tostd::unordered_map
, just boosts (older) version beforestd::unordered_map
was a thing. I’ve tested with boost version 1.65.1. - EASTL has
eastl::hash_map
. The Electronic Arts Standard Template Library, an STL implementation with emphasis on high performance. It seems to be a bit dated though. - Facebook’s folly:
folly::F14ValueMap
andfolly::F14NodeMap
. C++14 conform and high performance in mind. The maps are described in the F14 Hash Table document. - greg7mdp’s parallel hashmap:
phmap::flat_hash_map
andphmap::node_hash_map
are closely based on Abseil’s map, but simpler to integrate since they are header only.phmap::parallel_flat_hash_map
andphmap::parallel_node_hash_map
use a novel improvement that makes the maps a tad slower but usable in parallel. Also, peak memory requirements are a bit lower. Read more in “The Parallel Hashmap”. - greg7mdp’s sparsepp:
spp::sparse_hash_map
tuned to be memory efficient. - ktprime’s HashMap: A rather unknown implementation
emilib1::HashMap
by /u/huangyuanbing. It might not be as stable and well tested as other implementations here, but the numbers look very promising. - martinus’s robin-hood-hashing: A single-file header-only implementation that contains
robin_hood::unordered_flat_map
androbin_hood::unordered_node_map
. I am the author of the maps, so I might not be perfectly unbiased… The numbers won’t lie though, and I try to be as objective as possible. - Malte Skarupke’s Bytell Map After first claiming I Wrote The Fastest Hashtable and later A new fast hash table in response to Google’s new fast hash table, his maps
ska::flat_hash_map
andska::bytell_hash_map
are obvious choices for this benchmark. - std::unordered_map Of course, the standard implementation of
std::unordered_map
has to be included has well Since I am using g++ 8.2, this uses the libstdc++ implementation. - tessil’s maps: Tessil has done lots and lots of work on hashmaps, in all kinds of flavours. Here I am benchmarking
tsl::hopscotch_map
,tsl::robin_map
, andtsl::sparse_map
. They are all available on github.
Hashes
Some hashmap implementations come with their own hashing methods, each with different properties. In my benchmarks I have used either integral types like int
or uint64_t
, and std::string
as the keys.
- Abseil’s Hash
absl:Hash
: An extremely fast hash, that works very well in all situations I have tested. - FNV1a A very simple hash that is used by Microsoft in Visual Studio 2017. Interestingly, they even use this byte-wise hash for integral types. My benchmark has its own implementation, but in my experiments it has produced the same assembler code as the original Microsoft variant.
- Folly’s Hash
folly::hasher
: Unfortunately I could not find any documentation. It seems to be well optimized and uses native crc instruction if available. Unfortunately the result is only a 32bit hash which can work badly for some hashmap variants. - libstdc++-v3 simply casts integral types to
size_t
and uses this as a hash function. It is obviously the fastest hash, but many hashmap implementations rely on a somewhat good avalanching hash quality so this seems to be a rather bad choice. - martinus’s robin-hood-hashing
robin_hood::hash
is based on abseil’s hash for integral types, with minor modifications.
How is benchmarked?
Build
- I’ve used g++ 8.2.0 with
-O3 -march=native
:g++-8 (Ubuntu 8.2.0-1ubuntu2~18.04) 8.2.0
- CMake build is done with
Release
mode - and I’ve set
FOLLY_CXX_FLAGS
to-march=native
. - For the
ktprime
map benchmarks I had to add-fno-strict-aliasing
.
System Configuration
- All benchmarks were run on an Linux.
uname -a
output:Linux dualit 4.15.0-47-generic #50-Ubuntu SMP Wed Mar 13 10:44:52 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
- Processor Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz, locked to 3200 MHz.
- Isolated a core with it’s hyperthreading companion by editing
/etc/default/grub
and changingGRUB_CMDLINE_LINUX_DEFAULT
so it looks like this:GRUB_CMDLINE_LINUX_DEFAULT="quiet splash isolcpus=5,11 rcu_nocbs=5,11"
- Turbo boost and frequency scaling were disabled with the python tool perf with the command
sudo python3 -m perf system tune
This sets cores 5 and 11 to 3200 MHz, sets scaling governor to performance, disables Turbo Boost, sets irqbalances service to inactive, IRQ affinity to all CPUs except 5 and 11.
- Each benchmarks is run in a separately started process.
- Isolated cores are used with
taskset -c 5,11
- To get rid of any potential outliers and to average effects of ASLR, all benchmarks were run 9 times and I show only the median result.
Benchmarks
Enough talk, onwards to the benchmarks!