Using the Right Datastructure for the job
15 May 2015Searching for a key
The process of choosing the right data structure is one that few will ever think about, one that most take for granted. In my experience, this lack of curiosity, and more importantly, an aversion to experimenting and benchmarking can lead to unnecessary overhead. Let’s take an example. One of the most common operations of any software I’ve seen written is finding a key and acting upon that, whether that means checking whether it exists, doesn’t exist, or getting the associated value. We will narrow this down to finding whether the key exists or not for simplicity’s sake.
Additionally for the same purpose, let’s limit the key type to int
. int
this is probably the most common and well understood datatype in the language, and very likely used as an ‘id’ type in many different applications.
Alright, so we have an application that has integer id’s, perhaps for some customer identification system and we’re building a cache for it. Hitting or missing the cache answers the simple question “Should we query the database?” Now, the obvious datastructure that pretty much every computer science major would choose here is the hashmap, which in C++ is now called the unordered_map. Why? Because Big O notation tells us the runtime of querying a hashmap is constant. What other options do we have?
- A vector, which under the hood is an array , aka contiguous memory.
- A tree, which can give us sorted traversal, but has a lot of pointers and possibly page faults
- A Linked list
There’s more but I’m going to stop there. Why would we choose anything else when Big O tells us a hashmap is good enough? Sometimes Big O behavior is not indicative of our application. Sometimes we want small O. What do I want you to take away from this? Measure Everything. Get into the habit of measuring tasks, as it’s the only way you can get “true” answers, in terms of prediction performance on your machine. Some people might say, “Use a vector! We can optimize cache lines that way!” Well, the truth is, I don’t know which is the best one, but say hack something together.
Key generation
This is our key supplier, giving us a predictable range of keys in a uniform distribution. We can use dis
as much as we want, and be confident that as the number of keys grows, the keys will fall in line with a uniform distribution.
Here we populate the same keys to both the map and the vector and any other data structures we’d like to measure. This gives us comparable performance, as well as a method to check the correctness of our code, aka. did the same queries give us the same results for both data structures?
Building Queries
This is important. We are generating the queries beforehand and storing them. This accomplishes several things:
- the cost of generating a query is not baked into the measurement of querying
- The same queries can be executed on both the map and the vector
- If we wanted, the queries could be manipulated in several ways [ sorted, shuffled, etc.]
Querying
The task here is simple, serially iterate through the queries, and check whether they exist or not in the index. Time the duration it takes to run through all the queries and do this check. Afterwards, output the results. The code is quite similar for the vector so I won’t include that here, but I’ll post the source.
Results
Alas, our computer science professors taught us well. A hashmap is the way to go. Or is it? I had several parameters, one of which we varied. I kept the keyspace limited to 10000 for the purpose of clarity.
Number of Keys | Unordered Map Query Time (s) | Vector Query Time (s) |
---|---|---|
5 | 0.092408 | 0.048035 |
10 | 0.095659 | 0.067716 |
20 | 0.100039 | 0.130688 |
50 | 0.087938 | 0.254695 |
100 | 0.081867 | 0.479975 |
200 | 0.098481 | 0.888507 |
400 | 0.096329 | 1.73241 |
800 | 0.098282 | 3.29157 |
1600 | 0.097052 | 6.12024 |
3200 | 0.093558 | 10.463 |
6400 | 0.093106 | 16.1336 |
For almost every experiment, the unordered map beats out the vecto, sometimes by several orders of magnitude. The vector does win out when you have a very small number of keys ( < 10 ). Imagine a scenario where you have an application that is distributing queries to a cluster of machines, where the number of machines is quite low and the number of queries is very high. A vector might be the right choice here over the unordered map. But for almost everything else, use a hashmap.
Compound Keys
There will come a time in every software engineer’s life where he’ll want to combine two keys to make a single key. An application might need to know given a domain id, does a user id exist. In a library application, we might need to key off the author and book title. Whether there are values involved or not is not as important. This circumstance has happened to me several times, and by default, I chose the option that came easiest to me to think of. More often than not, this isn’t the best option. Better would be to measure and understand the tradeoffs between several options.
Datastructures
The solution I defaulted to was creating a compound key, and creating some makeshift hasher for that. This compound key and hasher would be used to create the unordered_map.
. This is another potential solution by splitting up the two keys by making one the key of the unordered map, and the second part of value of the first key, in the form of a vector of pairs(key 2, value).
There are other solutions: trees, stacks, linear probing maps, etc. But for the purpose of demonstrating the process rather than the answer, I kept it to just these two. In the following sections, I’ll show code snippets of the hashing, key generation, query generation, and querying all in one go, since the idea is quite the same. The one thing I’ll explain is how to build a map with a compound key, aka creating a hash function. Everything below is self contained and can be compiled and run.
The Code
The hashing function was stolen from Boost, which provides, a decent way generalized way (templated) to combine hash functions of as many keys as you want. Here, since we’re only hasing a compound key, we called ::hash_combine
on the first and second part of the pair, and specialized the template on that, so we can build unordered_maps with pairs as keys for any type that has been specialied for std::hash. Below is the repeat of the hashing functionality.
Results
The 3d scatter plot shows two dependent variables for the x and y axes, and the ouput variable on the z axis.
- X: the number of primary keys in data structure
- Y: the number of secondary keys for each primary key
- Z: the real time to query each data structure for 1 Million Queries
The red points in the scatter plot are the measurements for std::unordered_map<compound key, value>
, where as the blue points represent the timings for the unordered_map<key 1, vector<key 2, value> >
. For the majority of cases, the unordered_map<key 1, vector<key 2, value> >
beats out the std::unordered_map<compound key, value>
, defying Big O analysis which would predict constant time querying for the compound key map.
Analysis
Having a mediocre hash function can crush performance
Using a compound key forces us to come up with an ad-hoc hash function. While the hash functions for primitive types such as integers tend to be uniform and well studied, those of combined forms are most likely not. We stole Boost’s ::hash_combine
implementation, yet our performance still degraded. Bad hashing functions lead to collisions and adding additional links in our linked list see how a hashmap works here. Navigating linked lists lead to a lot of pointer look ups, aka more cache misses.
On the flipside, the unordered_map<key 1, vector<key 2, value> >
has predictable look up performance, since the hashing is only applied on a single integer instead of combining two of them. Once we find the vector, scanning through the vector is fast because of locality of reference.
Main takeaway
The point of this exercise is not to tell you which data structure to use, but to hopefully lead you to answer these questions on your own, depending on the application. The bread of this experiment had many assumptions baked in like limiting the key space to an arbitrary value or randomizing the queries. These are not necessarily indicative of your own application - it is up to you to decide which parameters to vary and how to build an experiment as close to reality as possible. Perhaps, your key space is limited to under 1000. Or maybe there will be many secondary key look ups for the same primary key sequentially.
This sort of exercise is useful for every engineer, and vital to build low-latency applications. Hope this helps! Happy benchmarking.