Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Optimizing Open Addressing (thenumb.at)
224 points by TheNumbat on April 2, 2023 | hide | past | favorite | 65 comments


Modern CPUs keep making optimal algorithms weirder. Speculative superscalar execution and the colossal gap between the CPU and memory speed means that often a brute-force solution that fits in a cache line wins over solutions that would feel more elegant or efficient.


To be fair, that’s been true of the past 20 years if not more.


That lacks a "theoretically", then? If an algorithm is optimal but still out-performed by something that is not, then the definition of "optimal" is not so helpful and might need revision.

I do not follow the computer science development at all, but I guess people are working on ways of modelling caching as that becomes (as you point out) harder and harder to ignore. Not memory accesses cost the same.


A specific software implementation for a specific algorithm may be the optimal solution for one specific type of hardware, but not for others. Or: it doesn't make much sense to think about software performance without taking into account the specific hardware it needs to run on.

Not sure if this is at all surprising for the more academical 'computer science types', but for low level coders it's been very obvious since pretty much forver.


And then you quickly reach the conclusion that the only way to assess whether one of the implementation choices will be faster/less memory consuming/etc. in production is to implement both, deploy to production and observe the performance on the actual workloads. And the findings don't generalize, so you can't actually learn from them and build a predictive theory.

And this conclusion doesn't sit well with quite a number of developers, myself included, because I personally would like to be able to do the "right" choice without the brute force of "try everything and see what works better".


In theory of cache oblivious algorithms this is modeled; for optimality the complexity should also be independent of block / cache size


Yeah, and then somebody took care to actually implement and benchmark those cache-oblivious algorithms against the classic algorithms with hard-coded chunk sizes and somehow the cache-oblivious ones decisively lost in every single case [0].

[0] Kamen Yotov, Tom Roeder, Keshav Pingali, John Gunnels, and Fred Gustavson "An experimental comparison of cache-oblivious and cache-conscious programs" (2007). https://doi.org/10.1145/1248377.1248394


Thanks for the great write-up! I have one quibble with your implementation of quadratic probing though: the usual index function used for quadratic probing is start_index + (i + i^2) / 2. This is the sequence you get by adding one to your start index, then adding two the next time, then adding three, etc., so you can avoid performing any actual multiplication by just adding one to the stride on every failed probe. Furthermore, this sequence has the useful property of visiting every index once before returning to the start, if your table size is a power of 2, so you could remove a check from your inner loop.


I was actually wondering about that - it appears a (i+i^2)/2 sequence makes insertions (and by extension erases with rehashing) 7-10% faster, which is pretty significant. Lookups and probe lengths are about the same, so I think the conclusions stand.


Nobody talks about the possibility of chained hashing, where the chains are vectors and not linked lists. After hashing the key, you chase only one extra pointer. First the pointer to the table, as usual, from which you get a pointer to the "chain", which is another cache-friendly array.

Open Addressing has an ugly problem with deletion. You can never delete anything, because any present node might be a needed stepping stone to another entry. Chained hashing has no problem with deletion: an item is just spliced out of the list. Similarly, chained hashing with vectors could do the same thing. In any given secondary vector, we know that all the entries hashed to the same slot; we don't have to leave any tombstones. We don't have to move entries to close the gap; we can move the last element of the array to the deleted position and decrement the length.

Chained hashing with linked lists has a nice property when you reorganize the table; you don't have to reallocate any memory or do a lot of writes; just reshuffle the linked list pointers to reassign the nodes to the new chains of the resized table. If chains are vectors, this becomes more of a hassle. When growing the table, we allocate new vectors for the odd chains and then move about half the items from old chains to new chains.


TFA describes a deletion scheme for open addressing in the first paragraph.


Sorry what is TFA?


The fucking article.


The Fine Article.


calque of 'rtfm'


Have you tried absl::flat_map? It uses simd in a different way than described in this article, and Google claims that it saves them a lot of memory because it still works pretty well at 90-95% occupancy.


I've benchmarked swiss tables and found that (for hit-heavy workloads) a minimum of 2 loads per lookup is expensive compared to 1.


I've really wanted to try making a hash table that works like a Swiss table but that stores the metadata interleaved with the rest of the data (16 metadata, 16 key, 16 value, repeat). doing so would keep your memory accesses closer together, but be a pain to program


Sometimes a very basic fixed size chaining hash tables is actually the best you can do.

Take for example tcc, which is a very fast c compiler.

They just use a basic chaining hash table: `Node *nodes[16384];`.

Since most translation units have far less than 16384 tokens, most lookups result in a direct hit.


Is that the best TCC could have done here? It's easy to write this in C, which is one obvious reason for TCC to do it, but it's not at all clear this is the best way.

What TCC is actually storing in TokenSym (which you've named "Node" in this context) is four pointers to other information about symbols, a unique ID, and a string.

If we used a compact inline string (like CompactString) we can squeeze all this into 60 bytes unless the string is more than 24 bytes in length such as TCC's own warn_implicit_function_declaration or is_compatible_unqualified_types which would need a heap allocation like they have today, for the string itself.

If we do that we can build a linear probed open addressed hash table and avoid paying for the extra indirection any time the symbol names are 24 bytes or shorter in length, I'd expect this is markedly better.

But it's a lot of work in C, so it isn't a surprise TCC doesn't do it.


TCC has the advantage of being a batch process, so it can insert things using open addressing without ever considering how it intends to delete them, then just drop the whole thing on the floor at the end. Of course, if you are also a batch process, you absolutely should do that too.


This is an advantage of C style programming. If you understand the problem and tailor a solution you can utilize simplifying assumptions which make the code easier and faster.

General libraries have to handle all kinds cases you might not care about. The most egregious example is assuming every piece of code might be threaded so everything needs to be protected by locks. (I would guess > 75% of dictionaries never remove a key, and simply throw away the whole thing at the end.)


Locks require enough of a performance hit on enough architectures that it’s better to document the dictionary as _not_ being threadsafe and require external synchronization. You’ll frequently need external locking anyway to enforce atomicity with changes to other related data structures


And yet most stdio implementations on systems capable of multithreading add locking around every operation (with the notable exception of Microsoft’s single-threaded C runtime when it still existed, as well as a handful of distinct *_unlocked functions on Unix that were added specifically to mitigate this problem).

There are probably multiple reasons for this, including historical precedent and the need for misuse resistance in a language’s standard library, but I’d guess that in part this is simply because locks used to be much cheaper, especially before systems with multiple hardware threads became ubiquitous.


I would include malloc in that list of regrets (or at least not having a lock free alternative).

I think historically it wasn't clear what role threads would take. If you look at Java they expected application programmers to be using threads haphazardly. Now we know that isn't a good idea.


This may result in poor CPU cache access pattern. For tasks that are often human-bound like source files or UI data structures using a sorted array can be a good option.


It's not discussed in the article but there is also coalesced hashing [1] which is a hybrid of separate chaining and open addressing. Coalesced hashing makes the best use of space as you can fill a table to capacity before resizing. Speed wise, it provides consistent performance for all operations (insert, search, delete) rather than being optimized for only one or some of them.

Another topic the article doesn't touch on is the difference between dense and sparse hashmaps. The former stores values and keys together but the latter stores values in separate array(s) which is more cache friendly when probing keys.

[1] https://en.wikipedia.org/wiki/Coalesced_hashing


Coalesced hashing is just worse than robin hood tables.


I colorized the benchmark data to make it a bit more readable[0].

[0]: https://i.imgur.com/C7MN44m.png


Reads like a great summary for the hash map ideas that work in practice. I would have loved to see flat_hash_map<> thrown into the benchmark mix.


I just benchmarked absl::flat_hash_map and got results comparable to Robin Hood with a load factor between 75% and 90%, which makes sense. It's also faster for looking up missing keys, so seems like a good option. I didn't benchmark the maximum probe lengths, though, so not sure on that front.


I tied adding the maps to the [1] benchmark, but I wasn't able to, since they aren't type generic yet. You may want to benchmark against [2], and [3] which are the best performing ones in the above benchmark.

[1] https://github.com/martinus/map_benchmark/

[2] https://github.com/martinus/unordered_dense/blob/main/includ...

[3] https://github.com/ktprime/emhash

Edit: I had the wrong link for [2] and [3], somehow...


I wish this article had been available a few weeks ago when I was implementing my own hash table for my language. The explanations of the algorithms are great. It's given me enough confidence to try to implement Robin Hood probing. Should be a nice upgrade.

I'm also wondering how to implement that virtual memory trick...


I'm guessing the virtual memory trick it to set up a page mapping so that the next bit of virtual address space after your hash table, is the same physical memory of the hash table again. The power of 2 restriction is needed to make this mapping always meet alignment rules.

If you do this, you can skip out the modulo arithmetic that wraps accesses back to the start of the table. Instead you just read off the end of the table and the next (virtual) memory address after the end of the table is equivalent to reading the first address of the table.

This trick can be useful when setting up some form of DMA into a circular buffer - you've got hardware that will write data into a chunk of memory (or read from it) and wants a range of virtual addresses to work with. If your destination is a circular buffer (which is pretty common) the free space might wrap around the end. Your options are to only read enough to get to the end, use two entries in a scatter-gather DMA to do the two parts, or use this trick.


> hardware [...] wants a range of virtual addresses to work with

Pretty sure DMA can not use virtual addresses since that'd require cooperation from the CPU. I guess there can be some IOMMU device that would present some sort of separate address space specifically for peripherals?


Yeah, the only time I've done it, it was an ARM based chip that didn't actually have a full MMU (i.e. full virtual addressing), but did have an IOMMU.


Yeah, it's a lot like the address mirroring I often see in electronics. I'm just wondering if it can be done with Linux mmap or some other system call.


I learned about Open Addressing a long time ago when I was optimizing Dnsmasq. The scenario involved looking up a very large list of domain names to block or combine with ipset in order to customize routing based on domain names. Open addressing worked very well on these large immutable lists.


If the lists are immutable, could you not just construct a special hash function a priori that will guarantee no collisions?


Perfect Hash Functions are possible, but in my limited experience they rarely out-perform the naive thing where you pre-size the hash table to fit all your data (avoid grow-on-insert overhead).

PHFs have two downsides: Finding a PHF for your data can take a bunch of CPU time so it's frustrating to have in your build, and the resulting hash function can be much more costly than a simpler hash function. That is, the cost of hashing can more than dominate an extra lookup in a tiny fraction of cases.

I'd say the use case for PHF is more when you're absolutely memory-bound, e.g. in an embedded ROM. For anything else, just bump up your hash table size a little, to not have those collisions.


Yes. The term of art is “perfect hash”.


I find that there is too much of an emphasis on open-addressing in blog articles and such.

There are a lot of real and common data structures where chaining is much better suited.

For example whenever you have data part of one or multiple other node-based data structures, and you layer an index on top of that.

Chaining also tends to waste a lot less memory.


I'm not convinced there are "a lot" of these real and common structures.

They certainly do arise, and particularly if you're also doing concurrency linked lists are attractive because a compare-exchange type mechanism is ideally suited for such structures and that has been implemented cheaply in hardware. But this article was about the default, and we don't want defaults tailored to niche uses. The default T-shirt size shouldn't be big enough for The Rock. The default ceiling height in new buildings shouldn't be 1.5 metres. The default beverage offered at your restaurant shouldn't be Mountain Dew Code Red. It's OK that these things are options, that somebody who wants them can have them, but they're bad defaults.

I'd want to see some measurements from real world projects to believe chainig should be the default. If I take, say, the Chromium codebase, and the LibreOffice codebase, and I look at places they use any type of hash table, how often are they going to be for stuff that'd be "better suited" to chaining? Am I going to consistently find 50% ? 10% ? 1% ?


Pretty much 95% of all hybrid data structures.

Now hybrid data structures are only used in systems programming, because they require to be particularly careful about object lifetime and such.


In my experience / tests, its way easier to write a high-performance open-addressing Hash Table than a high-performance chaining one.

That being said, chaining is easier to write.

> Chaining also tends to waste a lot less memory.

How so? A typical 32-bit integer or 32-bit float uses 4-bytes. But a typical pointer is 8-bytes. But there's also the internal malloc/new chunk header to consider.

That means, to store a 4-byte integer into a chained hash table, you need 8-bytes (pointer) + 8-byte (glibc malloc chunk header) + 4 byte integer/float. Or 20 bytes total in practice.

If you have say, 50 integers inside of the hashmap, then you'll use (table-size * 8) + 50-integers * 20 bytes each == 1000+ bytes for the pointers/data alone, plus even more for the table itself. (ex: Table of size 50 would need 50 * 8-byte pointers even when empty, for a total of 1000 bytes + 400 bytes == 1400 bytes or so)

In contrast, a 4-byte open-addressing hash table only takes up 4-bytes. So 50-integers in a hashmap with ~50% load-factor (ie: size the table to be size 128 or so) will be 128 * 4 == 512 bytes.

EDIT: Its because of this huge practical difference in memory used that I'm pretty sure open-addressing is so high performance in comparison. When your data-structures are 1/2 or smaller number of bytes than the competition, its easier to stay in L1 cache or other such size benefits.


It is not necessary for an allocator to have any hidden overhead inside an allocated block. That would be quite unacceptable for small blocks like a 32 byte block holding four pointer-sized values.

In any case, applications can do their own aggregation for small allocations: allocate the nodes in an array and dole it out from that. It can recycle unused nodes itself.


> In any case, applications can do their own aggregation for small allocations: allocate the nodes in an array and dole it out from that. It can recycle unused nodes itself.

Or you could just write the open-addressing HashMap, which is easier than implementing your own custom small allocation malloc()/free()?

That's the thing. To make chaining competitive against open-addressing HashMaps, you've got to bend-over backwards and its suddenly not easy anymore. Though you're right in that different versions of malloc/free could be more efficient. Ex: tcmalloc() has small allocations built in already like you mention.

But even if you erase those 8-bytes from the glibc chunk implementation of malloc, you're _still_ 8-bytes over the open-addressing implementation (you can't get rid of the 8-byte "next" pointer). So all that work and you're still worse off than the other methodology.


I can write aggregating allocation for a C with my eyes closed, in a tiny number of lines. These days I mostly wouldn't. If the target system thinks it's a good idea to add a header to every 32 byte allocation, that's their problem. No worthwhile malloc implementation does that, other than in a debugging configuration, where it adds "red zones".

The calculations in my other post show that it's not as simple as that next pointer being pure overhead compared to open addressing.

https://news.ycombinator.com/item?id=35416520


Nodes don't have to be allocated with malloc (that is actually the worst thing you could possibly do).

An open-addressing hash table only performs well up to 50-75% bucket usage. Chaining doesn't care and performs the same up to 100%.


> Nodes don't have to be allocated with malloc (that is actually the worst thing you could possibly do).

I mean... the most obvious place to place nodes is... inside the table itself. Also known as Open Addressing.

Or what, are you going to implement a 2nd, custom heap algorithm to manage your nodes? I guess there's "buddy allocators", but buddy-allocators aren't exactly free either. Whatever method you're using to keep track of nodes is going to have some level of inefficiency.

Whatever allocation you're doing to get these nodes (above-and-beyond generic malloc) is time you could have been spent writing open-addressing instead.

> An open-addressing hash table only performs well up to 50-75% bucket usage. Chaining doesn't care and performs the same up to 100%.

Hardly an apples-to-apples comparison. As I pointed out, 50% load-factor on 50x uint32 open-addressing is only ~400 bytes used.

While 100% load factor on 50x uint32 is 1400 bytes used on chaining. (plus additional load-factor in practice due to malloc fragmentation)


I already explained. Objects in non-trivial data structures are part of several data structures concurrently.


Perhaps we need to start talking with actual code? Here's just a simple idea that's in my brain right now.

    template <class Data>
    struct HashNode{
        // Probably should be shared_ptr<HashNode<Data>>, but that's 
        // now adding even more inefficiencies like a ref_count per element
        struct HashNode<Data>* nextPointerChain;
        Data d; // Maybe Data* d if you're sharing it with other data-structures?
    };

    template <class Data, int size>
    struct ChainHashTable{
        HashNode<Data> theTable[size]; 
    };

    template <class Data, int size>
    struct LinearProbingHashTable{
        Data theTable[size];
        vector<bool> occupied; // set to "size", 0 means empty and 1 means occupied
    };
nextPointerChain takes up 8 bytes, even if its nullptr. No other data-structure needs to have the nextPointerChain. If we use shared_ptr<HashNode<Data>> as per typical modern C++, there's even more inefficiencies that I forgot about. EDIT: You probably can get away with unique_ptr, now that I think of it.

----------

Pointers aren't free btw. If you're sharing the pointer with many different parts of your program, that's definitely one of those things you'll want to start getting rid of. Not only does a pointer cost 8 bytes, but its also _SIGNIFICANTLY_ slower and cache-unfriendly in practice.

L1 cache works by reading data in-and-around anything you access. If you're only using 8-bytes of a cache line for pointer-indirection (8/64), you're wasting 56-bytes of your fetch.

Linear Probing is extremely fast because it tends to blitz through L1 cache thanks to locality of data. Simple Linear Probing (or Robin-hood augmented probing) is probably the fastest, and simplest, approach for modern CPUs.


You still fail to understand what being part of multiple data structures means.

Shared pointers, separate container to check occupancy, suggesting making a container of pointers, all these things suggest you have no idea how to do hybrid data structures.

The nodes contain the data along with multiple pointers in them to chain them to various data structures they are part of (trees, lists, hash tables etc.). The object in question contains state that is several cache lines long.

You cannot just put the nodes inside your buckets as you don't get to control where the nodes are and cannot invalidate the other data structures. Though I suppose it's a possibility if you know ahead of time how many nodes you're going to need, but then in that case you may be able to go for perfect hashing to begin with (unless you're building some kind of fixed-size cache).

A given object for example can be part of several hash maps, based on the various pieces of state it is useful to index by.

In practice you'd use a pool to allocate all your nodes in blocks. There is no per-node overhead for this allocation strategy.


So you've at best removed just 8 bytes for rather inconvenient programming for...

    LinearProbingHashTable<FooBar*, size*2>
    ChainHashTable<FooBar*, size>
Note that linearProbingHashTable only uses 8 bytes per pointer. While ChainHashPointer still needs nextPtr, taking up 16 bytes.

The *2 is for 50% load factor on linear table. But we still win on linear if we're at say 75% (simple linear) or 95% load factor (which is doable with Robin Hood)


Agreed - I do mention in the post that open addressing is impractical when using intrusive linked lists, which are common in low level data structures.


> only performs well up to 50-75% bucket usage.

Robin Hood performs well into the high 90's.


One situation where I painfully learned that it doesn't, is when you iterate over one hashtable to fill another. To defend against that one needs to add some per-hashtable randomized state into the hash IV.

Through bad experience I also learned that you need to not just grow due to fillfactor, but also due to disproportional chain length, even with the above defense in place.


> To defend against that one needs to add some per-hashtable randomized state into the hash IV.

You should almost always be doing this anyways, since most hash tables use user-provided data as keys, which opens up a well known avenue for denial-of-service attacks:

https://lwn.net/Articles/474912/


I think that's often fixed using a per run random seed, rather than a per table seed. The patch linked in the python bug linked from that article does seem to do that, for example.


> (that is actually the worst thing you could possibly do)

Nonsense, there are many worse things you could do. For example, you could allocate nodes with mmap, at 4KB per node. (Which is a thing I've actually done in the context of cursed bootstrap code where malloc isn't available and having more than a handful of (in that case doubly-linked-list) nodes means the environment is deranged anyway.)


Now that seems plain silly.


I would tend to agree.

This article conveniently designs its benchmark to be exactly when open addressing would work.

In almost all of the maps implementations I have used so far, chaining was necessary and intrusive linked lists were used.

An open-addressing scheme was however preferable (cuckoo) for a read-mostly map allowing concurrent reads and locked writes.

I would be more interested in an article exploring the compromises involved there, e.g. related to paying the cost of an additional indirection when accessing the value.

Another critic of this article I would add is the many tables throughout referencing different baselines. A single baseline would be much preferable, to keep a sense of scale for each change.


Choice-of-two + bias (pick L) + fixed-sized bin slices of the backing arrays = very high loading, strictly 2 ops (max) per R/W, trivial to write.

I typically just use a very fast cryptographic hash (Blake3 is awesome) and have plenty of bits for h1, h2 (you can have additional choices but it is a fast dimiminishing return).


Chaning still requires a flat table, which can change in size, and is proportional to the number of entries in the table (to keep the average chain length ("load factor") bounded). Thus, this can still cause memory fragmentation, even though the hash table chain nodes can all be the same size and so reduce fragmentation.

In terms of absolute bytes, ignoring fragmentation, it depends on the load factors.

In chaining, you need at least a singly linked list. Let's assume that our nodes store the hash value, as well as the key, value and a next pointer: so four pointer-sized words. (You don't have to store the hash code, but you'd be foolish not to, because it can reject mismatches in a single one-word comparison, avoiding the need to do a full key comparison.)

In addition, each chain needs a pointer-sized word which points to it from the table. (We ignore remaining overheads, like the small data structure which keeps the table and other bookkeeping info.)

Let's assume we keep the load factor (maximum average chain length) to 4. Beyond that we will increase the table size. So the chained table is considered 100% full with 4 nodes per chain on average. In this situation, the root table is amortized as 1/4 word overhead per node: so each node effectively 4.25 nodes.

An open-addressed table stores keys and values. To make the comparison fair, it should also store the hash codes; they are needed for the same reason: as we probe over collisions, we can use them to reject non-matching keys.

So, when it's 100% full, it needs 3 words per entry. In this situation, it's beating the 4.24 value of chained hashing. However, we would never want an open-addressed table to get 100% full, because the performance degrades, regardless of the collision resolution strategy. Say we allow up to 80%. 20% (0.2 x 3 words = 0.6 words per entry) of the table is wasted space, so 3.75. Still beating chained.

Now let's look at the half full situation. Below half full we might reorganize either table to be smaller, so that's our worst case.

When the chained table is half full, the average chain length is 2, and so each node needs 4.5 words: the table part represents more overhead, but the value doesn't change much from the maximum load case.

The half-full open-addressed table basically wastes 60% of the table: it's as if each item requires 7.5 words rather than 3. In this minimum load case, it is losing to chained hashing. (Remember, full is 80%, so half of that is 40%).

Summary:

   Words per entry*:
    
       \ Method
   Load \          Chained      Open Addressing
         +---------------------------------------
   half  |         4.5          7.5
         |
   full  |         4.25         3.75
         

   * Words are pointer-sized scalar values
   * For chained: full means load factor of 4; half means load factor of 2.
   * For open addressing: full means 80% of the table, half is 40% of the table
   * Assuming entries store hash code, key and value.
There is no clear winner in terms of memory, but it's looking as if the improvement from open addressing may not be that great, if at all realized. Of course, the values are debatable; why should the max load factor for chaining be 4? Or may be more than 80% can be crammed into an open-adressed table. I'm kind of surprised; according to the parameters I chose, I thought that open addressing would hit a better density.

What points in favor of the open addressing is caching behavior: avoiding the dependent loads of pointer chasing. For Open-Addressing, cache-friendly collision strategies can be chosen so when multiple entries are probed, they tend to be cached close together in the same block.


Hmm.

I think the main difference between your calcs and mine is that you have implemented a HashMap, while I was assuming a HashSet (so I only need to store the value, while in your case, you store (key,value)).

In any case, the larger the "entry", the less the 8-byte pointer matters. The smaller the "entry", the more the 8-byte pointer matters.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: