Recently, a colleague of mine found a problem deep down in the rabbit hole. What we could see from the outside was a major performance issue – the fix was simple, yet I found the root cause of the problem interesting enough to warrant this post. Consider it a follow-up on contracts and how deep and hideous bugs can manifest themselves if you violate them, as well as how a root cause analysis (RCA) is connected to this.
It’s labelled a quick tip, since most seasoned developers should already know about this, and as such, it merely serves as a reminder to them. While it may not be a quick read, the gist of this post is just a tiny little thing in your everyday work, hence, the quick tip. I am not entirely sure yet on whether to continue quick tips in this way, or even what a proper definition would be, but I’m interested in getting your thoughts on it (preferably after you read the post). If you read this post’s title and immediately know what it’s about and how to deal with it in practical terms, then congratulations – you have just been promoted to proof-reading this post.
In a certain kind of way, the problem I am presenting here is a continuation of my post on contracts. In particular, since it is about HashSets, we are interested in implementations of the hashCode method.
How HashSets work
If you haven’t yet had the chance to look under the hood of a HashSet, here’s a quick primer. The idea is simple: We want several objects to be placed in a collection such that duplicates are only present in the collection once, aka we want a mathematical set of these objects. The “Hash” part on the other hand, is an implementational detail telling us how the handling of “duplicates” works. Mind you, there are other kinds of set implementations, which work completely different and each have their own pros and cons.
A HashSet in particular is based on two things each object (at least in the JVM world) has: a method to compute a hash code and a more costly method (relative to the hash code computation) to compare it against another object.
These two methods have a contract that you should know already. In essence, two objects, which are to be considered equal to each other must guarantee that they also have the same hash code. On the other hand, two objects which share a hash code may not be equal. This is basically the trade-off inherent to a hash code: you can blazingly fast compute that hash code, and while it helps to distinguish a lot of objects, it doesn’t distinguish all of them.
Application of the contract
Since the hash code of objects serves as a way for grouping them, the HashSet implementation is based on an array for each of those groups. For technical reasons, we do not provide an array the size of an integer of course, but the gist is, that objects with different hash codes are tried to be placed into different array positions.
An interesting little bit of knowledge is, that when you look at the source for java.util.HashSet, or take a closer look at an object of this class in your debugger, you will notice that the implementation is based directly on java.util.HashMap. In other words, everything we say here not only applies to a HashSet, but equally to the keys of a HashMap.
The “obvious” problem
Now if that was all you know about how a HashSet works, then I’m afraid, you don’t know how a HashSet works. There is one obvious problem in grouping the objects via their hash codes: Two objects, A and B, which would return false when you compare them via A.equals(B), are allowed to share their hash code. Once you have inserted A into the set, how do you check whether B is contained or not? You would find an object in the set based on B.hashCode, but it’s not B. Also, what happens when you insert both A and B? There is only enough space for one thing in a single array cell. Will only the latter insertion prevail?
None of this really works out well, unless your data structure implementation ensures it works the right way. Since, A and B, both satisfy the equals/hashCode-contract, the HashSet implementation must be able to deal with these problems. And the solution is simple, yet its implications are profound: instead of simply storing an element in the array, the implementation stores a linked list of elements that share their hash codes.
So if you insert B, after you already inserted A, the array position is read, the implementations checks whether the first element of that list (A) is equal to the element that is to be inserted (A.equals(B)). Since it is not, this means we have the case of two different objects with the same hash code, and the linked list for that hash code is extended to include both A and B. Lateron, we can check whether B is contained in the HashSet. This works by computing the hash code of B, checking the array at that position, and then comparing the given B against the first list element A. Since it’s not equal, we check for further list elements and find that B is stored there as well and is equal to B. If we checked for an element C, which also shares its hash code with A and B, we can find out that it is not contained in the list.
The problem with the solution
This implementation of HashSets we discussed here has served well as a basis for decades. I know many software developers, who are quite at ease when working with hashed sets, yet are not equally at ease when having to describe how they work. As usual, this is no problem, until it is one. The concept of a HashSet is an abstractional layer on top of its implementation. As long as the abstraction serves our needs, we need not know its internals. But knowing about this means we’re better off in actually deciding whether it serves our needs, or finding the problem when the serving fails.
As mentioned in the beginning, our real-world case was simply a case of bad performance. So we began to lift the cover and look underneath things, until eventually the culprit was identified. What my colleague found was a third-party library defining a class, which overrides equals. Sounds harmless? It’s not when you take our contract considerations into account, because said class did not override hashCode. Luckily, we derived our own class from that class, which did override both, equals and hashCode, with the interesting insight, that only equals was based on calling super – which in turn is code generated by the IDE, along with a nice little warning that the superclass (the one from the 3rd party) does not overwrite hashCode. Somehow, that little warning was ignored a while back – kind of makes a case for treating recognizable contract violations as errors, doesn’t it?
What happened next in our software is referred to as degeneration. Since the hashCode computation did not include the superclass attributes, we had a lot of objects which ended up having the same hash code. Yet, they were not equal to each other, as equality included the base class attributes. All of that is fine, and at least it no longer violates the contract that way, but it degenerates the HashSet.
Degeneration of a data structure happens, when your data structure that is optimized for one way of working, switches into behaving like a different data structure. We can witness this usually from observing performance characteristics like space and time needs. In complexity theoretic terms, it simply means that we have a data structure with significantly better average case performance than worst-case performance, and degeneration is just another word for hitting that worst-case jackpot.
In our case, most objects were allocated to the same array position due to their hash code, which caused the corresponding linked list to grow enormously. Hence, instead of a hash-indexed table, we ended up having to look through a huge linked list of data items, whenever working with the resulting set. Instead of computing a hash code and looking up the result in an array cell, we instead spent O(n) time for n already inserted items.
Just in case you are not familiar with O-notations and complexity theory, bear in mind that what we are talking about here was the real-world difference between a runtime measured in several hours versus just mere minutes. All it came down to was a violation of a contract, hence, it is vitally important for us as developers to be aware of such contracts.
The bad news is that knowing about the contract won’t immediately give you the cause of such an issue, nor will it prevent contract violations. We even had a contract-satisfying implementation in our slow running code. The good news though is that awareness of these contracts, apart from all the benefits mentioned in the previous post, also helps you on your journey to the root cause.
Here’s a final hypothesis as food for thought: While performing a root cause analysis, if you find a contract violation, then you found the root.
[The feature image is CC-BY-SA 3.0 byvia wikimedia commons]