## C#/.NET Little Wonders: The Useful But Overlooked Sets

*Once again we consider some of the lesser known classes and keywords of C#. Today we will be looking at two set implementations in the System.Collections.Generic namespace: HashSet<T> and SortedSet<T>. Even though most people think of sets as mathematical constructs, they are actually very useful classes that can be used to help make your application more performant if used appropriately.*

*For more of the "Little Wonders" posts, see the index here.*

### A Background From Math

In mathematical terms, a set is an unordered collection of unique items. In other words, the set {2,3,5} is identical to the set {3,5,2}. In addition, the set {2, 2, 4, 1} would be invalid because it would have a duplicate item (2). In addition, you can perform set arithmetic on sets such as:

**Intersections**:*The intersection of two sets is the collection of elements common to both.**Example: The intersection of {1,2,5} and {2,4,9} is the set {2}.*

**Unions**:*The union of two sets is the collection of unique items present in either or both set.**Example: The union of {1,2,5} and {2,4,9} is {1,2,4,5,9}.*

**Differences**:*The difference of two sets is the removal of all items from the first set that are common between the sets.**Example: The difference of {1,2,5} and {2,4,9} is {1,5}.*

**Supersets**:*One set is a superset of a second set if it contains all elements that are in the second set.**Example: The set {1,2,5} is a superset of {1,5}.*

**Subsets:***One set is a subset of a second set if all the elements of that set are contained in the first set.**Example: The set {1,5} is a subset of {1,2,5}.*

### If We’re Not Doing Math, Why Do We Care?

Now, you may be thinking: why bother with the set classes in C# if you have no need for mathematical set manipulation? The answer is simple: they are ** extremely** efficient ways to determine ownership in a collection.

For example, let’s say you are designing an order system that tracks the price of a particular equity, and once it reaches a certain point will trigger an order. Now, since there’s tens of thousands of equities on the markets, you don’t want to track market data for every ticker as that would be a waste of time and processing power for symbols you don’t have orders for. Thus, we just want to subscribe to the stock symbol for an equity order only if it is a symbol we are not already subscribed to.

Every time a new order comes in, we will check the list of subscriptions to see if the new order’s stock symbol is in that list. If it is, great, we already have that market data feed! If not, then and only then should we subscribe to the feed for that symbol.

So far so good, we have a collection of symbols and we want to see if a symbol is present in that collection and if not, add it. This really is the essence of set processing, but for the sake of comparison, let’s say you do a list instead:

1: // class that handles are order processing service

2: public sealed class OrderProcessor

` 3: {`

4: // contains list of all symbols we are currently subscribed to

5: private readonly List<string> _subscriptions = new List<string>();

` 6: `

` 7: ...`

` 8: }`

Now whenever you are adding a new order, it would look something like:

1: public PlaceOrderResponse PlaceOrder(Order newOrder)

` 2: {`

3: // do some validation, of course...

` 4: `

5: // check to see if already subscribed, if not add a subscription

6: if (!_subscriptions.Contains(newOrder.Symbol))

` 7: {`

8: // add the symbol to the list

` 9: _subscriptions.Add(newOrder.Symbol);`

` 10: `

11: // do whatever magic is needed to start a subscription for the symbol

` 12: }`

` 13: `

14: // place the order logic!

` 15: }`

What’s wrong with this? In short: performance! Finding an item inside a **List<T>** is a ** linear - **O(n) – operation, which is not a very performant way to find if an item exists in a collection.

*(I used to teach algorithms and data structures in my spare time at a local university, and when you began talking about big-O notation you could immediately begin to see eyes glossing over as if it was pure, useless theory that would not apply in the real world, but I did and still do believe it is something worth understanding well to make the best choices in computer science).*

Let’s think about this: a linear operation means that as the number of items increases, the time that it takes to perform the operation tends to increase in a linear fashion. Put crudely, this means if you double the collection size, you might expect the operation to take something like the order of twice as long.

Linear operations tend to be bad for performance because they mean that to perform some operation on a collection, you must potentially “visit” every item in the collection. Consider finding an item in a **List<T>**: if you want to see if the list** **has an item, you must potentially check every item in the list before you find it or determine it’s not found.

Now, we could of course sort our list and then perform a binary search on it, but sorting is typically a linear-logarithmic complexity – O(n * log n) - and could involve temporary storage. So performing a sort after each add would probably add more time.

As an alternative, we could use a **SortedList<TKey, TValue>** which sorts the list on every Add(), but this has a similar level of complexity to move the items and also requires a key and value, and in our case the key is the value.

This is why ** sets** tend to be the best choice for this type of processing: they don’t rely on separate keys and values for ordering – so they save space – and they typically don’t care about ordering – so they tend to be extremely performant.

The .NET BCL (Base Class Library) has had the **HashSet<T>** since .NET 3.5, but at that time it did not implement the **ISet<T>** interface. As of .NET 4.0, **HashSet<T>** implements **ISet<T>** and a new set, the **SortedSet<T>** was added that gives you a set with ordering.

### HashSet<T> – For Unordered Storage of Sets

When used right, **HashSet<T>** is a beautiful collection, you can think of it as a simplified **Dictionary<T,T>****. **That is, a Dictionary where the **TKey** and **TValue** refer to the same object. This is really an oversimplification, but logically it makes sense. I’ve actually seen people code a **Dictionary<T,T>** where they store the same thing in the key and the value, and that’s just inefficient because of the extra storage to hold both the key and the value.

As it’s name implies, the **HashSet<T>** uses a hashing algorithm to find the items in the set, which means it does take up some additional space, but it has lightning fast lookups! Compare the times below between **HashSet<T> **and **List<T>**:

Operation |
HashSet<T> |
List<T> |

Add() | O(1) | O(1) at end O(n) in middle |

Remove() | O(1) | O(n) |

Contains() | O(1) | O(n) |

Now, these times are amortized and represent the typical case. In the very worst case, the operations could be linear if they involve a resizing of the collection – but this is true for both the **List** and **HashSet** so that’s a less of an issue when comparing the two.

The key thing to note is that in the general case, **HashSet** is constant time for adds, removes, and contains! This means that no matter how large the collection is, it takes roughly the exact same amount of time to find an item or determine if it’s not in the collection. Compare this to the **List **where almost any add or remove must rearrange potentially all the elements! And to find an item in the list (if unsorted) you must search every item in the **List**.

So as you can see, if you want to create an unordered collection and have very fast lookup and manipulation, the **HashSet** is a great collection.

And since **HashSet<T>** implements **ICollection<T> **and **IEnumerable<T>,** it supports nearly all the same basic operations as the **List<T> **and can use the System.Linq extension methods as well.

All we have to do to switch from a **List<T>** to a **HashSet<T> **is change our declaration. Since **List **and **HashSet **support many of the same members, chances are we won’t need to change much else.

1: public sealed class OrderProcessor

` 2: {`

3: private readonly HashSet<string> _subscriptions = new HashSet<string>();

` 4: `

5: // ...

` 6: `

7: public PlaceOrderResponse PlaceOrder(Order newOrder)

` 8: {`

9: // do some validation, of course...

` 10: `

11: // check to see if already subscribed, if not add a subscription

12: if (!_subscriptions.Contains(newOrder.Symbol))

` 13: {`

14: // add the symbol to the list

` 15: _subscriptions.Add(newOrder.Symbol);`

` 16: `

17: // do whatever magic is needed to start a subscription for the symbol

` 18: }`

` 19: `

20: // place the order logic!

` 21: }`

` 22: `

23: // ...

` 24: }`

` 25: `

Notice, we didn’t change any code other than the declaration for **_subscriptions** to be a **HashSet<T>**. Thus, we can pick up the performance improvements in this case with minimal code changes.

### SortedSet<T> – Ordered Storage of Sets

Just like **HashSet<T> **is logically similar to **Dictionary<T,T>**, the **SortedSet<T>** is logically similar to the **SortedDictionary<T,T>**.

The **SortedSet** can be used when you want to do set operations on a collection, but you want to maintain that collection in sorted order. Now, this is not necessarily mathematically relevant, but if your collection needs do include order, this is the set to use.

So the **SortedSet** seems to be implemented as a binary tree (possibly a red-black tree) internally. Since binary trees are dynamic structures and non-contiguous (unlike **List** and **SortedList**) this means that inserts and deletes do not involve rearranging elements, or changing the linking of the nodes.

There is some overhead in keeping the nodes in order, but it is much smaller than a contiguous storage collection like a **List<T>**. Let’s compare the three:

Operation |
HashSet<T> |
SortedSet<T> |
List<T> |

Add() | O(1) | O(log n) | O(1) at end O(n) in middle |

Remove() | O(1) | O(log n) | O(n) |

Contains() | O(1) | O(log n) | O(n) |

The MSDN documentation seems to indicate that operations on **SortedSet** are O(1), but this seems to be inconsistent with its implementation and seems to be a documentation error. There’s actually a separate MSDN document (**here**) on **SortedSet** that indicates that it is, in fact, logarithmic in complexity. Let’s put it in layman’s terms: logarithmic means you can double the collection size and typically you only add a single extra “visit” to an item in the collection.

Take that in contrast to **List<T>’**s linear operation where if you double the size of the collection you double the “visits” to items in the collection. This is very good performance! It’s still not as performant as **HashSet<T>** where it always just visits one item (amortized), but for the addition of sorting this is a good thing.

Consider the following table, now this is just illustrative data of the relative complexities, but it’s enough to get the point:

Collection Size | O(1) Visits | O(log n) Visits | O(n) Visits |

1 | 1 | 1 | 1 |

10 | 1 | 4 | 10 |

100 | 1 | 7 | 100 |

1000 | 1 | 10 | 1000 |

Notice that the logarithmic – O(log n) – visit count goes up very slowly compare to the linear – O(n) – visit count. This is because since the list is sorted, it can do one check in the middle of the list, determine which half of the collection the data is in, and discard the other half (binary search).

So, if you need your set to be sorted, you can use the **SortedSet<T>** just like the **HashSet<T> **and gain sorting for a small performance hit, but it’s still faster than a **List<T>.**

### Unique Set Operations

Now, if you do want to perform more set-like operations, both implementations of **ISet<T>** support the following, which play back towards the mathematical set operations described before:

**IntersectWith()**–*Performs the set intersection of two sets. Modifies the current set so that it only contains elements also in the second set.***UnionWith()**– Performs a set union of two sets. Modifies the current set so it contains all elements present both in the current set and the second set.**ExceptWith()**– Performs a set difference of two sets. Modifies the current set so that it removes all elements present in the second set.**IsSupersetOf()**– Checks if the current set is a superset of the second set.**IsSubsetOf()**– Checks if the current set is a subset of the second set.

For more information on the set operations themselves, see the MSDN description of **ISet<T>** (**here**).

### What Sets Don’t Do

Don’t get me wrong, sets are not silver bullets. You don’t really want to use a set when you want separate key to value lookups, that’s what the **IDictionary** implementations are best for.

Also sets don’t store temporal add-order. That is, if you are adding items to the end of a list all the time, your list is ordered in terms of when items were added to it. This is something the sets don’t do naturally (though you could use a **SortedSet** with an **IComparer** with a **DateTime** but that’s overkill) but **List<T> **can.

Also, **List<T>** allows indexing which is a blazingly fast way to iterate through items in the collection. Iterating over all the items in a **List<T>** is generally much, much faster than iterating over a set.

### Summary

Sets are an excellent tool for maintaining a lookup table where the item is both the key and the value. In addition, if you have need for the mathematical set operations, the C# sets support those as well.

The **HashSet<T>** is the set of choice if you want the fastest possible lookups but don’t care about order. In contrast the **SortedSet<T>** will give you a sorted collection at a slight reduction in performance.

Tweet |

- Share This Post:
- Short Url: http://wblo.gs/bP9

Print | posted on Thursday, February 3, 2011 6:23 PM | Filed Under [ My Blog C# Software .NET Little Wonders ]

## Feedback

## # re: C#/.NET Little Wonders: The Useful But Overlooked Sets

Great post! :)## # re: C#/.NET Little Wonders: The Useful But Overlooked Sets

You mentioned “HashSet<T> uses a hashing algorithm to find the items in the set, which means it does take up some additional space, but it has lightning fast lookups”You also said “Also, List<T> allows indexing which is a blazingly fast way to iterate through items in the collection. Iterating over all the items in a List<T> is generally much, much faster than iterating over a set.”

So which one you prefer

a. Iterating over large un-sorted collection ( say items over a million)

b. Need fast lookups and check Contains

Since List<T> is indexing, would that be the better approach?

## # re: C#/.NET Little Wonders: The Useful But Overlooked Sets

They are different kinds of indexing. The indexing in a List<T> means if you want the 3rd element you can say myList[2] and it's super fast. However, if you want tofindan item in the list, then it's a very slow linear operation O(n).HashSet, on the other hand, is very fast at finding, so that the find operation is O(1).

My point on the List is that if you want to iterate over the entire collection or if you want to go directly to a

position(not an item, but a position in the list), it's fast, but you need to find the position you're interested in first, which is slow in lists.Make sense?

## # re: C#/.NET Little Wonders: The Useful But Overlooked Sets

So what about hashtables? There not generic but they are faster than a hashset.## # re: C#/.NET Little Wonders: The Useful But Overlooked Sets

Actually, I don't think they are, I ran 1,000,000 iterations against a HashSet and a HashTable and the HashSet was faster:1000000 Iterations -- Hashtable: 253 ms, HashSet: 86 ms

In addition, the HashSet has the advantage of not requiring twice as much storage (since only stores key, not key and value) and it does not incur boxing/unboxing for primatives.

## # re: C#/.NET Little Wonders: The Useful But Overlooked Sets

Plus, the fact that they're generic means you can avoid many run-time errors with stronger type-checking at compile-time.## # re: C#/.NET Little Wonders: The Useful But Overlooked Sets

The only thing that gets me is that you say its a lot slower to iterate over a hashset then it would be a list. Generally if you have a set you going to be iterating over it. So yes lookups are a lot faster but when you go to iterate over you set its going to be a lot slower. So then that begs the question of which one you should use. What if your storing objects say a Customer Object. Then HashSet<T> would also be slow. In that case it would be best to store it in a Dictionary<CustomerId,Customer> assuming that most of your look up swould be on CustomerId. I think this might limit the usefullness of HashSet<T> Please correct me if i'm wrong.## # re: C#/.NET Little Wonders: The Useful But Overlooked Sets

Arg... - an "Ordered Set" or "Sorted Set" doesn't even make any sense...Maybe OrderedUniqueList<>?

If your dealing with only a few ("few" depends on your app., but we can say less than 100's) then it probably doesn't make much difference in the wild -- assuming these loops aren't nested, of course :)

One thing to keep in mind is that depending on the objects you have, it may be more natural to use Equality vs. getHashCode() -- although these are both often implemented poorly (if at all) - so you're set may not turn out the way you expect.

For this reason, specifying the test in-line w/a lambda is nice.

## # re: C#/.NET Little Wonders: The Useful But Overlooked Sets

@ncage: No, you aren't necessarily going to be iterating over a set. Don't make the mistake of assuming that one collection always takes the place of another. Otherwise why would we have both?Sets are ideal for performing quick lookups where the key is the value, that is, they are a simplification of Dictionary<TKey, TValue> where the key and value are the same instance.

Sets are like Dictionaries in that, yes, you COULD iterate over them, but we typically don't, we typically use them for fast lookups.

## # re: C#/.NET Little Wonders: The Useful But Overlooked Sets

When you start using the set operations to replace list operations, you continue with checking whether the new element is already in the list before adding with the Contains method. Isn't one of the benefits of using sets that you can just add the new value and the set implementation will take care of avoiding duplicate entries for you?## # re: C#/.NET Little Wonders: The Useful But Overlooked Sets

Yes, true, I could have just checked the return value of the HashSet<T>.Add() which returns true if the item was added or false if already in the list. good catch!Fortunately, the Contains() on a hashset is O(1) so it doesn't add much overhead.

## # re: C#/.NET Little Wonders: The Useful But Overlooked Sets

For the case of doing a full iteration over a collection, I'm not sure why a HashSet would be very much slower than a List. Is it just because of the improved memory locality of a List? Or are there algorithmic concerns as well?I wouldn't have expected the difference to be more than about 20%.

## # re: C#/.NET Little Wonders: The Useful But Overlooked Sets

Interesting post. Very informative## # re: C#/.NET Little Wonders: The Useful But Overlooked Sets

Nice Post.