James Michael Hare

...hare-brained ideas from the realm of software development...
posts - 136 , comments - 1091 , trackbacks - 0

My Links

News

Welcome to my blog! I'm a Sr. Software Development Engineer in Seattle, WA. I've been doing C++/C#/Java development for over 18 years, but have definitely learned that there is always more to learn!

All thoughts and opinions expressed in my blog and my comments are my own and do not represent the thoughts of my employer.

Blogs I Read

MCC Logo MVP Logo

Follow BlkRabbitCoder on Twitter

Tag Cloud

Archives

Post Categories

C#/.NET Little Wonders: The LINQ Set Operations -- they're not just for math!

Once again, in this series of posts I look at the parts of the .NET Framework that may seem trivial, but can help improve your code by making it easier to write and maintain.

Today we are going to examine the LINQ set operations that are part of the IEnumerable<T> extension methods.  Now, most of the time when people think of set operations they think of the math or logic classes they are usually taught in, but really these LINQ methods have a much larger appeal and applicability than just math exercises!

Set Operations

When we think of set operations in the realm of set logic, there are three primary ones that come to mind:

  • Intersection:
    • The intersection of two sets is a set that contains all the common elements of both sets.
    • { A, B, C } { B, C, D } = { B, C }
  • Union:
    • The union of two sets is a set that contains all the unique elements of both sets.
    • { A, B, C } { B, C, D } = { A, B, C, D }
  • Difference:
    • The difference of two sets is the first set minus all common elements with the second set.
    • { A, B, C } - { B, C, D } =  { A }

Notice that in set theory, sets are collections of unique elements with no duplicates.  Thus the set union, as we saw, removes all duplicates between the two sets.  This becomes important as we discuss how these set operations apply to collections in .NET.

I've blogged before on the often overlooked HashSet and SortedSet classes (here) in .NET, which are collections that implement sets.  Though these are useful, sometimes you want to be able to apply these useful operations to other types of collections as well.  This is where the LINQ set operation extension methods come in handy, they are implementations of these set operations that can be applied to any sequence that implements IEnumerable<T> including favorites like arrays, List<T>, iterators, etc.

A quick preface on equality

For the set operations to properly work on your sequences of a type, the type must have a valid notion of equality (both Equals() and GetHashCode() must be meaningful). Now, for nearly all the primitive types (int, double, char, etc) the string class, structs and some BCL reference types, equality is well defined and implemented and they will work fine as is.

However, custom classes you write can present a problem because the default implementation of equality most likely will not meet our needs. Just keep that in mind for now and we’ll come back to that in the end with examples of how this can bite you and how to mitigate it…

Intersect() – gathering the common elements

The Intersect() method gets the common elements from two different enumerable sequences, just like the set logic's intersection operation dictates. Intersections are very useful in determining where two sets overlap (that is, what elements two sets have in common).

The two forms of Intersect(), assuming extension method syntax where the first argument is the target, are:

  • Intersect(IEnumerable<TSource> second)
    • Returns a sequence of elements that are common between the first sequence and the second sequence using the default equality comparer for TSource.
  • Intersect(IEnumerable<TSource> second, IEqualityComparer<TSource> comparer)
    • Returns a sequence of elements that are common between the first sequence and the second sequence using the definition of equality specified by the passed-in equality comparer.

Let’s play with string since it already has a good default equality comparer. So say we have two enumerable sequences of string:

   1: // lets say we have a list of healthy stuff to consume
   2: var healthyStuff = new List<string> { "fruits", "vegetables", "proteins", "simple carbs", "fiber" };
   3:  
   4: // and we have a list of what i consume
   5: var myStuff = new List<string> { "soda", "chips", "proteins", "fat", "sugar" };

So now that we have these two lists: one of healthy things we can consume, and one that is things i typically consume.  We can perform an intersection on them and see what things I consume that are healthy by saying:

   1: var results = myStuff.Intersect(healthyStuff);
   2:  
   3: foreach (var item in results)
   4: {
   5:     Console.WriteLine(item);
   6: }

This will output the item “proteins” which is the only item common between what I eat and healthy stuff (I eat better than that, really, but just for illustrative purposes).

It should be noted that the Intersect of two sets follows the commutative property. This means that A ∩ B = B ∩ A. So in our example that would mean that the following two statements are logically identical:

   1: // these two are identical
   2: results = myStuff.Intersect(healthyStuff);
   3: results = healthyStuff.Intersect(myStuff);

This makes sense because asking what healthy foods exist in the list that I eat is the same as asking what foods do I eat that exist in the healthy foods list.

Union() – combining the unique elements

The Union() method combines the unique elements from two different enumerable sequences, just like the set union operation dictates. Unions are very useful for combining two sets without duplicates. Thus if in our example we wanted to get a list of all the healthy foods and all foods I eat, we could union the two sets.

The two forms of Union(), assuming extension method syntax where the first argument is the target, are:

  • Union(IEnumerable<TSource> second)
    • Returns a sequence of unique elements from the first and second sequence combined using the default equality comparer for TSource.
  • Union(IEnumerable<TSource> second, IEqualityComparer<TSource> comparer)
    • Returns a sequence of unique elements from the first and second sequence combined using the definition of equality specified by the passed-in equality comparer.

By unique elements, I don’t mean to imply that only items with no duplicates are in the resulting set, but that the resulting set eliminates any duplicates. So that if you had { 1, 1, 3, 5 } ∪ { 1, 3, 7 } the result would be { 1, 3, 5, 7 }.

For example:

   1: var results = myStuff.Union(healthyStuff);
   2:  
   3: // this will output soda, chips, proteins, fat, sugar,
   4: // fruits, vegetables, simple carbs, and fiber 
   5: foreach (var item in results)
   6: {
   7:     Console.WriteLine(item);
   8: }

Notice that the duplicate of protein is gone. Union() is also commutative, so A ∪ B = B ∪ A. However that said the ordering will be different since the elements from the first sequence appear first in the resulting sequence, followed by any elements in the result that came from the second sequence.

The nice thing about Union() is it gives you a nice and easy way to join together two sequences and eliminate duplicates. Note that this is very different from the Concat() extension method in LINQ that just concatenates one sequence to the end of the other, but this makes no attempt to remove duplicates.

   1: // these are logically identical
   2: results = myStuff.Union(healthyStuff);
   3: results = myStuff.Concat(healthyStuff).Distinct();

Except() – subtracting the common elements

The Except() method performs a set difference between two sets. That is, A – B yields the items in A minus any items in B that happen to be in A. Any items that were unique to B alone are ignored. Thus, if we wanted to get a list of the food I eat that is NOT healthy food, I could do the set difference between what I eat and the healthy things to eat.

The two forms of Except(), assuming extension method syntax where the first argument is the target, are:

  • Except(IEnumerable<TSource> second)
    • Returns the unique items in the first set, minus any common items from the second sequence using the default equality comparer for TSource.
  • Except(IEnumerable<TSource> second, IEqualityComparer<TSource> comparer)
    • Returns the unique items in the first set, minus any common items from the second sequence using the definition of equality specified by the passed-in equality comparer.

Once again this is a simple set difference operation. { 1, 3, 5, 7 } – { 1, 5, 8 } = { 3, 7 }. The 1 and 5 are removed since they were in both the first and second set, and the 8 is removed since it didn’t exist in the first set. So the resulting sequence are only the unique items from the first set that are not also in the second set.

As you can probably tell, difference is not commutative because if you reverse the order of the sets in the difference you get two different things:

  • { 1, 3, 5, 7 } – { 1, 5, 8 } = { 3, 7 }
  • { 1, 5, 8 } – { 1, 3, 5, 7 } = { 8 }

This means that you have to be careful with Except() that you are subtracting the sets correctly. Once again if we look at the food example:

   1: // this is a list of the things I eat that are not healthy
   2: // soda, chips, fat, sugar
   3: var results = myStuff.Except(healthyStuff);
   4:  
   5: // this is a list of healthy things that I do not eat
   6: // fruits, vegetables, simple carbs, fiber
   7: results = healthyStuff.Except(myStuff);

So as you can see, Except() is a handy way to get a list of elements in a sequence that do not match the items from a second sequence.

A quick note on deferred execution…

Please note, that like many of the System.Linq extension methods, Except(), Union(), and Intersect() are deferred execution which means they will not be invoked until an enumerator is requested from them. Thus if you did something like this:

   1: results = healthyStuff.Except(myStuff);
   2:  
   3: // because results is an iterator (deferred execution) this clears
   4: // myStuff before it is actually used, which alters our results
   5: myStuff.Clear();
   6:  
   7: foreach (var item in results)
   8: {
   9:     Console.WriteLine(item);
  10: }

The resulting set will be all of healthyStuff (instead of the difference) since the results variable holds an iterator to the resulting sequence, but that sequence will not be calculated until we begin enumerating through the results.  Thus by calling Clear() on one of the sets before we actually use the results, we've altered the operands before the operator is actually applied.  In this example, that means we try to subtract an empty set myStuff from the list of healthyStuff, which of course results in the full list.

To avoid this either iterate over the results immediately, or use extension methods like ToArray() or ToList() to immediately process the query and put the results in a collection.

A quick note on covariance in IEnumerable<T>

In IEnumerable<T> the type T is covariant, this means you can use the set operations to manipulate two sets of different types related through inheritance. For example, if you had class Employee and class SalariedEmployee which inherits from Employee, then you can perform set operations between the two sets and the resulting set type is the wider of the two types (that is, the higher up the inheritance chain – Employee in this case).

A quick note on mixed containers

Also notice that the only thing required for these set operations in System.Linq to work is that both sequences must implement IEnumerable<T>, this means they can be an array, a List<T>, a HashSet<T>, or an iterator from another query of type T (and so on). Essentially, this is just to say that you can intersect a HashSet<string> with a List<string> and so on, the only thing that is important is that their element types are the same (or covariant as stated above).

A final note on equality in complex classes

I hinted before that these operations will work exactly as you expect for primitives, strings, structs, and any reference types that correctly implement the concept of equality. And I hinted that custom classes you write may be in danger of not working. But why?

Well, you may think that the first problem is that with class the default concept of Equals() is a reference comparison. While this is true, it is only half the issue. Let’s say we define an Employee class and override Equals() on it:

   1: public class Employee 
   2: {
   3:     public int Id { get; set; }
   4:     public string Name { get; set; }
   5:     public double Salary { get; set; }
   6:  
   7:     public override bool Equals(object obj)
   8:     {
   9:         var other = obj as Employee;
  10:  
  11:         return other == null
  12:                 ? false
  13:                 : Equals(Id, other.Id) &&
  14:                   Equals(Name, other.Name) &&
  15:                   Equals(Salary, other.Salary);
  16:     }
  17: }

So now if we attempt to perform set operations on these two lists, what do we get?

   1: var listOne = new []
   2:     {
   3:         new Employee { Id = 1, Name = "John Smith", Salary = 12342 },
   4:         new Employee { Id = 12, Name = "Lucy Doe", Salary = 99243 }
   5:     };
   6:  
   7: var listTwo = new []
   8:     {
   9:         new Employee { Id = 2, Name = "Jane Doe", Salary = 3241 },
  10:         new Employee { Id = 1, Name = "John Smith", Salary = 12342 }
  11:     };

Now, if we try to do an intersection, we’d expect to see John Smith, right?

   1: var results = listOne.Intersect(listTwo);

But we don’t, we get an empty list! Why? We can get a hint in that the second forms of Union(), Intersect(), and Except() that take an IEqualityComparer<TSource>. Why IEqualityComparer, why not just IComparer?

The answer is that IEqualityComparer requires both an Equals() and a GetHashCode() method to be defined. So this should be a good hint to us that we need to provide not only a meaningful Equals() overload but a GetHashCode() overload in our custom classes (or provide a separate custom IEqualityComparer of course).  Remember that two items that are equal should always return the same hash code, but the opposite is not necessarily true.  It's okay for two non-equal items to have the same hash code, but it's never okay for two equals items to not have equal hash codes.  Typically this means that the GetHashCode() method should be based on the same fields used in the Equals() check (or a subset).

So, assuming we add an override for GetHashCode():

   1: public class Employee 
   2: {
   3:     // ... all the other stuff from before
   4:  
   5:     public override int GetHashCode()
   6:     {
   7:         unchecked
   8:         {
   9:             // using the pretty standard method of primes and combinging field hash codes
  10:             int hash = 11;
  11:  
  12:             hash = hash * 31 + Id.GetHashCode();
  13:             hash = hash * 31 + Name != null ? Name.GetHashCode() : 0;
  14:             hash = hash * 31 + Salary.GetHashCode();
  15:  
  16:             return hash;
  17:         }
  18:     }
  19: }

Now we get the expected result of John Smith! Many of the LINQ extension methods use the hash codes of the items in the sequences to quickly and efficiently work their way through the lists. We don’t have this issue with primitives and classes such as string which already override Equals() and GetHashCode() appropriately, and struct doesn’t have this issue because struct by default already does a member-wise Equals() and GetHashCode() construction.

Thus, another way we could have corrected this would be to make Employee a struct, though this has larger ramifications to consider and shouldn’t be done lightly (for more info on class vs struct and all the differences see here).  So the general recommendation is to either provide a meaningful Equals() and GetHashCode(), or create a separate IEqualityComparer that will define these for your custom class.

Summary

System.Linq includes some great set operation extension methods that can be used to operate on two sequences as if they were sets. These can come in handy when combining sequences with no duplicates (Union()), seeing if two sequences have elements in common (Intersect()), or seeing what elements in a sequence are not part of another sequence (Except()).

While set operations are typically thought of as math operations, these can be applied to many computer science problems and should be considered when needing to check membership between two sequences of items.

Print | posted on Thursday, May 5, 2011 7:13 PM | Filed Under [ My Blog C# Software .NET Little Wonders ]

Feedback

Gravatar

# re: C#/.NET Little Wonders: The LINQ Set Operations -- they're not just for math!

In the deferred execution comment on the Except method, surely the results are the full set of healthyStuff, rather than an empty set as you state, as you have cleared the set that you are removing (myStuff), so, in effect, you are saying:

results = healthyStuff.Except(new List<string>())

Or have I misunderstood?
5/6/2011 7:24 AM | alastair
Gravatar

# re: C#/.NET Little Wonders: The LINQ Set Operations -- they're not just for math!

@alastair:

You are right, when i first posted it someone noted that paragraph after the example was vague and I went in and altered the verbiage and accidentally reversed what i was thinking in the process.

I've cleaned it up now, thanks so much for the catch!
5/6/2011 9:03 AM | James Michael Hare
Gravatar

# re: C#/.NET Little Wonders: The LINQ Set Operations -- they're not just for math!

In your first version of Eympleyee, you declare it as struct. Because of that, your Equals() override won't compile, because you can't use the as operator on (non-nullable) struct. OTOH if you removed the Equals() completely (or if fixed it without providing GetHashCode()), it actually, works, because structs don't use referential equality.
5/6/2011 2:32 PM | Svick
Gravatar

# re: C#/.NET Little Wonders: The LINQ Set Operations -- they're not just for math!

@Svick: yep a typo. I had been altering that program in my VS2010 editor several times with each experiment and forgot to switch it back to class, but as you see in the later example, class was intended. I've fixed it now, thanks for the heads up on the typo.

Yes, on structs, that was one of my points is that value types (like struct) already have correct Equals() and GetHashCode() implementations based on memberwise computations. Only custom classes need these members for correct notions of equality.
5/6/2011 2:44 PM | James Michael Hare
Gravatar

# re: C#/.NET Little Wonders: The LINQ Set Operations -- they're not just for math!

Great Article! Very useful tip.
5/9/2011 2:35 AM | Dean Hume
Gravatar

# nice article

article is great. i have enjoyed more, thanks for sharing with us...
5/9/2011 8:01 AM | alexa
Gravatar

# re: C#/.NET Little Wonders: The LINQ Set Operations -- they're not just for math!


Wonderful Post and look forward to reading more similar articles.

5/12/2011 5:17 AM | Stephen Soos
Gravatar

# re: C#/.NET Little Wonders: The LINQ Set Operations -- they're not just for math!

Shouldn't line 13 of GetHashCode be

hash = hash * 31 + (Name != null ? Name.GetHashCode() : 0);

otherwise you are doing a concat and it will throw when Name is null?

6/2/2011 4:01 AM | Ken
Gravatar

# re: C#/.NET Little Wonders: The LINQ Set Operations -- they're not just for math!

It's worth noting that Intersects & Except both return unique elements (like Union)
ex:
{ 1, 1, 3, 5 } ∩ { 1, 1, 3, 7 } = { 1, 3 } not { 1, 1, 3 }
{ 1, 1, 3, 5 } - { 3, 7 } = { 1, 5 } not { 1, 1, 5 }
7/28/2011 7:35 AM | Wizou
Gravatar

# re: C#/.NET Little Wonders: The LINQ Set Operations -- they're not just for math!

@Wizou: Very true, all the set operations return unique results (as defined by the IEqualityComparer provided (or default if not provided).
7/28/2011 9:05 AM | James Michael Hare
Gravatar

# re: C#/.NET Little Wonders: The LINQ Set Operations -- they're not just for math!

Excellent article I wrote one similar on Working with Set Data but it’s much more condensed you can find it here
Set Data
9/18/2012 3:40 AM | LINQ Operations
Gravatar

# re: C#/.NET Little Wonders: The LINQ Set Operations -- they're not just for math!

Awesome article! I was looking into whether .Except() was comutative and didn't expect to find anything. You saved me some time and made my afternoon! thanks!
7/1/2014 10:46 PM | Andrew Poland
Post A Comment
Title:
Name:
Email:
Comment:
Verification:
 
 

Powered by: