James Michael Hare

...hare-brained ideas from the realm of software development...
posts - 166 , comments - 1431 , trackbacks - 0

My Links

News

Welcome to my blog! I'm a Sr. Software Development Engineer in the Seattle area, who has been performing C++/C#/Java development for over 20 years, but have definitely learned that there is always more to learn!

All thoughts and opinions expressed in my blog and my comments are my own and do not represent the thoughts of my employer.

Blogs I Read

Follow BlkRabbitCoder on Twitter

Tag Cloud

Archives

Post Categories

Image Galleries

.NET

CSharp

Little Wonders

Little Wonders

vNext

C#/.NET Fundamentals: Returning an Immutable Collection

Last week we discussed returning immutable POCOs from an enclosing class so that you can prevent someone who asks for your class’s data to mutate it out from under you.  Now we’re going to get a little more complex and talk about returning immutable collections from an enclosing class for the same reasons.

I will discuss several different methods for returning collections in a read-only fashion with their pros and cons, including performance implications.

The Problem

Many times you will create a type that may have a property that exposes an internally held collection.  This can often be dangerous, as giving the person using your class direct access to the underlying data can allow it to be modified in ways you do not expect.

Now perhaps this is okay if your class is a simple POCO (Plain-Old-CLR-Object) since there is nothing in the class that depends on that data.  And if you truly do wish to allow the user to modify your internally held collection and you want to react to it, this is possible with the ObservableCollection<T>

But many times, when you return an internally held collection from a type you do not expect it to be modified out from under you.  However, if you return the type directly you are allowing that very possibility.

For example, you might have a class called MessageSender that is used to publish messages on a bus, and perhaps you want to allow them to pass in an array of hosts to publish to in the constructor.  Being nice and helpful, you decide you’d like them to be able to also query what the hosts are later.  You might write this:

   1: public sealed class MessageSender
   2: {
   3:     private readonly List<string> _hosts = new List<string>();
   4:  
   5:     // yes could have used private setter, but wanted to illustrate point w readonly.
   6:     public List<string> Hosts 
   7:     { 
   8:         get { return _hosts; }
   9:     }
  10:  
  11:     // ...
  12:  
  13:     public MessageSender(List<string> hosts)
  14:     {
  15:         // set the list of hosts to attempt to connect to
  16:         _hosts = hosts;
  17:     }
  18:  
  19:     // ...
  20: }

On the surface this seems plausible and reasonable.  But we have to think about ways this could be abused, or could be misleading, because it allows things like this:

   1: var sender = new MessageSender(new List<string> { "somehost001", "somehost002", "somehost003" });
   2:  
   3: // assume we connect and are very happy
   4: // ...
   5:  
   6: // there is nothing at all that prevents this from happening:
   7: sender.Hosts.Add("somehost004");
   8:  
   9: // or this
  10: sender.Hosts.Clear();

The fact that we’ve made _hosts readonly is irrelevant because all that means is that we can’t set _hosts to a new List<string> instance, it doesn’t mean that we can’t mutate the list it currently points to.

This can be problematic at best.  What would it mean to clear the Hosts list?  Would that disconnect all current hosts?  Should it be an error if we’re connected but allowable if already connected?  These are the logical dilemmas that returning a mutable collection from an enclosing class can cause.

 

 

 

Some Solutions

Let’s take our MessageSender as an example and talk about the different ways we could allow the Hosts property to return the current collection of hosts in a read-only fashion. 

Return as IEnumerable<string> “Directly”:

The simplest thing we could do would be just to return Tags as IEnumerable<string> instead of List<string>:

   1: public sealed class MessageSender
   2: {
   3:     private readonly List<string> _hosts = new List<string>();
   4:  
   5:     // returning same thing, but changed return type so only shows as simpler interface
   6:     public IEnumerable<string> Hosts 
   7:     { 
   8:         get { return _hosts; }
   9:     }
  10:  
  11:     // ...
  12: }

 

The main advantage of this approach is simplicity, all we have done is change the return type to a more generic type that won’t support mutations.  This is easy to code and is one of the most performant solutions.

However, that said, the disadvantage is that we can easily cast back to List<string> and mutate the original list, and since it is the original list any changes made to the original will reflect in our copy, which may lead to unexpected behaviors including crashes if we are iterating while a change is made in another thread.

Due to this lack of protection, it should probably be avoided if we are operating in a multi-threaded environment or if we want the data returned to be a read-only snapshot and not change. 

Return IEnumerable<string> Using An “Iterator”:

One other thing we can do would be to pass back an “iterator” (using yield return or other deferred expression) over our original collection.  This removes the possibility of the user being able to cast the IEnumerable<string> back to List<string>, but we still have the problem that the user is iterating over the original source (it doesn’t return a copy, it just keeps its place in the original).

In addition, iterators add a bit of additional overhead due to the compiler magic that saves the state of the iteration, which means it will behave a bit more slowly overall if you are processing the entire list.  If you are only interested in the first few items of a list, though, you could see better performance.

   1: public sealed class MessageSender
   2: {
   3:     private readonly List<string> _hosts = new List<string>();
   4:  
   5:     // Use yield return to create an iterator over the collection that
   6:     // is execution deferred - that is, it only calculates next item when requested.
   7:     public IEnumerable<string> Hosts 
   8:     { 
   9:         get 
  10:         { 
  11:             foreach (var host in _hosts)
  12:             {
  13:                 yield return host;
  14:             }
  15:         }
  16:     }
  17:  
  18:     // ...
  19: }

 

That’s a lot more code than simply returning the original list, but we can simplify a bit and use the LINQ extension method Skip() passing in zero.  This means skip the first zero items then return the rest of the list using deferred execution (yield return).

   1: public sealed class MessageSender
   2: {
   3:     private readonly List<string> _hosts = new List<string>();
   4:  
   5:     // Jon Skeet mentioned Skip(0) as a simpler way to do an iterator
   6:     // on a StackOverflow post...
   7:     public IEnumerable<string> Hosts 
   8:     { 
   9:         get { return _hosts.Skip(0); }
  10:     }
  11:  
  12:     // ...
  13: }

That’s much shorter and may add a wee bit of overhead on smaller lists (for the comparison on the skip), but on larger lists should be about a wash on performance.

While conceptually this is a bit better because we are not returning a collection that can be cast back to List<string> and mutated, we are still iterating over the original list, which can still cause headaches if the list is altered in another thread or while we are in the middle of iterating over it.

Return as ReadOnlyCollection<string>:

So, .NET offers a ReadOnlyCollection<T> in the System.Collections.ObjectModel namespace.  Sounds like a slam-dunk, doesn’t it?  Well, yes and no.

First of all, the ReadOnlyCollection<T> is simply a runtime read-only wrapper of the original List<T>, which means that you still have the possibility someone could mutate the collection out from under you, which can be bad. 

The other thing I’m not overly fond of is that ReadOnlyCollection<T> is a full implementation of IList<T> which means it contains all the mutating operations (Add(), Clear(), etc.), but it throws an exception if you attempt to call them. 

While this is logically read-only, that can be problematic as it is much better to provide compile-time protection when possible.  Consider if you accidentally attempt to mutate the collection in a rarely used piece of code and it seems to test well but that code is never hit in development and testing, then you release to production and that piece of code gets hit and it throws an exception. 

   1: public sealed class MessageSender
   2: {
   3:     private readonly List<string> _hosts = new List<string>();
   4:  
   5:     // wraps list as read-only, but still has mutable operations, difference is
   6:     // if you attempt to call those operations it throws an exception
   7:     public IEnumerable<string> Hosts 
   8:     { 
   9:         get { return new ReadOnlyCollection<string>(_hosts); }
  10:     }
  11:  
  12:     // ...
  13: }

It looks nice and easy to use, and because it’s a simple wrapper it doesn’t have the complex overhead that an iterator does and doesn’t make a full copy of the list either, so it’s somewhat lighter.  That said it still suffers from the main fault of allowing your read-only copy to be mutated out from under you by changes in the original collection.

Return ToArray() As IEnumerable<string>:

So, we’ve seen all the solutions so far return some sort of candy on top of the original collection which makes you vulnerable to odd behavior or crashes if the original collection is modified when you aren’t expecting it.  So the safest thing to do would be to return a copy of the original collection (preferably in a read-only format). 

Probably the easiest way to do this would be to return an array using the ToArray() method.  This creates an array of just the right size to hold the contents of the collection.  The benefits of using an array are fast iteration and the fact arrays as a collection can’t be added or removed from (you can only overwrite indexes).  This gives you a bit more of a “read-only feel.”

   1: public sealed class MessageSender
   2: {
   3:     private readonly List<string> _hosts = new List<string>();
   4:  
   5:     // ToArray() makes an array of the contents of the list.
   6:     public IEnumerable<string> Hosts 
   7:     { 
   8:         get { return _hosts.ToArray(); }
   9:     }
  10:  
  11:     // ...
  12: }

Now, the downside of this is we are essentially shallow-copying our collection.  If the collection is very large, this could be an issue, though typically it’s not as bad as you’d think.  Also if the array contains reference types keep in mind only the references and not the objects themselves are copied. 

That is, copying a List<string> of size 20,000 to a string[20000] only takes the size of 20000 references, which is typically 80,000 bytes (less than 80KB).  The size of the strings are irrelevant since it’s just a shallow-copy of the references and not the strings.

The other potential shortfall is that you are making a copy even if you only care about the first few items.  In those cases, an iterator approach is probably more performant, but since its hard to anticipate usage patterns, this is usually not as relevant a criticism.

Performance Differences

To test the performance of each, I ran 20,000 iterations of the test for each method and used the System.Diagnostics.Stopwatch class to time the results of the total run in elapsed milliseconds.

Return Type

10 Items

100 Items

1000 Items

10,000 Items

Direct

3

26

260

2640

ToArray()

7

37

319

3218

Iterator (Direct)

14

100

981

9879

Iterator (Skip(0))

21

99

912

8916

ReadOnlyCollection

6

49

497

4877

 

It’s interesting to note that even though ToArray() makes a full copy, it is faster than everything but direct for returning and iterating over the returned collection when the list is of any appreciable size, and even on small lists is very close to being second fastest.

Also notice that the iterators are slower than ToArray() and the ReadOnlyCollection.  Though I still argue that the fact that ReadOnlyCollection throws instead of disallowing mutation through interface or compile-time can be extremely problematic for debugging.

So which should you use?  As in all things, it depends.  While we iterated over all the items in the collection here, it’s possible someone just wanted to pull the list just to see the first item, which would be more expensive for ToArray() but not effect the others so much. 

With a list of 10,000 if I ran each of these again and only examined the first item, and I got:

Return Type

10,000 Items

Direct

0

ToArray()

271

Iterator (Direct)

5

Iterator (Skip(0))

15

ReadOnlyCollection

1

 

In this case, ToArray() is more expensive because of the complete copy.  That said, its hard to anticipate how a collection returned will be used, and since copies in general aren’t expensive (compared to say database access, etc), it’s probably a better trade-off to assume the whole collection will be iterated or at least most of it.

So, in summary, as in all things, the best solution depends on how you expect it to be used.  In my mind I prefer ToArray() because it makes it easy to return a snapshot of the list in read-only fashion and it will be immune to changes in the original list.  Plus, with a little locking, it would prevent multi-threading issues as well.

All the other methods suffer from problems with multi-threaded access, and can also throw an exception if the original collection is altered while iterating over the wrapped collection.

 

 Technorati Tags: ,,,,,,,,,

 

Print | posted on Thursday, November 4, 2010 7:04 PM | Filed Under [ My Blog C# Software .NET Fundamentals ]

Powered by: