Lazy LINQ

This isn’t an EPiServer-specific post as such, but it is a relevant to most people developing with EPiServer because the chances are you will use a fair bit of LINQ – especially if you’re using things like the DDS. I love LINQ. It’s great. It enables you to take potentially complex looping and iterative coding and reduce it to easily readable LINQ statements. I use it primarily on collections of .Net objects, and the ability to do ‘deep’ property searching, matching and even arithmetic operations has led me to being a bit lazy and splashing LINQ around liberally. I even use it to avoid using EPiServer filters and FPWC if I already have a page data collection handy. However, nothing comes free and LINQ, like everything, has a cost. In the case of LINQ for collections, it is performance. (note that other LINQ implementations like LINQ to SQL are implemented entirely differently internally and this post does not apply to them – although you can use other techniques to improve performance for those queries too)

LINQ for collections itself is not inherently badly performing. It’s pretty slick. However, all these LINQ statements actually do is get ‘flattened out’ to code at compile time. (if looking at disassembled/profiled code, it’s those funny <something>b__x() methods you see everywhere that look like generics but aren’t) Internally, in the methods the compiler creates it iterates the collection and does whatever your query needs… ‘where’ clauses become equality checks etc.

At a simple level, LINQ may not perform any better when compared like-for-like to manually written code. In fact, it may even perform worse. Even more than that, it is important to think about what is actually happening. When you realise that you’re actually doing enumeration code, albeit in a clever wrapper, it makes you that much more aware of being careful with what you are doing. For example, let’s say you have a collection of ‘Foo’ objects and you want to find all objects that have a ‘bar’ property set to ‘A’, and all with ‘bar’ set to ‘B’. In LINQ, you could say this:

        1: var enumerableFooA = from foo in fooCollection where foo.bar == "A" select foo;

       2: var enumerableFooB = from foo in fooCollection where foo.bar == "B" select foo;

This looks nice and clean – we have our two enumerable collections populated on a single line each. If ‘fooCollection’ is small, then this is unlikely to be a problem. However, what this code is actually doing is enumerating the whole collection twice looking for each setting of ‘bar’. Even if the LINQ code performed as well as a loop for each item, we are doubling things up. It would therefore be much better to do something like:

       1: var enumerableFooA = new List<Foo>();

       2: var enumerableFooB = new List<Foo>();

       3:  

       4: foreach(var foo in fooCollection)

       5: {

       6:     if(foo.bar == "A") enumerableFooA.Add(foo);

       7:     if(foo.bar == "B") enumerableFooB.Add(foo);

       8: }

This is more lines of code, but it’s only a single loop and we are populating both collections. The whole subject of optimising LINQ and loops goes much further, naturally. For example, it might be a good idea to break down your collections into Dictionary or SortedList collections if you have some kind of key that you use often. That way you can avoid doing a LINQ ‘where’ clause on the collection on every call just to find items for a specific key. For example, let’s say that each item in my collection of ‘Foo’ objects has a reference to a ‘bazID’ and I wanted to get I the sum of all ‘Value’ properties of my ‘Foo’ objects by their ‘bazID’. (this could typically happen if you have something like an order header with order lines). Using un-optimised LINQ I could write this:

       1: var totals = new Dictionary<Baz, int>();

       2:  

       3: var bazCollection = GetBazCollection();

       4: var fooCollection = GetFooCollection();

       5:  

       6: foreach(var baz in bazCollection)

       7: {

       8:     totals.Add(baz, (from foo in fooCollection where foo.bazID = baz.ID select foo.Value).Sum())); 

       9: }

What the code above will do is iterate my entire fooCollection for every iteration of the ‘baz’ loop looking for items with the right ‘bazID’, then sum the results. At this simple level, even though the entire collection is being repeatedly iterated that might still be a sensible option just for simplicity, but an alternative option would be to create an ‘indexed’ collection first, for example:

       1: var totals = new Dictionary<Baz, int>();

       2:  

       3: var bazCollection = GetBazCollection();

       4: var fooCollection = GetFooCollection();

       5:  

       6: var fooByBazIdCollection = new Dictionary<int, List<Foo>>();

       7:  

       8: foreach(var foo in fooCollection)

       9: {

      10:     if(!fooByBazIdCollection.ContainsKey(foo.bazID)) fooByBazIdCollection.Add(foo.bazID, new List<Foo>());

      11:  

      12:     fooByBazIdCollection[foo.bazID].Add(foo);

      13: }

      14:  

      15: foreach(var baz in bazCollection)

      16: {

      17:     totals.Add(baz, (from foo in fooByBazIdCollection[baz.ID] select foo.Value).Sum());

      18: }

We now run through the whole fooCollection only once to create our ‘indexed’ collection, and after that only the appropriate parts will be iterated by LINQ. This is a fairly trivial example and of course you could achieve the same above using other simple operators and lambda expressions, but it shows the technique. If doing multiple checks against the fooByBazIdCollection then obviously the value of narrowing down the list that is being iterated increases rapidly.

Although nothing in this post is particularly new or complex, I hope that a reminder of what is happening behind the scenes with LINQ for collections might be of some use to you in writing optimal code.

Nov 02, 2011

Comments

Magnus Rahl Nov 4, 2011 09:55 AM

LINQ is great power which required great responsibility. Here are my two cents about other situations to look out for:

I often use Linq to generate collections fake objects for use in tests. But that can result in very strange behaviour and total or intermitent mismatches between expected and actual objects in the test if the enumeration is lazy executed.

An application may gather information from many different sources that are costly to call, say a number of webservices. Therefore you would like to cache the result once you've compiled it. If you do this compilation using Linq, you may end up caching your not-yet-executed lazy indexer, so you actually end up doing all those costly operations every time you get the item from "cache".

Both the above examples are avoided by executing your IEnumerables using ToList(). But it shouldn't be a general rule of thumb to always ToList everything either, so you really have to know what you're doing.

Please login to comment.