During .NET performance assessments, I found that one of the most common performance problem has been, without any doubt, related to collection choice decisions. When you turn the profiler and attach the process to profiler… here it is! The loop (with even 2 o 3 nested levels) + incorrect collection anti-pattern arise. Moreover, this code will have “good” performance in development environments (with low data volumes), but when this code is in production, the performance goes down amazingly. Nevertheless, a simple line of code can set the difference between fail or succeed.

Let’s see a practical example!

Business case

Our company wants a new query to see aggregated information from country and state. The sales header and lines are stored in different data sources. So we need to design a .NET process to make the calculations in-memory. Classes implicated in this process are:

 public class SalesHeader 
 { 
     public Guid Id { get; set; } 
     
     public string CustomerName { get; set; } 
     
     public string Country { get; set; } 
     
     public string State { get; set; } 
     
     public int Age { get; set; }
 }

public class SalesLine 
{ 
    public Guid SalesHeaderId { get; set; } 
    
    public int LineId { get; set; } 
    
    public Guid ProductId { get; set; } 
    
    public string ProductDescription { get; set; } 
    
    public int Quantity { get; set; } 
    
    public decimal UnitPrice { get; set; } 
    
    public decimal SubTotal { get { return Quantity * UnitPrice; } } 
    
}

“Traditional” implementation

Here is a “traditional” implementation that could solve the problem:

var groupedSales = new List(); foreach (var header in salesDb.SalesHeaders) { foreach (var line in salesDb.SalesLines) { if (line.SalesHeaderId == header.Id) { SalesGroup saleGroup = null; foreach (var group in groupedSales) { if (group.Country == group.Country && group.State == header.State) { saleGroup = group; break; } } if (saleGroup == null) { saleGroup = new SalesGroup { Country = header.Country, State = header.State, }; groupedSales.Add(saleGroup); } saleGroup.Sales += line.SubTotal; saleGroup.AvgAge = (saleGroup.AvgAge + header.Age) / 2; } } } var groups = groupedSales.OrderByDescending(x => x.Sales).ToList();

So let’s do a small load test and see the method performance behaviour:

Test parameters:

  • From 1.000 to 9.000 sales headers
  • From 20.000 to 180.000 sales lines (20 lines per header)
  • 30 countries (randomly picked-up)
  • 30 states (randomly picked-up)

Rendimiento con bucles

Vertical axis represents execution time (seconds), and horizontal represents number of sales headers processed.

Ups! Process performance goes down with “only” 9.000 sales headers! What is going wrong?

**NOTE: **Graphic’s table footer time is represented in milliseconds

The “LINQ” way

If you are familiar with .NET, you may know that there a cleaner, faster way to implement this query. Find next an alternative solution with LINQ to Objects:

var query = from header in salesDb.SalesHeaders from line in salesDb.SalesLines where header.Id == line.SalesHeaderId group line by new { header.Country, header.State } into salesGroup select new SalesGroup { Country = salesGroup.Key.Country, State = salesGroup.Key.State, Sales = salesGroup.Sum(x => x.SubTotal), }; var groupedSales = query.OrderByDescending(x => x.Sales).ToList();

So much better! More self-explain and easier to maintain, at least. (I REALLY love LINQ):

Implementación con LINQ

But, wait a moment. Houston, we’ve got a problem! LINQ to Objects performance is 36% slower than loop implementation! Well, is not a problem of LINQ at all, but the chosen collection type. It’s not the right one for the scenario we are facing. Furthermore, we can see that with relative slow data volumes it’s performance is not “acceptable” at all.

LINQ is no the problem at all. Is the incorrect collection type chosen for in-memory lookups.

Lookup to the rescue

From a theoretical point of view (in future posts will go deeper on this), what we see is the hidden effect of asymptotic computacional complexity of two queries against memory lists. In the SQL Server world, a DBA would tell you “guy, create some indexes, your query is performing full table scans!”.

So, parting from the LINQ implementation, let’s create an “index” using an innocent Lookup:

var lookup = salesDb.SalesLines.ToLookup(x => x.SalesHeaderId); var query = from header in salesDb.SalesHeaders from line in lookup[header.Id] group line by new { header.Country, header.State } into salesGroup select new SalesGroup { Country = salesGroup.Key.Country, State = salesGroup.Key.State, Sales = salesGroup.Sum(x => x.SubTotal), }; var groupedSales = query.OrderByDescending(x => x.Sales).ToList();

So let’s test again:

Comparación de rendimiento

So much better! LINQ + Lookup (in gray) has an amazing performance compared with the other implementations. With 9.000 sales headers and 20 lines per header, our “loop” implementations requires 26,4 seconds, LINQ to Objects 71,6 seconds and, the LINQ + Lookup only 0,151 seconds.

Choosing lookups for in-memory implementations greatly increases global application performance effortless.

To infinite and beyond!

As final exercise, let’s see the behaviour of the LINQ to Objects + Lookup performance with 1.000.000 sales headers and 20.000.000 sales lines:

lookup-performance

Function performance degradation is so much lesser than loops or LINQ by its self. However, is inevitable to increate the execution time, in a linear way, while the number of data increases. And is this linear complexity increase the ideal scenario we always should look for.

LINQ + Lookup performance with 1.000.000 sales header (11,8 seconds) is so much better than loops implementation (6.000 sales header, 12,3 seconds) or LINQ by its self (4.000 sales headers, 11,7 seconds)

Collection type importance

In conclusion, is vital for our application performance to choose the right collection type at development time. Every collection type has its own advantages and disadvantages, and you need to know it and act consequently.

The key for successful software scalability is to find linear growing algorithms. And choosing the right collection type is vital towards it.

In upcoming posts will work asymptotic computational complexity in depth, from a more theatrical point of while, and will also check .NET main collection types and its relation with complexity. Please find next some links as sneak peek:

Asymptotic Computational Complexity (Wikipedia)

Commonly Used Collection Types (MSDN)