It's better to .Take()

Challenge

Given EFCore entity objects with a one-to-many relationship, lets discuss how to best retrieve the parent record along with a single record of the child property based on a specific filter.

In our example we have a Component entity, and a ComponentStatus entity. ComponentStatus is a history of records providing the status of the component at any point in time.

public class Component
{
    public int Id { get; set; }
    public virtual ComponentStatus LatestComponentStatus { get; set; }
}

public class ComponentStatus {
    public int Id { get; set; }
    public DateChanged { get; set; }
    public StatusEnum Status { get; set; }
    public in ComponentId { get; set; }
}

The final result should be a list of components, with the LatestComponentStatus property containing the most recent ComponentStatus record for each component.

In our situation, these tables may be massive with tens of thousands of components and millions of ComponentStatus records to sift through.

Solution 1 (not good)

My first query looked pretty basic:

var query = from c in Components
	select new  {
	   component=c,
	   componentstatus=(from cs in ComponentStatus 
	   			where cs.ComponentId==c.Id && cs.DateChanged<DateTime.UtcNow
				select cs).OrderBy(cs=>cs.Id).FirstOrDefault()
	};

This seemingly works great, but as with all LINQ statements that aren’t straight SELECT’s, if you aren’t looking under the hood, you aren’t doing yourself any favors. There is no tool better for this investigation than LinqPad. If you don’t have a paid license for this wonderful tool, stop what you are doing and get it now.

The above query generates the seemingly harmless SQL code:

SELECT [s].[Id], [t0].[ID], [t0].[ComponentId],  [t0].[DateChanged], [t0].[Status]
FROM [Component] AS [s]
LEFT JOIN (
    SELECT [t].[ID], [t].[ComponentId], [t].[DateChanged], [t].[Status]
    FROM (
        SELECT [c].[ID], [c].[ComponentId], [c].[DateChanged], [c].[Status], ROW_NUMBER() OVER(PARTITION BY [c].[ComponentId] ORDER BY [c].[ID]) AS [row]
        FROM [ComponentStatus] AS [c]
        WHERE [c].[DateChanged] < GETUTCDATE()
    ) AS [t]
    WHERE [t].[row] <= 1
) AS [t0] ON [s].[Id] = [t0].[ComponentId]

After running the above query directly in SQL Manager and applying the recommended indices, a database with approximately 20,000 components and over 225,000 component status records took 6 seconds. Not great. SQL Server has to do a lot of work to make it happen:

I recommend taking your generated SQL from LINQ statements and running it through Microsoft SQL Manager so you can see these statistics. They tell 99% of the story.

Even with proper indexing, SQL still had to do a lot of heavy lifting to get the proper result.

Solution 2 (much better)

By chance (unfortunate how often this happens in my development life) I decided to slightly modify the query, not expecting much change:

var query = from c in Components
	select new  {
	   component=c,
	   componentstatus=(from cs in ComponentStatus 
	   			where cs.ComponentId==c.Id && cs.DateChanged<DateTime.UtcNow
				select cs).OrderBy(cs=>cs.Id).Take(1).FirstOrDefault()
	};

This query looks essentially the same, the only difference is the addition of .Take(1) Oddly (more analysis at the end), that statement fundamentally changed how EFCore built the resulting SQL:

SELECT [s].[Id], [t0].[ID], [t0].[ComponentId], [t0].[DateChanged], [t0].[Status]
FROM [Component] AS [s]
OUTER APPLY (
    SELECT TOP(1) [t].[ID], [t].[ComponentId],[t].[DateChanged], [t].[Status]
    FROM (
        SELECT TOP(1) [c].[ID], [c].[ComponentId], [c].[DateChanged],  [c].[Status]
        FROM [ComponentStatus] AS [c]
        WHERE ([c].[ComponentId] = [s].[Id]) AND ([c].[DateChanged] < GETUTCDATE())
        ORDER BY [c].[ID]
    ) AS [t]
    ORDER BY [t].[ID]
) AS [t0]

Rather than a simple LEFT JOIN, EFCore now chose to implement an OUTER APPLY.

….. interesting.

….. and lucky.

The above query executes in less than 2 seconds (a 3x improvement), and generates the following workload in SQL Server:

Observations

Why .Take?

I’m sure some of this is obvious to EntityFramework wonks, but to the masses this seems a little strange. When you consider the following two LINQ statements:

var query1 = ComponentStatus
		.Where(cs=>cs.DateChanged<DateTime.UtcNow)
		.Select(cs=>cs.Id)
		.OrderBy(cs=>cs)				
		.FirstOrDefault();
	
var query2 = ComponentStatus
		.Where(cs=>cs.DateChanged<DateTime.UtcNow)
		.Select(cs=>cs.Id)
		.OrderBy(cs=>cs)
		.Take(1)
		.FirstOrDefault();

Generate fundamentally the same SQL:

SELECT TOP(1) [c].[ID]
FROM [ComponentStatus] AS [c]
WHERE [c].[DateChanged] < GETUTCDATE()
ORDER BY [c].[ID]
GO

SELECT TOP(1) [t].[ID]
FROM (
    SELECT TOP(1) [c].[ID]
    FROM [ComponentStatus] AS [c]
    WHERE [c].[DateChanged] < GETUTCDATE()
    ORDER BY [c].[ID]
) AS [t]
ORDER BY [t].[ID]
GO

Why then, when .Take is included in a sub-query does it magically change the higher order SQL?

Projection Just Doesn’t Work

Our original requirement was to return a single list of Component objects with a faux-navigation property containing the current ComponentStatus record. Our current LINQ statement returns two objects that have to be merged:

var query = from c in Components
	select new  {
	   component=c,
	   componentstatus=(from cs in ComponentStatus 
	   			where cs.ComponentId==c.Id && cs.DateChanged<DateTime.UtcNow
				select cs).OrderBy(cs=>cs.Id).Take(1).FirstOrDefault()
	};

var results = await query.ToListAsync();
results.forEach(x=>x.component.CurrentComponentStatus = x.componentstatus);

Wouldn’t it be nice if we did not have to do any post-processing to merge the object? I tried the following code:

var query5 = from c in Components
	select new Component  {
	   Id=c.Id,
	   ComponentStatus=(from cs in ComponentStatus 
	   			where cs.ComponentId==c.Id && cs.DateChanged<DateTime.UtcNow
				select new { cs.Status, cs.Id }).OrderBy(cs=>cs.Id).Take(1).FirstOrDefault()
	};

Unfortunately, I could not get EFCore to run this projection successfully. I kept getting an error:

EFCore Linq “Data is Null. This method or property cannot be called on Null values”

My assumption was that was happening because inevitably, some Component records didn’t have any Component Status records. My attempt to add .DefaultIfEmpty() to the sub-select generated a more horrifying error:

Processing of the LINQ expression …. This may indicate either a bug or a limitation in EF Core.

By this point, I gave up. It wasn’t worth further research when throwing a 1 line post-processing statement dealt with it just as efficiently.

The Lesson

Watch how your LINQ statements are transformed in to SQL. Small, seemingly innocuous alterations can have drastic effects.

Anything better?

I feel like there may be some magic JOIN solution that may present a third option and give me the best of both world: A single result set with the navigation property set immediately in the result.

Comments

Robin Sue commented 5 years ago

Interesting problem but also a bit confusing and there are a few errors in the code samples: public DateChanged { get; set; } does not specify a type (DateTime), public in ComponentId { get; set; } misspelled int, .OrderBy(cs=>cs.Id) should likely be OrderByDescending if you’re looking for the latest status, results.forEach(x=>x.component.CurrentComponentStatus = x.componentstatus); forEach should be PascalCase and CurrentComponentStatus should be LatestComponentStatus. Also is LatestComponentStatus a navigation property? It is declared virtual which makes this likely, why not just db.Components.Include(x => x.LatestComponentStatus) then? That will generate a straightforward join. If it is not a navigation property or part of the db model, why is it declared on a db class? Last but not least, it took me way longer than i’d like to admit to realize, that cs.DateChanged<DateTime.UtcNow in the second code sample is not generic code where the right hand side is truncated from view (the UtcNow touches the border of the codebox but there's no scrollbar) but a comparison without whitespace inbetween 😀

Author

feech commented 5 years ago

Hi Robin — All great catches and mostly a result of my first attempt at trying to transpose code into WordPress. I’ll clean up your catches! Much appreciated. As far as some of your non-syntactical observations: 1) .OrderBy was a replacement for my internal query which wasn’t Id, so yes, OrderByDescending is most accurate for this generic version. Thanks! 2) LastComponentStatus as a native navigation property using a subselect query is completely doable, and works. However, the generated SQL reverts back to the inefficient join that basically blows things up at scale. I had debated for sample purposes pulling it out and just demonstrating via an external ComponentStatus but ultimately left it as a nav property for clarity. Guess I #failed haha. Thanks again for noticing some of the styling and formatting issues. Next post coming next week, I’ll attempt to raise the bar. feech

I mean, when you use LatestComponentStatus as a navigation property, you can do context.Components.Include(x => x.LatestComponentStatus).ToList(); which results in SELECT [c].[Id], [c].[LatestComponentStatusId], [c0].[Id], [c0].[ComponentId], [c0].[DateChanged], [c0].[Status] FROM [Components] AS [c] LEFT JOIN [ComponentStatus] AS [c0] ON [c].[LatestComponentStatusId] = [c0].[Id] which is as efficient as it gets and populates the LatestComponentStatus automatically, the downside is that everytime you insert, you need to update the LatestComponentStatusId on Component. Thinking about it though, the cs.DateChanged < DateTime.UtcNow part probably means you can have a future status that must not be (yet) selected so this idea just went down the drain 🙂

Hi Robin! The goal is to retrieve the single most recent record ComponentStatus and populate LastComponentStatus with that record, not a list. The point of my post was that even if you try a simple join with a sub-select inside the join to just retrieve 1 record, the translated query is highly inefficient. You actually have to retrieve both records individually and use .Take(). For some reason EFCore changes the entire query structure between .FirstOrDefault() and .Take(1).FirstOrDefault() as I was trying to show. take care, jasen

> The goal is to retrieve the single most recent record ComponentStatus and populate LastComponentStatus with that record, not a list. Yes, the code i’ve suggested does exactly that, it caches a one-to-one navigation as LatestComponentStatusId on Component, which allows for fast and easy retrieval with a trivial join but the downside, that you must keep this cached value up-to-date and that it doesn’t work well if there’s a time dependency. > The point of my post… Yeah i’ve got that, quite weird, i agree 🙂

Comments are closed.