img by Coba

How to Find the Truncated Mean in SQL

The truncated mean is an average over a set obtained after you actually ignore the highest and lowest numbers. For example: given a set of the three numbers 100, 200 and 300, the "average" would be 200. We would add 100, 200, and 300 then divide by the number of members in the set: (100+200+300)/3 = 200. A "truncated mean" however would look at the set and eliminate the lowest value and the highest value before calculating the average: (200)/1 = 200 {note that we eliminated 100 and 300 in the set}.

I got into a reporting situation for a Reporting Services report that I was writing this week in which I had some really, really off-the-wall numbers for the highs and lows and the truncated mean was the solution. I had a set of data that was something like 15 rows - 1 row was 0.00, another row was 100000.00 and the other thirteen rows were clustered around 1300-1400. Clearly a regular "average" would've generated an odd number: I had an "average" of something like 7800. Once I eliminated 0.00 and 100000.00 from the set, I was fine. Now sure, there are lots of ways to calculate averages and standard deviations; that's not what this post is about. This post has one goal: to show you how if you wish to to calculate the truncated mean in SQL Server 2005 or SQL Server 2008 (or SQL Server 2011 for that matter!). For more details about what a truncated mean is, check out the wikipedia entry here.

To start out, let's create ourselves a table to play with:

USE tempdb
GO
CREATE TABLE Sales(SalesPerson NVARCHAR(128), TotalSales MONEY);
INSERT Sales 
SELECT 'Scott'  , $1 UNION ALL
SELECT 'Scott'  , $2 UNION ALL
SELECT 'Scott'  , $3

(Yes, I know I could have used row initializers but that wouldn't work for SQL 2005, would it?)

Our set is simple: three rows of $1, $2, and $3. The "average" is ($1+$2+$3)/3 = $2 and the truncated mean is $2/1=$2. The key to calculating the truncated mean is use windowing and partitioning. For example, run this query:

SELECT ROW_NUMBER() OVER(ORDER BY TotalSales) AS RowNumber
    , COUNT(*) OVER() AS NumberOfMembersInSet
    , SalesPerson, TotalSales
FROM Sales

The ROW_NUMBER() function allows us to assign a unique number to each row in a set of rows. The OVER() windowing function allows us to define how we want the ROW_NUMBER() to be calculated. Because we said ROW_NUMBER() OVER(ORDER BY TotalSales), #1 in the ROW_NUMBER() generated will be for the row with the lowest amount of sales ($1 in our example). The COUNT(*) tells us how many total rows are in the set and the COUNT(*) OVER() says to include the total number of rows in the count.

Here's what we get back when we run our work-in-progress query:

image

Take a look at the "RowNumber" column is critical here; it's what we will use to "throw out" the highest and lowest value. What we really want to do is to say, "Eliminate RowNumber 1 and 3 and tell us the average of the rows in between." And that's simple enough in this example since there are only three rows; the truncated mean would be $2.00.

But how do you return just the truncated mean?

Unfortunately, we can't just write a WHERE clause that says WHERE RowNumber BETWEEN 2 AND COUNT(*)-1 (for various reasons - check out our transact-sql training videos for more). Windowing functions' results cannot be directly accessed in a query. There is a workaround though... Run this query:

SELECT * FROM (
    SELECT ROW_NUMBER() OVER(ORDER BY TotalSales) AS RowNumber
        , COUNT(*) OVER() AS NumberOfMembersInSet
        , SalesPerson, TotalSales
    FROM Sales
) AS x
WHERE x.RowNumber > 1
    AND x.RowNumber < NumberOfMembersInSet 
ORDER BY SalesPerson DESC

Couple of things:

  1. We've now placed the windowing functions in a subquery
  2. Now that we've done that, we can put a filter that says we want rows 2 through COUNT(*)-1

The "NumberOfMembersInSet" column is now very important - it effectively tells us how many rows are in the set. We can use that in our WHERE clause to filter out the last row in the set. The final result of the above query is to just return the rows that are RowNumber>1 AND RowNumber<(last row). The only final piece of our puzzle is to calculate the average:

SELECT SalesPerson, AVG(TotalSales) AS AvgSales FROM (
    SELECT ROW_NUMBER() OVER(ORDER BY TotalSales) AS RowNumber
        , COUNT(*) OVER() AS NumberOfMembersInSet
        , SalesPerson, TotalSales
    FROM Sales
) AS x
WHERE x.RowNumber > 1
    AND x.RowNumber < NumberOfMembersInSet 
GROUP BY SalesPerson
ORDER BY SalesPerson DESC

But wait - there's more you need to know...

The above was a simplified version of our query that showed you only how to calculate the truncated mean over one partition. We were only looking at one salesperson's average - what if we want the average for each salesperson? Each salesperson is a different partition and, to account for that, we must use PARTITION BY SalesPersonId in our windowing functions. To illustrate, let's add a few more salespeople and sales to the database:

INSERT Sales 
SELECT 'Chad'   , $1 UNION ALL
SELECT 'Chad'   , $2 UNION ALL
SELECT 'Chad'   , $4 UNION ALL
SELECT 'Chad'   , $5 UNION ALL
SELECT 'Chrissy', $1 UNION ALL
SELECT 'Chrissy', $1 UNION ALL
SELECT 'Chrissy', $2 UNION ALL
SELECT 'Chrissy', $4 UNION ALL
SELECT 'Chrissy', $5

If you just eyeball it, you can see that Chad has four rows - if we eliminate $1 (the low) and $5 (the high), then the truncated mean is ($2+$4)/2=$3. Chrissy, however, shows us what happens when we have two values that are the same on either the hi or low (the "low" in this case): she has two $1 entries. A truncated mean only removes the first one!

The Final Query

SELECT SalesPerson, AVG(TotalSales) AS AvgSales FROM (
    SELECT ROW_NUMBER() OVER(PARTITION BY SalesPerson ORDER BY TotalSales) AS RowNumber
        , COUNT(*) OVER(PARTITION BY SalesPerson) AS NumberOfMembersInSet
        , SalesPerson, TotalSales
    FROM Sales
) AS x
WHERE x.RowNumber > 1
    AND x.RowNumber < NumberOfMembersInSet 
GROUP BY SalesPerson
ORDER BY SalesPerson DESC

The final result is:

image

Scott's truncated mean eliminated the low ($1) and the high ($3) which left only a single value ($2)

Chad's truncated mean threw out a low of $1 and a high of $5 and the final calculation was ($2+$4)/2 = $3

Chrissy's truncated mean had to eliminate the first $1 (the low) and the $5 (the high) for a final calculation of ($1+$2+4)/3 = $2.3333.

I hope this helped you understand the truncated mean a bit more. It isn't a calculation that we, as developers, need all that often but it's nice to have it ready when and if you need it!

authors
scott whigham
grant moyle
chad weaver
recent comments
  • Pilipinas Win na Win: I'm interested in acquiring website link on this internet site. read more
  • Pilipinas Win na Win: blog posts? Is actually doing this over creative commons? read more
  • Pilipinas Win na Win: I am truly right here to aid, you need to read more
  • Pilipinas Win na Win: Hiya! Our staff members are on the lookout for near read more
  • Pilipinas Win na Win: Can you be fascinated to become link partners? read more
  • Pilipinas Win na Win: Hi ! Do you hold methods type of donation box read more
  • The Love Calculator: I just book marked your blog on Digg and StumbleUpon.I read more
  • Replacement Windows: Thanks a million for this, I am greatful for the read more
  • Scuba Diving: I’ve been visiting your blog for a while now and read more
  • Tahitian Noni Juice: Which golf clubs will be the best for beginner ? read more