June 14, 2018
There has been a few times where I had to build a query in such a way that I needed to both aggregate groups of data while showing information of each grouped line. In Manufacturing, one specific report was a production dashboard, which summarized all the operations of a production job (total run time, total lead time, number of operations), but also needed to show the “routing” of the job, which was the name of each operation being grouped together. The output needed to be one line per job. The solution was to concatenate the names of each operation into one row per group.
Another problem I had months later was a Purchase Price variance report for Finance. They wanted one line per Purchase Order, showing the variance that occured against the paid invoice. This was a problem because there were some instances of multiple invoices making partial payments to a single Purchase Order, duplicating the Purchase order line which was displaying the total purchase price. The solution was to group the invoices by Purchase Order, but we didn’t want to lose out on the invoice numbers. The solution was to list the related invoices in a string.
In both of these scenarios, it was useful to aggregate a character field into a comma seperated list for grouping. Below is how to do so in both PostgreSQL and SQL Server.
I will be using a demo dataset that can be downloaded from the PostgreSQL website. It has three tables:
%%sql postgresql:// SELECT * FROM products limit 5;
5 rows affected.
|2||Python Book||29.99||2011-01-01 10:00:00-10:00||None||['Book', 'Programming', 'Python']|
|3||Ruby Book||27.99||2011-01-01 10:00:00-10:00||None||['Book', 'Programming', 'Ruby']|
|4||Baby Book||7.99||2011-01-01 10:00:00-10:00||None||['Book', 'Children', 'Baby']|
|5||Coloring Book||5.99||2011-01-01 10:00:00-10:00||None||['Book', 'Children']|
%%sql postgresql:// SELECT * FROM purchases limit 5;
5 rows affected.
|1||2011-03-16 05:03:00-10:00||Harrison Jonson||6425 43rd St.||FL||50382||7|
|2||2011-09-13 19:00:00-10:00||Cortney Fontanilla||321 MLK Ave.||WA||43895||30|
|3||2011-09-10 19:54:00-10:00||Ruthie Vashon||2307 45th St.||GA||98937||18|
|4||2011-02-27 10:53:00-10:00||Isabel Wynn||7046 10th Ave.||NY||57243||11|
|5||2011-12-20 02:45:00-10:00||Shari Dutra||4046 8th Ave.||FL||61539||34|
%%sql postgresql:// SELECT * FROM purchase_items limit 5;
5 rows affected.
As shown below, there are buyers that have bought multiple items. In this demo, we want to show the buyers name, the total amount he paid and what items he bought, with one buyer per row. This requires that we aggregate the Product field, which is a category type column.
%%sql postgresql:// SELECT purch.Name as "Buyer", prod.title as "Product", items.price * items.quantity as "Cost", RANK() OVER( PARTITION BY items.purchase_id ORDER BY items.product_id) AS "Item" FROM purchase_items items INNER JOIN products prod ON items.product_id = prod.id INNER JOIN purchases purch ON items.purchase_id = purch.id limit 10;
10 rows affected.
|Harrison Jonson||Ruby Book||27.99||1|
|Harrison Jonson||MP3 Player||108.00||2|
|Ruthie Vashon||Classical CD||9.99||1|
|Ruthie Vashon||Holiday CD||9.99||2|
|Isabel Wynn||Baby Book||23.97||1|
|Shari Dutra||Python Book||119.96||1|
|Kristofer Galvez||Coloring Book||5.99||1|
Turns out it is extremely easy to do so in postgreSQL, because there is a built in function called “STRING_AGG”. This does all the heavy lifting and easily concatenates the strings for us.
%%sql postgresql:// SELECT purch.Name as "Buyer", SUM(items.price * items.quantity) as "Total Paid", STRING_AGG(prod.title, ' | ') as "Products Purchased", count(purch.Name) as "Item Count" FROM purchase_items items INNER JOIN products prod ON prod.id = items.product_id INNER JOIN purchases purch ON purch.id = items.purchase_id GROUP BY purch.Name ORDER BY "Item Count" DESC limit 10
10 rows affected.
|Buyer||Total Paid||Products Purchased||Item Count|
|Evelyn Fretz||72.93||Holiday CD | Dictionary | Action | Baby Book||4|
|Russ Petrin||539.95||Pop CD | Electronic CD | Desktop Computer | Classical CD||4|
|Williams Selden||544.95||Dictionary | Drama | Desktop Computer | Classical CD||4|
|Angel Coderre||1050.96||Action | 42" LCD TV | Baby Book||3|
|Andres Schippers||2539.96||42" LCD TV | Country CD | Action||3|
|Allen Harshberger||209.89||Electronic CD | Action | Ruby Book||3|
|Alfonzo Jay||1038.98||Desktop Computer | Pop CD | 42" Plasma TV||3|
|Alfonzo Haubrich||2359.93||Comedy Movie | Laptop Computer | Desktop Computer||3|
|Alfonzo Bodkin||3721.91||Classical CD | Laptop Computer | Ruby Book||3|
|Adell Mayon||1408.98||Dictionary | 42" LCD TV | Laptop Computer||3|
In SQL Server, it’s a bit harder. One suggestion online is to use the “FOR XML PATH” function, which concatenates the field into an XML path by a delimiter. This is packaged in the “STUFF” function, which stuffs one string into another string and removes the first comma for us.
I do not have access to a SQL Server where my jupyter notebook is stored, but the accompanied repository has a DDL for the data which can be pasted into SQL Fiddle along with the code below.
%%sql sqlserver:// WITH Purch AS ( SELECT purchasers.name, items.product_id as "product_id", round(items.price * items.quantity, 2) as "Total", products.title FROM purchase_items items INNER JOIN purchases purchasers ON purchasers.id = items.purchase_id INNER JOIN products ON products.id = items.product_id ) SELECT Purch2.name, STUFF(( SELECT ', ' + Purch1.title FROM Purch AS Purch1 WHERE Purch1.name = Purch2.name FOR XML PATH ('')),1 ,1 , '') AS [Products], sum(Purch2.Total) AS [Total Paid] FROM Purch AS Purch2 GROUP BY Purch2.Name
An alternative is to use the PIVOT function if you know the maximum number of lines in each grouping. This works if you don’t have too many lines in each group. In this case there are ever only four items bought, so we Pivot the rank function (which counts number of rows in a group) into columns, and have a case when function to return the columns concatenated into a string.
WITH Purch AS ( SELECT purchasers.name, items.product_id as "product_id", round(items.price * items.quantity, 2) as "Total", products.title, RANK() OVER( PARTITION BY purchasers.name ORDER BY items.product_id) as "Count" FROM purchase_items items INNER JOIN purchases purchasers ON purchasers.id = items.purchase_id INNER JOIN products ON products.id = items.product_id ) SELECT Purch.name, Purch.product_id, Purch.Total, Purch.title, ( case when Purch.1 is null then '' else Purch.1 end + case when Purch.2 is null then '' else ', ' + Purch.2 + case when Purch.3 is null then '' else ', ' + Purch.3 + case when Purch.4 is null then '' else ', ' + Purch.4 end end end) as "Purchased Items" FROM Purch PIVOT max(Purch.title) for Purch.Count in (,,, )