String Aggregate And Concatenate

June 14, 2018

There has been a few times where I had to build a query in such a way that I needed to both aggregate groups of data while showing information of each grouped line. In Manufacturing, one specific report was a production dashboard, which summarized all the operations of a production job (total run time, total lead time, number of operations), but also needed to show the “routing” of the job, which was the name of each operation being grouped together. The output needed to be one line per job. The solution was to concatenate the names of each operation into one row per group.

Another problem I had months later was a Purchase Price variance report for Finance. They wanted one line per Purchase Order, showing the variance that occured against the paid invoice. This was a problem because there were some instances of multiple invoices making partial payments to a single Purchase Order, duplicating the Purchase order line which was displaying the total purchase price. The solution was to group the invoices by Purchase Order, but we didn’t want to lose out on the invoice numbers. The solution was to list the related invoices in a string.

In both of these scenarios, it was useful to aggregate a character field into a comma seperated list for grouping. Below is how to do so in both PostgreSQL and SQL Server.

Understanding the Dataset

I will be using a demo dataset that can be downloaded from the PostgreSQL website. It has three tables:

  • Products: holding a table of products for sale
  • Purchases: holding a list of users who’ve made purchases
  • Purchase_Items: holding a list of items purchased
%%sql postgresql://
    
    SELECT * FROM products limit 5;
5 rows affected.
id title price created_at deleted_at tags
1 Dictionary 9.99 2011-01-01 10:00:00-10:00 None ['Book']
2 Python Book 29.99 2011-01-01 10:00:00-10:00 None ['Book', 'Programming', 'Python']
3 Ruby Book 27.99 2011-01-01 10:00:00-10:00 None ['Book', 'Programming', 'Ruby']
4 Baby Book 7.99 2011-01-01 10:00:00-10:00 None ['Book', 'Children', 'Baby']
5 Coloring Book 5.99 2011-01-01 10:00:00-10:00 None ['Book', 'Children']
%%sql postgresql://
    
    SELECT * FROM purchases limit 5;
5 rows affected.
id created_at name address state zipcode user_id
1 2011-03-16 05:03:00-10:00 Harrison Jonson 6425 43rd St. FL 50382 7
2 2011-09-13 19:00:00-10:00 Cortney Fontanilla 321 MLK Ave. WA 43895 30
3 2011-09-10 19:54:00-10:00 Ruthie Vashon 2307 45th St. GA 98937 18
4 2011-02-27 10:53:00-10:00 Isabel Wynn 7046 10th Ave. NY 57243 11
5 2011-12-20 02:45:00-10:00 Shari Dutra 4046 8th Ave. FL 61539 34
%%sql postgresql://
    
    SELECT * FROM purchase_items limit 5;
5 rows affected.
id purchase_id product_id price quantity state
2 1 3 27.99 1 Delivered
3 1 8 108.00 1 Delivered
4 2 1 9.99 2 Delivered
5 3 12 9.99 1 Delivered
6 3 17 14.99 4 Delivered

Grouping Possibilities

As shown below, there are buyers that have bought multiple items. In this demo, we want to show the buyers name, the total amount he paid and what items he bought, with one buyer per row. This requires that we aggregate the Product field, which is a category type column.

%%sql postgresql://
            
            SELECT 
                purch.Name as "Buyer", 
                prod.title as "Product", 
                items.price * items.quantity as "Cost",
                RANK() OVER(
                    PARTITION BY items.purchase_id
                    ORDER BY items.product_id) AS "Item"
                
            FROM purchase_items items
            INNER JOIN products prod ON
                items.product_id = prod.id
            INNER JOIN purchases purch ON
                items.purchase_id = purch.id
            
            limit 10;
10 rows affected.
Buyer Product Cost Item
Harrison Jonson Ruby Book 27.99 1
Harrison Jonson MP3 Player 108.00 2
Cortney Fontanilla Dictionary 19.98 1
Ruthie Vashon Classical CD 9.99 1
Ruthie Vashon Holiday CD 9.99 2
Ruthie Vashon Documentary 59.96 3
Isabel Wynn Baby Book 23.97 1
Shari Dutra Python Book 119.96 1
Shari Dutra Romantic 14.99 2
Kristofer Galvez Coloring Book 5.99 1

PostgreSQL Implementation

Turns out it is extremely easy to do so in postgreSQL, because there is a built in function called “STRING_AGG”. This does all the heavy lifting and easily concatenates the strings for us.

%%sql postgresql://

        SELECT 
            purch.Name as "Buyer", 
            SUM(items.price * items.quantity) as "Total Paid",
            STRING_AGG(prod.title, ' | ') as "Products Purchased",
            count(purch.Name) as "Item Count"

        FROM purchase_items items
        INNER JOIN products prod ON
            prod.id = items.product_id
        INNER JOIN purchases purch ON
            purch.id = items.purchase_id

        GROUP BY purch.Name
        ORDER BY "Item Count" DESC
        limit 10
10 rows affected.
Buyer Total Paid Products Purchased Item Count
Evelyn Fretz 72.93 Holiday CD | Dictionary | Action | Baby Book 4
Russ Petrin 539.95 Pop CD | Electronic CD | Desktop Computer | Classical CD 4
Williams Selden 544.95 Dictionary | Drama | Desktop Computer | Classical CD 4
Angel Coderre 1050.96 Action | 42" LCD TV | Baby Book 3
Andres Schippers 2539.96 42" LCD TV | Country CD | Action 3
Allen Harshberger 209.89 Electronic CD | Action | Ruby Book 3
Alfonzo Jay 1038.98 Desktop Computer | Pop CD | 42" Plasma TV 3
Alfonzo Haubrich 2359.93 Comedy Movie | Laptop Computer | Desktop Computer 3
Alfonzo Bodkin 3721.91 Classical CD | Laptop Computer | Ruby Book 3
Adell Mayon 1408.98 Dictionary | 42" LCD TV | Laptop Computer 3

SQL Server Implementation

In SQL Server, it’s a bit harder. One suggestion online is to use the “FOR XML PATH” function, which concatenates the field into an XML path by a delimiter. This is packaged in the “STUFF” function, which stuffs one string into another string and removes the first comma for us.

I do not have access to a SQL Server where my jupyter notebook is stored, but the accompanied repository has a DDL for the data which can be pasted into SQL Fiddle along with the code below.

%%sql sqlserver://
    
        WITH Purch AS (
          SELECT 
            purchasers.name,
            items.product_id as "product_id",
            round(items.price * items.quantity, 2) as "Total",
            products.title

          FROM purchase_items items
            
          INNER JOIN purchases purchasers ON
            purchasers.id = items.purchase_id
            
          INNER JOIN products ON
            products.id = items.product_id
        )

        SELECT 
          Purch2.name,
          STUFF((
            SELECT ', ' + Purch1.title
            FROM Purch AS Purch1
            WHERE Purch1.name = Purch2.name
            FOR XML PATH ('')),1 ,1 , '') AS [Products],
          sum(Purch2.Total) AS [Total Paid]
        FROM Purch AS Purch2
        GROUP BY Purch2.Name

An alternative is to use the PIVOT function if you know the maximum number of lines in each grouping. This works if you don’t have too many lines in each group. In this case there are ever only four items bought, so we Pivot the rank function (which counts number of rows in a group) into columns, and have a case when function to return the columns concatenated into a string.

        WITH Purch AS (
          SELECT 
            purchasers.name,
            items.product_id as "product_id",
            round(items.price * items.quantity, 2) as "Total",
            products.title,
            RANK() OVER(
                PARTITION BY
                    purchasers.name
                ORDER BY
                    items.product_id) as "Count"

          FROM purchase_items items
          INNER JOIN purchases purchasers ON
            purchasers.id = items.purchase_id
          INNER JOIN products ON
            products.id = items.product_id
        )

        SELECT
            Purch.name,
            Purch.product_id,
            Purch.Total,
            Purch.title,
            (
            case when Purch.1 is null then '' else Purch.1 end +
            case when Purch.2 is null then '' else ', ' + Purch.2 +
            case when Purch.3 is null then '' else ', ' + Purch.3 +
            case when Purch.4 is null then '' else ', ' + Purch.4 end end end) as "Purchased Items"
        FROM Purch
        PIVOT
            max(Purch.title) for Purch.Count in ([1],[2],[3], [4])