Skip to content

Instantly share code, notes, and snippets.

@kgjenkins
Last active July 23, 2019 21:41
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save kgjenkins/937eb7ba351e34310846f4d9ad7ddb62 to your computer and use it in GitHub Desktop.
Save kgjenkins/937eb7ba351e34310846f4d9ad7ddb62 to your computer and use it in GitHub Desktop.
Aggregate counts of multi-valued columns

Aggregate counts of multi-valued columns

Let's suppose we have a dataset of research studies conducted in various countries, and some studies are associated with multiple countries. Our studies table might have multiple comma-delimited values in the country column, and there might be spaces after some of those commas, looking something like this:

id title country
1 Study one Mauritania
2 Study two Mali
3 Study three Guinea
4 Study four Mali,Guinea
5 Study five Mali, Senegal, Mauritania

If we want to map how many studies took place in each country, we can do this using QGIS, using one of the two methods below.

Virtual tables method

With this method, we will create a derivate table that lists the number of studies associated with each country, and then join it to the countries layer.

From the Layer menu, select "Create Layer > New Virtual Layer..." and enter the following SQL query:

SQL QUERY #1:

with recursive splitvalues(id, val, more) as (
  select id, '', replace(country, ', ', ',') || ','
  from studies
  union all
  select id, substr(more, 0, instr(more, ',')), substr(more, 1+instr(more, ','))
  from splitvalues
  where more <> ''
)
select trim(val, ', ') as country, count(*) as n
from splitvalues
where val <> ''
group by val
order by val

This should give us the following output:

country n
Guinea 2
Mali 3
Mauritania 2
Senegal 1

This table could then be joined manually to a country boundary layer for mapping. Or we could even do the joining within the virtual layer query. Assuming we have a country boundary polygon layer called countries, here's how to output the study counts joined to countries as points, which could then be sized according to the number of studies:

SQL QUERY #2:

with cs as (
with recursive splitvalues(id, val, more) as (
  select id, '', replace(country, ', ', ',') || ','
  from studies
  union all
  select id, substr(more, 0, instr(more, ',')), substr(more, 1+instr(more, ','))
  from splitvalues
  where more <> ''
)
select trim(val, ', ') as country, count(*) as numstudies
from splitvalues
where val <> ''
group by val
order by val
)
select cs.country, numstudies, centroid(geometry) as geometry
from cs left join countries on cs.country=countries.name

Thanks to Samuel Bosch's original recipe which I modified to accept multi-values with or without spaces after the comma, and to remove the trailing commas from the final output.

QGIS field calculator method

Thanks to a suggestion from Kurt Menke I figured out an even easier method using the aggregate() expression.

Use field calculator to add a new column directly to the countries layer:

aggregate(
  'studies',
  'sum',
  if (regexp_match(','||"country"||',', ',\\s?'||attribute(@parent, 'name')||',') > 0, 1, 0)
)

This expression gets evaluated for each row in the country table. For each country, it adds up how many of the country fields in the study table include the current country name. The regexp_match() is written in a way to prevent values like 'West Virgina' from increasing the total for 'Virginia', if you are using state names.

It took my computer a couple seconds to run on a layer of 241 countries and just 5 studies, so if you have much more than that, you may want to save it as an actual new field rather than a virtual field. I think the field calculator method is less efficient than the virtual layers method, at least as far as the computation goes, but it is easier in that it requires less steps for the human, with no need to add a table join, etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment