Let's suppose we have a dataset of research studies conducted in various countries, and some studies are associated with multiple countries. Our studies
table might have multiple comma-delimited values in the country column, and there might be spaces after some of those commas, looking something like this:
id | title | country |
---|---|---|
1 | Study one | Mauritania |
2 | Study two | Mali |
3 | Study three | Guinea |
4 | Study four | Mali,Guinea |
5 | Study five | Mali, Senegal, Mauritania |
If we want to map how many studies took place in each country, we can do this using QGIS, using one of the two methods below.
With this method, we will create a derivate table that lists the number of studies associated with each country, and then join it to the countries layer.
From the Layer menu, select "Create Layer > New Virtual Layer..." and enter the following SQL query:
SQL QUERY #1:
with recursive splitvalues(id, val, more) as (
select id, '', replace(country, ', ', ',') || ','
from studies
union all
select id, substr(more, 0, instr(more, ',')), substr(more, 1+instr(more, ','))
from splitvalues
where more <> ''
)
select trim(val, ', ') as country, count(*) as n
from splitvalues
where val <> ''
group by val
order by val
This should give us the following output:
country | n |
---|---|
Guinea | 2 |
Mali | 3 |
Mauritania | 2 |
Senegal | 1 |
This table could then be joined manually to a country boundary layer for mapping. Or we could even do the joining within the virtual layer query. Assuming we have a country boundary polygon layer called countries
, here's how to output the study counts joined to countries as points, which could then be sized according to the number of studies:
SQL QUERY #2:
with cs as (
with recursive splitvalues(id, val, more) as (
select id, '', replace(country, ', ', ',') || ','
from studies
union all
select id, substr(more, 0, instr(more, ',')), substr(more, 1+instr(more, ','))
from splitvalues
where more <> ''
)
select trim(val, ', ') as country, count(*) as numstudies
from splitvalues
where val <> ''
group by val
order by val
)
select cs.country, numstudies, centroid(geometry) as geometry
from cs left join countries on cs.country=countries.name
Thanks to Samuel Bosch's original recipe which I modified to accept multi-values with or without spaces after the comma, and to remove the trailing commas from the final output.
Thanks to a suggestion from Kurt Menke I figured out an even easier method using the aggregate()
expression.
Use field calculator to add a new column directly to the countries
layer:
aggregate(
'studies',
'sum',
if (regexp_match(','||"country"||',', ',\\s?'||attribute(@parent, 'name')||',') > 0, 1, 0)
)
This expression gets evaluated for each row in the country
table. For each country, it adds up how many of the country
fields in the study
table include the current country name. The regexp_match() is written in a way to prevent values like 'West Virgina' from increasing the total for 'Virginia', if you are using state names.
It took my computer a couple seconds to run on a layer of 241 countries and just 5 studies, so if you have much more than that, you may want to save it as an actual new field rather than a virtual field. I think the field calculator method is less efficient than the virtual layers method, at least as far as the computation goes, but it is easier in that it requires less steps for the human, with no need to add a table join, etc.