Last active
August 29, 2015 14:01
-
-
Save rsimon/2c050ba033e2f6881d91 to your computer and use it in GitHub Desktop.
Getting top unidentified places from Recogito DB
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
SELECT | |
-- concat (toponym, toponym_corrected), count(*) | |
coalesce(toponym, toponym_corrected), count(*) | |
FROM annotations WHERE | |
(status = 'NOT_IDENTIFYABLE' OR status = 'NO_SUITABLE_MATCH' OR status = 'AMBIGUOUS' OR status = 'MULTIPLE') | |
-- GROUP BY concat(toponym, toponym_corrected) | |
GROUP BY coalesce(toponym, toponym_corrected) | |
ORDER BY count desc ; |
As Hugh Cayless has pointed out:
In any case, the answer is probably the COALESCE() function instead of CONCAT()
Thanks again!
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
This query pulls the top unidentified toponyms out of our Recogito database. Toponyms are kept in a table with the following structure:
The problem: it's a bit of a hack. Recogito retains toponyms (which are typically identified by an automatic process) and manual corrections to them in separate columns (so that we can later benchmark the system). For a clean query, we'd need to flatten both columns into a toponym_merged column, according to the rule
"use toponym_corrected if available, otherwise use toponym if avaliable, otherwise leave blank".
I frankly don't know how to implement this in SQL. So I'm just concatenating both columns right now (which works most of the time since the cases where both colums are populated are rare). But any SQL hints would be greatly appreciated!