Skip to content

Instantly share code, notes, and snippets.

@rsimon
Last active August 29, 2015 14:01
Show Gist options
  • Save rsimon/2c050ba033e2f6881d91 to your computer and use it in GitHub Desktop.
Save rsimon/2c050ba033e2f6881d91 to your computer and use it in GitHub Desktop.
Getting top unidentified places from Recogito DB
SELECT
-- concat (toponym, toponym_corrected), count(*)
coalesce(toponym, toponym_corrected), count(*)
FROM annotations WHERE
(status = 'NOT_IDENTIFYABLE' OR status = 'NO_SUITABLE_MATCH' OR status = 'AMBIGUOUS' OR status = 'MULTIPLE')
-- GROUP BY concat(toponym, toponym_corrected)
GROUP BY coalesce(toponym, toponym_corrected)
ORDER BY count desc ;
@rsimon
Copy link
Author

rsimon commented May 13, 2014

This query pulls the top unidentified toponyms out of our Recogito database. Toponyms are kept in a table with the following structure:

toponym     | toponym_corrected  | status
---------------------------------------------------------
Lybia       |                    | AMBIGUOUS
Africa      |                    | AMBIGUOUS
            | Metellus           | NO_SUITABLE_MATCH
Ilicitanian | Ilicitanian Gulf   | NOT_IDENTIFYABLE

The problem: it's a bit of a hack. Recogito retains toponyms (which are typically identified by an automatic process) and manual corrections to them in separate columns (so that we can later benchmark the system). For a clean query, we'd need to flatten both columns into a toponym_merged column, according to the rule

"use toponym_corrected if available, otherwise use toponym if avaliable, otherwise leave blank".

I frankly don't know how to implement this in SQL. So I'm just concatenating both columns right now (which works most of the time since the cases where both colums are populated are rare). But any SQL hints would be greatly appreciated!

@rsimon
Copy link
Author

rsimon commented May 13, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment