Skip to content

Instantly share code, notes, and snippets.

@joyrexus
Last active January 26, 2016 23:02
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save joyrexus/6022332 to your computer and use it in GitHub Desktop.
Save joyrexus/6022332 to your computer and use it in GitHub Desktop.
Overview and examples of querying the LDP dataset.

LDP Queries

An overview and a few example of how to query the LDP dataset for those with local copies of the sqlite datastore but unfamiliar with SQL as a query language.


To dump a few columns from a specific transcript, see this gist.

If you're an R user, there's a sqlite interface called RSQLite, which makes it pretty easy to build "data frames" with selections straight from the LDP dataset (assuming you have your local copy of the dataset in a sqlite database).


Setup

LDP_DB=<path to your local copy of the dataset>
alias q="sqlite3 $LDP_DB"

To see the tables defined in the database:

q '.schema' | grep "CREATE TABLE"

... or the columns of a particular table:

q '.schema utterances'

The utterances table is the primary table queried by our researchers. The relevant columns are defined and explained in our Transcript Column Spec.

Sample queries

In general, a simple query will look like:

SELECT <COLUMN NAMES> FROM <TABLE> WHERE <CONSTRAINTS>

Find parent utterances in transcript 22.01 (subject 22, session 1) with WH words:

q 'SELECT subject, session, p_utts
   FROM utterances 
   WHERE subject=22 AND session=1 AND p_utts LIKE "wh%" 
   LIMIT 4'

subject session p_utts
22  1 what do you want?
22  1 what is that?
22  1 who's pen is that?
22  1 what?

Show one row of parent morphosyntax annotation where the parent syntax contains an XMODifier (i.e., a non-finite clause that is a nominal modifer (such as a relative clause) or complement:

q 'select p_utts, p_mor, p_syn from utterances where p_syn like "%XMOD%" limit 1'

we still have one more to pick, right?  
pro|we adv|still v|have det:num|one pro:indef|more inf|to v|pick co|right ?
1|3|SUBJ 2|3|JCT 3|0|ROOT 4|5|QUANT 5|3|OBJ 6|7|INF 7|5|XMOD 8|3|COM 9|3|PUNCT 

Finally, try ...

q 'select subject, session, p_utts from utterances where subject=22 and session=1' > 22.01.xls 
open 22.01.xls

You probably get the picture by now.

Listing all columns in a table

I occasionally forget what columns are available or how a particular column that I know exists is named. To see the full list of columns in the utterance table:

q '.schema utterances'

.
.
.

subject INTEGER,
session INTEGER,
row INTEGER,
last_update DATE,
time TEXT,
line TEXT,
key TEXT,
p_utts_orig TEXT,
p_utts TEXT,
p_chat TEXT,
p_enum TEXT,
p_mor TEXT,
p_syn TEXT,

.
.
.

More advanced queries

Q. Is there an efficient way to phrase "or" statements? Often I want to look for the same thing in parent or child speech, like a combination of ...

select p_mor, p_syn from utterances where p_syn like "%XMOD%" and p_mor like "%part%PROG%" limit 10

select c_mor, c_syn, from utterances where c_syn like "%XMOD%" and c_mor like "%part%PROG%" limit 10

Should I just run those two queries separately?

A. I'd do this:

SELECT p_mor, p_syn, c_mor, c_syn FROM utterances
       WHERE (p_syn like "%XMOD%" AND p_mor like "%part%PROG%")
          OR (c_syn like "%XMOD%" AND c_mor like "%part%PROG%")

So ... you can use parentheses to group parts of the WHERE clause and then use AND, OR, and NOT as needed.

With SQL (Structured Query Language) you can pretty much specify any condition on or between columns. You just have to be careful about grouping the conditions in the right way. Notice how these two queries are very different:

SELECT * FROM utterances  WHERE (a AND b) OR (c AND NOT (d OR e))
SELECT * FROM utterances WHERE (a AND (b OR c)) AND NOT (d OR e)

Sample sqlite session

~/Documents/Work/LDP/data$ sqlite3 ldp.db 
-- Loading resources from ~/.sqliterc
SQLite version 3.6.12
Enter ".help" for instructions
Enter SQL statements terminated with a ";"

sqlite> .tables
lines  log    sess   subj   tran 

sqlite> .mode columns

sqlite> .header ON

sqlite> SELECT subject, session, row, p_utts FROM lines WHERE p_utts="hello." LIMIT 5;
subject     session     row         p_utts    
----------  ----------  ----------  ----------
22          2           1507        hello.    
22          2           1583        hello.    
22          3           1194        hello.    
22          4           2079        hello.    
22          5           2325        hello.    

sqlite> select subject, session, row, p_utts from lines where p_utts like "hello%" limit 5;
subject     session     row         p_utts               
----------  ----------  ----------  ---------------------
22          1           900         hello @NAME.
22          2           1169        hello?               
22          2           1507        hello.               
22          2           1572        hello, right.        
22          2           1583        hello.               

sqlite> select subject, session, row, p_utts from lines where p_utts like "%hello%" limit 5;
subject     session     row         p_utts               
----------  ----------  ----------  ---------------------
22          1           900         hello @NAME.
22          2           1169        hello?               
22          2           1507        hello.               
22          2           1572        hello, right.        
22          2           1583        hello.               

sqlite> select subject, session, row, p_utts from lines where p_utts like "%hello%" and session > 5 limit 5;
subject     session     row         p_utts
----------  ----------  ----------  ----------
22          6           587         hello?
22          8           293         "Pooh and
22          8           732         hello.
22          8           1276        hello.
22          8           1539        "hello, Fr

sqlite> .width 10 10 5 40

sqlite> select subject, session, row, p_utts from lines where p_utts like "%hello%" and session > 5 limit 5;
subject     session     row    p_utts
----------  ----------  -----  ----------------------------------------
22          6           587    hello?
22          8           293    "Pooh and Piglet called hello through th
22          8           732    hello.
22          8           1276   hello.
22          8           1539   "hello, Freddy_Frog."

You can of course run such queries from the command-line as well.

$ alias q="sqlite3 $LDP_DB"
$ q 'select subject, session, p_utts from utterances where subject=100 and session=2 limit 10;'
subject session p_utts
100 2
100 2 luck.
100 2 luck.
100 2 Lucky.
100 2
100 2
100 2 is there --
100 2 is Elmo in there?
100 2
100 2 here @NAME.

$ q 'select subject, session, p_utts from utterances where p_utts like "%hello%" limit 10;' 
subject session p_utts
22  1 hello @NAME.
22  2 hello?
22  2 hello. 
22  2 hello, right.
22  2 hello.
22  2 hello? 
22  2 it's a hello.
22  2 it's not a phone, it's a hello.
22  3 hello?
22  3 hello?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment