An overview and a few example of how to query the LDP dataset for those with local copies of the sqlite datastore but unfamiliar with SQL as a query language.
To dump a few columns from a specific transcript, see this gist.
If you're an R user, there's a sqlite interface called RSQLite, which makes it pretty easy to build "data frames" with selections straight from the LDP dataset (assuming you have your local copy of the dataset in a sqlite database).
LDP_DB=<path to your local copy of the dataset>
alias q="sqlite3 $LDP_DB"
To see the tables defined in the database:
q '.schema' | grep "CREATE TABLE"
... or the columns of a particular table:
q '.schema utterances'
The utterances
table is the primary table queried by our researchers. The relevant columns are defined and explained in our Transcript Column Spec.
In general, a simple query will look like:
SELECT <COLUMN NAMES> FROM <TABLE> WHERE <CONSTRAINTS>
Find parent utterances in transcript 22.01
(subject 22, session 1) with WH words:
q 'SELECT subject, session, p_utts
FROM utterances
WHERE subject=22 AND session=1 AND p_utts LIKE "wh%"
LIMIT 4'
subject session p_utts
22 1 what do you want?
22 1 what is that?
22 1 who's pen is that?
22 1 what?
Show one row of parent morphosyntax annotation where the parent syntax contains an
XMOD
ifier (i.e., a non-finite clause that is a nominal modifer (such as a relative
clause) or complement:
q 'select p_utts, p_mor, p_syn from utterances where p_syn like "%XMOD%" limit 1'
we still have one more to pick, right?
pro|we adv|still v|have det:num|one pro:indef|more inf|to v|pick co|right ?
1|3|SUBJ 2|3|JCT 3|0|ROOT 4|5|QUANT 5|3|OBJ 6|7|INF 7|5|XMOD 8|3|COM 9|3|PUNCT
Finally, try ...
q 'select subject, session, p_utts from utterances where subject=22 and session=1' > 22.01.xls
open 22.01.xls
You probably get the picture by now.
I occasionally forget what columns are available or how a particular column that I know exists is named. To see the full list of columns in the utterance table:
q '.schema utterances'
.
.
.
subject INTEGER,
session INTEGER,
row INTEGER,
last_update DATE,
time TEXT,
line TEXT,
key TEXT,
p_utts_orig TEXT,
p_utts TEXT,
p_chat TEXT,
p_enum TEXT,
p_mor TEXT,
p_syn TEXT,
.
.
.
Q. Is there an efficient way to phrase "or" statements? Often I want to look for the same thing in parent or child speech, like a combination of ...
select p_mor, p_syn from utterances where p_syn like "%XMOD%" and p_mor like "%part%PROG%" limit 10
select c_mor, c_syn, from utterances where c_syn like "%XMOD%" and c_mor like "%part%PROG%" limit 10
Should I just run those two queries separately?
A. I'd do this:
SELECT p_mor, p_syn, c_mor, c_syn FROM utterances
WHERE (p_syn like "%XMOD%" AND p_mor like "%part%PROG%")
OR (c_syn like "%XMOD%" AND c_mor like "%part%PROG%")
So ... you can use parentheses to group parts of the WHERE clause and then use AND, OR, and NOT as needed.
With SQL (Structured Query Language) you can pretty much specify any condition on or between columns. You just have to be careful about grouping the conditions in the right way. Notice how these two queries are very different:
SELECT * FROM utterances WHERE (a AND b) OR (c AND NOT (d OR e))
SELECT * FROM utterances WHERE (a AND (b OR c)) AND NOT (d OR e)
~/Documents/Work/LDP/data$ sqlite3 ldp.db
-- Loading resources from ~/.sqliterc
SQLite version 3.6.12
Enter ".help" for instructions
Enter SQL statements terminated with a ";"
sqlite> .tables
lines log sess subj tran
sqlite> .mode columns
sqlite> .header ON
sqlite> SELECT subject, session, row, p_utts FROM lines WHERE p_utts="hello." LIMIT 5;
subject session row p_utts
---------- ---------- ---------- ----------
22 2 1507 hello.
22 2 1583 hello.
22 3 1194 hello.
22 4 2079 hello.
22 5 2325 hello.
sqlite> select subject, session, row, p_utts from lines where p_utts like "hello%" limit 5;
subject session row p_utts
---------- ---------- ---------- ---------------------
22 1 900 hello @NAME.
22 2 1169 hello?
22 2 1507 hello.
22 2 1572 hello, right.
22 2 1583 hello.
sqlite> select subject, session, row, p_utts from lines where p_utts like "%hello%" limit 5;
subject session row p_utts
---------- ---------- ---------- ---------------------
22 1 900 hello @NAME.
22 2 1169 hello?
22 2 1507 hello.
22 2 1572 hello, right.
22 2 1583 hello.
sqlite> select subject, session, row, p_utts from lines where p_utts like "%hello%" and session > 5 limit 5;
subject session row p_utts
---------- ---------- ---------- ----------
22 6 587 hello?
22 8 293 "Pooh and
22 8 732 hello.
22 8 1276 hello.
22 8 1539 "hello, Fr
sqlite> .width 10 10 5 40
sqlite> select subject, session, row, p_utts from lines where p_utts like "%hello%" and session > 5 limit 5;
subject session row p_utts
---------- ---------- ----- ----------------------------------------
22 6 587 hello?
22 8 293 "Pooh and Piglet called hello through th
22 8 732 hello.
22 8 1276 hello.
22 8 1539 "hello, Freddy_Frog."
You can of course run such queries from the command-line as well.
$ alias q="sqlite3 $LDP_DB"
$ q 'select subject, session, p_utts from utterances where subject=100 and session=2 limit 10;'
subject session p_utts
100 2
100 2 luck.
100 2 luck.
100 2 Lucky.
100 2
100 2
100 2 is there --
100 2 is Elmo in there?
100 2
100 2 here @NAME.
$ q 'select subject, session, p_utts from utterances where p_utts like "%hello%" limit 10;'
subject session p_utts
22 1 hello @NAME.
22 2 hello?
22 2 hello.
22 2 hello, right.
22 2 hello.
22 2 hello?
22 2 it's a hello.
22 2 it's not a phone, it's a hello.
22 3 hello?
22 3 hello?