-
-
Save simonw/0df922918cc653e73baa8d003df4d872 to your computer and use it in GitHub Desktop.
import click | |
import sqlite3 | |
import dateparser | |
count = 0 | |
c = sqlite3.connect("tweets.db") | |
with click.progressbar( | |
length=c.execute("select count(*) from tweets").fetchone()[0], | |
label='Updating dates' | |
) as bar: | |
def dateparser_parse(s): | |
bar.update(1) | |
return dateparser.parse(s).isoformat() | |
c.create_function("dateparser_parse", 1, dateparser_parse) | |
with c: | |
c.execute(""" | |
update tweets set | |
publish_date=dateparser_parse(publish_date); | |
""") |
Interestingly... once this hits 100% Python keeps on churning for several minutes longer - and running watch 'ls -l tweets.db-journal
shows that the journal file keeps increasing in size. So even though SQLite has executed the custom function for each row it still takes it a long time to finish committing the update. I don't understand the details of this.
Aha! That was because I was using bar.update()
incorrectly - you don't send it a running total, you just need to send it a 1
to say "move forward by 1 unit out of the total length"
I tried speeding this up by chunking it into 1000 updates per commit, but that appears to run slower - estimated 11 hours compared to 9 hours:
import click
import sqlite3
import dateparser
count = 0
c = sqlite3.connect("tweets.db")
num_rows = c.execute("select count(*) from tweets").fetchone()[0]
with click.progressbar(
length=num_rows * 2,
label='Updating dates'
) as bar:
def dateparser_parse(s):
bar.update(1)
return dateparser.parse(s).isoformat()
c.create_function("dateparser_parse", 1, dateparser_parse)
min_id = 0
max_id = min_id + 1000
while min_id < num_rows:
with c:
c.execute("""
update tweets set
publish_date=dateparser_parse(publish_date),
harvested_date=dateparser_parse(harvested_date)
where rowid > ? and rowid < ?
;
""", (min_id, max_id))
min_id = max_id
max_id = min_id + 1000
I turned this into a tool: https://github.com/simonw/sqlite-transform/
You can paste the above straight into ipython. Output looks like this: