Skip to content

Instantly share code, notes, and snippets.

@simonw
Last active January 18, 2021 04:45
Show Gist options
  • Save simonw/0df922918cc653e73baa8d003df4d872 to your computer and use it in GitHub Desktop.
Save simonw/0df922918cc653e73baa8d003df4d872 to your computer and use it in GitHub Desktop.
Updating SQLite using a custom function and a click-powered progress bar
import click
import sqlite3
import dateparser
count = 0
c = sqlite3.connect("tweets.db")
with click.progressbar(
length=c.execute("select count(*) from tweets").fetchone()[0],
label='Updating dates'
) as bar:
def dateparser_parse(s):
bar.update(1)
return dateparser.parse(s).isoformat()
c.create_function("dateparser_parse", 1, dateparser_parse)
with c:
c.execute("""
update tweets set
publish_date=dateparser_parse(publish_date);
""")
@simonw
Copy link
Author

simonw commented Aug 3, 2018

You can paste the above straight into ipython. Output looks like this:

In [2]: import click
   ...: import sqlite3
   ...: import dateparser
   ...: 
   ...: count = 0
   ...: 
   ...: c = sqlite3.connect("tweets.db")
   ...: 
   ...: with click.progressbar(
   ...:     length=c.execute("select count(*) from tweets").fetchone()[0],
   ...:     label='Updating dates'
   ...: ) as bar:
   ...:     def dateparser_parse(s):
   ...:         bar.update(1)
   ...:         return dateparser.parse(s).isoformat()
   ...:     c.create_function("dateparser_parse", 1, dateparser_parse)
   ...:     with c:
   ...:         c.execute("""
   ...:             update tweets set
   ...:                 publish_date=dateparser_parse(publish_date);
   ...:         """)

Updating dates  [########----------------------------]   23%  0d 00:03:59

@simonw
Copy link
Author

simonw commented Aug 3, 2018

Interestingly... once this hits 100% Python keeps on churning for several minutes longer - and running watch 'ls -l tweets.db-journal shows that the journal file keeps increasing in size. So even though SQLite has executed the custom function for each row it still takes it a long time to finish committing the update. I don't understand the details of this.

@simonw
Copy link
Author

simonw commented Aug 3, 2018

Aha! That was because I was using bar.update() incorrectly - you don't send it a running total, you just need to send it a 1 to say "move forward by 1 unit out of the total length"

@simonw
Copy link
Author

simonw commented Aug 3, 2018

I tried speeding this up by chunking it into 1000 updates per commit, but that appears to run slower - estimated 11 hours compared to 9 hours:

import click
import sqlite3
import dateparser

count = 0

c = sqlite3.connect("tweets.db")

num_rows = c.execute("select count(*) from tweets").fetchone()[0]

with click.progressbar(
    length=num_rows * 2,
    label='Updating dates'
) as bar:
    def dateparser_parse(s):
        bar.update(1)
        return dateparser.parse(s).isoformat()
    c.create_function("dateparser_parse", 1, dateparser_parse)
    min_id = 0
    max_id = min_id + 1000
    while min_id < num_rows:
        with c:
            c.execute("""
                update tweets set
                    publish_date=dateparser_parse(publish_date),
                    harvested_date=dateparser_parse(harvested_date)
                where rowid > ? and rowid < ?
                ;
            """, (min_id, max_id))
        min_id = max_id
        max_id = min_id + 1000

@simonw
Copy link
Author

simonw commented Jan 18, 2021

I turned this into a tool: https://github.com/simonw/sqlite-transform/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment