Skip to content

Instantly share code, notes, and snippets.

@ummjackson
Last active January 4, 2023 01:03
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save ummjackson/050044e661c6c5fa3ae61aa8a8baf485 to your computer and use it in GitHub Desktop.
Save ummjackson/050044e661c6c5fa3ae61aa8a8baf485 to your computer and use it in GitHub Desktop.
Run full-text search SQL on a Mastodon federated stream

Mastodon Streaming Full-Text Search

This is a (hacky) implementation of full-text search in Python for Mastodon. Built and tested with Python 3.10

Errors and disconnections are not handled, so you'll need to implement that for something more robust and production ready. If you just want to monitor certain keywords via SQL queries from a real-time stream, however, the simple implementation may suit your needs.

You can change the instance URL on Line 28 (default: mastodon.social) to fetch the federated timeline from somehwere else, if you'd like. Big, well-federated instances will obviously expose the most rich streams.

Running

Install dependencies:

pip3 install Mastodon.py jsonpickle bs4

Begin capturing stream:

python3 stream.py

You can then open the "streaming.db" SQLite database and begin querying it. Note that only statuses created after you begin capturing the stream will be search-able. Here's an example of how to search:

sqlite3

.open streaming.db

SELECT * FROM statuses WHERE statuses MATCH "cats" ORDER BY created_at DESC;

The above example will return all records matching the term "cats", sorted in reverse chronological order. For more advanced querying options, read the SQLite FTS5 documentation.

import mastodon
import logging
import json
import sqlite3
import jsonpickle
from bs4 import BeautifulSoup
# Setup logging
logging.basicConfig(format='%(asctime)s - %(levelname)s - %(message)s', level=logging.INFO)
# Setup database connection
con = sqlite3.connect("streaming.db")
con.execute('pragma journal_mode=wal')
cur = con.cursor()
# SQL to create the full-text search table if it doesn't exist
fts_create_sql = """
CREATE VIRTUAL TABLE IF NOT EXISTS statuses
USING FTS5(content_text, created_at UNINDEXED, status_id UNINDEXED, status_json UNINDEXED);
"""
cur.execute(fts_create_sql)
con.commit()
logging.info("SQLite ready to go, beginning streaming")
# Begin streaming
api_noauth = mastodon.Mastodon(api_base_url = 'https://mastodon.social')
class Listener(mastodon.StreamListener):
def on_update(self, status):
logging.info("Update: %s", status['id'])
insert_sql = """
INSERT INTO statuses (content_text, created_at, status_id, status_json)
VALUES (?, ?, ?, ?)
"""
status_json = jsonpickle.encode(status, unpicklable=False, max_depth=4, make_refs=False)
statusRecord = {
'content_text': json.dumps(BeautifulSoup(status['content'], 'html.parser').get_text(), ensure_ascii=False),
'created_at': status['created_at'],
'status_id': status['id'],
'status_json': status_json
}
cur.execute(insert_sql, tuple(statusRecord.values()))
con.commit()
def on_delete(self, status_id):
logging.info("Delete: %s", status_id)
delete_sql = "DELETE FROM statuses WHERE status_id=?"
cur.execute(delete_sql, (status_id,))
con.commit()
api_noauth.stream_public(Listener())
@ummjackson
Copy link
Author

Update: Just realized that the Mastodon implementation of Snowflake IDs means that IDs are not guaranteed to be unique across instances... so the on_delete function, which relies on the ID field may not work correctly sometimes. Hard to tell. Better solution would be to update to delete based on URL of the post or some concatenation of fields to make it a universally unique identifier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment