Skip to content

Instantly share code, notes, and snippets.

@atomotic
atomotic / README.md
Last active February 10, 2024 13:43
load xml files into SQLite and transform to json

Install sqlpkg

Install extensions

sqlpkg install sqlite/fileio
sqlpkg install jakethaw/xmltojson

Start

➜ file 89595bd2-8076-4da0-8880-518c291e7904
89595bd2-8076-4da0-8880-518c291e7904: EPUB document
➜ tika -m -j 89595bd2-8076-4da0-8880-518c291e7904
Exception in thread "main" org.apache.tika.exception.TikaException: TIKA-237: Illegal SAXException from org.apache.tika.parser.epub.EpubParser@3a320ade
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:310)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:203)
at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:1071)
at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:493)
sqlite> .schema itemAnnotations
CREATE TABLE IF NOT EXISTS "itemAnnotations" (
itemID INTEGER PRIMARY KEY,
parentItemID INT NOT NULL,
type INTEGER NOT NULL,
authorName TEXT,
text TEXT,
comment TEXT,
color TEXT,
pageLabel TEXT,
@atomotic
atomotic / epub-search.md
Created November 13, 2021 12:11
indexing epub content into solr

indexing epub content into solr

solr schema

  • 1 document per chapter, then collapse
  • multivalued fields: chapter_title and chapter_text, keeping order.

text extraction

how to extract structured text from epub

version: "3"
node-exporter:
image: prom/node-exporter
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- "--path.procfs=/host/proc"
@atomotic
atomotic / readme.md
Created October 3, 2021 10:46
export a static image from Annotorious, with annotation data embedded
@atomotic
atomotic / iiif-annotation-studio.png
Last active March 7, 2019 15:58
iiif-annotation-studio
iiif-annotation-studio.png
~ ipfs ls /ipns/ipfs-sec.stackexchange.cloudflare-ipfs.com/crypto/
zdj7WawSwGzackrPMpRyE5gB14rrR3CXSML4Cowsfo8RVA48m 261478141 A
zdj7WmdqpgAKtT6bik5FZiUuEBw3ibBE2Jvbf2yDHoCTeZtUR 2369946 -
zdj7WikogGGVBPciUv1hgnecawo7P4E6Rwj44LC6vdibCywTN 276549715 I
zdj7WfHUdRgg4LTr77ZX3tR3fLj7xDDpo8StCFFU4B2ZR43cm 1034 M
zdj7WfKAh1sMoUDZLq13yJzyb7dpHaqSV1p2Ftdx2VUgxMhFe 154 index.html
zdj7WcC5bvaUjBXYXZnsUa7Ghe2rtiSst9JnwpMwDTBWC4N4m 8343 search.html
zdj7WazUKfQCWpKePKDRBFaomsJrNcEu6U9obNP9UhLg6cArN 53013564 _index
@atomotic
atomotic / mastodon-followers.sh
Created August 31, 2018 08:26
get the list of followers of a mastodon user. output in ntriples
#!/usr/bin/env bash
instance="https://digipres.club"
user="raffaele"
json=$(curl -s -H "Accept: application/activity+json" $instance/users/$user/followers?page=1)
echo "$json" | jq -r .orderedItems[] | xargs -I% echo "<%> <follows> <$instance/user/$user> ."
next=$(echo "$json" | jq -r .next)
while true; do
# apt install rustc cargo
# git clone https://github.com/tari/warcdedupe
# cd warcdedupe
# cargo install
# ...
# ./target/debug/warcdedupe -h
WARC deduplicator.
Usage:
warcdedupe [options] [<infile>] [<outfile>]