indexing epub content into solr
solr schema
- 1 document per chapter, then collapse
- multivalued fields: chapter_title and chapter_text, keeping order.
text extraction
how to extract structured text from epub
sqlite> .schema itemAnnotations | |
CREATE TABLE IF NOT EXISTS "itemAnnotations" ( | |
itemID INTEGER PRIMARY KEY, | |
parentItemID INT NOT NULL, | |
type INTEGER NOT NULL, | |
authorName TEXT, | |
text TEXT, | |
comment TEXT, | |
color TEXT, | |
pageLabel TEXT, |
how to extract structured text from epub
version: "3" | |
node-exporter: | |
image: prom/node-exporter | |
volumes: | |
- /proc:/host/proc:ro | |
- /sys:/host/sys:ro | |
- /:/rootfs:ro | |
command: | |
- "--path.procfs=/host/proc" |
i join the brainstorm :) i was impressed lately by this feature of excalidraw: embedding raw data into an exported image
https://twitter.com/excalidraw/status/1316001446043750400
https://twitter.com/dluzar/status/1316005742512607232
the idea: export from annotorious an image (png) with drawed boxes with annotations. the export contains also json web annotation data.
reimport the image into an annotorious setup, it will open the image (static or iiif) with editable annotations.
~ ipfs ls /ipns/ipfs-sec.stackexchange.cloudflare-ipfs.com/crypto/ | |
zdj7WawSwGzackrPMpRyE5gB14rrR3CXSML4Cowsfo8RVA48m 261478141 A | |
zdj7WmdqpgAKtT6bik5FZiUuEBw3ibBE2Jvbf2yDHoCTeZtUR 2369946 - | |
zdj7WikogGGVBPciUv1hgnecawo7P4E6Rwj44LC6vdibCywTN 276549715 I | |
zdj7WfHUdRgg4LTr77ZX3tR3fLj7xDDpo8StCFFU4B2ZR43cm 1034 M | |
zdj7WfKAh1sMoUDZLq13yJzyb7dpHaqSV1p2Ftdx2VUgxMhFe 154 index.html | |
zdj7WcC5bvaUjBXYXZnsUa7Ghe2rtiSst9JnwpMwDTBWC4N4m 8343 search.html | |
zdj7WazUKfQCWpKePKDRBFaomsJrNcEu6U9obNP9UhLg6cArN 53013564 _index |
#!/usr/bin/env bash | |
instance="https://digipres.club" | |
user="raffaele" | |
json=$(curl -s -H "Accept: application/activity+json" $instance/users/$user/followers?page=1) | |
echo "$json" | jq -r .orderedItems[] | xargs -I% echo "<%> <follows> <$instance/user/$user> ." | |
next=$(echo "$json" | jq -r .next) | |
while true; do |
# apt install rustc cargo | |
# git clone https://github.com/tari/warcdedupe | |
# cd warcdedupe | |
# cargo install | |
# ... | |
# ./target/debug/warcdedupe -h | |
WARC deduplicator. | |
Usage: | |
warcdedupe [options] [<infile>] [<outfile>] |
install solr and create a core (books
)
brew install solr
solr start
solr create -c books -d /usr/local/Cellar/solr/7.2.1/example/files/conf
index a pdf
post -c books /tmp/gabriella-giannachi-archive-everything-mapping-the-everyday.pdf
spotify:album:4nSWX5A4xVomzrOEGDKLQ6 - Slowdive, Slowdive | |
spotify:album:4JQ2igmQEWUihSRzWgTiCF - Gas, Narkopop | |
spotify:album:0D8xltlqklXZ1DV7lFyE22 - Drew McDowall, Unnatural Channel | |
spotify:album:7Hcbzsu4lqRzPakrCnpgb9 - Emptyset, Borders | |
spotify:album:4y372QHtXp8aJCV7M4YkBv - Lawrence English, Cruel Optimism | |
spotify:album:5EXqFb0ch5dqP2ncl63XVY - Gnod, Just Say No To The Psycho Right-Wing Capitalist Fascist Industrial Death Machine | |
spotify:album:4yLRI4kaOy4LhSPZ2sCVbE - Godflesh, Post Self | |
spotify:album:6B1OkPs0AlG9QsHIxKwrgp - William Basinski, A Shadow In Time | |
spotify:album:6LDgPsDJlyJ948ARpncN9c - Alessandro Cortini, Avanti | |
spotify:album:02RHfsgbl7H9lnYXEsTLsA - Wire, Silver / Lead |