Skip to content

Instantly share code, notes, and snippets.

@thbar
Forked from bitsgalore/scratchpad.md
Created December 18, 2020 12:31
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save thbar/b24ac97d3ed51a3975bf04000d9ddf52 to your computer and use it in GitHub Desktop.
Save thbar/b24ac97d3ed51a3975bf04000d9ddf52 to your computer and use it in GitHub Desktop.

Experimental attempt at getting organized ...

12/12/2020

Dead Simple Python

https://dev.to/codemouse92/introducing-dead-simple-python-563o

28/11/2020

Microkorg editor troubleshooting

If loading sounds from .prg files gives unexpected results: check that Midi channel is set to 1 before launching the editor! Reportedly MIDI clock needs to be set to external as well.

14/11/2020

A Visual Guide to Regular Expression

In this post, I will illustrate the various concepts underlying regex. The goal is to help you build a good mental model of how a regex pattern works.

https://amitness.com/regex/

11/11/2020

List of applications - ArchWiki

[A] general list of applications sorted by category, as a reference for those looking for packages. Many sections are split between console and graphical applications.

https://wiki.archlinux.org/index.php/List_of_applications

03/11/2020

How to move /var/www/html folder to external hdd?

https://superuser.com/questions/1101851/how-to-move-var-www-html-folder-to-external-hdd/1101856

Also:

https://askubuntu.com/questions/1220778/how-can-web-server-access-external-hdd

29/10/2020

Thorium Reader

Thorium Reader is an easy to use EPUB reading application for Windows 10/10S, MacOS and Linux.

https://github.com/edrlab/thorium-reader/releases

21/10/2020

Apache: redirect all folderv root references to home.htm file in folder

This seems to work:

RedirectMatch ^(.*)/$ $1/home.htm

16/10/2020

MIDI not working under Jack / Reaper

In View menu, open routing matrix and click on system:midi midi playback2 (needs to be enabled first from Preferences). Routing is set for each track.

15/10/2020

Virtualbox fails after kernel update

https://askubuntu.com/questions/819939/virtualbox-fails-after-kernel-update

06/10/2020

The Quartz guide to bad data

An exhaustive reference to problems seen in real-world data along with suggestions on how to resolve them.

https://github.com/Quartz/bad-data-guide

14/09/2020

What's so hard about PDF text extraction?

https://filingdb.com/b/pdf-text-extraction

11/09/2020

More than 100 scientific journals have disappeared from the Internet

https://www.nature.com/articles/d41586-020-02610-z

07/09/2020

ftfy: fixes text for you

ftfy fixes Unicode that's broken in various ways.

https://github.com/LuminosoInsight/python-ftfy

03/09/2020

QGIS Flatpak instructions

https://www.qgis.org/en/site/forusers/alldownloads.html#flatpak

01/09/2020

Keep Remote SSH Sessions and Processes Running After Disconnection

https://www.tecmint.com/keep-remote-ssh-sessions-running-after-disconnection/

Steps:

screen

Then issue commands. Then press Ctrl-a followed by d to detach. Log out.

31/08/2020

Linux - display details on startup/boot

systemd-analyze time

Result (in this case there's some odd firmware delay):

Startup finished in 1min 55.160s (firmware) + 10.965s (loader) + 3.955s (kernel) + 10.002s (userspace) = 2min 20.085s
graphical.target reached after 9.996s in userspace

Detailed breakdown:

systemd-analyze blame

Result:

          7.416s NetworkManager-wait-online.service
          1.966s vboxdrv.service
           827ms apt-daily-upgrade.service
           558ms systemd-fsck@dev-disk-by\x2duuid-9224\x2d4AC1.service
           500ms dev-sdb1.device
           477ms systemd-journal-flush.service
           ::   ::
           

Split large text file into smaller files

Here, split into 500,000-line files:

split -l 500000 -d 2019-05-21_all_domains_NL.txt domains-nl

30/08/2020

How to delete all your files

https://www.reddit.com/r/linux/comments/if1krd/how_to_delete_all_your_files/

26/08/2020

PDF info/validation/testing commands

qpdf --check --verbose whatever.pdf
pdfinfo whatever.pdf

Or (forces reading of all text):

pdftotext whatever.pdf
jhove -m PDF-hul -i whatever.pdf
gs -dNOPAUSE -dBATCH -sDEVICE=nullpage whatever.pdf

Using PDFDebugger (activates GUI-type browser):

java -jar ~/pdfbox/pdfbox-app-2.0.21.jar PDFDebugger whatever.pdf
mutool info whatever.pdf 
verapdf whatever.pdf

(Or use GUI).

pdfcpu validate whatever.pdf

Note to self: installed this by copying the Linux binary to ~/.local/bin/ (doesn't require GoLang).

Compare two PDFs

Compare text (verbose output):

comparepdf ct -v=2 whatever.pdf wherever.pdf

Compare appearance (verbose output):

comparepdf ca -v=2 whatever.pdf wherever.pdf

12/08/2020

Reaper reports JACK: error creating client error omn startup

First run jackd:

jackd -dalsa -dhw:USB -r48000 -p128 -n3 -Xseq

See also here

02/08/2020

Convert stereo audio file to mono, changing bit depth and sampling frequency

To 8-bit, 15Khz:

sox versatility.wav -b 8 -r 15k versatility_8.wav remix -

BUT sox output is really noisy; better results with ffmpeg:

ffmpeg -i boc-arpeggio.wav -ar 15000 -acodec pcm_u8 boc-arpeggio-8ff.wav

27/07/2020

How to hide a list in HTML without javascript

https://stackoverflow.com/a/13127738/1209004

14/07/2020

Create shared folder on local network (Linux Mint, Caja file manager)

From instructions here:

Install samba and caja-share

sudo apt install samba
sudo apt install caja-share

Set up usershares folder and make sambashare group owner

sudo mkdir /var/lib/samba/usershares
sudo chgrp sambashare /var/lib/samba/usershares
sudo chmod 1770 /var/lib/samba/usershares

Set samba password

sudo smbpasswd -a your_username

Then reboot machine, and right-click folder in Caja and select sharing options. After this, folder is accessible from other machines on the local network.

06/07/2020

Python - CGI Programming

https://www.tutorialspoint.com/python/python_cgi_programming.htm

30/06/2020

U.S. National Archives and Records Administration Digital Preservation Framework

150 formats added in latest release:

https://github.com/usnationalarchives/digital-preservation

Convert Kazam output to HTML5-compatible MP4

ffmpeg -i mirror.mp4 -vcodec libx264 -pix_fmt yuv420p -profile:v baseline -level 3 -strict -2 mirror-264.mp4

(Source)

24/06/2020

HTML video elements in local Jekyll site not working in Chrome

https://stackoverflow.com/questions/48876911/embedded-local-mp4-not-playing-in-chrome-when-running-jekyll-serve-econnreset

Apparently works when deployed live:

https://exoji2e.github.io/2019/02/18/video-tag-in-chrome.html

18/06/2020

Enable CGI Scripts on Apache

https://www.ionos.com/community/server-cloud-infrastructure/apache/enable-cgi-scripts-on-apache/

But this assumes 1 fixed dir for cgi scripts.

Apache Tutorial: Dynamic Content with CGI

https://httpd.apache.org/docs/2.4/howto/cgi.html

This explains how to set custom script locations.

17/06/2020

File naming conventions based on Semantic tagging

https://karl-voit.at/managing-digital-photographs/

Tools here:

https://github.com/novoid

19/05/2020

Prospect Mail

The Outlook desktop client for the new Outlook Interface from MS Office 365.

https://github.com/julian-alarcon/prospect-mail

18/05/2020

Test images Developer's Image Library

https://sourceforge.net/p/openil/svn/1554/tree/trunk/Test%20Images/

JPEG 2000 (Bandcamp)

https://jpeg2000.bandcamp.com

Bitrot tool

Detects bit rotten files on the hard drive to save your precious photo and music collection from slow decay.

https://github.com/ambv/bitrot

16/05/2020

Python for AV

https://lis655.github.io/av-python-carpentry/

13/05/2020

JPEG White Paper:JPEG XL image coding system

http://ds.jpeg.org/whitepapers/jpeg-xl-whitepaper.pdf

Setting up Python-based web server

Just run:

python3 -m http.server

Then site can be accessed from:

http://127.0.0.1:8000/

Useful for testing with local files, not suitable for production. More info:

https://developer.mozilla.org/en-US/docs/Learn/Common_questions/set_up_a_local_testing_server

08/05/2020

Two Bit Bash Script Library

https://twobitpreservation.com/script-library

07/05/2020

Download videos from YouTube (and more sites)

https://ytdl-org.github.io/youtube-dl/index.html

03/05/2020

We read the privacy policies of Skype, Meet, and Webex: 10 ways videoconferencing systems can better protect privacy for customers

https://medium.com/cr-digital-lab/skype-meet-webex-videoconference-privacy-845bc8360fd3

02/05/2020

Digital Repair Cafe (Project CEST)

Lijkt qua doelen en scope erg op NDE project fysieke dragers:

https://automatic-ingest-digital-archives.github.io/Digital-Repair-Cafe/

Kijk bv ook hiernaar, "Handleiding Verouderde Dragers Herkennen":

https://www.projectcest.be/wiki/Publicatie:Handleiding_Verouderde_Dragers_Herkennen

How to Read a Floppy Disk on a Modern PC or Mac

https://www.howtogeek.com/669331/how-to-read-a-floppy-disk-on-a-modern-pc-or-mac/

30/04/2020

Reduce PDF file size

Using Ghostscript:

https://askubuntu.com/a/256449/1052776

23/04/2020

Choosing the right video conferencing tool for the job

https://freedom.press/training/blog/videoconferencing-tools/

COVID-19 and Cybersecurity

https://medium.com/@gdbelvin/covid-19-and-cybersecurity-e9ee5cba6de7

SPARQL queries YUL digital preservation

https://www.wikidata.org/wiki/User:YULdigitalpreservation/SPARQL2#Disk_image_file_formats

17/04/2020

Preservica adds headers/footers to exported HTML files

wellcomecollection/platform#4425

16/04/2020

The Robustness of Apache Tika

https://cwiki.apache.org/confluence/display/TIKA/The+Robustness+of+Apache+Tika

09/04/2020

How to use Jitsi Meet, an open source Zoom alternative

https://mashable.com/article/how-to-use-jitsi-meet-zoom-alternative/

05/04/2020

Malware Analysis Fundamentals - Files | Tools

https://winitor.com/pdf/Malware-Analysis-Fundamentals-Files-Tools.pdf

02/04/2020

The best alternatives to Zoom for videoconferencing

https://www.theverge.com/2020/4/1/21202945/zoom-alternative-conference-video-free-app-skype-slack-hangouts-jitsi

01/04/2020

Github Wikis

https://help.github.com/en/github/building-a-strong-community/about-wikis

And:

https://help.github.com/en/github/building-a-strong-community/adding-or-editing-wiki-pages

Simple-Jekyll-Search

A JavaScript library to add search functionality to any Jekyll blog:

https://github.com/christian-fei/Simple-Jekyll-Search

27/03/2020

Jitsi installation instructions

https://jitsi.org/downloads/ubuntu-debian-installations-instructions/

Jitsi servers NL

https://vc4all.nl/

26/03/2020

Books.Files: Preservation of Digital Assets in the Contemporary Publishing Industry

https://drum.lib.umd.edu/handle/1903/25605

22/03/2020

Digital preservation policies and strategies (Caylin Smith)

https://docs.google.com/spreadsheets/d/1nAPh6M5c2VlvuFtdMIDEfxwdLvQ-47-i0ZicUUGkzjM/edit#gid=0

21/03/2020

Disable / enable webcam from terminal

Disable until reboot:

sudo modprobe -r uvcvideo

Enable again:

sudo modprobe uvcvideo

Source

18/03/2020

Create large test file with only null bytes

For a 1 MB file:

dd if=/dev/zero of=file.dat count=1024 bs=1024

Same, 1 GB file:

dd if=/dev/zero of=file.dat count=1024 bs=1048576

Source

17/03/2020

Wasmachine geeft overdosering aan

https://www.wasmachines.nl/forum/457-miele-w2203-lampje-overdosering/

https://community.consumentenbond.nl/woning-huishouden-8/miele-wasmachine-trommelkruis-designed-to-fail-16834

Maar:

https://www.klusidee.nl/Forum/miele-w-3821-wasmachine-meldt-contr-dosering-t46008.html

Dus: was op 95 graden, anders speciaal reinigingsmiddel.

16/03/2020

WordToEPUB

https://daisy.org/activities/software/wordtoepub/

Announcement:

https://daisy.org/news-events/articles/new-epub-creation-tool/

12/03/2020

OneDrive – Some files weren’t downloaded

https://web.archive.org/web/20190704152920/http://yannickborghmans.com/2018/05/19/onedrive-some-files-werent-downloaded/

11/03/2020

Download files and folders from OneDrive or SharePoint

https://support.office.com/en-us/article/download-files-and-folders-from-onedrive-or-sharepoint-5c7397b7-19c7-4893-84fe-d02e8fa5df05

Downloads are subject to the following limits: individual file size limit: 10GB; total zip file size limit: 20GB; total number of files limit: 10,000.

10/03/2020

Unzipping 6 GB OneDrive ZIP file under Linux fails

Reworked this into a blog:

https://www.bitsgalore.org/2020/03/11/does-microsoft-onedrive-export-large-ZIP-files-that-are-corrupt

04/03/2020

Map Windows Folder to a Drive Letter for Quick and Easy Access

https://www.raymond.cc/blog/map-folder-or-directory-to-drive-letter-for-quick-and-easy-access/

03/03/2020

What's so hard about PDF text extraction?

https://www.filingdb.com/pdf-text-extraction

02/03/2020

Graphviz

Graphviz is open source graph visualization software. Graph visualization is a way of representing structural information as diagrams of abstract graphs and networks.

https://www.graphviz.org/

01/03/2020

Bot Sentinel

Bot Sentinel is a free platform developed to detect and track trollbots and untrustworthy Twitter accounts.

https://botsentinel.com/

The Importance of Digital Persistence

https://philarcher.org/diary/2020/importanceOfPersistence/

25/02/2020

How to Sync Microsoft OneDrive with Linux

https://www.maketecheasier.com/sync-onedrive-linux/

21/02/2020

COinS for Your Jekyll Blog

https://matthewlincoln.net/2014/03/15/coins-for-your-jekyll-blog.html

17/02/2020

Persistent identifiers for heritage objects

https://journal.code4lib.org/articles/14978

15/02/2020

Google Webfonts Helper

https://google-webfonts-helper.herokuapp.com/fonts

14/02/2020

Notes on the Troubleshooting and Repair of Compact Disc Players and CDROM Drives

https://www.repairfaq.org/sam/cdfaq.htm

Check items under "Intermittent or erratic operation" and "Operation is poor or erratic when cold".

NAD CD player repair video

https://www.youtube.com/watch?v=jAehSoTmLGY

12/02/2020

Jekyll without plugins

https://jekyllcodex.org/without-plugins/

DLF Levels of Born-Digital Access

https://osf.io/af4eq/

04/02/2020

Firefox web archives add-on

https://github.com/dessant/web-archives

28/01/2020

Accessing Digital Archives Guide, UNC Library

https://guides.lib.unc.edu/accessdigitalarchives

Geolocate URL

Command-line:

https://www.maketecheasier.com/ip-address-geolocation-lookups-linux/

Python:

https://pypi.org/project/geoip2/

Uses MaxMind databases.

BUT getting IP address from URL is difficult in python, so perhaps better to use bash:

https://linuxhandbook.com/find-website-ip-address-linux/

Windows registry code for Pandoc context menu item

Windows Registry Editor Version 5.00

[HKEY_CLASSES_ROOT\*\shell\mkd2doc]
[HKEY_CLASSES_ROOT\*\shell\mkd2doc\command]
@="\"F:\\Pandoc\\pandoc.exe\" -s -S --ascii -N --toc-depth=2 \"%1\" -o \"%1.docx\""

Then save as pandoc.reg.

22/01/2020

Changed behaviour of Python collectionsin Python 3.8

This may be relevant to Iromlab or OmSipCreator:

https://docs.python.org/3/whatsnew/3.8.html#collections

Example:

https://github.com/kieranjol/IFIscripts/commit/c6eedd9ec0821b7108f7a93f81bf043a6cb53d20

(Via Twitter)

18/01/2020

PinePhone

https://en.wikipedia.org/wiki/PinePhone

16/01/2020

Everything I know about SSDs

http://kcall.co.uk/ssd/index.html

Task failed successfully pin

https://www.hellovoid.online/product/task-failed-successfully-enamel-pin-pre-order

09/01/2020

Low disk space on boot partition

https://forums.linuxmint.com/viewtopic.php?t=265077

Solved by running following codeblock (as described here):

OLDCONF=$(dpkg -l|grep "^rc"|awk '{print $2}')
CURKERNEL=$(uname -r|sed 's/-*[a-z]//g'|sed 's/-386//g')
LINUXPKG="linux-(image|headers|ubuntu-modules|restricted-modules)"
METALINUXPKG="linux-(image|headers|restricted-modules)-(generic|i386|server|common|rt|xen)"
OLDKERNELS=$(dpkg -l|awk '{print $2}'|grep -E $LINUXPKG |grep -vE $METALINUXPKG|grep -v $CURKERNEL)
YELLOW="\033[1;33m"
RED="\033[0;31m"
ENDCOLOR="\033[0m"
sudo apt-get purge $OLDKERNELS

24/12/2019

The 2010s were supposed to bring the ebook revolution. It never quite came.

https://www.vox.com/culture/2019/12/23/20991659/ebook-amazon-kindle-ereader-department-of-justice-publishing-lawsuit-apple-ipad

15/12/2019

Microsoft Access: The Database Software That Won’t Die

https://medium.com/young-coder/microsoft-access-the-zombie-database-software-that-wont-die-5b09e389c166

12/12/2019

On Implementation of Open Standards in Software: To What Extent Can ISO Standards be Implemented in Open Source Software?

Some interesting observations on JPEG 2000:

http://www.diva-portal.org/smash/get/diva2:925474/FULLTEXT01.pdf

12/11/2019

Search Github gists by user

curl user:bitsgalore

04/11/2019

Two New Tools that Tame the Treachery of Files

https://blog.trailofbits.com/2019/11/01/two-new-tools-that-tame-the-treachery-of-files/

02/11/2019

EML attachments in O365 - a recipe for phishing

https://isc.sans.edu/forums/diary/EML+attachments+in+O365+a+recipe+for+phishing/25474/

01/11/2019

xkcd Earth Temperature Timeline

https://xkcd.com/1732/

31/10/2019

Manage Docker as a non-root user

https://docs.docker.com/install/linux/linux-postinstall/

30/10/2019

Linked multisession discs (CD-ROM)

http://www.gburner.com/online-help/what-is-multisession-disc.htm

"When you add more files in a subsequent session, a complete new file system is written for the new session, but it can include references to files recorded in the previous session; this is known as linked multisession."

History:

https://web.archive.org/web/20050211005128/http://www.roxio.com/en/support/cdr/multisessionhistory.html

28/10/2019

KPN Secure File Transfer

https://filetransfer.kpn.com/

23/10/2019

Location for AppImage files

Official recommendation is to use folder in home directory (see https://askubuntu.com/questions/1092742/where-should-i-put-appimages-files), but since homedir on home PC is on slow HD whereas OS + all other software is on fast SDD, I created a directory under root:

/Applications/

Then move AppImage files there.

16/10/2019

List of web archives

https://erichennekam.blogspot.com/2014/07/lijst-webarchieven-in-de-wereld-want.html

14/10/2019

Levels of Born-Digital Access

https://docs.google.com/document/d/1N1fG4AgyBEJISc3tk5rWAc_3ZYdDbdVK4_Dbi_TusYQ/edit

13/10/2019

Computer Files Are Going Extinct

https://onezero.medium.com/the-death-of-the-computer-file-doc-43cb028c0506

08/10/2019

Why most academic journals are following outdated publishing practices

https://blog.scholasticahq.com/post/why-academic-journals-are-following-outdated-publishing-practices/

04/10/2019

Running Iromlab wrapped commands manually

For testing only:

C:\Users\jkn010\AppData\Roaming\Python\Python36\site-packages\iromlab\tools\libcdio\win64\cd-info.exe -C H: --no-header --no-device-info --no-disc-mode --no-cddb --dvd > cd-info.log

"C:\Program Files\dBpoweramp\BatchRipper\Loaders\Nimbie\Pre-Batch\Pre-Batch.exe" --drive="H"  --logfile="prebatch.log" --passerrorsback="prebatcherrors.log"

"C:\Program Files\dBpoweramp\BatchRipper\Loaders\Nimbie\Load\Load.exe" --drive="H" --rejectifnodisc  --logfile=load.log" --passerrorsback="loaderrors.log"

"C:\Program Files (x86)\Smart Projects\IsoBuster\IsoBuster.exe" /d:H: /ei:test-h.iso /et:u /ep:oea /ep:npc /c /m /nosplash /s:1 /l:ib-h.log

01/10/2019

Software setup for Device Side FC5025 floppy controller, Linux

  1. Compile and install the software according to official documentation

  2. In file /etc/udev/rules.d/025_fc5025.rules, replace the two occurrences of SYSFS with ATTRS

  3. Run:

    sudo usermod -a -G floppy $USER

  4. Reboot the machine

Tested with Linux Mint 18.3 (Sylvia), equivalent to Ubuntu Xenial.

Sources: https://groups.google.com/forum/#!topic/bitcurator-users/K1BPIbdKoOY/discussion + email correspondence with Device Side Data (the creator of the FC5025).

28/09/2019

OfficeToPDF

OfficeToPDF is a command line utility that converts Microsoft Office 2003, 2007, 2010, 2013 and 2016 documents from their native format into PDF using Office's in-built PDF export features.

https://github.com/cognidox/OfficeToPDF

27/09/2019

QEMU QED

"ffmprovisr for QEMU":

https://eaasi.gitlab.io/qemu-qed/

25/09/2019

OpenShot video editor

https://www.openshot.org/

(Used this for iPRES video)

Kdenlive video editor

https://kdenlive.org/en/

(Used this for earlier video, I think).

Copy of Apache-related files on Linux machine

Directories /etc/apache2, /var/www and file etc/hosts copied to folder backup-webserver on backup disk BAKWA. Copied using:

  • sudo rsync -avhl /var/www/ ./var/www

  • sudo rsync -avhl /etc/apache2/ ./etc/apache2

  • sudo rsync -avhl /etc/hosts ./etc/

To be restored after reinstall.

ATA Secure Erase (erase SSD disk)

https://ata.wiki.kernel.org/index.php/ATA_Secure_Erase

23/09/2019

MIT Digital Media Transfer Kits

https://libguides.mit.edu/digmediatransfer

How to Sync Microsoft OneDrive with Linux

https://www.maketecheasier.com/sync-onedrive-linux/

20/09/2019

U.S. National Archives Digital Preservation Framework

https://github.com/usnationalarchives/digital-preservation

15/09/2019

Learning Machine Learning

https://cloud.google.com/products/ai/ml-comic-1/

11/09/2019

DiscImageCreator

https://github.com/saramibreak/DiscImageCreator

(via Twitter)

06/09/2019

Appendix A: Tables of File Formats | National Archives

https://www.archives.gov/records-mgmt/policy/transfer-guidance-tables.html

27/08/2019

Microservices in Audiovisual Archives

This document describes and examines strategies for designing lightweight microservice environments for the processing of digital, file-based, audiovisual data within an archive.

http://journal.iasa-web.org/pubs/article/view/70

22/08/2019

Fix Bless "not enough free space on the device to save file" errors

  1. Close Bless, and open preferences file (/home/johan/.config/bless/preferences.xml) in a text editor.
  2. Set temp dir by editing pref element with ByteBuffer.TempDir name attribute
  3. Add closing </preferences> tag and save the file. File should look like below:
    <preferences>
        <pref name="ByteBuffer.TempDir">/tmp/Bless</pref>
        <pref name="Default.NumberBase">Hexadecimal</pref>
        <pref name="Undo.Actions">100</pref>
        <pref name="View.Toolbar.Show">True</pref>
        <pref name="Undo.Limited">False</pref>
        <pref name="View.Statusbar.Show">True</pref>
        <pref name="Session.RememberWindowGeometry">True</pref>
        <pref name="Default.Layout.UseCurrent">False</pref>
        <pref name="Session.RememberCursorPosition">True</pref>
        <pref name="Session.AskBeforeLoading">False</pref>
        <pref name="View.Statusbar.Selection">True</pref>
        <pref name="Tools.Statistics.Show">False</pref>
        <pref name="View.Statusbar.Offset">True</pref>
        <pref name="Tools.ConversionTable.LEDecoding">False</pref>
        <pref name="Default.EditMode">Insert</pref>
        <pref name="Tools.ConversionTable.Show">True</pref>
        <pref name="Highlight.PatternMatch">True</pref>
        <pref name="Undo.KeepAfterSave">Memory</pref>
        <pref name="Session.LoadPrevious">True</pref>
        <pref name="View.Statusbar.Overwrite">True</pref>
        <pref name="Default.Layout.File">
    </preferences>
  4. Make the file read-only:
    chmod 0444 /home/johan/.config/bless/preferences.xml
    

Done!

Source here

Update: this didn't quite work, but a workaround is to enter the location of the temp dir (/tmp/Bless) directly in Bless' user interface as a text string (so don't use the file navigation widgets!).

16/08/2019

Philology and the digital writing process

https://filologiaunlp.files.wordpress.com/2018/06/ries_philology-and-the-digital-writing-process_2017.pdf

14/08/2019

JP2 images in Tika regression corpus

http://162.242.228.174/share/jp2.tgz

13/08/2019

Going Commando - Put Down The Mouse

https://blog.codinghorror.com/going-commando-put-down-the-mouse/

Mouseless Computing

https://weblogs.asp.net/jongalloway/Mouseless-Computing

Hack Attack: Mouse-less Firefox

https://lifehacker.com/hack-attack-mouse-less-firefox-139495

09/08/2019

Python reverse geocode

Reverse Geocode takes a latitude / longitude coordinate and returns the country and city.

https://pypi.org/project/reverse-geocode/

03/08/2019

Verloren jouw gegevens

Bron: https://twitter.com/Eijsbouts/status/1157591377624150016

31/07/2019

1995: kwart grote bedrijven op Internet

https://twitter.com/rutger_/status/1156629656533110787 (archived)

Delpher link: https://resolver.kb.nl/resolve?urn=ABCDDD:010870971:mpeg21:a0117

Gebruiken als context bij xxLINK presentatie!

Install Android on VirtualBox

29/07/2019

Install Android on VirtualBox

https://www.howtogeek.com/164570/how-to-install-android-in-virtualbox/

Then in VirtualBox change display option "Graphics Controller" to VBoxVGA, and enabled 3D acceleration, as per here.

Home Assistant

https://www.home-assistant.io/

27/07/2019

Renoise audio configuration

Added following lines to /etc/security/limits.conf, as per here:

johan - rtprio 99
johan - nice -10

11/07/2019

deja-dup / duplicity keeps asking for encryption password

See:

https://askubuntu.com/questions/462085/deja-dup-repeatedly-asks-encryption-password

Tried:

  • Re-install of duplicity
  • Changed ownership of a few dirs in home that were owned by root.

Start backup from terminal:

export DEJA_DUP_DEBUG=1
deja-dup --backup

Result: backup appears to be created, but after verification stage deja-dup asks for password again. Tail end of debug output:

DUPLICITY: .     self.gpg_failed()
DUPLICITY: .   File "/usr/lib/python2.7/dist-packages/duplicity/gpg.py", line 272, in gpg_failed
DUPLICITY: .     raise GPGError(msg)
DUPLICITY: .  GPGError: GPG Failed, see log below:
DUPLICITY: . ===== Begin GnuPG log =====
DUPLICITY: . gpg: WARNING: "--no-use-agent" is an obsolete option - it has no effect
DUPLICITY: . gpg: AES256 encrypted data
DUPLICITY: . gpg: encrypted with 1 passphrase
DUPLICITY: . gpg: decryption failed: Bad session key
DUPLICITY: . ===== End GnuPG log =====
DUPLICITY: . 
DUPLICITY: . 

DUPLICITY: ERROR 31 GPGError
DUPLICITY: . GPGError: GPG Failed, see log below:
DUPLICITY: . ===== Begin GnuPG log =====
DUPLICITY: . gpg: WARNING: "--no-use-agent" is an obsolete option - it has no effect
DUPLICITY: . gpg: AES256 encrypted data
DUPLICITY: . gpg: encrypted with 1 passphrase
DUPLICITY: . gpg: decryption failed: Bad session key
DUPLICITY: . ===== End GnuPG log =====
DUPLICITY: . 

10/07/2019

nwipe - securely erase disks (dban fork)

https://linux.die.net/man/1/nwipe

08/07/2019

Archaeology of the Amsterdam digital city; why digital data are dynamic and should be treated accordingly

https://www.tandfonline.com/doi/full/10.1080/24701475.2017.1309852

02/07/2019

Toward Environmentally Sustainable Digital Preservation

https://dash.harvard.edu/handle/1/40741399

25/06/2019

Deja-dup filling up home dir

After attaching a large external HD + including it in the backup scheme, deja-dup eats up all space of main HD. Cause: deja-dup writes some metadata and manifest files to home dir at:

~/.cache/deja-dup/

These files become very large (here: > 18 GB) which results in running out of disk space. Apparently causes problems for lots of deja-dup users, e.g. here, here. This post suggests to solve this by creating a symlink to ~/.cache/deja-dup/ on another disk with sufficient space:

mkdir /media/johan/BAKWA/.deja-dup-cache
mv ~/.cache/deja-dup/* /media/johan/BAKWA/.deja-dup-cache/
rmdir ~/.cache/deja-dup
ln -sf /media/johan/BAKWA/.deja-dup-cache ~/.cache/deja-dup

UPDATE: doesn't work, files are still written to home dir!! Interim solution: exclude external drive from deja-dup backup scheme, and back it up manually with rsync (no incremental backup though!).

20/06/2019

Format USB drive as ext4

List partitions:

df -h

Result:

Filesystem      Size  Used Avail Use% Mounted on
udev            3,9G     0  3,9G   0% /dev
tmpfs           789M  9,5M  780M   2% /run
/dev/sda1       227G  202G   14G  94% /
tmpfs           3,9G   34M  3,9G   1% /dev/shm
tmpfs           5,0M  4,0K  5,0M   1% /run/lock
tmpfs           3,9G     0  3,9G   0% /sys/fs/cgroup
cgmfs           100K     0  100K   0% /run/cgmanager/fs
tmpfs           789M   32K  789M   1% /run/user/1000
/dev/sdb1       1,9T  144M  1,9T   1% /media/johan/Elements4

So in this case we need to format /dev/sdb1. Unmount the disk:

sudo umount /dev/sdb1

Format as ext4:

sudo mkfs.ext4 /dev/sdb1

Change generic label to WEBARCH:

sudo e2label /dev/sdb1 WEBARCH

Done!

Copy directory tree with rsync

 #!/bin/bash
 # Script must be run as root!

 sourceDir=/media/johan/Elements4/webarcheologie
 destDir=/media/johan/WEBARCH/
 rsync -avhl --dry-run $sourceDir $destDir

Copy homedir:

#!/bin/bash
# Script must be run as root!

sourceDir=~
destDir=/media/johan/BAKWA/homedir-25022020/
rsync -avhl $sourceDir $destDir

17/06/2019

Filesystem Hierarchy Standard

https://www.linuxjournal.com/content/filesystem-hierarchy-standard

11/06/2019

Researcher, Don’t Make Your Readers Scream!

https://www.cl.cam.ac.uk/~lp15/Pages/Scream.html

07/06/2019

Quick MAME/MESS Philips CD-I Tutorial (Mame 0.172)

https://forums.launchbox-app.com/topic/29631-quick-mamemess-philips-cd-i-tutorial-mame-0-172/

30/05/2019

Reader Privacy: The New Shape of the Threat (Clifford Lynch)

https://publications.arl.org/16ivjbv/ (PDF link)

27/05/2019

LaTEX setup notes

First install the following packages:

sudo apt install texlive-latex-extra
sudo apt-get install texlive-bibtex-extra biber
sudo apt-get install texlive-fonts-recommended

Then download the OpenSans package here. Install using following steps:

  1. Copy doc/, fonts/, source/, and tex/ directories to /etc/texmf directory
  2. Run mktexlsr to refresh the file name database and make TEX aware of the new files.
  3. Run sudo updmap -sys --enable Map=opensans.map to make Dvips, dvipdf and pdfTEX aware of the new fonts.

26/05/2019

Digital Physical Carrier Illustrations

https://blog.matthewburgess.net/2019/05/digital-physical-carrier-illustrations.html

22/05/2019

Corrupt a file - The file corrupter you were looking for!

https://corrupt-a-file.net/

18/05/2019

Manuals HP ProDesk

https://support.hp.com/us-en/product/hp-prodesk-400-g3-microtower-pc/7638325/manuals

16/05/2019

Regex to convert smart quotes with regular ones (and vice-versa)

https://gist.github.com/zerolab/1633661

Convert dumb quotes to smart quotes in Python

https://gist.github.com/davidtheclark/5521432

Even easier, use SmartyPants:

https://pypi.org/project/smartypants/

09/05/2019

Library of Congress Web Archive Data Sets

https://labs.loc.gov/experiments/webarchive-datasets/

01/05/2019

Unraveling the JPEG

https://parametric.press/issue-01/unraveling-the-jpeg/

20/04/2019

Floppy disks are like Jesus

15/04/2019

ArchiveBox

ArchiveBox takes a list of website URLs you want to archive, and creates a local, static, browsable HTML clone of the content from those websites (it saves HTML, JS, media files, PDFs, images and more).

https://archivebox.io/

02/04/2019

Text in PDF has no Unicode mapping

Short of AI, your best bet is to run OCR (tesseract) on these files.

https://lists.apache.org/thread.html/d25f20eda1c2094f0902e7b7092d829a64085b3b87aad2b8b346a453@%3Cuser.tika.apache.org%3E

23/03/2019

Identification of audio CD on Linux

Use cd-discid:

cd-discid /dev/sr1

Result:

b608ed0f 15 150 8656 19406 37656 48025 58358 71683 77998 90546 103443 117153 120751 132154 144223 157688 2287

Lookup in freedb using:

http://freedb.freedb.org/~cddb/cddb.cgi?cmd=cddb+query+b608ed0f+15+150+8656+19406+37656+48025+58358+71683+77998+90546+103443+117153+120751+132154+144223+157688+2287&hello=user+hostname+program+version&proto=3

Result:

200 rock b608ed0f Der Plan / Unkapitulierbar

Full record:

http://www.freedb.org/freedb/rock/b608ed0f

# xmcd
#
# Track frame offsets: 
#        150
#        8656
#        19406
#        37656
#        48025
#        58358
#        71683
#        77998
#        90546
#        103443
#        117153
#        120751
#        132154
#        144223
#        157688
#
# Disc length: 2287 seconds
#
# Revision: 0
# Processed by: cddbd v1.5.2PL0 Copyright (c) Steve Scherf et al.
# Submitted via: ExactAudioCopy v0.99pb5
#
DISCID=b608ed0f
DTITLE=Der Plan / Unkapitulierbar
DYEAR=2017
DGENRE=Electronic
TTITLE0=Wie der Wind weht
TTITLE1=Lass die Katze stehn!
TTITLE2=Man leidet herrlich
TTITLE3=Grundrecht
TTITLE4=Es heißt: die Sonne
TTITLE5=Gesicht ohne Buch
TTITLE6=Stille hören
TTITLE7=Flohmarkt der Gefühle
TTITLE8=Der Herbst
TTITLE9=Körperlos im Cyberspace
TTITLE10=Zu Besuch bei N. Senada
TTITLE11=Wie schwarz ist ein Rabe?
TTITLE12=Come Fly With Me
TTITLE13=Was kostet der Austritt?
TTITLE14=Die Hände des Astronauten
EXTD=
EXTT0=
EXTT1=
EXTT2=
EXTT3=
EXTT4=
EXTT5=
EXTT6=
EXTT7=
EXTT8=
EXTT9=
EXTT10=
EXTT11=
EXTT12=
EXTT13=
EXTT14=
PLAYORDER=

Python: cddb-py; Python 3 port here.

See also CDDB.

08/03/2019

Update forked Git repository

From here:

git fetch upstream
git checkout master
git rebase upstream/master
git push -f origin master

06/03/2019

ExifTool: report custom image properties to CSV file

Suppose we want to extract the Jpeg2000:NumberOfComponents field for each JP2 image:

exiftool -csv -Jpeg2000:NumberOfComponents /media/johan/Elements4/test/*.jp2 > exif.csv

Result:

SourceFile,NumberOfComponents
/media/johan/Elements4/test/HS-19640508-001.jp2,3
/media/johan/Elements4/test/HS-19640508-002.jp2,3
::

05/03/2019

ImageMagick: resize all images in directory to fixed width

mogrify -resize 1014 *.jpg

(Note: this changes the images in-place, so make a copy of the original images before doing this).

12/02/2019

ImageMagick: fix 'convert: not authorized'on PDF

https://alexvanderbist.com/posts/2018/fixing-imagick-error-unauthorized

10/02/2019

Emulation resources list (Ethan Gates)

https://github.com/EG-tech/emulation-resources

29/01/2019

Big List of Naughty Strings

The Big List of Naughty Strings is an evolving list of strings which have a high probability of causing issues when used as user-input data.

https://github.com/minimaxir/big-list-of-naughty-strings

03/01/2019

Twitter search advanced guide

https://espirian.co.uk/twitter-search-advanced-guide/

22/12/2018

Mounting Fritz.NAS under Linux Mint

Below instructions are for a fresh install. Based on:

https://dominicpratt.de/fritz-nas-unter-debianubuntu-einbinden/

  1. Open fstab in text editor as sudo:

    sudo xed /etc/fstab
    
  2. Add folllowing line to bottom (last line of file must be empty):

    //192.168.178.1/FRITZ.NAS /media/fritzbox cifs credentials=/etc/samba/auth,vers=1.0,uid=1000,gid=1000 0 
    
  3. Create the mount directory:

    sudo mkdir -p /media/fritzbox 
    
  4. Create file /etc/samba/auth:

    sudo touch /etc/samba/auth
    
  5. Edit as sudo:

    sudo xed  /etc/samba/auth
    
  6. Add username and password entries (must be FritzNAS uname + pwd, not the FritzBox ones!):

    username=johan
    password=dfh3476fh8((77&&
    
  7. It might be necessary to install the cifs-utils and samba packages (it seems cifs-utils is already part of the default Linux Mint install):

    sudo apt-install cifs-utils
    sudo apt install samba 
    
  8. Finally mount:

    sudo mount -a
    

Done!

21/12/2018

Linux Mint: new install resuts in Grub Prompt when booting

https://forums.linuxmint.com/viewtopic.php?t=217509

Deark

A utility for file format and metadata analysis, data extraction, and image format decoding

https://github.com/jsummers/deark

24/192 Music Downloads ... and why they make no sense

https://people.xiph.org/~xiphmont/demo/neil-young.html

30/11/2018

mh virtual tape & library system.

https://github.com/markh794/mhvtl

Install script for Ubuntu 16.04:

https://gist.github.com/hrchu/3eb1c0aa9994df0328037fff04cd889d

Then run using:

sudo /etc/init.d/mhvtl start

24/11/2018

Tkinter bitmaps in Ubuntu (Python)

<https://stackoverflow.com/a/25223352/1209004

E.g.:

def main():
    """Main function"""

    appDir = get_main_dir()
    root = tk.Tk()
    root.iconphoto(True, tk.PhotoImage(file=os.path.join(appDir, 'icon.png')))
    myGUI = tapeimgrGUI(root)

24/10/2018

Bash: output to array, which is then parsed

# Get tape status, output to array (split at newline)
IFS=$'\n' tapeStatus=$(mt -f $TAPEnr status)

# Parse file number and block number from status output 
for item in ${tapeStatus[*]}
do
    if [[ $item == *"file number"* ]]; then
        # Split at equal sign, 2nd item is value
        tmp=$(echo $item | cut -f2 -d=)
        # Strip whitespace
        fileNumber="$(echo -e "${tmp}" | tr -d '[:space:]')"
        #echo $fileNumber
    fi

    if [[ $item == *"block number"* ]]; then
        # Split at equal sign, 2nd item is value
        tmp=$(echo $item | cut -f2 -d=)
        # Strip whitespace
        blockNumber="$(echo -e "${tmp}" | tr -d '[:space:]')"
        #echo $blockNumber
    fi

done

20/10/2018

Oxford Common File Layout

This Oxford Common File Layout (OCFL) specification describes an application-independent approach to the storage of digital information in a structured, transparent, and predictable manner. It is designed to promote long-term object management best practices within digital repositories.

https://ocfl.io/

18/10/2018

Hex Editing for Archivists

http://www.av-rd.com/knowhow/

12/10/2018

Camelot: PDF Table Extraction for Humans (Python)

https://github.com/socialcopsdev/camelot/

01/10/2018

Update nodejs

https://askubuntu.com/questions/711834/unable-to-update-node-js-keeps-returning-to-old-version-ubuntu-15-04

try dat

https://try-dat.com/

25/09/2018

ReMarkable MarkDown editor

https://remarkableapp.github.io/index.html

Preservation Planning for Emerging Formats at the British Library

https://osf.io/65p7m/

14/09/2018

Docker files consume excessive amounts of disk space

See also moby/moby#21925.

E.g.:

sudo du -hx --max-depth=1 /var/lib

Result contains this entry:

25G	/var/lib/docker

There are probably more elegant/subtle ways to handle this, see e.g. https://lebkowski.name/docker-volumes/

Solution/workaround

Uninstall docker:

sudo apt-get remove docker docker-engine docker.io

Delete files:

sudo rm -rf /var/lib/docker

10/09/2018

BL Emerging Formats project

The Library’s ‘Emerging Formats’ project is focused on UK publications created for the mobile web, as interactive narratives or in database format.

https://britishlibrary.recruitment.northgatearinso.com/birl/pages/vacancy.jsf?latest=01001612

Caylin Smith and Ian Cooke report on the Emerging Formats project, which is investigating the collection management needs of published works that are created with digital formats that have significant software and hardware dependencies. They discuss the collection management challenges of these format types within the framework of UK NPLD.

http://journals.sagepub.com/doi/full/10.1177/0955749018785836

24/08/2018

Empty Trash on Linux machine from terminal

This works if Trash contains items that swere put there as superuser:

sudo rm -rf ~/.local/share/Trash/*

16/08/2018

How To Install WordPress with LAMP on Ubuntu 16.04

https://www.digitalocean.com/community/tutorials/how-to-install-wordpress-with-lamp-on-ubuntu-16-04

Use this to import kbresearch blog; then export to static site using:

https://wordpress.org/plugins/static-html-output-plugin/

06/08/2018

Digital transformation at Wellcome Collection

https://stacks.wellcomecollection.org/digital-transformation-at-wellcome-collection-639fb177aad6

27/07/2018

Search by file extension on Github

filename:ext extension:ext where ext is the extension you're interested in. You need both the filename and extension keywords to filter it down to only potential files of interest.

https://twitter.com/NKrabben/status/1022575556209074220

Example:

https://github.com/search?q=filename%3Awq1+extension%3Awq1

26/07/2018

Smallest possible […] file

This repository aims to collect the smallest possible syntactically valid files in different programming/scripting/markup languages.

https://github.com/mathiasbynens/small

25/07/2018

VisiData

VisiData is an interactive multitool for tabular data. It combines the clarity of a spreadsheet, the efficiency of the terminal, and the power of Python, into a lightweight utility which can handle millions of rows with ease.

http://visidata.org/

09/07/2018

Disk wiping and data forensics: Separating myth from science

https://www.techrepublic.com/article/disk-wiping-and-data-forensics-separating-myth-from-science/

30/06/2018

Excel Unusual

the home of the most unique Microsoft Excel animated spreadsheets

http://www.excelunusual.com/

29/06/2018

Hackmd.io

https://hackmd.io/

28/06/2018

It's Not Easy Being Green(e): Digital Preservation in the Age of Climate Change

https://scholarsphere.psu.edu/concern/generic_works/bvq27zn11p

23/06/2018

PREMIS/METS for scalability

https://wiki.archivematica.org/PREMIS/METS_for_scalability

17/06/2018

Markdown and Visual Studio Code

https://code.visualstudio.com/Docs/languages/markdown

Build an Amazing Markdown Editor Using Visual Studio Code and Pandoc

http://thisdavej.com/build-an-amazing-markdown-editor-using-visual-studio-code-and-pandoc/

15/06/2018

How to Measure Static Electricity

https://www.wikihow.com/Measure-Static-Electricity

11/06/2018

gedit on Windows

Install in MINGW:

pacman -S mingw-w64-x86_64-gedit

Add external plugin:

https://stackoverflow.com/questions/39360149/adding-external-plug-ins-to-gedit-in-windows

Get plugins here:

https://wiki.gnome.org/Apps/Gedit/ThirdPartyPlugins-v3.0

28/05/2018

Swisscows search engine

https://swisscows.com/

22/05/2018

Installation of Ace

If ELIFECYCLE / puppeteer error happens, try this (source):

sudo npm install @daisy/ace -g -unsafe-perm=true --allow-root

BUT ace now fails on this (installing Chrome doesn't help).

20/05/2018

Proselint

Our goal is to aggregate knowledge about best practices in writing and to make that knowledge immediately accessible to all authors in the form of a linter for prose.

https://github.com/amperser/proselint/

18/05/2018

Memento Tracer

http://tracer.mementoweb.org/

17/05/2018

The Importance of EPUB and the Need for EPUB 4

https://w3c.github.io/publ-bg/docs/EPUB4_business_case.html

11/05/2018

What’s in a Name? On ‘Meaningfulness’ and Best Practices in Filenaming within the LAM Community

http://journal.code4lib.org/articles/13438

10/05/2018

Microsoft Office Supported File formats

Possibly more here.

07/05/2018

Ace Accessibility Checker for EPUB

https://daisy.github.io/ace/

Web service based on Ace:

http://bacc.dzb.de/

03/05/2018

List of open workflows and resources for A/V archiving

https://github.com/amiaopensource/open-workflows

26/04/2018

Integration of nonharvested web data into an existing web archive

http://netarkivet.dk/wp-content/uploads/IntegrationOfNonHarvestedData.pdf

Read Tape Contents (Linux)

https://www.linuxquestions.org/questions/linux-newbie-8/read-tape-contents-944371/

17/04/2018

Ten simple rules for structuring papers

http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005619

12/04/2018

A successful Git branching model

http://nvie.com/posts/a-successful-git-branching-model/

Wikidata portal project

https://github.com/WikiDP/wikidp-portal

03/04/2018

Apache Web Server on Ubuntu 16.04

https://www.digitalocean.com/community/tutorials/how-to-install-the-apache-web-server-on-ubuntu-16-04

Restrict Apache Access to Localhost Only

In config file ports.conf, change this line:

Listen 80

into this:

Listen 127.0.0.1:80

See:

https://serverfault.com/questions/276963/make-apache-only-accessible-via-127-0-0-1-is-this-possible/276968#276968

Setting up multiple sites:

https://www.liberiangeek.net/2015/07/how-to-enable-and-run-multiple-websites-using-apache2-on-ubuntu-15-04/

28/03/2018

Script Ahoy

Community resource intended to provide helpful one-liners and script code specifically drawn from real-life examples in archives and libraries

https://dd388.github.io/crals/

Create static archived version of Wordpress blog

wget --recursive --no-clobber --span-hosts --page-requisites \
     --convert-links --no-parent -w 5 --random-wait \
     http://blog.kbresearch.nl >>wget.log 2>&1

This doesn't quite work the way it should:

  • If we leave out --span-hosts external stylesheets etc. are ignored, even if --page-requisites is used (don't want that)!
  • If we include --span-hosts externally referenced pages/sites are scraped as well (don't want that either!)

See also https://gist.github.com/dannguyen/03a10e850656577cfb57

Better approach:

  1. Scrape one single page:

    wget --page-requisites --span-hosts --convert-links --adjust-extension -w 5 --random-wait http://blog.kbresearch.nl/2015/07/07/why-pdfa-validation-matters-even-if-you-dont-have-pdfa/ >>$logFile 2>&1

This gives us the domains used for individual page resources, which we can subsequently feed into --domains. After some fiddling (we don't want to harvest +60 gravatar subdomains) this looks reasonable:

#!/bin/bash

url=http://blog.kbresearch.nl
domains=blog.kbresearch.nl,wp.com,researchkb.files.wordpress.com,googleapis.com,gstatic.com

logFile=wget.log
wget --mirror --page-requisites --span-hosts --convert-links --adjust-extension -w 5 --random-wait --domains=$domains $url >>$logFile 2>&1

24/03/2018

Difficulties of Timestamping Archived Web Pages

https://arxiv.org/abs/1712.03140

22/03/2018

swMATH

swMATH is a freely accessible, innovative information service for mathematical software. swMATH not only provides access to an extensive database of information on mathematical software, but also includes a systematic linking of software packages with relevant mathematical publications.

http://www.swmath.org/

17/03/2018

Windows previous versions documentation

https://docs.microsoft.com/en-us/previous-versions/windows/

12/03/2018

Wikidata for digital preservation portal

http://wikidp.org/

09/03/2018

Search files in UK web archive by magic pattern

See this thread on digipres.club for some context:

https://digipres.club/@joe/99650486509645352

Search URL:

https://www.webarchive.org.uk/shine/search?page=1&invert=&facet.fields=crawl_year&invert=&invert=&facet.fields=public_suffix&invert=&invert=&invert=&invert=&action=search&query=content_ffb:%220baddeed%22&totalCount=totalCount&sort=crawl_date&order=asc

Is Open Science ready for software containers?

One of our goals is to publish researcher's data, code, and executable Linux container all as files in a version controlled Dat repository. For this to be useful, a person should be able to execute these Linux environments (aka containers) anywhere

https://blog.datproject.org/2018/01/26/challenges-of-decentralized-hpc-containerization/

07/03/2018

Install OwnCloud desktop on Linux Mint 18.3

Instructions here, Ubuntu 16.04.

If updating results in warnings about package authentication, follow steps below:

owncloud/client#5287 (comment)

06/03/2018

Remove all XMP tags from a TIFF, except xmp-tiff ones

exiftool -xmp:all= "-all:all<xmp-tiff:all" MMKB19_000004012_00002_master.tiff

27/02/2018

Set non-standard maximum line length in pep8

Use --max-line-length option, e.g.:

pep8 --max-line-length=120 ~/omSipCreator/omSipCreator > pep8.txt

16/02/2018

Longevity of Optical Disc Media: Accelerated Ageing Predictions and Natural Ageing Data

https://www.degruyter.com/view/j/rest.2017.38.issue-3/res-2016-0032/res-2016-0032.xml?format=INT

COMPACT DISC SERVICE LIFE: AN INVESTIGATION OF THE ESTIMATED SERVICE LIFE OF PRERECORDED COMPACT DISCS (CD-ROM)

https://www.loc.gov/preservation/resources/rt/CDservicelife_rev.pdf

CD-ROM Longevity Research at LoC

https://www.loc.gov/preservation/scientists/projects/cd_longevity.html

CD-R and DVD-R RW Longevity Research at LoC

https://www.loc.gov/preservation/scientists/projects/cd-r_dvd-r_rw_longevity.html

15/02/2018

Python Macros in OpenOffice / LibreOffice

http://christopher5106.github.io/office/2015/12/06/openoffice-libreoffice-automate-your-office-tasks-with-python-macros.html

14/02/2018

Write Markdown with 8 Exceptional Open Source Editors

https://www.ossblog.org/markdown-editors/

06/02/2018

Discard unstaged changes to Git repo

git checkout -- .

(see also stackoverflow)

05/02/2018

PREMIS in METS Toolbox

validate METS file against best practices:

http://pim.fcla.edu/validate

Schematron rules:

http://pim.fcla.edu/resources

31/01/2018

Siegfried format counts

sf -csv t/images | cut -d ',' -f 6 | sort | uniq -c | sort -r

Result:

  8 x-fmt/390
  7 fmt/645
  5 fmt/41
  5 fmt/101
  4 fmt/43
  3 x-fmt/62
  3 x-fmt/263
  3 x-fmt/111
  3 fmt/44
  2 fmt/661
  2 fmt/5
  2 fmt/17
 28 UNKNOWN
  1 x-fmt/92
  ::
  etc

(Source: Nick Krabbenhöft)

How to update a GitHub forked repository

https://stackoverflow.com/a/7244456

Create Windows context menu item

https://gist.github.com/bitsgalore/7c5da72277557b608c94

ExifTool sample files

https://sourceforge.net/p/exiftool/code/ci/master/tree/t/images/

Wine installation on Linux Mint 18.3

Not working, problem seems to correspond to issue here:

https://forums.linuxmint.com/viewtopic.php?f=47&t=260925

24/01/2018

Finding and installing packages in MSYS2

Create/update package database:

pacman -Fy

Result:

:: Synchronizing package databases...
 mingw32                    2.4 MiB  2.97M/s 00:01 [#####################] 100%
 mingw32.sig               96.0   B  0.00B/s 00:00 [#####################] 100%
 mingw64                    2.4 MiB  1695K/s 00:01 [#####################] 100%
 mingw64.sig               96.0   B  0.00B/s 00:00 [#####################] 100%
 msys                     855.8 KiB  4.24M/s 00:00 [#####################] 100%
 msys.sig                  96.0   B  0.00B/s 00:00 [#####################] 100%

Find package name from (sub) string:

pacman -Fsx iso-info

Result:

mingw32/mingw-w64-i686-libcdio 2.0.0-1
    mingw32/bin/iso-info.exe
    mingw32/share/man/man1/iso-info.1.gz
mingw64/mingw-w64-x86_64-libcdio 2.0.0-1
    mingw64/bin/iso-info.exe
    mingw64/share/man/man1/iso-info.1.gz

Install package:

pacman -S mingw-w64-x86_64-libcdi0

Uninstall package:

pacman -R mingw-w64-x86_64-libcdi0

Source: https://github.com/msys2/msys2/wiki/Using-packages

23/01/2018

SRU: select Mac-only CD-ROMs

Query:

extent any "cdrom* cd-rom*" and annotation any "Mac*" not annotation any "Win* PC*"

Result:

http://www.kbresearch.nl/tpxslt/?xml=http://jsru.kb.nl/sru/sru?query=extent%20any%20%22cdrom*%20cd-rom*%22%20and%20annotation%20any%20%22Mac*%22%20not%20annotation%20any%20%22Win*%20PC*%22&x-collection=GGC&maximumRecords=10&xsl=http://www.kbresearch.nl/xportal/brief.xsl

SRU: select Blu-Ray discs

Query:

extent any "blu*"

Result (only 5 hits, 23/1/2018):

http://www.kbresearch.nl/tpxslt/?xml=http://jsru.kb.nl/sru/sru?query=extent%20any%20%22blu*%22&x-collection=GGC&maximumRecords=10&xsl=http://www.kbresearch.nl/xportal/brief.xsl

18/12/2017

List contents of ISO image with 7-zip

Command:

7z l -slt iso9660.iso

Result:

7-Zip [64] 9.20  Copyright (c) 1999-2010 Igor Pavlov  2010-11-18
p7zip Version 9.20 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,4 CPUs)

Listing archive: iso9660.iso

--
Path = iso9660.iso
Type = Iso
Created = 2017-06-30 18:31:33
Modified = 2017-06-30 18:31:33

----------
Path = nimbie.jpg
Folder = -
Size = 69424
Packed Size = 69424
Modified = 2017-06-30 13:23:38

Path = readme.txt
Folder = -
Size = 37
Packed Size = 37
Modified = 2017-06-30 13:25:20

UDF Bridge:

7z l -slt iso9660_udf.iso

Result:

7-Zip [64] 9.20  Copyright (c) 1999-2010 Igor Pavlov  2010-11-18
p7zip Version 9.20 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,4 CPUs)

Listing archive: iso9660_udf.iso

--
Path = iso9660_udf.iso
Type = Udf
Comment = UDF Bridge demo
Cluster Size = 2048
Created = 2017-06-30 18:31:33

----------
Path = nimbie.jpg
Folder = -
Size = 69424
Packed Size = 69632
Modified = 2017-06-30 13:23:38
Accessed = 2017-06-30 18:31:33

Path = readme.txt
Folder = -
Size = 37
Packed Size = 2048
Modified = 2017-06-30 13:25:20
Accessed = 2017-06-30 18:31:33

13/12/2017

Apache Tika vs DROID

https://twitter.com/anjacks0n/status/941020183812100096

Esp.:

Without Tika, relying on on DROID, there would have been 25,887,108 unidentified resources - mostly plain text, JS, CSS etc. Without DROID, only 464 would go unidentified, but we'd have no format-version-level information. Combining tools is crucial for web archives.

Find which file(s) are located in damaged area of ISO image

Using iso-info:

iso-info -l -i dvd-erik.iso

Result:

  d [LSN     22]      4096 Jan 01 1970 01:00:00  .
  d [LSN     22]      2048 Jan 01 1970 01:00:00  ..
  - [LSN     26] 158549392 Jul 30 2008 09:33:59  086_10B21_078v_079r.TIF
  - [LSN  77443] 158633884 Jul 30 2008 09:34:08  087_10B21_079v_080r.TIF
  - [LSN 154901] 157658880 Jul 30 2008 09:34:19  088_10B21_080v_081r.TIF
  - [LSN 231883] 157877788 Jul 30 2008 09:34:29  089_10B21_081v_082r.TIF
    ::
    ::
  - [LSN 2092850] 158203324 Jul 30 2008 09:38:31  113_10B21_105v_106r.TIF
  - [LSN 2170098] 156139844 Jul 30 2008 09:38:41  114_10B21_106v_107r.TIF

Here LSN * 2048 = offset of start of file.

11/12/2017

DDrescue --try-again switch

From the manual:

--try-again Mark all non-trimmed and non-scraped blocks inside the rescue domain as non-tried before beginning the rescue. Try this if the drive stops responding and ddrescue immediately starts scraping failed blocks when restarted. If '--retrim' is also specified, mark all failed blocks inside the rescue domain as non-tried.

Useful if ddrescue remains stuck endlessly in "scraping failed blocks".

06/12/2017

Run .msi installer as admin

msiexec /a putty-64bit-0.70-installer.msi

23/11/2017

Useful VeraPDF command-lines

Disable PDF/A validation, only extract features:

verapdf --off --extract whatever.pdf > whatever.xml

Recursively process directory tree:

verapdf --recurse --off --extract myDir > whatever.xml

21/11/2017

Zenodo categories KBNL community

17/11/2017

Archivematica 1.6 Default Format Policy Registry

https://docs.google.com/spreadsheets/d/1g2vbAFBHWhsPRkNljbQBsKasMI-GCFTsQLol0cFT6js/edit#gid=0

Understanding Computer Technology

https://web.archive.org/web/20020201195007/http://www.geocities.com:80/SiliconValley/4031/

16/11/2017

Obtaining a list of all hyperlinks in an MS-Word document

https://superuser.com/questions/670324/obtaining-a-list-of-all-hyperlinks

05/11/2017

Environmental impact of academic conferences

https://www.researchgate.net/publication/318970823_Academic_conferences_urgently_need_environmental_policies

(Note: lots of DOIs in references don't resolve at all, or resolve to wrong location!)

http://onlinelibrary.wiley.com/doi/10.1111/1746-692X.12106/full

http://www.nature.com/news/a-clean-green-science-machine-1.17125?WT.mc_id=TWT_NatureNews

http://tyndall.ac.uk/sites/default/files/twp161.pdf

https://www.chemistryworld.com/opinion/cutting-the-science-travel-footprint/9567.article

01/11/2017

Use of objectCharacteristicsExtension element in PREMIS

Archivematica examples in:

https://www.loc.gov/standards/premis/examples.html

26/10/2017

Customise Pytlint error reporting for a project

https://stackoverflow.com/questions/43280486/pylint-error-message-e1101-module-lxml-etree-has-no-strip-tags-member

25/10/2017

File identification: Tika vs DROID

Paper by Andy Jackson (2012):

http://arxiv.org/pdf/1210.1714.pdf

20/10/2017

Extract URLs from PDF

https://twitter.com/andrewjbtw/status/920791293122396160

11/10/2017

Convert compressed TIFF to uncompressed TIFF

03/10/2017

For one file:

convert whatever_compressed.tif +compress whatever_uncompressed.tif

Multiple files:

#!/bin/bash


# Input and output directories
dirIn=~/tiffsDDD
dirOut=~/tiffsDDUncompressed

while IFS= read -d $'\0' -r file ; do
    # File basename 
    bName=$(basename -s .TIF "$file")
    
    # Output name
    outName=$bName.TIF
    
    # Full output paths
    fOut="$dirOut/$outName"
 
    # Convert to uncompressed TIFF
    convert  $file +compress $fOut

done < <(find $dirIn -type f -name "*.TIF" -print

Linux Mint 18.2 issues

28/09/2017

warcio

This library provides a fast, standalone way to read and write WARC Format commonly used in web archives.

https://github.com/webrecorder/warcio

25/09/2017

JWAT TOOLS

Includes ARC/WARC validation:

https://sbforge.org/display/JWAT/Running+JWAT-Tools

23/09/2017

Format Technology Lifecycle Analysis

https://tspace.library.utoronto.ca/bitstream/1807/75891/1/JASIST-format-technology-lifecycle-analysis.pdf

12/09/2017

Mimetypes of MS Office formats

https://technet.microsoft.com/en-us/library/ee309278(office.12).aspx

08/09/2017

Tika mimetype definitions

https://github.com/apache/tika/tree/master/tika-core/src/main/resources/org/apache/tika/mime

06/09/2017

Kaitai Struct

Kaitai Struct is a declarative language used for describe various binary data structures, laid out in files or in memory (...).

The main idea is that a particular format is described in Kaitai Struct language (.ksy file) and then can be compiled with ksc into source files in one of the supported programming languages. These modules will include a generated code for a parser that can read described data structure from a file / stream and give access to it in a nice, easy-to-comprehend API.

http://kaitai.io/

29/08/2017

Suppress 'invalid-name' messages in Pylint output

Use -d option with invalid-name:

python3 -m pylint -d invalid-name boxvalidator.py > pylintjpylyzer.txt

24/08/2017

Zenodo: list all publications with "Digital Preservation" keyword in kbnl community

https://zenodo.org/communities/kbnl/search?page=1&size=20&q=keywords%253A%2522digital%2Bpreservation%2522

16/08/2017

JPEG 2000 drafts and freely available standards

https://github.com/Dzonatas/solution/tree/master/Documentation

15/08/2017

Remember Git login username/password

Following command will keep logibn credentials in cache for 1 hour:

git config --global credential.helper "cache --timeout=3600"

14/08/2017

Add path to LD_LIBRARY_PATH

For some reason I always forget this (below for OpenJPEG):

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib

Very large JP2 images

10/08/2017

How GIT commit to an existing tag

https://gist.github.com/danielestevez/2044589

15/06/2017

How to use HTML and CSS for printing

http://css4.pub/

Prince tool:

http://www.princexml.com/

waeasyprint (OS alternative):

http://weasyprint.org/

29/05/2017

E-READ

The goal of this Action is to improve scientific understanding of the implications of digitization, hence helping individuals, disciplines, societies and sectors across Europe to cope optimally with the effects.

http://ereadcost.eu/

12/05/2017

Huge List Of Example Files – Creative Commons

http://blog.online-convert.com/huge-list-of-example-files-creative-commons/

10/05/2017

Copy directory tree with Robocopy

robocopy sourceDir destDir /COPYALL /E /R:0 /DCOPY:T

E.g.:

robocopy H:\iromlabTestKBDepotNew M:\DigitalPreservation\optischeDragers\iromlabTestKBDepot /COPYALL /E /R:0 /DCOPY:T >robocopy.stdout 2>robocopy.stderr

19/04/2017

reading ISO image of data session of multisession (e.g. enhanced audio) CDs

Some useful links:

Good description of the problem:

https://lists.debian.org/debian-user/2005/01/msg02339.html

the sector numbers in the file system refer to sectors of the original CD rather than sectors of session2.iso. I don't know of a utility for rewriting them so that the file can be loop-mounted or written to an ordinary CD, but you can at least get a directory listing by using isoinfo with an offset:

isoinfo -i session2.iso -N 204345 -l

https://lists.gnu.org/archive/html/libcdio-devel/2010-02/msg00048.html

Esp.:

Remember, the path table and directory structure of the iso reflect the fact that the ISO filesystem starts on sector 222145 (49:23:70) of the CD. If it is burned to another CD at a different position, it won't work. Likewise, any program that reads the iso will need to be able to compensate for the offset. Try, for example: isoinfo -N 222145 -d -i '8mm-songs_to_love_and_die_by.iso'

Also (from same thread):

https://lists.gnu.org/archive/html/libcdio-devel/2010-02/msg00053.html

06/04/2017

Ensure correct encoding when writing a text file in Python

Default encoding for read/write write depends on locale settings, which can result in unexpected behaviour. See e.g.:

http://stackoverflow.com/questions/43256079/decoding-of-bytes-object-results-in-unexpected-invalid-utf-8-how-can-i-avoid

Solution: always set the encoding explicitly when opening a file for read/write in text mode. Example:

# Byte sequence corresponds to multiplication sign in UTF-8
myBytes = b'\xc3\x97'
# Decode to string 
myString = myBytes.decode('utf-8')

# Write myString to file
with open("myString.txt", "w", encoding="utf-8") as ms_file:
    ms_file.write(myString)

03/04/2017

Create symbolic link on Windows

In this case, create link to f:\Pandoc\pandoc.exe in directory c:\bin:

mklink pandoc.exe F:\Pandoc\pandoc.exe

30/03/2017

How to Create a List of Your Installed Programs on Windows

https://www.howtogeek.com/165293/how-to-get-a-list-of-software-installed-on-your-pc-with-a-single-command/

Powershell method:

Get-ItemProperty HKLM:\Software\Wow6432Node\Microsoft\Windows\CurrentVersion\Uninstall\* | Select-Object DisplayName, DisplayVersion, Publisher, InstallDate | Format-Table –AutoSize > installedPrograms.txt

28/03/2017

Guidelines for Using PREMIS with METS for exchange

https://www.loc.gov/standards/premis/guidelines2017-premismets.pdf

24/03/2017

Extract text from Epub

Apache Tika

java -jar tika-app-1.14.jar -t whatever.epub > whatever.txt

BUT doesn't return chapters in reading order!!

Textract (Python)

https://github.com/deanmalmgren/textract

Installs with errors under Windows; seems to work OK on Linux.

23/03/2017

Build process for Windows binaries of file/libmagic under Linux

https://github.com/nscaife/file-windows

28/02/2017

Change bit depth of WAV file

Saves output file as 24 bits / channel:

ffmpeg -i frogs-01.wav -codec pcm_s24le frogs-01-24-bit.wav

For list of all codec values:

ffmpeg -codecs

07/02/2017

Python relative imports for the billionth time

http://stackoverflow.com/questions/14132789/relative-imports-for-the-billionth-time

27/01/2017

FFmpeg - Extract Blu-Ray Audio

https://wiki.gentoo.org/wiki/FFmpeg_-_Extract_Blu-Ray_Audio

19/01/2017

Accessing raw devices under Windows, command line

From:

https://support.microsoft.com/nl-nl/help/100027/info-direct-drive-access-under-win32

To open a physical hard drive for direct disk access (raw I/O) in a Win32-based application, use a device name of the form

\\.\PhysicalDriveN

where N is 0, 1, 2, and so forth, representing each of the physical drives in the system.

To open a logical drive, direct access is of the form

\\.\X:

where X: is a hard-drive partition letter, floppy disk drive, or CD-ROM drive.

E.g. compute checksum on CD in d: drive:

 md5sum \\.\D:

Accessing raw devices in Python (under Windows)

Access to logical drives:

http://stackoverflow.com/q/6522644/1209004

Write access:

http://stackoverflow.com/q/7135398/1209004

Reading raw disks with Python:

http://blog.lifeeth.in/2011/03/reading-raw-disks-with-python.html

Isoparser

https://github.com/barneygale/isoparser

15/01/2017

How to host your static site with HTTPS on GitHub Pages and CloudFlare

https://developer.ubuntu.com/en/blog/2016/02/17/how-host-your-static-site-https-github-pages-and-cloudflare/

BUT this will make accessing the site CAPTCHA hell for Tor users: https://support.cloudflare.com/hc/en-us/articles/203306930-Does-CloudFlare-block-Tor-

Alternatives:

  • CERTBot / Letsencrypt: requires server access
  • Github pages has built-in https support, but only for github.io domains.

11/01/2017

How to Host your Python Package on PyPI with GitHub

https://www.codementor.io/arpitbhayani/host-your-python-package-using-github-on-pypi-du107t7ku

One everything is set up, for each new release the basic steps are:

  1. Update version number in main code
  2. Update link to download_url (in my case this is automated)
  3. Commit changes & push
  4. Add tag: git tag -a x.y.z -m "whatever"
  5. git push --tags
  6. python setup.py register -r pypi
  7. python setup.py sdist upload -r pypi

09/01/2017

CD/DVD Carrier checksums vs ISO image checksums

The md5sum of a "burnt" CD can be different than the md5sum of the associated iso file and not indicate an error

http://twiki.org/cgi-bin/view/Wikilearn/CdromMd5sumsAfterBurning

See also:

http://superuser.com/questions/220082/how-to-validate-a-dvd-against-an-iso

06/01/2017

Books and Literature Status Review 2016

https://warekennis.nl/wp-content/uploads/2013/03/BOOKS-AND-LITERATURE-STATUS-REVIEW-2017-.pdf

02/01/2017

Use ffmpeg / ffprobe to get tech properties from audio file

ffprobe track01.cdda.wav -show_format -show_streams > properties.txt

Result (file properties.txt):

[STREAM]
index=0
codec_name=pcm_s16le
codec_long_name=PCM signed 16-bit little-endian
profile=unknown
codec_type=audio
codec_time_base=1/44100
codec_tag_string=[1][0][0][0]
codec_tag=0x0001
sample_fmt=s16
sample_rate=44100
channels=2
channel_layout=unknown
bits_per_sample=16
id=N/A
r_frame_rate=0/0
avg_frame_rate=0/0
time_base=1/44100
start_pts=N/A
start_time=N/A
duration_ts=8233176
duration=186.693333
bit_rate=1411200
max_bit_rate=N/A
bits_per_raw_sample=N/A
nb_frames=N/A
nb_read_frames=N/A
nb_read_packets=N/A
DISPOSITION:default=0
DISPOSITION:dub=0
DISPOSITION:original=0
DISPOSITION:comment=0
DISPOSITION:lyrics=0
DISPOSITION:karaoke=0
DISPOSITION:forced=0
DISPOSITION:hearing_impaired=0
DISPOSITION:visual_impaired=0
DISPOSITION:clean_effects=0
DISPOSITION:attached_pic=0
DISPOSITION:timed_thumbnails=0
[/STREAM]
[FORMAT]
filename=track01.cdda.wav
nb_streams=1
nb_programs=0
format_name=wav
format_long_name=WAV / WAVE (Waveform Audio)
start_time=N/A
duration=186.693333
size=32932748
bit_rate=1411201
probe_score=99
[/FORMAT]

XML output:

ffprobe track01.cdda.wav -show_format -show_streams -print_format xml > properties.xml

01/01/2017

Update the Fritz!Box Mediaserver file index from a script

https://blog.heckel.xyz/2012/12/07/script-refresh-the-fritzmediaserver-dlna-index-of-the-fritzbox-6360-cable/

Script:

https://blog.heckel.xyz/wp-content/uploads/2012/12/fritzbox-dlna-refresh

19/12/2016

AMIA open workflows and resources for A/V archiving

https://github.com/amiaopensource/open-workflows

16/12/2016

NYPL Specifications for Audio and Moving Image Digitization

https://confluence.nypl.org/display/DIG/Specifications+for+Audio+and+Moving+Image+Digitization

07/12/2016

Mediags

Mediags is a console program that scans directories for media files and verifies the integrity of those files. Detailed content reports may optionally be produced.

https://mediags.codeplex.com/

(Binaries windows only)

01/12/2016

Browsers, not apps, are the future of mobile:

https://medium.com/swlh/browsers-not-apps-are-the-future-of-mobile-c552752ff75#.ilc1zlj1a

27/11/2016

Appear.in

Video conversations with up to 8 people for free. No login required — no installs

https://appear.in/

31/10/2016

A guide to Wikidata, SPARQL, and WDQS

https://www.wikidata.org/wiki/User:TweetsFactsAndQueries/A_Guide_To_WDQS

28/10/2016

PDFx

Extract references and metadata from PDF documents, and download all referenced PDFs:

https://www.metachris.com/pdfx/

24/10/2016

Explanation of need for Multi Threading GUI programming

http://stackoverflow.com/questions/13343096/explanation-of-need-for-multi-threading-gui-programming

22/10/2016

Digital Open Access Identifier

http://doai.io/

19/10/2016

Wikidata:WikiProject Informatics/File formats

https://www.wikidata.org/wiki/Wikidata:WikiProject_Informatics/File_formats

14/10/2016

Python debugging tips

http://stackoverflow.com/questions/1623039/python-debugging-tips

30/09/2016

A Slow-Motion Revolution (history of the CD-ROM)

http://www.filfre.net/2016/09/a-slow-motion-revolution/

29/09/2016

An Open-Source Strategy for Documenting Events: The Case Study of the 42nd Canadian Federal Election on Twitter

http://journal.code4lib.org/articles/11358

25/09/2016

Check the Accessibility of a PDF Document (online)

http://checkers.eiii.eu/en/pdfcheck/

23/09/2016

Python event scheduler and queue modules

https://docs.python.org/3.6/library/queue.html

https://docs.python.org/3.6/library/sched.html

And perhaps:

https://docs.python.org/3.6/library/threading.html#module-threading

Possibly usable in CD imaging workflow (esp. interaction with operator input).

13/09/2016

media-autobuild_suite

This Windows Batchscript setups a MinGW/GCC compiler environment for building ffmpeg and other media tools under Windows. After building the environment it retrieves and compiles all tools. All tools get static compiled, no external .dlls needed (with some optional exceptions)

https://github.com/jb-alvarado/media-autobuild_suite

By default this doesn't build the ffmpeg optional libraries (incl. cddio). In order to build them, if the batch file prompts you to Choose ffmpeg and mpv optional libraries?, select option 4 (All available external libs). Alternatively (if you accidentally ran the build with the default option), open file media-autobuild_suite.ini and set the value of ffmpegChoice to 4:

ffmpegChoice=4

Libcdio windows binaries

http://lrn.no-ip.info/packages/i686-w64-mingw/libcdio/0.93-1/

12/09/2016

Cdrdao Windows binaries

http://www.student.tugraz.at/thomas.plank/

08/09/2016

Discid tool

http://discid.sourceforge.net/

Tried flactag fork, which gives following output for CD-ROM:

Query failed: no actual audio tracks on disc: CDROM or DVD?

So might be useful for distinguishing between audio CD's and CD-ROMs (tarball contains Windows binary).

disktype tool

http://disktype.sourceforge.net/

Output audio CD:

Block device, size 690.4 MiB (723972096 bytes)
CD-ROM, 14 tracks, CDDB disk ID D912690E
Track 1: Audio track, 37.35 MiB (39163152 bytes),   3 min 42 sec
Track 2: Audio track, 87.89 MiB (92163120 bytes),   8 min 42 sec 
::
Track 13: Audio track, 37.22 MiB (39029088 bytes),   3 min 41 sec
Track 14: Audio track, 78.14 MiB (81931920 bytes),   7 min 44 sec

CD-ROM:

Block device, size 223.2 MiB (233990144 bytes)
CD-ROM, 1 track, CDDB disk ID 0205F301
Track 1: Data track, 223.2 MiB (233994240 bytes)
  ISO9660 file system
    Volume name "0305132335"
    Preparer    "CEQUADRAT 32BIT ISO-9660 FORMATTER COPYRIGHT (C) 1995-1998 BY CEQUDRAT GMBH"
    Data size 222.9 MiB (233682944 bytes, 114103 blocks of 2 KiB)
    Joliet extension, volume name "0305132335"

Enhanced audio CD:

Block device, size 223.2 MiB (233990144 bytes)
CD-ROM, 22 tracks, CDDB disk ID 4B113416
Track 1: Audio track, 9.627 MiB (10094784 bytes),   0 min 57 sec
Track 2: Audio track, 30.01 MiB (31462704 bytes),   2 min 58 sec
::
Track 20: Audio track, 41.33 MiB (43340304 bytes),   4 min 05 sec
Track 21: Audio track, 47.73 MiB (50048208 bytes),   4 min 43 sec
Track 22: Data track, 90.84 MiB (95252480 bytes)

DVD:

Block device, size 223.2 MiB (233990144 bytes)
CD-ROM, 1 track, CDDB disk ID 023BFD01
Track 1: Data track, 2.197 GiB (2358986752 bytes)
  Apple partition map, 2 entries
  Partition 1: 31.50 KiB (32256 bytes, 63 sectors from 1)
    Type "Apple_partition_map"
  Partition 2: 2.737 GiB (2938324992 bytes, 5738916 sectors from 1108)
    Type "Apple_HFS"
    HFS Plus file system
      Volume size 2.737 GiB (2938324992 bytes, 1434729 blocks of 2 KiB)
      Volume name "BelPop Marc Moulin"
  UDF file system
    Sector size 2048 bytes
    Volume name "BelPop Marc Moulin"
    UDF version 1.50
  ISO9660 file system
    Volume name "BELPOPMARCMOULIN"
    Data size 2.737 GiB (2938894336 bytes, 1435007 blocks of 2 KiB)
    Joliet extension, volume name "BelPop Marc Moul"

(note DVD is identified as CD-ROM; doesn't realy matter as extraction fronm DVD is identical to data CD-ROM).

Compiles without problems under Windows (using Cygwin), but doesn't seem to be able to access cd-devices. E.g.:

disktype /dev/sr0

Result:

--- /dev/sr0
Block device, size 332.6 MiB (348790784 bytes)
disktype: Data read failed at position 0: Invalid request code

Or:

disktype D:\

Result:

--- D:\
disktype: D:\: Is a directory

Or:

disktype D

Result:

--- D
disktype: Can't stat D: No such file or directory

Perhaps try cdrdao scanbus?

07/09/2016

WMI queries from the command line (Windows)

http://www.robvanderwoude.com/wmic.php

Example - get information about optical drives:

wmic cdrom  where mediatype!='unknown' get > test.txt

06/09/2016

Libcdio & pycdio

The GNU Compact Disc Input and Control library (libcdio) contains a library for CD-ROM and CD image access. Applications wishing to be oblivious of the OS- and device-dependent properties of a CD-ROM or of the specific details of various CD-image formats may benefit from using this library.

http://www.gnu.org/software/libcdio/

Python interface:

https://pypi.python.org/pypi/pycdio/

01/09/2016

Python data entry form example

http://codereview.stackexchange.com/questions/52397/a-general-purpose-gui-data-input-with-validation-but-unclear-about-best-object

31/08/2016

Imaging and image format for mixed mode CDs

Brown, "Developing Virtual CD-ROM Collections" (2012):

http://www.ijdc.net/index.php/ijdc/article/view/216/285

Page 13:

  • Create BIN/TOC file with cdrdao using:

    cdrdao read-cd --read-raw --device 1,0,0 --datafile allmy.bin allmy.toc

  • Author developed SheepShaver extension that allows these images to be read by emulator

Caveats:

  • The given cdrdao command only extracts one session (I guess the Voyager CD-ROMs only contain one session with both the data and audio tracks, although the paper isn't entirely clear about this).
  • In case of a CD with multiple sessions one would have to repeat the command for each of those (result: one separate image for each session)
  • Hybrid CD-ROMs are not supported by any of the most widely-used emulators (also stressed by author)

Jackson (BL):

http://anjackson.net/keeping-codes/practice/developing-a-robust-migration-workflow-for-preserving-and-curating-handheld-media.html

On multisession carriers:

While CD-ROM, DVD and HFS+ format disks are reasonably well covered by this approach, there are some important limitations. For example, the optical media formats all support the notion of ‘sessions’ – consecutive additions of tracks to a disk. This means that a given carrier may contain a ‘history’ of different versions of the data. By choosing to extract a single disk image, we only expose the final version of the data track, and any earlier versions, sessions or tracks are ignored. For our purposes, these sessions are not significant, but this may not be true elsewhere.

BUT sessions (at least on commercially manufactured carriers) typically don't contain different versions of the same data, but data that are completely different! Example: many 'enhanced' audio CDs that contain one session with all audio tracks, and another session with a data track. So sessions are significant!

BL workflow for REd Book (audio) and Yellow Book (mixed mode) carriers:

  • Image to MDS/MDF format
  • Then post-process MDS/MDF file with IsoBuster

But it's not entirely clear if the MDS/MDF can handle multisession carriers?

I found this in the Knowledge Base of the developer of the format:

http://support.alcohol-soft.com/en/knowledgebase.php?postid=15034&title=Restrictions+for+creating+image+files

Image making wizard will always allow the user to create mds/mdf ccd/img/sub.

But ISO format, only for those disc's that contain 1 data track(mode1 or mode2form1).

For cue/bin only for one session disc. if the original disc is a multi-session one, then the cue/bin would not be available and If the user chooses read sub-channel, the cue/bin and iso would be unavailable as well . because iso and cue/bin could not save sub channel data.

So apparently MDS/MDF does support multisession after all!

Good overview of disc image formats here:

http://www.theisozone.com/blogs/homebrew/burning-image-file-type-explained/

23/08/2016

Sheepshaver (Macintosh emulator)

Includes links to ROM and startup images:

http://www.redundantrobot.com/#/sheepshaver

Preserving and Emulating Digital Art Objects

Report by Cornell University:

https://ecommons.cornell.edu/handle/1813/41368

CD-ROM FAQ

Some useful info on Mac / PC images and hybrids:

http://www.macdisk.com/faqcden.php

22/08/2016

CDRWIN manual

Contains lots of info on optical carrier and disc image formats (e.g. BIN/CUE):

http://web.archive.org/web/20070221154246/http://www.goldenhawk.com/download/cdrwin.pdf

18/08/2016

Python requests fetch a file from a local url

http://stackoverflow.com/questions/10123929/python-requests-fetch-a-file-from-a-local-url

17/08/2016

Computer Display Calibration 101

https://blog.codinghorror.com/computer-display-calibration-101/

Bias Lighting

https://blog.codinghorror.com/bias-lighting/

16/08/2016

Recursively find/count files with specific extension

Find all files with .pdf extension:

find . -type f -name '*.pdf'

Count all files with .pdf extension:

find . -type f -name '*.pdf'| wc -l

PyRomInfo

Esp. 'useful links' section:

https://github.com/garbear/pyrominfo

21/07/2016

One pixel is worth three thousand words

Representation of 1 pixel in many different formats:

http://cloudinary.com/blog/one_pixel_is_worth_three_thousand_words

20/07/2016

The Programming Historian

Online tutorials on APIs, Data Management, Data Manipulation, Distant Reading, Linked Open Data, Mapping and GIS, Network Analysis, Omeka Exhibit Building, Web Scraping and Programming with Python:

http://programminghistorian.org/lessons/

14/07/2016

Writerperfect library

Supports lots of (old) Office-related formats + includes many conversion tools:

https://launchpad.net/ubuntu/+source/writerperfect/0.9.5-1

06/07/2016

Horrifying PDF experiments

https://github.com/osnr/horrifying-pdf-experiments

05/07/2016

Python classes simple examples

https://en.wikibooks.org/wiki/A_Beginner%27s_Python_Tutorial/Classes

26/06/2016

How To Install Linux Mint to SSD and HHD /home

https://forums.linuxmint.com/viewtopic.php?t=177915

23/06/2016

Python metadata libraries

(Source: Nick Krabbenhöft on Twitter)

22/06/2016

Library of Congress Audio Compact Disc METS Profile

http://www.loc.gov/standards/mets/profiles/00000007.html

Creating Virtual CD-ROM Collections

http://dx.doi.org/10.2218/ijdc.v4i2.107

From Imaging to Access - Effective Preservation of Legacy Re-movable Media

http://www.digpres.com/publications/woodsbrownarch09.pdf

Example METS file (note that apparently they combine multiple ISOs in one AIP):

http://webapp1.dlib.indiana.edu/virtual_disk_library/index.cgi/4252478/mets

BL METS profile - Sound Recordings 2

http://www.bl.uk/profiles/sound/METS_profile.pdf

19/06/2016

Linux File System Hierarchy

https://www.blackmoreops.com/2015/06/18/linux-file-system-hierarchy-v2-0/

Digital Dark Age Klaxon

https://youtu.be/a_6CZ2JaEuc

17/06/2016

SIP creator tools

14/06/2016

Characterisation of CD-ROMs

31/05/2016

Validate XML against user-defined XSD schema

xmllint --noout -schema schema.xsd whatever.xml

27/05/2016

Recursively compute md5 checksums for all files in directory tree

find -type f -exec md5sum "{}" + > checksums.md5

Source: http://askubuntu.com/a/318534. Works also under Cygwin.

Issue: output also includes MD5 sum of output file (which become invalid once anything is written to the file).

23/05/2016

Generate new access JP2 from master

  1. Convert master JP2 to TIFF using Kakadu (this preserves any embedded ICC profiles):
    kdu_expand -i master.jp2 -o master.tiff

  2. Convert TIFF to lossy JP2 with Aware via jpwrappa:
    jpwrappa -m -p C:\jpwrappa\profiles\optionsKBAccessLossy_2014.xml master.tiff access.jp2

(The -m switch can be omitted, in which case there is no need for Exiftool.)

19/05/2016

Disc robots

18/05/2016

Digital newspapers

10/05/2016

Use docx document as template in Pandoc

Use the --reference-docx switch:

pandoc -S --reference-docx=template.docx test.md -o test.docx 

26/04/2016

Rollback git repo to previous state + push changes to remote

Rollback to previous state:

git reset --hard <tag/branch/commit id>

Commit changes:

git push ... -f

Example:

git reset --hard 2dbe067c1674dcf9a23104c4b64b772e1550ba29
git push origin master -f

Mimetype Comparison DROID, Tika, File, April 2016

http://162.242.228.174/mimes/mime_comparisons.html

Common Crawl

An open repository of web crawl data that can be accessed and analyzed by anyone

https://commoncrawl.org/

25/04/2016

Tika-python

A Python port of the Apache Tika library that makes Tika available using the Tika REST Server.

https://github.com/chrismattmann/tika-python

Manipulating PDFs with Python

https://www.binpress.com/tutorial/manipulating-pdfs-with-python/167

22/04/2016

Introduction to the Bash Command Line (The Programming Historian)

http://programminghistorian.org/lessons/intro-to-bash

20/04/2016

Python wrapper for EpubCheck

https://github.com/titusz/epubcheck

Tegelspreukmaker

http://www.tegelspreukmaker.nl/

14/04/2016

Sozi presentation software

Looks a bit similar to Prezi, but OS (presentation as SVG):

http://sozi.baierouge.fr/

06/04/2016

Android screen rotate in VirtualBox

Press F9, F10, F11 or F12 twice. "Auto-rotate screen" option in Android Settings must be enabled.

04/04/2016

HTML codeblock hell in Wordpress

Following codeblock is not rendered correctly in Wordpress:

<pre><code>&lt;div&gt;test&lt;/div&gt;</code></pre>

Workaround is to replace forward slash in closing tag by entity reference:

<pre><code>&lt;div&gt;test&lt;&#47;div&gt;</code></pre>

29/03/2016

Caradoc - a PDF parser and validator

https://github.com/ANSSI-FR/caradoc

Note: current Debian package of Opam not recent enough, so used the instructions under "Binary distribution" at https://opam.ocaml.org/doc/Install.html. Installs binary in /usr/local/bin.

Make file initially didn't work because ocamlfind could not be found. Fixed by typing:

eval $(opam config env)

After this it compiles without any errors.

24/03/2016

Seeing the Double Rainbow: The Trials and Tribulations Working with Optical Media

Includes MiniDisc:

http://ndsr.nycdigital.org/seeing-the-double-rainbow-the-trials-and-tribulations-working-with-optical-media/

15/03/2016

Ebooklib

Python library that reads/writes EPUB, including EPUB 3:

https://github.com/aerkalov/ebooklib

Example, create EPUB from HTML:

https://gist.github.com/bitsgalore/4c830a301f33f584c041

CB infographics e-books in Nederland

http://www.cb.nl/nieuws/alle-relevante-data-over-e-books-in-nederland/

http://www.cb.nl/nieuws/e-bookbarometeblijft-groeien/

14/03/2016

Encyclopedia of Graphics File Formats

http://fileformats.archiveteam.org/wiki/Encyclopedia_of_Graphics_File_Formats

HTML5 is the New Flash

http://homepages.cwi.nl/~steven/Talks/2015/11-06-xml-amsterdam/

05/03/2016

Excel to XML: How to Transfer Your Spreadsheet Data Onto an XML File

This works (but what's referred to as a "schema" isn't really a schema at all):

https://blog.udemy.com/excel-to-xml/

How To Export an Excel 2010 Worksheet to XML

Similar to above, but uses XSD Schema directly, might be better:

https://bitwizards.com/blog/november-2010/how-to-export-an-excel-2010-worksheet-to-xml

23/02/2016

Reference rot in scholarly articles

Scholarly Context Not Found: One in Five Articles Suffers from Reference Rot

Playback WARC

Web archive player:

https://github.com/ikreymer/webarchiveplayer

20/02/2016

Search and replace string for all files in directory tree

E.g. replace every occurrence of /tmp/"$fileIn" with /tmp/"$(cat /dev/urandom | tr -cd 'a-f0-9' | head -c 16)":

find /home/johan/cajascripts -type f -print0 | xargs -0 sed -i 's/\/tmp\/"$fileIn"/\/tmp\/"$(cat \/dev\/urandom | tr -cd 'a-f0-9' | head -c 16)"/g'

18/02/2016

Save blog with archiveBot

  • Don't save offsite links
  • Use 'blogs' ignore pattern

Command (I think?):

!archive http://www.flipvandyke.nl/ --no-offsite-links --ignore-sets=blogs

28/01/2016

Recovering data from broken disk under Ubuntu

https://help.ubuntu.com/community/DataRecovery

14/01/2016

Links to freely available EPUB files with DRM

07/01/2016

Determine actual compression ratio of each quality layer in JP2

If N = number of layers, then first extract layers i and below to a separate JP2 with Aware j2kdriver tool:

j2kdriver -i foo.jp2 -ql (N-i+1) -t JP2 -o foo_i.jp2

Then use jpylyzer to compute the compression ratio of resulting image.

Example - input image with 11 quality layers

Create derived image for each quality layer:

j2kdriver -i MMAD01_000001001_00011_master.jp2 -ql 11 -t JP2 -o layer1.jp2
j2kdriver -i MMAD01_000001001_00011_master.jp2 -ql 10 -t JP2 -o layer2.jp2
::
::
j2kdriver -i MMAD01_000001001_00011_master.jp2 -ql 1 -t JP2 -o layer11.jp2

09/12/2015

Change last modified date of file

touch -d "1 January 1768" myfile.txt

30/11/2015

Stop laptop from re-booting after shutdown

This happened to my HP ProBook 640 G1. Workaround: in BIOS, disable "wake on LAN". Source: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1470723/comments/13

24/11/2015

Comparison of CD rippers

http://wiki.hydrogenaud.io/index.php?title=Comparison_of_CD_rippers

10/11/2015

Convert Word document to PDF from command line

http://superuser.com/questions/789968/windows-7-batch-command-line-to-save-as-pdf-file-for-word-2013-docx-file

06/11/2015

Beeld & Geluid Preservation Metadata Dictionary

http://publications.beeldengeluid.nl/pub/84

05/11/2015

Yale Library Digital Preservation System Requirements

http://web.library.yale.edu/sites/default/files/files/YULDPSHighLevelRequirementsUseCasesDiagrams.pdf

19/10/2015

Best Way To Merge A (GitHub) Pull Request

http://blog.differential.com/best-way-to-merge-a-github-pull-request/

Third option (Catch Feature Up with Master by Rebasing, then fast-forward Merge).

16/10/2015

Handboek informaticavaardigheden UvA

http://liv.science.uva.nl/index.html

Misschien delen (her)bruikbaar voor interne cursussen e.d.

10/10/2015

Add right-click context menu items in Ubuntu /Linux Mint

Ubuntu with Nautilus file manager - Nautilus Actions:

http://www.pcsteps.com/4434-add-right-click-commands-linux-mint-ubuntu/

Linux Mint Cinnamon with Nemo file manager:

http://www.pcsteps.com/4434-add-right-click-commands-linux-mint-ubuntu/

Linux Mint Mate with Caja file manager:

http://www.ethanjoachimeldridge.info/tech-blog/caja-exifstrip-context-action

10/09/2015

Create floppy image from arbitrary files

From http://stackoverflow.com/a/11202773:

Suppose I want to create a floppy image containing file oakcdrom.sys:

dd bs=512 count=2880 if=/dev/zero of=oakcd.img
mkfs.msdos oakcd.img
mcopy -i oakcd.img oakcdrom.sys ::/

Inspect contents:

mdir -i oakcd.img

27/08/2015

Create image of 3.5" DOS / Windows floppy

General command:

ddrescue -d -n -b 512 /dev/fd0 myfloppy.img myfloppy.log 

To get name of device:

lsblk

Result:

NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda      8:0    0 465,8G  0 disk 
├─sda1   8:1    0 457,9G  0 part /
├─sda2   8:2    0     1K  0 part 
└─sda5   8:5    0   7,9G  0 part [SWAP]
sdb      8:16   0  29,8G  0 disk 
sdc      8:32   1   1,4M  1 disk

So in this case it is /dev/sdc. Create the image with:

sudo ddrescue -d -n -b 512 /dev/sdc myfloppy.img myfloppy.log

Optionaly use dosfsck tool to check the integrity of the file system (assuming it is a DOS file system). Use following command:

echo "n" |dosfsck -t -r myfloppy.img

The -t option checks for bad clusters, but this only works in combination with -a (automatically repair) or -r (interactively repair). So to do the check without automatic repair or input from user we use -r and then use a pipe to prevent any changes being made. Result:

fsck.fat 3.0.26 (2014-03-07)
Cluster 2845 is unreadable.
Cluster 2846 is unreadable.
Cluster 2847 is unreadable.
Cluster 2848 is unreadable.
Perform changes ? (y/n) myfloppy.img: 33 files, 2304/2847 clusters

Git as synchronisation tool links

Check integrity of git rpo:

http://stackoverflow.com/questions/5585388/which-git-commands-perform-integrity-checks

(Bottom line: use git fsck.)

How to shrink the git folder:

http://stackoverflow.com/questions/5613345/how-to-shrink-the-git-folder

25/05/2015

Exiting and re-entering GUI in Linux Mint

Exit GUI:

 Ctrl-Alt-F1

Re-enter:

Ctrl-Alt-F8

18/05/2015

Make Markdown preview in ReText work

From https://bugs.launchpad.net/ubuntu/+source/retext/+bug/1451125:

sudo apt-get install python3-docutils python3-markdown

17/05/2015

Entering BIOS of HP EliteBook 840

From the manual:

  1. Turn on or restart the computer, and then press esc while the “Press the ESC key for Startup Menu” message is displayed at the bottom of the screen
  2. Press f10 to enter Computer Setup.

Check hard disk for bad sectors/blocks

sudo badblocks -sv /dev/sda1

See also:http://askubuntu.com/questions/59064/how-to-run-a-checkdisk

14/04/2015

Location of Virtual Box Guest additions on Linux host machine

/usr/share/virtualbox

23/03/2015

How to get rid of clock skew errors while building packages on VM

Run this on host machine:

sudo ntpdate ntp.xs4all.nl

Then re-start VM; host and guest are now in sync and no more clock skew errors.

17/03/2015

Markdown to HTML (with smart quotes) in Pandoc

pandoc -S whatever.md -o whatever.html

12/03/2015

Validating code lists with Schematron

http://broadcast.oreilly.com/2008/11/validating-code-lists-with-sch.html

02/03/2015

Character sets

Handige Unicode en UTF-8 achtergrondinfo:

http://codesnippets.wpakb.kb.nl/index.php?title=Character_sets

17/02/2015

EPUB creation tool

Sigil:

https://github.com/user-none/Sigil

Simple, use-friendly.

04/02/2015

ISO Image creation

ddrescue:

http://www.gnu.org/software/ddrescue/manual/ddrescue_manual.html

Command line (Cygwin):

ddrescue -b 2048 -v /dev/scd0 test.iso test.log

Info on image

disktype tool:

http://disktype.sourceforge.net/

E.g. reveals file system tyype (ISO/UDF), other tech info.

22/01/2015

Installing Windows 98 in VirtualBox

General instructions here:

http://www.msfn.org/board/topic/170785-virtualbox-windows-98se-step-by-step/

But results in error:

HID failed to attach mouse driver (VERR_PDM_NO_ATTACHED_DRIVER

Tried this:

https://forums.virtualbox.org/viewtopic.php?f=2&t=58657#p272752

VBoxInternal/USB/HidMouse/1/Config/CoordShift 0

Still doesn't work; neither does:

VBoxInternal/USB/HidMouse/1/Config/CoordShift 1

But see:

https://www.virtualbox.org/manual/ch12.html#idp60139152

Installing Windows 2000 in VirtualBox

Windows 2000 installation failures:

https://www.virtualbox.org/manual/ch12.html#idp60119680

Works!

Then go install guest additions:

https://docs.oracle.com/cd/E36500_01/E36502/html/qs-guest-additions.html

09/12/2014

AsciiMath

"AsciiMath is an easy-to-write markup language for mathematics":

http://asciimath.org/

03/12/2014

Git cheat sheet

Add all files in directory tree to the index (an remove deleted ones)

git add -A

Commit

git commit -m "Changed everything"

Push to master

git push origin master

Push to some other repo (provided I have the rights for this)

git push git@github.com:openplanets/jpylyzer-test-files.git master

Versioning / tagging

Versioning: x.y.z

x: API breakage y: new feature z: bugfix

Add tag

git tag -a 1.1.0 -m "tagging vesion 1.1.1 with refactored code"

Push tags

git push --tags

02/12/2014

Create test dataset according to new KB digitisation specs from old JP2 batch

​1. Convert all master JP2s to TIFF with ImageMagick, using the command:

mogrify -format tiff *.jp2

​2. Conversion loses resolution info (see below), so add new values using:

exiftool *.tiff -xresolution=300 -yresolution=300 -resolutionunit=inches

​3. Convert TIFFs to master JP2s:

f:\johan\pythoncode\jpwrappa\jpwrappa\jpwrappa.py M:\Trans\johan\testJP2ContrApp2014\B5\tiff\*.tiff M:\Trans\johan\testJP2ContrApp2014\B5\jp2k\master\ -p F:\johan\pythonCode\jpwrappa\jpwrappa\profiles\optionsKBMasterLossless_2014.xml -m

​4. Same for access JP2s:

f:\johan\pythoncode\jpwrappa\jpwrappa\jpwrappa.py M:\Trans\johan\testJP2ContrApp2014\B5\tiff\*.tiff M:\Trans\johan\testJP2ContrApp2014\B5\jp2k\access\ -p F:\johan\pythonCode\jpwrappa\jpwrappa\profiles\optionsKBAccessLossy_2014.xml -m

But ... looking at image header box:

<imageHeaderBox> <height>2818</height> <width>1913</width> <nC>1</nC> <bPCSign>unsigned</bPCSign> <bPCDepth>8</bPCDepth> <c>jpeg2000</c> <unkC>yes</unkC> <iPR>no</iPR> </imageHeaderBox>

So "unknown colourspace" is set to "yes", which should be no (and it is "No" in the source JP2). So what is causing this? Bug in Aware software? Does this only happen with Grayscale images?

Aware codec produces JP2s that are not valid if TIFF doesn't contain resolution info

To reproduce the problem:

  • Convert any JP2 to TIFF with ImageMagick (will strip away any resolution info)
  • Convert TIFF to JP2 with Aware.

Run jpylyzer on resulting JP2:

<isValidJP2>False</isValidJP2> <tests> <jp2HeaderBox> <resolutionBox> <captureResolutionBox> <hRcNIsValid>False</hRcNIsValid> </captureResolutionBox> </resolutionBox> </jp2HeaderBox> </tests>

Looking at properties of resolution box:

<resolutionBox> <captureResolutionBox> <vRcN>29491</vRcN> <vRcD>7491</vRcD> <hRcN>0</hRcN> <hRcD>1</hRcD> <vRcE>1</vRcE> <hRcE>4</hRcE> <vRescInPixelsPerMeter>39.37</vRescInPixelsPerMeter> <hRescInPixelsPerMeter>0.0</hRescInPixelsPerMeter> <vRescInPixelsPerInch>1.0</vRescInPixelsPerInch> <hRescInPixelsPerInch>0.0</hRescInPixelsPerInch> </captureResolutionBox> </resolutionBox>

25/11/2014

Encodings and writing to file (Unicode)

Here for UTF-8:

http://stackoverflow.com/a/9822937

20/11/2014

Jpylyzer Ubuntu / Debian links

Clone specific branch of Github repo

git clone https://github.com/openpreserve/jpylyzer.git --branch gh-pages --single-branch ./jpylyzerHomepage

7/11/2014

Refs to external macros in Excel workbook

File:

E:\\laPeyneCDROM\\xlsfiles\\series98.xls

Refs to MACROS.XLS'!ENash, which is missing.

Solution: before opening, disable automatic workbook calculation from options:

Loading spreadsheet now results in most recent values that are stored in workbook.

27/10/2014

Google search by file extension

thermo filetype:tdb

Only gives results with extension tdb.

16/10/2014

CD imaging

09/10/2014

Publisher data formats

https://spotdocs.scholarsportal.info/display/EJournals/Publisher+Data+Formats

06/10/2014

EPUBCHECK validation errors/warnings

Both errors and warnings reported to same _message_ element in XML. E.g. compare:

  <status>Not well-formed</status>
  <messages>
   <message>ERROR: /OEBPS/cover.html(5): non-standard stylesheet resource 'OEBPS/page-template.xpgt' of type 'application/vnd.adobe-page-template+xml'. A fallback must be specified.</message>
   <message>ERROR: /OEBPS/copyright.html(5): non-standard stylesheet resource 'OEBPS/page-template.xpgt' of type 'application/vnd.adobe-page-template+xml'. A fallback must be specified.</message>
   </messages>

with this:

  <status>Well-formed</status>
  <messages>
   <message>WARN: /OEBPS/toc.ncx: meta@dtb:uid content 'null' should conform to unique-identifier in content.opf: '821'</message>
  </messages>

So output needs some parsing. Tested w. epubcheck 3.0.1.

02/10/2014

Externe schijven Windows PC

  • E drive: Hitachi (grote drive)
  • H drive: Buffalo (kleine drive)

H gebruikt als backupdisk van E.

26/09/2014

Jpylyzer poster DPC / 4C

17/18 november, poster gecanceld, wel 90 s praatje + 1 slide.

10/09/2014

Jpylyzer users & links

BnF:

http://www.bnf.fr/documents/ref_num_fichier_image.pdf

04/09/2014

Ebook vs paper

Readers absorb less on Kindles than on paper, study finds:

http://www.theguardian.com/books/2014/aug/19/readers-absorb-less-kindles-paper-study-plot-ereader-digitisation

Reading and learning from screens versus print: a study in changing habits: Part 1 – reading long information rich texts:

http://www.emeraldinsight.com/doi/full/10.1108/NLW-01-2013-0012

http://www.scientificamerican.com/article/reading-paper-screens/

21/08/2014

Syncing a fork in Github

https://help.github.com/articles/syncing-a-fork

Requires:

https://help.github.com/articles/configuring-a-remote-for-a-fork

27/02/2014

Useful Python shit

10/02/2014

Create PDF from multiple TIFFS

GraphicsMagick command line:

gm convert -compress jpeg -quality 50 *.TIF test.pdf

Result: PDF with all images as JPEG, quality 50. According to Acrobat / Apache Preflight the PDF has some format conformance issues. One possible remedy is to re-process the PDF using Ghostscript. E.g. command below produces a PDF that conforms to PDF/A-1b::

gswin64 -dPDFA -dBATCH -dNOPAUSE -dUseCIEColor -sProcessColorModel=DeviceCMYK -sDEVICE=pdfwrite -sPDFACompatibilityPolicy=1 -sOutputFile=test_a.pdf test.pdf

Source: http://stackoverflow.com/questions/1659147/how-to-use-ghostscript-to-convert-pdf-to-pdf-a-or-pdf-x

05/02/2014

2013: 74% of Dutch e-books distributed without DRM

http://www.cb-logistics.nl/wp-content/uploads/2013/01/74-percent-of-Dutch-e-books-distributed-without-DRM.pdf

04/02/2014

Unix Commands and Batch Processing for the Reluctant Librarian or Archivist

Link: http://journal.code4lib.org/articles/9158

03/02/2014

How to estimate JPEG Quality

Tutorial:

http://fotoforensics.com/tutorial-estq.php

But ... this is also possible with ImageMagick / GraphicsMagick (according to Approximate Quantization Table method that is mentioned in the tutorial):

http://superuser.com/questions/62730/how-to-find-the-jpg-quality

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment