Skip to content

Instantly share code, notes, and snippets.

@ato
ato / crawler-beans.groovy
Created November 30, 2024 04:14
Heritrix default profile as Groovy Bean Definition DSL
/**
* HERITRIX 3 CRAWL JOB CONFIGURATION FILE
*
* This is a relatively minimal configuration suitable for many crawls.
*
* Commented-out beans and properties are provided as an example; values
* shown in comments reflect the actual defaults which are in effect
* if not otherwise specified specification. (To change from the default
* behavior, uncomment AND alter the shown values.)
*/
@ato
ato / warc-conversion-software-fields.md
Last active March 5, 2019 13:06
WARC conversion software fields draft

WARC conversion software fields (draft)

When converting content in an archive it is useful for diagnostic purposes to record the versions of major software components used and important conversion options. Another common use case is to identify records that later need to be reconverted with newer software in order to improve conversion quality or fix records misconverted due to a bug or incorrect option.

WARC-Conversion-Software

The WARC-Conversion-Software field indicates the version of software components used in the

@ato
ato / ReplayProxy.java
Last active November 28, 2018 23:12
OutbackProxy?
package outbackproxy;
import io.undertow.Undertow;
import io.undertow.connector.ByteBufferPool;
import io.undertow.server.DefaultByteBufferPool;
import io.undertow.server.HttpHandler;
import io.undertow.server.HttpServerExchange;
import io.undertow.server.handlers.BlockingHandler;
import io.undertow.util.HeaderMap;
import io.undertow.util.HttpString;
https://nlaaus-my.sharepoint.com/:p:/g/personal/aosborne_nla_gov_au/ETBiJ45EopxHurgUNogz2NwBSuY5zG8uvQub-bccPj-GYw?e=FryT8V
@ato
ato / ssurt.py
Last active November 5, 2017 05:38
#!/usr/bin/python3
# coding=utf-8
import re
SSURT_RE = r"""
\A
(?P<scheme> [a-zA-Z] [a-zA-Z0-9+.-]* : )?
(?P<authority>
(?P<slashes> /* )
@ato
ato / ItemNodesController.java
Last active October 19, 2016 03:12
On the fly unzipping for dl-repo
import de.schlichtherle.truezip.rof.AbstractReadOnlyFile;
import de.schlichtherle.truezip.zip.ZipEntry;
import de.schlichtherle.truezip.zip.ZipFile;
// ...
@RequestMapping(value = "/Repository/unzip/copy/{copyId:nla\\.obj-[^/]+}/{path:.+}", method = {RequestMethod.GET})
@ResponseBody
public void unzipCopy(@PathVariable String copyId,
@PathVariable String path,
@ato
ato / README.md
Last active September 29, 2016 20:24
tinycdxserver example

I just tried my example from the tinycdxserver README and realised that curl is messing up the line-endings due to some conversion it does by default. I haven't checked yet exactly what curl is doing but tinycdxserver is interpreting it as if all the lines in the file have been concatenated together (you can see that by running tinycdxserver in verbose mode with the -v option).

Using curl's --data-binary option instead of --data fixes that and I've updated the README correspondingly.

That could be what's tripping you up. Here's a more complete example that I just tested. You should get an "Added N records" response back if it worked properly, where N is the line count of the cdx.

@ato
ato / buggy.cpp
Last active August 29, 2015 14:03
RCSwitch receive bug
// "unsigned long" is 64-bit on x86_64 so changed to uint32_t to test this like it would be on a 32-bit platform
// commented out the delay logic so that it thinks it's just received all 1s
//
// output:
// $ g++ buggy.cpp -o buggy && ./buggy
// 7fffffff
#include <stdio.h>
#include <stdint.h>
;; Not useful as a generic HTTrack conversion tool as we don't bother trying to undo the URL-rewriting. PANDORA
;; crawls are often manually edited and sometimes collected with tools other than HTTrack.
;; Instead we just generate records with the URLs as they deliver in the PANDORA archive:
;; http://pandora.nla.gov.au/pan/...
;;
;; Example output:
;;
;; WARC/1.0
;; WARC-Type: resource
;; WARC-Target-URI: http://pandora.nla.gov.au/pan/85187/20080605-1425/www.tams.act.gov.au/__data/assets/pdf_file/0010/102250/Alcohol_and_Drugs_discussion_paper.pdf
@ato
ato / csvish-regex.clj
Last active August 29, 2015 14:02
CSV-ish regex parser
(map #(map second (re-seq #"((?:\"[^\"]*\"|[^ ]+)+)(?: |$)" (second %))) (re-seq #"((?:(?:\"[^\"]*\")+|[^\"\n]+)+)(?:\n|$)" input-string))
Working from the inside out:
(re-seq #"((?:(?:\"[^\"]*\")+|[^\"\n]+)+)(?:\n|$)" input-string)
Breaks down into:
(
(?: