Skip to content

Instantly share code, notes, and snippets.

@ato
Last active September 29, 2016 20:24
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save ato/b2ad8e65b35afe690921 to your computer and use it in GitHub Desktop.
Save ato/b2ad8e65b35afe690921 to your computer and use it in GitHub Desktop.
tinycdxserver example

I just tried my example from the tinycdxserver README and realised that curl is messing up the line-endings due to some conversion it does by default. I haven't checked yet exactly what curl is doing but tinycdxserver is interpreting it as if all the lines in the file have been concatenated together (you can see that by running tinycdxserver in verbose mode with the -v option).

Using curl's --data-binary option instead of --data fixes that and I've updated the README correspondingly.

That could be what's tripping you up. Here's a more complete example that I just tested. You should get an "Added N records" response back if it worked properly, where N is the line count of the cdx.

About the example CDX records below

records.cdx below has a blank ("-") first column because tinycdxserver ignores it and does its own canonicalisation so our usual indexing process doesn't even bother filling it in. You can use standard CDX files as well, I've included a second file records2.cdx with SURT-style URLs that was generated using IA tools just to demonstrate that.

Usage walkthrough

Compile tinycdxserver:

$ git clone git@github.com:nla/tinycdxserver.git
$ cd tinycdxserver
$ mvn package

Start tinycdxserver:

$ mkdir /tmp/data
$ java -jar target/tinycdxserver-0.1-SNAPSHOT.jar -d /tmp/data

Grab an example CDX:

$ curl -LO https://gist.github.com/ato/b2ad8e65b35afe690921/raw/4e663c44c74c585ac0d5226780465d2281177958/records.cdx
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  1203  100  1203    0     0   1297      0 --:--:-- --:--:-- --:--:--  1297

Load it:

$ curl -XPOST --data-binary @records.cdx http://localhost:8080/myindex
Added 6 records

Get a record back:

$ curl -s http://localhost:8080/myindex?url=http://minister.infrastructure.gov.au/
au,gov,infrastructure,minister)/ 20150914222035 http://www.minister.infrastructure.gov.au/ text/html 301 ZH3ZBTFT5T6VC4BHO3MC6MLFECBEKDYN 389

Query using wayback's xml protocol:

$ curl -s http://localhost:8080/myindex?q=type:urlquery+url:http://minister.infrastructure.gov.au/  | xml_pp
<?xml version="1.0" encoding="UTF-8"?>
<wayback>
  <request>
    <startdate>19960101000000</startdate>
    <enddate>20151015072406</enddate>
    <type>urlquery</type>
    <firstreturned>0</firstreturned>
    <url>au,gov,infrastructure,minister)/</url>
    <resultsrequested>10000</resultsrequested>
    <resultstype>resultstypecapture</resultstype>
  </request>
  <results>
    <result>
      <compressedoffset>152443</compressedoffset>
      <mimetype>text/html</mimetype>
      <file>WEB-20150914222031256-00000-43190~heritrix.nla.gov.au~8443.warc.gz</file>
      <redirecturl>http://minister.infrastructure.gov.au/</redirecturl>
      <urlkey>au,gov,infrastructure,minister)/</urlkey>
      <digest>ZH3ZBTFT5T6VC4BHO3MC6MLFECBEKDYN</digest>
      <httpresponsecode>301</httpresponsecode>
      <robotflags>-</robotflags>
      <url>http://www.minister.infrastructure.gov.au/</url>
      <capturedate>20150914222035</capturedate>
    </result>
  </results>
</wayback>
- 20150914222034 http://www.financeminister.gov.au/ text/html 200 ZMSA5TNJUKKRYAIM5PRUJLL24DV7QYOO - - 83848 117273 WEB-20150914222031256-00000-43190~heritrix.nla.gov.au~8443.warc.gz
- 20150914222035 http://strongersuper.treasury.gov.au/ text/html 302 TDYO3KQ3O2PR5EJJDNQ7NBNHWU44WR3D http://strongersuper.treasury.gov.au/content/Content.aspx?doc=home.htm - 442 138671 WEB-20150914222031256-00000-43190~heritrix.nla.gov.au~8443.warc.gz
- 20150914222035 http://www.mhs.gov.au/ text/html 200 LLSUKKXWSWIPCKTKRKFQY4VRTORHRKZT - - 9777 140712 WEB-20150914222031256-00000-43190~heritrix.nla.gov.au~8443.warc.gz
- 20150914222034 http://jbh.ministers.treasury.gov.au/ text/html 200 NS2AUHSI3HD2Y5VHYIQEYOX3Y3BSFQLG - - 19119 145121 WEB-20150914222031256-00000-43190~heritrix.nla.gov.au~8443.warc.gz
- 20150914222035 http://www.minister.infrastructure.gov.au/ text/html 301 ZH3ZBTFT5T6VC4BHO3MC6MLFECBEKDYN http://minister.infrastructure.gov.au/ - 389 152443 WEB-20150914222031256-00000-43190~heritrix.nla.gov.au~8443.warc.gz
- 20150914222034 http://bfb.ministers.treasury.gov.au/ text/html 200 WXEF6JLTZCZITLEP3VDFQ4MCB3ZS5EYS - - 19112 153934 WEB-20150914222031256-00000-43190~heritrix.nla.gov.au~8443.warc.gz
au,gov,australia)/about 20070831172339 http://australia.gov.au/about text/html 200 ZUEQ3STH3JAEABZG22LQI626TTY7DN2A - - - 14369759 NLA-AU-CRAWL-002-20070831172246-04117-crawling015.us.archive.org.arc.gz
au,gov,australia)/about 20080719174427 http://www.australia.gov.au/About text/html 200 CGSTTFZGMVAHEOMHQTGTUZUG46MLBFL6 - - - 62867360 NLA-AU-CRAWL-003-20080719174211-01545-crawling104.us.archive.org.arc.gz
au,gov,australia)/about 20090916104859 http://www.australia.gov.au/about text/html 200 7VXWF4Y6TXFWR7JZPORIEHUD5ORMHBMY - - - 59828846 NLA-AU-CRAWL-004-20090916104520-09084-crawling106.us.archive.org.arc.gz
au,gov,australia)/about 20091112023446 http://australia.gov.au/about text/html 200 7VXWF4Y6TXFWR7JZPORIEHUD5ORMHBMY - - - 70365777 NLA-AU-CRAWL-004-PATCH-20091112023201-00275-crawling108.us.archive.org.arc.gz
au,gov,australia)/about 20110216141839 http://www.australia.gov.au/about - 302 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ - - - 765762042 NLA-AU-CRAWL-005-20110216141406-00155-crawling218.us.archive.org.warc.gz
au,gov,australia)/about 20110217132700 http://australia.gov.au/about text/html 200 3JQ6HKH4HXEI4G335KENZBQNCFHF7PP4 - - - 477571492 NLA-AU-CRAWL-005-20110217132352-00349-crawling218.us.archive.org.warc.gz
au,gov,australia)/about 20110226123639 http://australia.gov.au/about text/html 200 DRP6CY44HXCJP4TNTMNOKE6AF3ZANGVU - - - 343881716 NLA-AU-CRAWL-005-20110226123146-00037-crawling218.us.archive.org.warc.gz
au,gov,australia)/about 20110226133347 https://australia.gov.au/about text/html 200 WBAG4MI6N5QCQ2LFLKSA3OQ6RZUMPTMO - - - 593342593 NLA-AU-CRAWL-005-20110226132237-00040-crawling218.us.archive.org.warc.gz
au,gov,australia)/about 20110328204616 http://australia.gov.au/about text/html 200 Z34GAL7DQINDJUXUS4CGPEL4YK4FRIOH - - - 656072390 NLA-AU-CRAWL-005-20110328201652-00001-crawling218.us.archive.org.warc.gz
au,gov,australia)/about 20110422083017 http://australia.gov.au/about text/html 200 BPPC5KI3E44TVMKFA66ZFUUT46KP7SAV - - - 513717895 NLA-AU-CRAWL-005-20110422082730-00024-crawling213.us.archive.org.warc.gz
au,gov,australia)/about 20120321062048 http://australia.gov.au/about text/html 200 RWIUXTTE64RHNEWQCCL7UEDIGZJNPLVJ - - - 198474221 NLA-AU-CRAWL-006-20120321061732570-00098-3266~web-crawl001.us.archive.org~8443.warc.gz
au,gov,australia)/about 20130409003017 http://australia.gov.au/about text/html 200 IB3AMRZJMPFIATC6WQHPH4LVUUACXAW7 - - - 718271905 NLA-AU-CRAWL-04-03-2013-20130409002124240-00009-27793~wbgrp-crawl008.us.archive.org~8443.warc.gz
au,gov,australia)/about 20130409234435 https://www.australia.gov.au/about application/http 302 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ - - - 855531863 NLA-AU-CRAWL-04-03-2013-20130409232851898-00397-27793~wbgrp-crawl008.us.archive.org~8443.warc.gz
au,gov,australia)/about 20130410112908 https://australia.gov.au/about text/html 200 PNFEFWUCNGAVTARXTS5LLSOMATLRFRG3 - - - 11352123 NLA-AU-CRAWL-04-03-2013-20130410112854462-00572-27793~wbgrp-crawl008.us.archive.org~8443.warc.gz
au,gov,australia)/about 20130421094357 https://www.australia.gov.au/about warc/revisit - 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ - - - 218425876 NLA-AU-CRAWL-04-03-2013-20130421093701746-01421-29417~wbgrp-crawl008.us.archive.org~8443.warc.gz
au,gov,australia)/about 20130421122342 https://australia.gov.au/about text/html 200 PICUVAYGMZY5IOXPWHKLE6BVYFACC7LG - - - 353874381 NLA-AU-CRAWL-04-03-2013-20130421121108172-01443-29417~wbgrp-crawl008.us.archive.org~8443.warc.gz
au,gov,australia)/about 20130427095452 https://www.australia.gov.au/about warc/revisit - 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ - - - 903094466 NLA-AU-CRAWL-04-03-2013-20130427085524926-01687-433~wbgrp-crawl008.us.archive.org~8443.warc.gz
au,gov,australia)/about 20130427132450 https://australia.gov.au/about text/html 200 PNX76BM2Z5WK4H66M4LGLE25AXLC5SZ5 - - - 286205731 NLA-AU-CRAWL-04-03-2013-20130427131330785-01699-433~wbgrp-crawl008.us.archive.org~8443.warc.gz
au,gov,australia)/about 20130502072522 http://australia.gov.au/about text/html 200 NUUDONRPIPF3FBGZL2UZRVUKIJPL6G2F - - - 659173849 NLA-AU-CRAWL-04-03-2013-20130502071652498-00011-26913~wbgrp-crawl008.us.archive.org~8443.warc.gz
au,gov,australia)/about 20130503074233 https://www.australia.gov.au/about application/http 302 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ - - - 68418494 NLA-AU-CRAWL-04-03-2013-20130503074145197-00410-26913~wbgrp-crawl008.us.archive.org~8443.warc.gz
au,gov,australia)/about 20130503082356 https://australia.gov.au/about text/html 200 DLJ7MHJX7XSBHL3DSK4AJTP47YJQ4XD6 - - - 169734931 NLA-AU-CRAWL-04-03-2013-20130503082227277-00426-26913~wbgrp-crawl008.us.archive.org~8443.warc.gz
au,gov,australia)/about 20130509171208 https://www.australia.gov.au/about application/http 302 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ - - - 469007252 NLA-AU-CRAWL-04-03-2013-20130509170233360-01422-2357~wbgrp-crawl008.us.archive.org~8443.warc.gz
au,gov,australia)/about 20130509202803 https://australia.gov.au/about text/html 200 NW2L4S35DERPWVH63TGZRN67PILW5BSK - - - 76130469 NLA-AU-CRAWL-04-03-2013-20130509202651491-01459-2357~wbgrp-crawl008.us.archive.org~8443.warc.gz
au,gov,australia)/about 20140114001125 http://australia.gov.au/about text/html 200 N75NS3Y3B44BJE22SDICYCBKE5YQKHUI - - 8295 476816575 NLA-AU-TEST-01-10-2014-20140114000228973-00012-24613~wbgrp-crawl003.us.archive.org~8443.warc.gz
au,gov,australia)/about 20140126200707 http://australia.gov.au/about text/html 200 V3PBNQEH6EPI5ARS6HOTI4GA7MA37ZG6 - - 8316 40437948 NLA-AU-CRAWL-01-21-2014-20140126200633086-00572-25807~wbgrp-crawl004.us.archive.org~8443.warc.gz
au,gov,australia)/about 20140407081604 http://www.australia.gov.au/about text/html 200 AKSSUZJLOW3BJF546AAVCYPSJH3PSCRF - - 8265 42047962 NLA-AU-CRAWL-01-21-2014-20140407081424273-03408-7081~wbgrp-crawl004.us.archive.org~8443.warc.gz
au,gov,australia)/about-australia 20050622180623 http://australia.gov.au/about-australia text/html 200 1a1eb13d0f84d6f7980546cf1254e019 - - - 15680184 NLA-AU-CRAWL-000-20050622180402-06036-crawling016.archive.org
au,gov,australia)/about-australia 20060819094822 http://australia.gov.au/about-australia text/html 200 751c368557765d512bf9ec76ba513ff5 - - - 40191902 NLA-AU-CRAWL-001-20060819094631-00724-crawling01.us.archive.org
au,gov,australia)/about-australia 20060820210554 http://australia.gov.au/about-australia text/html 200 751c368557765d512bf9ec76ba513ff5 - - - 62545839 NLA-AU-CRAWL-001-20060820210248-02827-crawling01.us.archive.org
au,gov,australia)/about-australia 20070830152508 http://australia.gov.au/about-australia text/html 404 6ZY2SKU552PWOLOTNZF5OTDE4WUJISSN - - - 8534439 NLA-AU-CRAWL-002-20070830152458-02848-crawling015.us.archive.org.arc.gz
@anjackson
Copy link

I thought the Wayback XML Query API used separate query parameters, i.e.

http://ia360911.us.archive.org:9090/wayback/xmlquery?type=urlquery&url={URL}&startdate={DATE}&enddate={DATE}

Does Wayback need any special configuration to use your CDX server as part of a remote collection?

Oh, hey, that works as well! http://www.webarchive.org.uk/wayback/archive/xmlquery.jsp?q=url:http://www.bl.uk/

So, the RemoteResourceIndex uses the q=url: form? Ah, so it looks like this is the OpenSearch API and that that is the required form. Excellent, thanks again.

@ato
Copy link
Author

ato commented Oct 15, 2015

Hahah, I didn't actually know that you could query Wayback with individual parameters like that, I'd only seen the opensearch form as indeed that's what RemoteResourceIndex uses.

The Wayback configuration is just like the the sample in RemoteCollection.xml. If you're using the IIPC OpenWayback fork just make sure you're not using a version affected by this bug: iipc/openwayback#239

We use SimpleResourceStore and fetch WARCs over HTTP like in the sample xml config, but any of the other resourceStore types should also be fine as long as the filenames in your CDX records match up.

  <bean id="remotecollection" class="org.archive.wayback.webapp.WaybackCollection">

    <property name="resourceStore">
      <bean class="org.archive.wayback.resourcestore.SimpleResourceStore">
        <property name="prefix" value="http://localhost/warcs/" />
      </bean>
    </property>

    <property name="resourceIndex">
      <bean class="org.archive.wayback.resourceindex.RemoteResourceIndex">
        <property name="searchUrlBase" value="http://localhost:8080/myindex" />
      </bean>
    </property>
  </bean>

@ato
Copy link
Author

ato commented Oct 15, 2015

Annoyingly RocksDB just seems to silently not compress if it's not built with snappy, even if you explicitly set the compression algorithm option. I'm not sure if there's a proper way to check it. The way I noticed the first time was just the file sizes were larger than what I was expecting and then I confirmed what it was doing by reading the raw .sst database files.

I don't have any uncompressed examples handy, but if it's working if you hexdump or strings on an .sst file you'll only see full urls at the start of each compression block (~8KB but it varies) and then following records will only have small fragments as the algorithm reuses previous strings. An uncompressed index will spell out the full URLs in each record and be a lot more human-readable.

$ hexdump -C 048503.sst | head -n50
00000000  fb fd 03 98 00 28 8d 01  31 30 31 2e 30 2e 36 37  |.....(..101.0.67|
00000010  2e 32 33 31 2f 66 61 76  69 63 6f 6e 2e 69 63 6f  |.231/favicon.ico|
00000020  00 00 12 51 7d f8 88 33  01 00 00 05 02 20 01 1f  |...Q}..3..... ..|
00000030  68 74 74 70 3a 2f 2f 5e  31 00 f0 7b 94 03 09 74  |http://^1..{...t|
00000040  65 78 74 2f 68 74 6d 6c  dc 05 14 b8 70 7b 98 a6  |ext/html....p{..|
00000050  cb 47 a2 42 00 1a ff 8e  ce f2 f0 8d df fd 6e 41  |.G.B..........nA|
00000060  57 45 42 2d 32 30 31 34  31 32 31 35 30 39 30 35  |WEB-201412150905|
00000070  32 35 33 36 32 2d 30 30  31 33 32 2d 31 35 31 39  |25362-00132-1519|
00000080  7e 68 65 72 69 74 72 69  78 2e 6e 6c 61 2e 67 6f  |~heritrix.nla.go|
00000090  76 2e 61 75 7e 38 34 34  33 2e 77 61 72 63 2e 67  |v.au~8443.warc.g|
000000a0  7a f1 e6 a9 9f 01 01 2d  0d 1a 8c 01 72 6f 62 6f  |z......-....robo|
000000b0  74 73 2e 74 78 74 00 00  01 ab 08 87 7e 01 05 a9  |ts.txt......~...|
000000c0  0c 00 00 01 1e 4e ab 00  19 30 08 c8 01 0a 05 aa  |.....N...0......|
000000d0  6c 70 6c 61 69 6e 86 0f  14 73 58 31 a2 2e 8c 81  |lplain...sX1....|
000000e0  20 a5 89 3b cd 57 b3 b6  03 49 d6 41 ae ee ab 00  | ..;.W...I.A....|
000000f0  09 ab 80 cf ff e0 46 01  2d 0d 56 c9 01 73 69 74  |......F.-.V..sit|
00000100  65 73 2f 64 65 66 61 75  6c 74 2f 66 69 6c 65 73  |es/default/files|
00000110  2f 73 74 79 01 07 a8 6c  61 72 67 65 2f 70 75 62  |/sty...large/pub|
00000120  6c 69 63 2f 61 68 6c 2d  61 72 5f 31 31 2d 31 32  |lic/ahl-ar_11-12|
00000130  2e 6a 70 67 3f 69 74 6f  6b 3d 77 64 73 63 79 71  |.jpg?itok=wdscyq|
00000140  66 38 2d 91 00 28 15 e6  00 5a 4e e6 00 ee 6c 00  |f8-..(...ZN...l.|
00000150  05 6c a0 43 79 51 46 38  c8 01 0a 69 6d 61 67 65  |.l.CyQF8...image|
00000160  2f 6a 70 65 67 bf 52 14  5d 42 a1 c3 8e 8a 5e a4  |/jpeg.R.]B....^.|
00000170  39 ac 21 c6 de a9 7c 6f  f9 82 32 4f ee 22 01 29  |9.!...|o..2O.".)|
00000180  22 38 9a d4 ce 85 01 01  2d 3d 26 c8 01 32 2d 31  |"8......-=&..2-1|
00000190  33 1d f3 18 69 63 35 63  34 34 75 0d f3 00 1c fe  |3...ic5c44u.....|
000001a0  f3 00 36 f3 00 36 6c 00  1c 57 69 43 35 43 34 34  |..6..6l..WiC5C44|
000001b0  75 32 f3 00 58 8f 3e 14  ac a1 84 ee 08 c8 b4 1f  |u2..X.>.........|
000001c0  be 9f f1 ba bb 56 b3 a1  7a 67 78 b9 ee f3 00 09  |.....V..zgx.....|
000001d0  f3 5c fa ef ab 76 01 2d  39 3a d8 01 73 74 72 61  |.\...v.-9:..stra|
000001e0  74 65 67 69 63 2d 70 6c  61 6e 65 16 01 05 00 37  |tegic-plane....7|
000001f0  39 06 1c 79 6e 64 6a 62  6d 61 31 29 06 04 87 c2  |9..yndjbma1)....|
00000200  35 06 00 6a ee f9 01 10  41 48 4c 2d 53 15 7c 00  |5..j....AHL-S.|.|
00000210  50 5a 7c 00 1c 59 4e 44  4a 62 6d 41 31 32 16 01  |PZ|..YNDJbmA12..|
00000220  58 de 4a 14 bd d3 ac 51  2b 14 94 59 8e d3 5f bb  |X.J....Q+..Y.._.|
00000230  46 3f 15 c0 a4 79 f8 2e  ee 16 01 29 16 50 84 93  |F?...y.....).P..|
00000240  d6 52 01 2d 48 2d da 01  61 74 2d 61 2d 67 6c 61  |.R.-H-..at-a-gla|
00000250  6e 63 65 39 09 1c 36 6a  74 6b 7a 36 76 61 2d 09  |nce9..6jtkz6va-.|
00000260  00 dc 35 09 00 6c fe 09  01 3a 09 01 14 41 74 2d  |..5..l...:...At-|
00000270  41 2d 47 46 7e 00 00 4b  01 7e 32 0b 01 58 dd 72  |A-GF~..K.~2..X.r|
00000280  14 5d 08 05 36 41 53 c9  d6 83 8f c7 8f ba 13 9b  |.]..6AS.........|
00000290  ea 46 6c 18 e3 ee 0b 01  29 0b 7c a2 bd ac 61 01  |.Fl.....).|...a.|
000002a0  2d 36 38 d3 01 6e 6e 75  61 6c 2d 72 65 70 6f 72  |-68..nnual-repor|
000002b0  74 2d 74 68 75 6d 62 6e  61 69 6c 39 16 1c 6f 61  |t-thumbnail9..oa|
000002c0  6d 71 76 6f 71 62 29 16  04 88 10 35 16 00 65 f2  |mqvoqb)....5..e.|
000002d0  16 01 09 77 00 52 09 77  00 54 46 77 00 0c 4f 61  |...w.R.w.TFw..Oa|
000002e0  4d 51 01 77 32 0f 01 58  f1 3f 14 fb 68 0a 43 be  |MQ.w2..X.?..h.C.|
000002f0  02 22 18 32 98 e4 ed cb  2b 75 0c de a0 be 7a ee  |.".2....+u....z.|
00000300  0f 01 29 0f 4c a6 99 df  6b 01 2d 02 26 8d 01 33  |..).L...k.-.&..3|
00000310  2e 37 2e 31 36 35 2e 39  38 46 a6 06 04 bc 7f 15  |.7.165.98F......|

@anjackson
Copy link

Okay, that helps a lot, thanks. I installed RockDB with compression enabled, and used the Maven dependency that does not have a platform binary in it, and it seems to work. I only have 77 records in it, so the compressed size was actually larger than the uncompressed, but it's a lot less readable!

FWIW, here's the docker build:

https://github.com/anjackson/wauldock/tree/master/tinycdxserver

and the minor changes I made to the tinycdxserver itself are in this repo:

https://github.com/anjackson/tinycdxserver

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment