Skip to content

Instantly share code, notes, and snippets.

@ato
Last active September 29, 2016 20:24
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save ato/b2ad8e65b35afe690921 to your computer and use it in GitHub Desktop.
Save ato/b2ad8e65b35afe690921 to your computer and use it in GitHub Desktop.
tinycdxserver example

I just tried my example from the tinycdxserver README and realised that curl is messing up the line-endings due to some conversion it does by default. I haven't checked yet exactly what curl is doing but tinycdxserver is interpreting it as if all the lines in the file have been concatenated together (you can see that by running tinycdxserver in verbose mode with the -v option).

Using curl's --data-binary option instead of --data fixes that and I've updated the README correspondingly.

That could be what's tripping you up. Here's a more complete example that I just tested. You should get an "Added N records" response back if it worked properly, where N is the line count of the cdx.

About the example CDX records below

records.cdx below has a blank ("-") first column because tinycdxserver ignores it and does its own canonicalisation so our usual indexing process doesn't even bother filling it in. You can use standard CDX files as well, I've included a second file records2.cdx with SURT-style URLs that was generated using IA tools just to demonstrate that.

Usage walkthrough

Compile tinycdxserver:

$ git clone git@github.com:nla/tinycdxserver.git
$ cd tinycdxserver
$ mvn package

Start tinycdxserver:

$ mkdir /tmp/data
$ java -jar target/tinycdxserver-0.1-SNAPSHOT.jar -d /tmp/data

Grab an example CDX:

$ curl -LO https://gist.github.com/ato/b2ad8e65b35afe690921/raw/4e663c44c74c585ac0d5226780465d2281177958/records.cdx
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  1203  100  1203    0     0   1297      0 --:--:-- --:--:-- --:--:--  1297

Load it:

$ curl -XPOST --data-binary @records.cdx http://localhost:8080/myindex
Added 6 records

Get a record back:

$ curl -s http://localhost:8080/myindex?url=http://minister.infrastructure.gov.au/
au,gov,infrastructure,minister)/ 20150914222035 http://www.minister.infrastructure.gov.au/ text/html 301 ZH3ZBTFT5T6VC4BHO3MC6MLFECBEKDYN 389

Query using wayback's xml protocol:

$ curl -s http://localhost:8080/myindex?q=type:urlquery+url:http://minister.infrastructure.gov.au/  | xml_pp
<?xml version="1.0" encoding="UTF-8"?>
<wayback>
  <request>
    <startdate>19960101000000</startdate>
    <enddate>20151015072406</enddate>
    <type>urlquery</type>
    <firstreturned>0</firstreturned>
    <url>au,gov,infrastructure,minister)/</url>
    <resultsrequested>10000</resultsrequested>
    <resultstype>resultstypecapture</resultstype>
  </request>
  <results>
    <result>
      <compressedoffset>152443</compressedoffset>
      <mimetype>text/html</mimetype>
      <file>WEB-20150914222031256-00000-43190~heritrix.nla.gov.au~8443.warc.gz</file>
      <redirecturl>http://minister.infrastructure.gov.au/</redirecturl>
      <urlkey>au,gov,infrastructure,minister)/</urlkey>
      <digest>ZH3ZBTFT5T6VC4BHO3MC6MLFECBEKDYN</digest>
      <httpresponsecode>301</httpresponsecode>
      <robotflags>-</robotflags>
      <url>http://www.minister.infrastructure.gov.au/</url>
      <capturedate>20150914222035</capturedate>
    </result>
  </results>
</wayback>
- 20150914222034 http://www.financeminister.gov.au/ text/html 200 ZMSA5TNJUKKRYAIM5PRUJLL24DV7QYOO - - 83848 117273 WEB-20150914222031256-00000-43190~heritrix.nla.gov.au~8443.warc.gz
- 20150914222035 http://strongersuper.treasury.gov.au/ text/html 302 TDYO3KQ3O2PR5EJJDNQ7NBNHWU44WR3D http://strongersuper.treasury.gov.au/content/Content.aspx?doc=home.htm - 442 138671 WEB-20150914222031256-00000-43190~heritrix.nla.gov.au~8443.warc.gz
- 20150914222035 http://www.mhs.gov.au/ text/html 200 LLSUKKXWSWIPCKTKRKFQY4VRTORHRKZT - - 9777 140712 WEB-20150914222031256-00000-43190~heritrix.nla.gov.au~8443.warc.gz
- 20150914222034 http://jbh.ministers.treasury.gov.au/ text/html 200 NS2AUHSI3HD2Y5VHYIQEYOX3Y3BSFQLG - - 19119 145121 WEB-20150914222031256-00000-43190~heritrix.nla.gov.au~8443.warc.gz
- 20150914222035 http://www.minister.infrastructure.gov.au/ text/html 301 ZH3ZBTFT5T6VC4BHO3MC6MLFECBEKDYN http://minister.infrastructure.gov.au/ - 389 152443 WEB-20150914222031256-00000-43190~heritrix.nla.gov.au~8443.warc.gz
- 20150914222034 http://bfb.ministers.treasury.gov.au/ text/html 200 WXEF6JLTZCZITLEP3VDFQ4MCB3ZS5EYS - - 19112 153934 WEB-20150914222031256-00000-43190~heritrix.nla.gov.au~8443.warc.gz
au,gov,australia)/about 20070831172339 http://australia.gov.au/about text/html 200 ZUEQ3STH3JAEABZG22LQI626TTY7DN2A - - - 14369759 NLA-AU-CRAWL-002-20070831172246-04117-crawling015.us.archive.org.arc.gz
au,gov,australia)/about 20080719174427 http://www.australia.gov.au/About text/html 200 CGSTTFZGMVAHEOMHQTGTUZUG46MLBFL6 - - - 62867360 NLA-AU-CRAWL-003-20080719174211-01545-crawling104.us.archive.org.arc.gz
au,gov,australia)/about 20090916104859 http://www.australia.gov.au/about text/html 200 7VXWF4Y6TXFWR7JZPORIEHUD5ORMHBMY - - - 59828846 NLA-AU-CRAWL-004-20090916104520-09084-crawling106.us.archive.org.arc.gz
au,gov,australia)/about 20091112023446 http://australia.gov.au/about text/html 200 7VXWF4Y6TXFWR7JZPORIEHUD5ORMHBMY - - - 70365777 NLA-AU-CRAWL-004-PATCH-20091112023201-00275-crawling108.us.archive.org.arc.gz
au,gov,australia)/about 20110216141839 http://www.australia.gov.au/about - 302 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ - - - 765762042 NLA-AU-CRAWL-005-20110216141406-00155-crawling218.us.archive.org.warc.gz
au,gov,australia)/about 20110217132700 http://australia.gov.au/about text/html 200 3JQ6HKH4HXEI4G335KENZBQNCFHF7PP4 - - - 477571492 NLA-AU-CRAWL-005-20110217132352-00349-crawling218.us.archive.org.warc.gz
au,gov,australia)/about 20110226123639 http://australia.gov.au/about text/html 200 DRP6CY44HXCJP4TNTMNOKE6AF3ZANGVU - - - 343881716 NLA-AU-CRAWL-005-20110226123146-00037-crawling218.us.archive.org.warc.gz
au,gov,australia)/about 20110226133347 https://australia.gov.au/about text/html 200 WBAG4MI6N5QCQ2LFLKSA3OQ6RZUMPTMO - - - 593342593 NLA-AU-CRAWL-005-20110226132237-00040-crawling218.us.archive.org.warc.gz
au,gov,australia)/about 20110328204616 http://australia.gov.au/about text/html 200 Z34GAL7DQINDJUXUS4CGPEL4YK4FRIOH - - - 656072390 NLA-AU-CRAWL-005-20110328201652-00001-crawling218.us.archive.org.warc.gz
au,gov,australia)/about 20110422083017 http://australia.gov.au/about text/html 200 BPPC5KI3E44TVMKFA66ZFUUT46KP7SAV - - - 513717895 NLA-AU-CRAWL-005-20110422082730-00024-crawling213.us.archive.org.warc.gz
au,gov,australia)/about 20120321062048 http://australia.gov.au/about text/html 200 RWIUXTTE64RHNEWQCCL7UEDIGZJNPLVJ - - - 198474221 NLA-AU-CRAWL-006-20120321061732570-00098-3266~web-crawl001.us.archive.org~8443.warc.gz
au,gov,australia)/about 20130409003017 http://australia.gov.au/about text/html 200 IB3AMRZJMPFIATC6WQHPH4LVUUACXAW7 - - - 718271905 NLA-AU-CRAWL-04-03-2013-20130409002124240-00009-27793~wbgrp-crawl008.us.archive.org~8443.warc.gz
au,gov,australia)/about 20130409234435 https://www.australia.gov.au/about application/http 302 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ - - - 855531863 NLA-AU-CRAWL-04-03-2013-20130409232851898-00397-27793~wbgrp-crawl008.us.archive.org~8443.warc.gz
au,gov,australia)/about 20130410112908 https://australia.gov.au/about text/html 200 PNFEFWUCNGAVTARXTS5LLSOMATLRFRG3 - - - 11352123 NLA-AU-CRAWL-04-03-2013-20130410112854462-00572-27793~wbgrp-crawl008.us.archive.org~8443.warc.gz
au,gov,australia)/about 20130421094357 https://www.australia.gov.au/about warc/revisit - 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ - - - 218425876 NLA-AU-CRAWL-04-03-2013-20130421093701746-01421-29417~wbgrp-crawl008.us.archive.org~8443.warc.gz
au,gov,australia)/about 20130421122342 https://australia.gov.au/about text/html 200 PICUVAYGMZY5IOXPWHKLE6BVYFACC7LG - - - 353874381 NLA-AU-CRAWL-04-03-2013-20130421121108172-01443-29417~wbgrp-crawl008.us.archive.org~8443.warc.gz
au,gov,australia)/about 20130427095452 https://www.australia.gov.au/about warc/revisit - 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ - - - 903094466 NLA-AU-CRAWL-04-03-2013-20130427085524926-01687-433~wbgrp-crawl008.us.archive.org~8443.warc.gz
au,gov,australia)/about 20130427132450 https://australia.gov.au/about text/html 200 PNX76BM2Z5WK4H66M4LGLE25AXLC5SZ5 - - - 286205731 NLA-AU-CRAWL-04-03-2013-20130427131330785-01699-433~wbgrp-crawl008.us.archive.org~8443.warc.gz
au,gov,australia)/about 20130502072522 http://australia.gov.au/about text/html 200 NUUDONRPIPF3FBGZL2UZRVUKIJPL6G2F - - - 659173849 NLA-AU-CRAWL-04-03-2013-20130502071652498-00011-26913~wbgrp-crawl008.us.archive.org~8443.warc.gz
au,gov,australia)/about 20130503074233 https://www.australia.gov.au/about application/http 302 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ - - - 68418494 NLA-AU-CRAWL-04-03-2013-20130503074145197-00410-26913~wbgrp-crawl008.us.archive.org~8443.warc.gz
au,gov,australia)/about 20130503082356 https://australia.gov.au/about text/html 200 DLJ7MHJX7XSBHL3DSK4AJTP47YJQ4XD6 - - - 169734931 NLA-AU-CRAWL-04-03-2013-20130503082227277-00426-26913~wbgrp-crawl008.us.archive.org~8443.warc.gz
au,gov,australia)/about 20130509171208 https://www.australia.gov.au/about application/http 302 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ - - - 469007252 NLA-AU-CRAWL-04-03-2013-20130509170233360-01422-2357~wbgrp-crawl008.us.archive.org~8443.warc.gz
au,gov,australia)/about 20130509202803 https://australia.gov.au/about text/html 200 NW2L4S35DERPWVH63TGZRN67PILW5BSK - - - 76130469 NLA-AU-CRAWL-04-03-2013-20130509202651491-01459-2357~wbgrp-crawl008.us.archive.org~8443.warc.gz
au,gov,australia)/about 20140114001125 http://australia.gov.au/about text/html 200 N75NS3Y3B44BJE22SDICYCBKE5YQKHUI - - 8295 476816575 NLA-AU-TEST-01-10-2014-20140114000228973-00012-24613~wbgrp-crawl003.us.archive.org~8443.warc.gz
au,gov,australia)/about 20140126200707 http://australia.gov.au/about text/html 200 V3PBNQEH6EPI5ARS6HOTI4GA7MA37ZG6 - - 8316 40437948 NLA-AU-CRAWL-01-21-2014-20140126200633086-00572-25807~wbgrp-crawl004.us.archive.org~8443.warc.gz
au,gov,australia)/about 20140407081604 http://www.australia.gov.au/about text/html 200 AKSSUZJLOW3BJF546AAVCYPSJH3PSCRF - - 8265 42047962 NLA-AU-CRAWL-01-21-2014-20140407081424273-03408-7081~wbgrp-crawl004.us.archive.org~8443.warc.gz
au,gov,australia)/about-australia 20050622180623 http://australia.gov.au/about-australia text/html 200 1a1eb13d0f84d6f7980546cf1254e019 - - - 15680184 NLA-AU-CRAWL-000-20050622180402-06036-crawling016.archive.org
au,gov,australia)/about-australia 20060819094822 http://australia.gov.au/about-australia text/html 200 751c368557765d512bf9ec76ba513ff5 - - - 40191902 NLA-AU-CRAWL-001-20060819094631-00724-crawling01.us.archive.org
au,gov,australia)/about-australia 20060820210554 http://australia.gov.au/about-australia text/html 200 751c368557765d512bf9ec76ba513ff5 - - - 62545839 NLA-AU-CRAWL-001-20060820210248-02827-crawling01.us.archive.org
au,gov,australia)/about-australia 20070830152508 http://australia.gov.au/about-australia text/html 404 6ZY2SKU552PWOLOTNZF5OTDE4WUJISSN - - - 8534439 NLA-AU-CRAWL-002-20070830152458-02848-crawling015.us.archive.org.arc.gz
@anjackson
Copy link

Okay, that helps a lot, thanks. I installed RockDB with compression enabled, and used the Maven dependency that does not have a platform binary in it, and it seems to work. I only have 77 records in it, so the compressed size was actually larger than the uncompressed, but it's a lot less readable!

FWIW, here's the docker build:

https://github.com/anjackson/wauldock/tree/master/tinycdxserver

and the minor changes I made to the tinycdxserver itself are in this repo:

https://github.com/anjackson/tinycdxserver

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment