I just tried my example from the tinycdxserver README and realised that curl is messing up the line-endings due to some conversion it does by default. I haven't checked yet exactly what curl is doing but tinycdxserver is interpreting it as if all the lines in the file have been concatenated together (you can see that by running tinycdxserver in verbose mode with the -v option).
Using curl's --data-binary option instead of --data fixes that and I've updated the README correspondingly.
That could be what's tripping you up. Here's a more complete example that I just tested. You should get an "Added N records" response back if it worked properly, where N is the line count of the cdx.
records.cdx below has a blank ("-") first column because tinycdxserver ignores it and does its own canonicalisation so our usual indexing process doesn't even bother filling it in. You can use standard CDX files as well, I've included a second file records2.cdx with SURT-style URLs that was generated using IA tools just to demonstrate that.
Compile tinycdxserver:
$ git clone git@github.com:nla/tinycdxserver.git
$ cd tinycdxserver
$ mvn package
Start tinycdxserver:
$ mkdir /tmp/data
$ java -jar target/tinycdxserver-0.1-SNAPSHOT.jar -d /tmp/data
Grab an example CDX:
$ curl -LO https://gist.github.com/ato/b2ad8e65b35afe690921/raw/4e663c44c74c585ac0d5226780465d2281177958/records.cdx
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
100 1203 100 1203 0 0 1297 0 --:--:-- --:--:-- --:--:-- 1297
Load it:
$ curl -XPOST --data-binary @records.cdx http://localhost:8080/myindex
Added 6 records
Get a record back:
$ curl -s http://localhost:8080/myindex?url=http://minister.infrastructure.gov.au/
au,gov,infrastructure,minister)/ 20150914222035 http://www.minister.infrastructure.gov.au/ text/html 301 ZH3ZBTFT5T6VC4BHO3MC6MLFECBEKDYN 389
Query using wayback's xml protocol:
$ curl -s http://localhost:8080/myindex?q=type:urlquery+url:http://minister.infrastructure.gov.au/ | xml_pp
<?xml version="1.0" encoding="UTF-8"?>
<wayback>
<request>
<startdate>19960101000000</startdate>
<enddate>20151015072406</enddate>
<type>urlquery</type>
<firstreturned>0</firstreturned>
<url>au,gov,infrastructure,minister)/</url>
<resultsrequested>10000</resultsrequested>
<resultstype>resultstypecapture</resultstype>
</request>
<results>
<result>
<compressedoffset>152443</compressedoffset>
<mimetype>text/html</mimetype>
<file>WEB-20150914222031256-00000-43190~heritrix.nla.gov.au~8443.warc.gz</file>
<redirecturl>http://minister.infrastructure.gov.au/</redirecturl>
<urlkey>au,gov,infrastructure,minister)/</urlkey>
<digest>ZH3ZBTFT5T6VC4BHO3MC6MLFECBEKDYN</digest>
<httpresponsecode>301</httpresponsecode>
<robotflags>-</robotflags>
<url>http://www.minister.infrastructure.gov.au/</url>
<capturedate>20150914222035</capturedate>
</result>
</results>
</wayback>
Annoyingly RocksDB just seems to silently not compress if it's not built with snappy, even if you explicitly set the compression algorithm option. I'm not sure if there's a proper way to check it. The way I noticed the first time was just the file sizes were larger than what I was expecting and then I confirmed what it was doing by reading the raw .sst database files.
I don't have any uncompressed examples handy, but if it's working if you
hexdump
orstrings
on an .sst file you'll only see full urls at the start of each compression block (~8KB but it varies) and then following records will only have small fragments as the algorithm reuses previous strings. An uncompressed index will spell out the full URLs in each record and be a lot more human-readable.