Andy Jackson anjackson

## diskdefs
diskdef ibm-3740
  seclen 128
  tracks 77
  sectrk 26
  blocksize 1024
  maxdir 64
  skew 6
  boottrk 2
  os p2dos
end

## gist:6206247

      
              1 file
            
          
              0 forks
            
          
              4 comments
            
          
              26 stars
            
          
                Asparagirl
                / gist:6206247
            
            
              Last active
              February 14, 2024 19:56
            
              
                Have a WARC that you would like to upload to the Internet Archive so that it can eventually be included in their Wayback Machine? Here's how to upload it from the command line.
              
          
    Do you have a WARC file of a website all downloaded and ready to be added to the Internet Archive?  Great!  You can do that with the Internet Archive's web-based uploader, but it's not ideal and it can't handle really big uploads.  Here's how you can upload your WARC files to the IA from the command line, and without worrying about a size restriction.
First, you need to get your Access Key and Secret Key from the Internet Archive for the S3-like API.  Here's where you can get that for your IA account: http://archive.org/account/s3.php  Don't share those with other people!
Here's their documentation file about how to use it, if you need some extra help: http://archive.org/help/abouts3.txt
Next, you should copy the following files to a text file and edit them as needed:
export IA_S3_ACCESS_KEY="YOUR-ACCESS-KEY-FROM-THE-IA-GOES-HERE"

  
## gist:7069028
~ virtualenv env
~ source env/bin/activate
~ pip install git+https://github.com/nlevitt/warctools@tweaks
~ pip install pyOpenSSL
~ git clone  git clone https://github.com/nlevitt/warcprox
~ cd warcprox
~ python warcprox.py --rollover-idle-time=7200
2013-10-20 14:36:07,923 66818 MainThread INFO server_activate(warcprox.py:346) listening on 127.0.0.1:8080
2013-10-20 14:36:07,924 66818 MainThread INFO _read_ca(warcprox.py:75) read CA key+cert from ./warcprox-ca.pem
2013-10-20 14:36:07,928 66818 WarcWriterThread INFO run(warcprox.py:510) WarcWriterThread starting, directory=/private/tmp/warcprox/warcs gzip=False rollover_size=1000000000 rollover_idle_time=7200 prefix=WARCPROX port=8080

## README.md

      
              3 files
            
          
              0 forks
            
          
              4 comments
            
          
              1 star
            
          
                ato
                / README.md
            
            
              Last active
              September 29, 2016 20:24
            
              
                tinycdxserver example
              
          
    I just tried my example from the tinycdxserver README and realised that curl is messing up the
line-endings due to some conversion it does by default.  I haven't checked yet exactly what curl is
doing but tinycdxserver is interpreting it as if all the lines in the file have been concatenated
together (you can see that by running tinycdxserver in verbose mode with the -v option).
Using curl's --data-binary option instead of --data fixes that and I've updated the README correspondingly.
That could be what's tripping you up. Here's a more complete example that I just tested. You
should get an "Added N records" response back if it worked properly, where N is the line count
of the cdx.

  
## md5_multipart_upload.py
#!/usr/bin/python
import argparse
import hashlib
import sys

def md5(f, count):
    hash_md5 = hashlib.md5()
    eof = False
    for i in range(count * 16):
        chunk = f.read(65536)
	diskdef ibm-3740
	seclen 128
	tracks 77
	sectrk 26
	blocksize 1024
	maxdir 64
	skew 6
	boottrk 2
	os p2dos
	end
	~ virtualenv env
	~ source env/bin/activate
	~ pip install git+https://github.com/nlevitt/warctools@tweaks
	~ pip install pyOpenSSL
	~ git clone git clone https://github.com/nlevitt/warcprox
	~ cd warcprox
	~ python warcprox.py --rollover-idle-time=7200
	2013-10-20 14:36:07,923 66818 MainThread INFO server_activate(warcprox.py:346) listening on 127.0.0.1:8080
	2013-10-20 14:36:07,924 66818 MainThread INFO _read_ca(warcprox.py:75) read CA key+cert from ./warcprox-ca.pem
	2013-10-20 14:36:07,928 66818 WarcWriterThread INFO run(warcprox.py:510) WarcWriterThread starting, directory=/private/tmp/warcprox/warcs gzip=False rollover_size=1000000000 rollover_idle_time=7200 prefix=WARCPROX port=8080
	#!/usr/bin/python
	import argparse
	import hashlib
	import sys

	def md5(f, count):
	hash_md5 = hashlib.md5()
	eof = False
	for i in range(count * 16):
	chunk = f.read(65536)