Полезно научиться работать с большими 7z архивами, с сетью, и с большими 7z архивами, доступными по сети. Рассмотрим на примере набора данных с kaggle объемом 8 Гб…
http://vk.cc/50eTT0
Посмотрим что там лежит.
$ host vk.cc
vk.cc has address 95.213.4.232
vk.cc has address 95.213.4.233
vk.cc has address 95.213.4.230
$ telnet 95.213.4.230 80
Trying 95.213.4.230...
Connected to 95.213.4.230.
Escape character is '^]'.
GET /50eTT0 HTTP/1.0
Host: vk.cc
HTTP/1.1 302 Found
Server: Apache
Date: Thu, 07 Apr 2016 15:06:44 GMT
Content-Type: text/html; charset=windows-1251
Content-Length: 0
Connection: close
X-Powered-By: PHP/3.22852
Pragma: no-cache
Cache-control: no-store
Location: https://www.kaggle.com/reddit/reddit-comments-may-2015/downloads/reddit-comments-may-2015.7z
Connection closed by foreign host.
Научимся теперь ходить на kaggle
$ host kaggle.com
kaggle.com has address 168.62.224.13
kaggle.com mail is handled by 10 aspmx.l.google.com.
kaggle.com mail is handled by 20 alt1.aspmx.l.google.com.
kaggle.com mail is handled by 30 alt2.aspmx.l.google.com.
kaggle.com mail is handled by 40 aspmx2.googlemail.com.
kaggle.com mail is handled by 50 aspmx3.googlemail.com.
$ nc 168.62.224.13 80
GET / HTTP/1.0
Host: kaggle.com
HTTP/1.1 301 Moved Permanently
Content-Length: 146
Content-Type: text/html; charset=UTF-8
Location: https://www.kaggle.com/
X-Frame-Options: SAMEORIGIN
Set-Cookie: ARRAffinity=ec8c6570ebf1aeb294a1d1705194c0e86b485396a7a974e599f249a69392380a;Path=/;Domain=kaggle.com
Date: Thu, 07 Apr 2016 15:15:03 GMT
Connection: close
<head><title>Document Moved</title></head>
<body><h1>Object Moved</h1>This document may be found <a HREF="https://www.kaggle.com/">here</a></body>
Видно, что на 168.62.224.13
принимают почту, а основной контент лежит на www.kaggle.com
.
$ host www.kaggle.com
www.kaggle.com has address 168.62.224.124
$ nc 168.62.224.124 80 | head
GET / HTTP/1.0
Host: www.kaggle.com
HTTP/1.1 301 Moved Permanently
Content-Length: 146
Content-Type: text/html; charset=UTF-8
Location: https://www.kaggle.com/
X-Frame-Options: SAMEORIGIN
Set-Cookie: ARRAffinity=5e6e1186c4f5100992840941bc3d52d8fb7eb0ebf5703f0e1f03aadd68005cbd;Path=/;Domain=www.kaggle.com
Date: Thu, 07 Apr 2016 15:23:53 GMT
Connection: close
<head><title>Document Moved</title></head>
Хочет защищенный. Основной контент лежит здесь 168.62.224.124
на порту 443
. Сходим — посмотрим на сертификатики.
$ openssl s_client -connect 168.62.224.124:443
CONNECTED(00000003)
depth=2 C = US, O = GeoTrust Inc., CN = GeoTrust Global CA
verify return:1
depth=1 C = US, O = GeoTrust Inc., CN = RapidSSL SHA256 CA - G3
verify return:1
depth=0 OU = GT01042386, OU = See www.rapidssl.com/resources/cps (c)15, OU = Domain Control Validated - RapidSSL(R), CN = *.kaggle.com
verify return:1
---
Certificate chain
0 s:/OU=GT01042386/OU=See www.rapidssl.com/resources/cps (c)15/OU=Domain Control Validated - RapidSSL(R)/CN=*.kaggle.com
i:/C=US/O=GeoTrust Inc./CN=RapidSSL SHA256 CA - G3
1 s:/C=US/O=GeoTrust Inc./CN=RapidSSL SHA256 CA - G3
i:/C=US/O=GeoTrust Inc./CN=GeoTrust Global CA
---
Server certificate
-----BEGIN CERTIFICATE-----
MIIEqzCCA5OgAwIBAgIDAeG4MA0GCSqGSIb3DQEBCwUAMEcxCzAJBgNVBAYTAlVT
MRYwFAYDVQQKEw1HZW9UcnVzdCBJbmMuMSAwHgYDVQQDExdSYXBpZFNTTCBTSEEy
NTYgQ0EgLSBHMzAeFw0xNTAxMjQxOTIxMDRaFw0xODAxMjcwNDM1NDJaMIGQMRMw
EQYDVQQLEwpHVDAxMDQyMzg2MTEwLwYDVQQLEyhTZWUgd3d3LnJhcGlkc3NsLmNv
bS9yZXNvdXJjZXMvY3BzIChjKTE1MS8wLQYDVQQLEyZEb21haW4gQ29udHJvbCBW
YWxpZGF0ZWQgLSBSYXBpZFNTTChSKTEVMBMGA1UEAwwMKi5rYWdnbGUuY29tMIIB
IjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEAtoFw6f5naQEZoehDmKeYPopY
ahFO9GDD/OfvFxLj3y8swTewTIBVlt8EBRLPwV1/cIPm5oBrVdTC5V9EDPQlfPCO
SwkVjQenCtGroI+GHLVe3B4z9rxSURagU3iYkes+9jMkkMEnh9zjRIsxC8fLVjlS
YzS9Sdvss3dahlt2VTxU8z+lax+80cExwSgKHbBBrKF0EySZRT4Pu7sXUM1W/u8n
dXfkHiYwdMaep5QO+/XNYMVPUa9fUgaVDJWPuxPCUqaKdL1EktsnbV4Y/EUtOLEH
nfKopRVnIAwTaexCm24Xwo1eTx6ArfK9VKswWI29HqMktsBex8iGASstURCbZQID
AQABo4IBVDCCAVAwHwYDVR0jBBgwFoAUw5zz/NNGCDS7zkZ/oHxb8+IIy1kwVwYI
KwYBBQUHAQEESzBJMB8GCCsGAQUFBzABhhNodHRwOi8vZ3Yuc3ltY2QuY29tMCYG
CCsGAQUFBzAChhpodHRwOi8vZ3Yuc3ltY2IuY29tL2d2LmNydDAOBgNVHQ8BAf8E
BAMCBaAwHQYDVR0lBBYwFAYIKwYBBQUHAwEGCCsGAQUFBwMCMCMGA1UdEQQcMBqC
DCoua2FnZ2xlLmNvbYIKa2FnZ2xlLmNvbTArBgNVHR8EJDAiMCCgHqAchhpodHRw
Oi8vZ3Yuc3ltY2IuY29tL2d2LmNybDAMBgNVHRMBAf8EAjAAMEUGA1UdIAQ+MDww
OgYKYIZIAYb4RQEHNjAsMCoGCCsGAQUFBwIBFh5odHRwczovL3d3dy5yYXBpZHNz
bC5jb20vbGVnYWwwDQYJKoZIhvcNAQELBQADggEBAHBWQKqgh8W1mocnmC2C40j4
XB3y4NeUKsaTkWDrMN6NMi1cld57+qEVnER3Zhu8S8gMl9w0CTRCPskLFRN+D/H8
zCzH181m5mYo1VM1QbYZmEdRBUf0vCugMWm54KWQHJ+6jX5roJ7qvDZRGpMcVJEQ
UpXZMAi/NVVxR37CvZVut0NiW8/2Oh9oAujvX6qupNZDmgp89jq2+c0afrAFYC3G
djwDSO9xwoozdV69dWpYDmpjM7wfOAQ1pXziHkFJKMFgwBnzBpXAYocolVQR1IXE
UByl4MgGsww7KHiYI0IcBFbDhvonMuMuDMTFa32xhNgCDcuzrrMJxgjzaJeCGfM=
-----END CERTIFICATE-----
subject=/OU=GT01042386/OU=See www.rapidssl.com/resources/cps (c)15/OU=Domain Control Validated - RapidSSL(R)/CN=*.kaggle.com
issuer=/C=US/O=GeoTrust Inc./CN=RapidSSL SHA256 CA - G3
---
No client certificate CA names sent
Peer signing digest: SHA1
Server Temp Key: ECDH, P-256, 256 bits
---
SSL handshake has read 2807 bytes and written 487 bytes
---
New, TLSv1/SSLv3, Cipher is ECDHE-RSA-AES256-SHA384
Server public key is 2048 bit
Secure Renegotiation IS supported
Compression: NONE
Expansion: NONE
No ALPN negotiated
SSL-Session:
Protocol : TLSv1.2
Cipher : ECDHE-RSA-AES256-SHA384
Session-ID: 450700009AC638B10CF535E14C51CD60003896E5E587412ACB71A45A8D26AE2E
Session-ID-ctx:
Master-Key: D79CDF6335E7FD82598A2C71768E89F6BD569F1F3AB6924341756EF3BF263D8E29600B2E24CD7489979491C8207541B3
Key-Arg : None
PSK identity: None
PSK identity hint: None
SRP username: None
Start Time: 1460046128
Timeout : 300 (sec)
Verify return code: 0 (ok)
---
GET /reddit/reddit-comments-may-2015/downloads/reddit-comments-may-2015.7z HTTP/1.0
Host: www.kaggle.com
HTTP/1.1 302 Found
Cache-Control: private
Content-Length: 220
Content-Type: text/html; charset=utf-8
Location: /account/login?ReturnUrl=%2freddit%2freddit-comments-may-2015%2fdownloads%2freddit-comments-may-2015.7z
X-Frame-Options: SAMEORIGIN
Set-Cookie: ARRAffinity=ec8c6570ebf1aeb294a1d1705194c0e86b485396a7a974e599f249a69392380a;Path=/;Domain=www.kaggle.com
Date: Thu, 07 Apr 2016 16:22:45 GMT
Connection: close
<html><head><title>Object moved</title></head><body>
<h2>Object moved to <a href="/account/login?ReturnUrl=%2freddit%2freddit-comments-may-2015%2fdownloads%2freddit-comments-may-2015.7z">here</a>.</h2>
</body></html>
read:errno=0
Отлично, при запросе нашего архива кегля перенаправила нас на /account/login
. Предположим, что кегля ждет от нас авторизации — у нас есть аккаунт, попробуем узнать в каком формате нужно его скормить…
$ echo -e "GET / HTTP/1.0\nHost: www.kaggle.com\n\n" | \
> ncat --ssl 168.62.224.124 443 | \
> python -c "import sys; p = sys.stdin.read();\
> end = '</form>'; print p[ p.index('<form') : p.index(end) + len(end) ]"
<form action="/account/login" id="signin" method="post"><input id="returnUrl" name="returnUrl" type="hidden" value="https://www.kaggle.com/" /><input data-val="true" data-val-length="The field User name must be a string with a minimum length of 2 and a maximum length of 255." data-val-length-max="255" data-val-length-min="2" data-val-required="The User name field is required." id="UserName" name="UserName" placeholder="Email / username" type="text" value="" /><span class="field-validation-valid" data-valmsg-for="UserName" data-valmsg-replace="true"></span><input data-val="true" data-val-length="The field Password must be a string with a minimum length of 1 and a maximum length of 255." data-val-length-max="255" data-val-length-min="1" data-val-required="The Password field is required." id="Password" name="Password" placeholder="Password" type="password" /><span class="field-validation-valid" data-valmsg-for="Password" data-valmsg-replace="true"></span> <div id="remember-me">
<input data-val="true" data-val-required="The Remember me? field is required." id="RememberMe" name="RememberMe" type="checkbox" value="true" /><input name="RememberMe" type="hidden" value="false" />
<label for="RememberMe">Remember me?</label>
</div>
<input type="submit" value="Login" />
<input name="__RequestVerificationToken" type="hidden" value="ZOwaRteEzRHjapWIVwMYnQ9aIcanHE88nvN74bdySaF1aUX2IqStmHVOz9mT-g0iGMeoL3EZs1Xswp8bj1gtm7LVWP01" /> <input id="signinjs" type="hidden" name="JavaScriptEnabled" value="false" />
</form>
На морде /
весит эта форма авторизации, простой комбинацией команд мы можем узнать требуемые поля для авторизации.
$ echo -e "GET / HTTP/1.0\nHost: www.kaggle.com\n\n" | ncat --ssl 168.62.224.124 443 | grep name=
Таковыми будут: UserName
, Password
и __RequestVerificationToken
— напишем скрипт для генерации PSOT
запроса.
$ cat kaggle
#!/usr/bin/python
# coding: utf-8
name = ''
pswd = ''
token = 'ZOwaRteEzRHjapWIVwMYnQ9aIcanHE88nvN74bdySaF1aUX2IqStmHVOz9mT-g0iGMeoL3EZs1Xswp8bj1gtm7LVWP01'
print """\
POST /account/login HTTP/1.0
Host: www.kaggle.com"""
auth = dict(UserName=name, Password=pswd, _RequestVerificationToken=token)
from urllib import urlencode
auth = urlencode(auth)
print """\
Content-Length: {}
Content-Type: application/x-www-form-urlencoded
{}\n\n""".format(len(auth), auth)
$ chmod u+x kaggle
$ ./kaggle
POST /account/login HTTP/1.0
Host: www.kaggle.com
Content-Length: 138
Content-Type: application/x-www-form-urlencoded
UserName=&Password=&_RequestVerificationToken=ZOwaRteEzRHjapWIVwMYnQ9aIcanHE88nvN74bdySaF1aUX2IqStmHVOz9mT-g0iGMeoL3EZs1Xswp8bj1gtm7LVWP01
Теперь у нас есть инструмент для авторизации — испробуем его, не забыв прописать в скрипте аккаунт — свои не даю :)
$ ./kaggle | ncat --ssl 168.62.224.124 443
HTTP/1.1 302 Found
Cache-Control: private
Content-Length: 133
Content-Type: text/html; charset=utf-8
Location: /account/welcome
Set-Cookie: .ASPXAUTH=CF1048B54555F4AB05449AFA6C1322752B86C0926B62BD9A2BAE30558B237B192E7167B2F0CCE7FDE1F7C9FA278116201D3A338A004BAD052B847B22EDB6E06578F33E6E704CC1910645227D06CD7E2E24A345A3; domain=.kaggle.com; path=/; secure
Set-Cookie: TempData=_hhCR3bCrBBZ5tsjEvI/YgV5BXEBAgCOTaQGVuWVhtBi6asMHP94a+m0EjAsIJQRe8tmquZEWhd8K5M1PqkBC6AsbjZM=; path=/; secure; HttpOnly
X-Frame-Options: SAMEORIGIN
Set-Cookie: ARRAffinity=5e6e1186c4f5100992840941bc3d52d8fb7eb0ebf5703f0e1f03aadd68005cbd;Path=/;Domain=www.kaggle.com
Date: Thu, 07 Apr 2016 19:13:54 GMT
Connection: close
<html><head><title>Object moved</title></head><body>
<h2>Object moved to <a href="/account/welcome">here</a>.</h2>
</body></html>
Нас перенаправили на /account/welcome
с кодом 302 Found
— думаю это можно интерпретировать как успешную авторизацию. Можно заметить новые куки — судя по всему это куки авторизации. Используя технику MVP
попробуем сначала первую куку.
$ cat dataset
GET /reddit/reddit-comments-may-2015/downloads/reddit-comments-may-2015.7z HTTP/1.0
Host: www.kaggle.com
Cookie: .ASPXAUTH=CF1048B54555F4AB05449AFA6C1322752B86C0926B62BD9A2BAE30558B237B192E7167B2F0CCE7FDE1F7C9FA278116201D3A338A004BAD052B847B22EDB6E06578F33E6E704CC1910645227D06CD7E2E24A345A3
$ cat dataset | ncat --ssl 168.62.224.124 443
HTTP/1.1 302 Found
Cache-Control: private, s-maxage=0
Content-Length: 316
Content-Type: text/html; charset=utf-8
Location: https://kaggle2.blob.core.windows.net/datasets/7/7/reddit-comments-may-2015.7z?sv=2012-02-12&se=2016-04-10T20%3A29%3A24Z&sr=b&sp=r&sig=hqn0amTYoKZfKHuhVyO%2FHQrCtCZHIWLKpCsSInBh9lg%3D
X-Frame-Options: SAMEORIGIN
Set-Cookie: ARRAffinity=5e6e1186c4f5100992840941bc3d52d8fb7eb0ebf5703f0e1f03aadd68005cbd;Path=/;Domain=www.kaggle.com
Date: Thu, 07 Apr 2016 20:29:23 GMT
Connection: close
<html><head><title>Object moved</title></head><body>
<h2>Object moved to <a href="https://kaggle2.blob.core.windows.net/datasets/7/7/reddit-comments-may-2015.7z?sv=2012-02-12&se=2016-04-10T20%3A29%3A24Z&sr=b&sp=r&sig=hqn0amTYoKZfKHuhVyO%2FHQrCtCZHIWLKpCsSInBh9lg%3D">here</a>.</h2>
</body></html>
Нам отдали в заголовке Location
динамическую ссылку на запрашиваемый файл — замечательно
$ host kaggle2.blob.core.windows.net
kaggle2.blob.core.windows.net is an alias for blob.ch3prdstr06a.store.core.windows.net.
blob.ch3prdstr06a.store.core.windows.net has address 23.98.55.152
$ cat reddit
HEAD /datasets/7/7/reddit-comments-may-2015.7z HTTP/1.0
Host: kaggle2.blob.core.windows.net
$ cat reddit | ncat 23.98.55.152 80
HTTP/1.1 200 OK
Keep-Alive: true
Content-Length: 8483353425
Content-Type: application/x-7z-compressed
Last-Modified: Sat, 19 Dec 2015 00:31:24 GMT
ETag: 0x8D3080BB99CEF48
Server: Windows-Azure-Blob/1.0 Microsoft-HTTPAPI/2.0
x-ms-request-id: 6899a978-0001-0106-7f17-917d40000000
x-ms-version: 2009-09-19
x-ms-meta-ExpectedContentLength: 8483353425
x-ms-meta-UserReportedLastModifiedDate: 1450476347000
x-ms-write-protection: false
x-ms-lease-status: unlocked
x-ms-blob-type: BlockBlob
Date: Thu, 07 Apr 2016 21:49:14 GMT
Connection: close
Приняли — даже без каких-либо дополнительных параметров. Теперь можно качать архив, или его части…
$ cat head_range
GET /datasets/7/7/reddit-comments-may-2015.7z HTTP/1.0
Host: kaggle2.blob.core.windows.net
Range: bytes=0-31
$ cat head_range | ncat 23.98.55.152 80
HTTP/1.1 206 Partial Content
Keep-Alive: true
Content-Length: 32
Content-Type: application/x-7z-compressed
Content-Range: bytes 0-31/8483353425
Last-Modified: Sat, 19 Dec 2015 00:31:24 GMT
ETag: 0x8D3080BB99CEF48
Server: Windows-Azure-Blob/1.0 Microsoft-HTTPAPI/2.0
x-ms-request-id: 6e1e5f7c-0001-0041-6f18-91e47e000000
x-ms-version: 2009-09-19
x-ms-meta-ExpectedContentLength: 8483353425
x-ms-meta-UserReportedLastModifiedDate: 1450476347000
x-ms-write-protection: false
x-ms-lease-status: unlocked
x-ms-blob-type: BlockBlob
Date: Thu, 07 Apr 2016 21:54:03 GMT
Connection: close
7z��'�����g˲���fW���
Таким образом используя заголовок Range
можно запрашивать и читать служебные части архива с нужным отступом, которые разбросаны по всему телу.