Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save osiyuk/dddd878f6f4925834fe947ba10090688 to your computer and use it in GitHub Desktop.
Save osiyuk/dddd878f6f4925834fe947ba10090688 to your computer and use it in GitHub Desktop.

Полезно научиться работать с большими 7z архивами, с сетью, и с большими 7z архивами, доступными по сети. Рассмотрим на примере набора данных с kaggle объемом 8 Гб…

http://vk.cc/50eTT0

Посмотрим что там лежит.

$ host vk.cc
vk.cc has address 95.213.4.232
vk.cc has address 95.213.4.233
vk.cc has address 95.213.4.230
$ telnet 95.213.4.230 80
Trying 95.213.4.230...
Connected to 95.213.4.230.
Escape character is '^]'.
GET /50eTT0 HTTP/1.0
Host: vk.cc

HTTP/1.1 302 Found
Server: Apache
Date: Thu, 07 Apr 2016 15:06:44 GMT
Content-Type: text/html; charset=windows-1251
Content-Length: 0
Connection: close
X-Powered-By: PHP/3.22852
Pragma: no-cache
Cache-control: no-store
Location: https://www.kaggle.com/reddit/reddit-comments-may-2015/downloads/reddit-comments-may-2015.7z

Connection closed by foreign host.

Научимся теперь ходить на kaggle

$ host kaggle.com
kaggle.com has address 168.62.224.13
kaggle.com mail is handled by 10 aspmx.l.google.com.
kaggle.com mail is handled by 20 alt1.aspmx.l.google.com.
kaggle.com mail is handled by 30 alt2.aspmx.l.google.com.
kaggle.com mail is handled by 40 aspmx2.googlemail.com.
kaggle.com mail is handled by 50 aspmx3.googlemail.com.
$ nc 168.62.224.13 80
GET / HTTP/1.0
Host: kaggle.com

HTTP/1.1 301 Moved Permanently
Content-Length: 146
Content-Type: text/html; charset=UTF-8
Location: https://www.kaggle.com/
X-Frame-Options: SAMEORIGIN
Set-Cookie: ARRAffinity=ec8c6570ebf1aeb294a1d1705194c0e86b485396a7a974e599f249a69392380a;Path=/;Domain=kaggle.com
Date: Thu, 07 Apr 2016 15:15:03 GMT
Connection: close

<head><title>Document Moved</title></head>
<body><h1>Object Moved</h1>This document may be found <a HREF="https://www.kaggle.com/">here</a></body>

Видно, что на 168.62.224.13 принимают почту, а основной контент лежит на www.kaggle.com.

$ host www.kaggle.com
www.kaggle.com has address 168.62.224.124
$ nc 168.62.224.124 80 | head
GET / HTTP/1.0
Host: www.kaggle.com

HTTP/1.1 301 Moved Permanently
Content-Length: 146
Content-Type: text/html; charset=UTF-8
Location: https://www.kaggle.com/
X-Frame-Options: SAMEORIGIN
Set-Cookie: ARRAffinity=5e6e1186c4f5100992840941bc3d52d8fb7eb0ebf5703f0e1f03aadd68005cbd;Path=/;Domain=www.kaggle.com
Date: Thu, 07 Apr 2016 15:23:53 GMT
Connection: close

<head><title>Document Moved</title></head>

Хочет защищенный. Основной контент лежит здесь 168.62.224.124 на порту 443. Сходим — посмотрим на сертификатики.

$ openssl s_client -connect 168.62.224.124:443
CONNECTED(00000003)
depth=2 C = US, O = GeoTrust Inc., CN = GeoTrust Global CA
verify return:1
depth=1 C = US, O = GeoTrust Inc., CN = RapidSSL SHA256 CA - G3
verify return:1
depth=0 OU = GT01042386, OU = See www.rapidssl.com/resources/cps (c)15, OU = Domain Control Validated - RapidSSL(R), CN = *.kaggle.com
verify return:1
---
Certificate chain
 0 s:/OU=GT01042386/OU=See www.rapidssl.com/resources/cps (c)15/OU=Domain Control Validated - RapidSSL(R)/CN=*.kaggle.com
   i:/C=US/O=GeoTrust Inc./CN=RapidSSL SHA256 CA - G3
 1 s:/C=US/O=GeoTrust Inc./CN=RapidSSL SHA256 CA - G3
   i:/C=US/O=GeoTrust Inc./CN=GeoTrust Global CA
---
Server certificate
-----BEGIN CERTIFICATE-----
MIIEqzCCA5OgAwIBAgIDAeG4MA0GCSqGSIb3DQEBCwUAMEcxCzAJBgNVBAYTAlVT
MRYwFAYDVQQKEw1HZW9UcnVzdCBJbmMuMSAwHgYDVQQDExdSYXBpZFNTTCBTSEEy
NTYgQ0EgLSBHMzAeFw0xNTAxMjQxOTIxMDRaFw0xODAxMjcwNDM1NDJaMIGQMRMw
EQYDVQQLEwpHVDAxMDQyMzg2MTEwLwYDVQQLEyhTZWUgd3d3LnJhcGlkc3NsLmNv
bS9yZXNvdXJjZXMvY3BzIChjKTE1MS8wLQYDVQQLEyZEb21haW4gQ29udHJvbCBW
YWxpZGF0ZWQgLSBSYXBpZFNTTChSKTEVMBMGA1UEAwwMKi5rYWdnbGUuY29tMIIB
IjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEAtoFw6f5naQEZoehDmKeYPopY
ahFO9GDD/OfvFxLj3y8swTewTIBVlt8EBRLPwV1/cIPm5oBrVdTC5V9EDPQlfPCO
SwkVjQenCtGroI+GHLVe3B4z9rxSURagU3iYkes+9jMkkMEnh9zjRIsxC8fLVjlS
YzS9Sdvss3dahlt2VTxU8z+lax+80cExwSgKHbBBrKF0EySZRT4Pu7sXUM1W/u8n
dXfkHiYwdMaep5QO+/XNYMVPUa9fUgaVDJWPuxPCUqaKdL1EktsnbV4Y/EUtOLEH
nfKopRVnIAwTaexCm24Xwo1eTx6ArfK9VKswWI29HqMktsBex8iGASstURCbZQID
AQABo4IBVDCCAVAwHwYDVR0jBBgwFoAUw5zz/NNGCDS7zkZ/oHxb8+IIy1kwVwYI
KwYBBQUHAQEESzBJMB8GCCsGAQUFBzABhhNodHRwOi8vZ3Yuc3ltY2QuY29tMCYG
CCsGAQUFBzAChhpodHRwOi8vZ3Yuc3ltY2IuY29tL2d2LmNydDAOBgNVHQ8BAf8E
BAMCBaAwHQYDVR0lBBYwFAYIKwYBBQUHAwEGCCsGAQUFBwMCMCMGA1UdEQQcMBqC
DCoua2FnZ2xlLmNvbYIKa2FnZ2xlLmNvbTArBgNVHR8EJDAiMCCgHqAchhpodHRw
Oi8vZ3Yuc3ltY2IuY29tL2d2LmNybDAMBgNVHRMBAf8EAjAAMEUGA1UdIAQ+MDww
OgYKYIZIAYb4RQEHNjAsMCoGCCsGAQUFBwIBFh5odHRwczovL3d3dy5yYXBpZHNz
bC5jb20vbGVnYWwwDQYJKoZIhvcNAQELBQADggEBAHBWQKqgh8W1mocnmC2C40j4
XB3y4NeUKsaTkWDrMN6NMi1cld57+qEVnER3Zhu8S8gMl9w0CTRCPskLFRN+D/H8
zCzH181m5mYo1VM1QbYZmEdRBUf0vCugMWm54KWQHJ+6jX5roJ7qvDZRGpMcVJEQ
UpXZMAi/NVVxR37CvZVut0NiW8/2Oh9oAujvX6qupNZDmgp89jq2+c0afrAFYC3G
djwDSO9xwoozdV69dWpYDmpjM7wfOAQ1pXziHkFJKMFgwBnzBpXAYocolVQR1IXE
UByl4MgGsww7KHiYI0IcBFbDhvonMuMuDMTFa32xhNgCDcuzrrMJxgjzaJeCGfM=
-----END CERTIFICATE-----
subject=/OU=GT01042386/OU=See www.rapidssl.com/resources/cps (c)15/OU=Domain Control Validated - RapidSSL(R)/CN=*.kaggle.com
issuer=/C=US/O=GeoTrust Inc./CN=RapidSSL SHA256 CA - G3
---
No client certificate CA names sent
Peer signing digest: SHA1
Server Temp Key: ECDH, P-256, 256 bits
---
SSL handshake has read 2807 bytes and written 487 bytes
---
New, TLSv1/SSLv3, Cipher is ECDHE-RSA-AES256-SHA384
Server public key is 2048 bit
Secure Renegotiation IS supported
Compression: NONE
Expansion: NONE
No ALPN negotiated
SSL-Session:
    Protocol  : TLSv1.2
    Cipher    : ECDHE-RSA-AES256-SHA384
    Session-ID: 450700009AC638B10CF535E14C51CD60003896E5E587412ACB71A45A8D26AE2E
    Session-ID-ctx: 
    Master-Key: D79CDF6335E7FD82598A2C71768E89F6BD569F1F3AB6924341756EF3BF263D8E29600B2E24CD7489979491C8207541B3
    Key-Arg   : None
    PSK identity: None
    PSK identity hint: None
    SRP username: None
    Start Time: 1460046128
    Timeout   : 300 (sec)
    Verify return code: 0 (ok)
---
GET /reddit/reddit-comments-may-2015/downloads/reddit-comments-may-2015.7z HTTP/1.0
Host: www.kaggle.com

HTTP/1.1 302 Found
Cache-Control: private
Content-Length: 220
Content-Type: text/html; charset=utf-8
Location: /account/login?ReturnUrl=%2freddit%2freddit-comments-may-2015%2fdownloads%2freddit-comments-may-2015.7z
X-Frame-Options: SAMEORIGIN
Set-Cookie: ARRAffinity=ec8c6570ebf1aeb294a1d1705194c0e86b485396a7a974e599f249a69392380a;Path=/;Domain=www.kaggle.com
Date: Thu, 07 Apr 2016 16:22:45 GMT
Connection: close

<html><head><title>Object moved</title></head><body>
<h2>Object moved to <a href="/account/login?ReturnUrl=%2freddit%2freddit-comments-may-2015%2fdownloads%2freddit-comments-may-2015.7z">here</a>.</h2>
</body></html>
read:errno=0

Отлично, при запросе нашего архива кегля перенаправила нас на /account/login. Предположим, что кегля ждет от нас авторизации — у нас есть аккаунт, попробуем узнать в каком формате нужно его скормить…

$ echo -e "GET / HTTP/1.0\nHost: www.kaggle.com\n\n" | \
> ncat --ssl 168.62.224.124 443 | \
> python -c "import sys; p = sys.stdin.read();\
> end = '</form>'; print p[ p.index('<form') : p.index(end) + len(end) ]"
<form action="/account/login" id="signin" method="post"><input id="returnUrl" name="returnUrl" type="hidden" value="https://www.kaggle.com/" /><input data-val="true" data-val-length="The field User name must be a string with a minimum length of 2 and a maximum length of 255." data-val-length-max="255" data-val-length-min="2" data-val-required="The User name field is required." id="UserName" name="UserName" placeholder="Email / username" type="text" value="" /><span class="field-validation-valid" data-valmsg-for="UserName" data-valmsg-replace="true"></span><input data-val="true" data-val-length="The field Password must be a string with a minimum length of 1 and a maximum length of 255." data-val-length-max="255" data-val-length-min="1" data-val-required="The Password field is required." id="Password" name="Password" placeholder="Password" type="password" /><span class="field-validation-valid" data-valmsg-for="Password" data-valmsg-replace="true"></span>    <div id="remember-me">
        <input data-val="true" data-val-required="The Remember me? field is required." id="RememberMe" name="RememberMe" type="checkbox" value="true" /><input name="RememberMe" type="hidden" value="false" />            
        <label for="RememberMe">Remember me?</label>
    </div>   
    <input type="submit" value="Login" />
<input name="__RequestVerificationToken" type="hidden" value="ZOwaRteEzRHjapWIVwMYnQ9aIcanHE88nvN74bdySaF1aUX2IqStmHVOz9mT-g0iGMeoL3EZs1Xswp8bj1gtm7LVWP01" />    <input id="signinjs" type="hidden" name="JavaScriptEnabled" value="false" />    
</form>

На морде / весит эта форма авторизации, простой комбинацией команд мы можем узнать требуемые поля для авторизации.

$ echo -e "GET / HTTP/1.0\nHost: www.kaggle.com\n\n" | ncat --ssl 168.62.224.124 443 | grep name=

Таковыми будут: UserName, Password и __RequestVerificationToken — напишем скрипт для генерации PSOT запроса.

$ cat kaggle
#!/usr/bin/python
# coding: utf-8

name = ''
pswd = ''
token = 'ZOwaRteEzRHjapWIVwMYnQ9aIcanHE88nvN74bdySaF1aUX2IqStmHVOz9mT-g0iGMeoL3EZs1Xswp8bj1gtm7LVWP01'

print """\
POST /account/login HTTP/1.0
Host: www.kaggle.com"""

auth = dict(UserName=name, Password=pswd, _RequestVerificationToken=token)
from urllib import urlencode
auth = urlencode(auth)

print """\
Content-Length: {}
Content-Type: application/x-www-form-urlencoded

{}\n\n""".format(len(auth), auth)

$ chmod u+x kaggle
$ ./kaggle 
POST /account/login HTTP/1.0
Host: www.kaggle.com
Content-Length: 138
Content-Type: application/x-www-form-urlencoded

UserName=&Password=&_RequestVerificationToken=ZOwaRteEzRHjapWIVwMYnQ9aIcanHE88nvN74bdySaF1aUX2IqStmHVOz9mT-g0iGMeoL3EZs1Xswp8bj1gtm7LVWP01

Теперь у нас есть инструмент для авторизации ­— испробуем его, не забыв прописать в скрипте аккаунт — свои не даю :)

$ ./kaggle | ncat --ssl 168.62.224.124 443
HTTP/1.1 302 Found
Cache-Control: private
Content-Length: 133
Content-Type: text/html; charset=utf-8
Location: /account/welcome
Set-Cookie: .ASPXAUTH=CF1048B54555F4AB05449AFA6C1322752B86C0926B62BD9A2BAE30558B237B192E7167B2F0CCE7FDE1F7C9FA278116201D3A338A004BAD052B847B22EDB6E06578F33E6E704CC1910645227D06CD7E2E24A345A3; domain=.kaggle.com; path=/; secure
Set-Cookie: TempData=_hhCR3bCrBBZ5tsjEvI/YgV5BXEBAgCOTaQGVuWVhtBi6asMHP94a+m0EjAsIJQRe8tmquZEWhd8K5M1PqkBC6AsbjZM=; path=/; secure; HttpOnly
X-Frame-Options: SAMEORIGIN
Set-Cookie: ARRAffinity=5e6e1186c4f5100992840941bc3d52d8fb7eb0ebf5703f0e1f03aadd68005cbd;Path=/;Domain=www.kaggle.com
Date: Thu, 07 Apr 2016 19:13:54 GMT
Connection: close

<html><head><title>Object moved</title></head><body>
<h2>Object moved to <a href="/account/welcome">here</a>.</h2>
</body></html>

Нас перенаправили на /account/welcome с кодом 302 Found — думаю это можно интерпретировать как успешную авторизацию. Можно заметить новые куки — судя по всему это куки авторизации. Используя технику MVP попробуем сначала первую куку.

$ cat dataset
GET /reddit/reddit-comments-may-2015/downloads/reddit-comments-may-2015.7z HTTP/1.0
Host: www.kaggle.com
Cookie: .ASPXAUTH=CF1048B54555F4AB05449AFA6C1322752B86C0926B62BD9A2BAE30558B237B192E7167B2F0CCE7FDE1F7C9FA278116201D3A338A004BAD052B847B22EDB6E06578F33E6E704CC1910645227D06CD7E2E24A345A3

$ cat dataset | ncat --ssl 168.62.224.124 443
HTTP/1.1 302 Found
Cache-Control: private, s-maxage=0
Content-Length: 316
Content-Type: text/html; charset=utf-8
Location: https://kaggle2.blob.core.windows.net/datasets/7/7/reddit-comments-may-2015.7z?sv=2012-02-12&se=2016-04-10T20%3A29%3A24Z&sr=b&sp=r&sig=hqn0amTYoKZfKHuhVyO%2FHQrCtCZHIWLKpCsSInBh9lg%3D
X-Frame-Options: SAMEORIGIN
Set-Cookie: ARRAffinity=5e6e1186c4f5100992840941bc3d52d8fb7eb0ebf5703f0e1f03aadd68005cbd;Path=/;Domain=www.kaggle.com
Date: Thu, 07 Apr 2016 20:29:23 GMT
Connection: close

<html><head><title>Object moved</title></head><body>
<h2>Object moved to <a href="https://kaggle2.blob.core.windows.net/datasets/7/7/reddit-comments-may-2015.7z?sv=2012-02-12&amp;se=2016-04-10T20%3A29%3A24Z&amp;sr=b&amp;sp=r&amp;sig=hqn0amTYoKZfKHuhVyO%2FHQrCtCZHIWLKpCsSInBh9lg%3D">here</a>.</h2>
</body></html>

Нам отдали в заголовке Location динамическую ссылку на запрашиваемый файл — замечательно

$ host kaggle2.blob.core.windows.net
kaggle2.blob.core.windows.net is an alias for blob.ch3prdstr06a.store.core.windows.net.
blob.ch3prdstr06a.store.core.windows.net has address 23.98.55.152
$ cat reddit 
HEAD /datasets/7/7/reddit-comments-may-2015.7z HTTP/1.0
Host: kaggle2.blob.core.windows.net

$ cat reddit | ncat 23.98.55.152 80
HTTP/1.1 200 OK
Keep-Alive: true
Content-Length: 8483353425
Content-Type: application/x-7z-compressed
Last-Modified: Sat, 19 Dec 2015 00:31:24 GMT
ETag: 0x8D3080BB99CEF48
Server: Windows-Azure-Blob/1.0 Microsoft-HTTPAPI/2.0
x-ms-request-id: 6899a978-0001-0106-7f17-917d40000000
x-ms-version: 2009-09-19
x-ms-meta-ExpectedContentLength: 8483353425
x-ms-meta-UserReportedLastModifiedDate: 1450476347000
x-ms-write-protection: false
x-ms-lease-status: unlocked
x-ms-blob-type: BlockBlob
Date: Thu, 07 Apr 2016 21:49:14 GMT
Connection: close

Приняли — даже без каких-либо дополнительных параметров. Теперь можно качать архив, или его части…

$ cat head_range 
GET /datasets/7/7/reddit-comments-may-2015.7z HTTP/1.0
Host: kaggle2.blob.core.windows.net
Range: bytes=0-31

$ cat head_range | ncat 23.98.55.152 80
HTTP/1.1 206 Partial Content
Keep-Alive: true
Content-Length: 32
Content-Type: application/x-7z-compressed
Content-Range: bytes 0-31/8483353425
Last-Modified: Sat, 19 Dec 2015 00:31:24 GMT
ETag: 0x8D3080BB99CEF48
Server: Windows-Azure-Blob/1.0 Microsoft-HTTPAPI/2.0
x-ms-request-id: 6e1e5f7c-0001-0041-6f18-91e47e000000
x-ms-version: 2009-09-19
x-ms-meta-ExpectedContentLength: 8483353425
x-ms-meta-UserReportedLastModifiedDate: 1450476347000
x-ms-write-protection: false
x-ms-lease-status: unlocked
x-ms-blob-type: BlockBlob
Date: Thu, 07 Apr 2016 21:54:03 GMT
Connection: close

7z��'�����g˲���fW���

Таким образом используя заголовок Range можно запрашивать и читать служебные части архива с нужным отступом, которые разбросаны по всему телу.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment