Skip to content

Instantly share code, notes, and snippets.

View ibnesayeed's full-sized avatar

Sawood Alam ibnesayeed

View GitHub Profile
@ibnesayeed
ibnesayeed / twitter-barackobama-capture-languages.tsv
Created March 22, 2018 16:22
Twitter BarackObama Capture Language Distribution
We can't make this file beautiful and searchable because it's too large.
DateTime Archive Status Language URIM
20070312000128 IA 200 en http://web.archive.org/web/20070312000128/http://twitter.com:80/BarackObama
20070312000213 IA 200 en http://web.archive.org/web/20070312000213/http://twitter.com:80/barackobama
20070320110428 IA 200 en http://web.archive.org/web/20070320110428/http://twitter.com:80/BarackObama
20070429014820 IA 200 en http://web.archive.org/web/20070429014820/http://twitter.com:80/barackobama
20070505120209 IA 200 en http://web.archive.org/web/20070505120209/http://twitter.com:80/BarackObama
20070513015443 IA 200 en http://web.archive.org/web/20070513015443/http://twitter.com:80/BarackObama
20070513141310 IA 200 en http://web.archive.org/web/20070513141310/http://twitter.com:80/BarackObama
20070514045148 IA 200 en http://web.archive.org/web/20070514045148/http://twitter.com:80/BarackObama
20070524090236 IA 200 en http://web.archive.org/web/20070524090236/http://twitter.com:80/BarackObama
@ibnesayeed
ibnesayeed / twitter-lang-sample.warc
Created March 21, 2018 13:56
Specific WARC segments extracted from our test crawl for illustration
WARC/1.0
WARC-Type: warcinfo
WARC-Date: 2018-03-16T21:58:38Z
WARC-Filename: WEB-20180316215838145-00000-35~2a25c0c89897~8443.warc.gz
WARC-Record-ID: <urn:uuid:c0516302-5c3e-4f80-bdc3-2a8d2f0e5484>
Content-Type: application/warc-fields
Content-Length: 377
software: Heritrix/3.2.0 http://crawler.archive.org
ip: 172.17.0.2
@ibnesayeed
ibnesayeed / twitter-server-side-localization.curl
Created March 20, 2018 20:19
Twitter pages translated on the server side
$ curl --silent https://twitter.com/?lang=ar | grep "<meta name=\"description\""
<meta name="description" content="من الأخبار العاجله حتى الترفيه إلى الرياضة والسياسة، احصل على القصه كامله مع التعليق المباشر.">
@ibnesayeed
ibnesayeed / twitter-cache-headers.curl
Created March 20, 2018 17:44
Cache related headers in Twitter
$ curl -I https://twitter.com/
HTTP/1.1 200 OK
cache-control: no-cache, no-store, must-revalidate, pre-check=0, post-check=0
date: Sun, 18 Mar 2018 17:43:25 GMT
expires: Tue, 31 Mar 1981 05:00:00 GMT
last-modified: Sun, 18 Mar 2018 17:43:25 GMT
pragma: no-cache
...
@ibnesayeed
ibnesayeed / twitter-cache-headers-curl.bash
Created March 20, 2018 17:40
Twitter tried hard to prevent caching
$ curl -i --silent -H "Accept-Language: ur" https://twitter.com/ | more
HTTP/1.1 200 OK
cache-control: no-cache, no-store, must-revalidate, pre-check=0, post-check=0
content-length: 130455
content-type: text/html;charset=utf-8
date: Tue, 20 Mar 2018 17:35:31 GMT
expires: Tue, 31 Mar 1981 05:00:00 GMT
last-modified: Tue, 20 Mar 2018 17:35:31 GMT
pragma: no-cache
server: tsa_b
$ curl -I https://twitter.com/
HTTP/1.1 200 OK
cache-control: no-cache, no-store, must-revalidate, pre-check=0, post-check=0
content-length: 124947
content-type: text/html;charset=utf-8
date: Tue, 20 Mar 2018 17:29:58 GMT
expires: Tue, 31 Mar 1981 05:00:00 GMT
last-modified: Tue, 20 Mar 2018 17:29:58 GMT
pragma: no-cache
server: tsa_b
@ibnesayeed
ibnesayeed / Twitter-BarackObama-Capture-Languages.tsv
Created March 20, 2018 00:20
Twitter BarackObama Capture Language Distribution
We can't make this file beautiful and searchable because it's too large.
DateTime Archive Status Language URIM
20070312000128 IA 200 en http://web.archive.org/web/20070312000128/http://twitter.com:80/BarackObama
20070312000213 IA 200 en http://web.archive.org/web/20070312000213/http://twitter.com:80/barackobama
20070320110428 IA 200 en http://web.archive.org/web/20070320110428/http://twitter.com:80/BarackObama
20070429014820 IA 200 en http://web.archive.org/web/20070429014820/http://twitter.com:80/barackobama
20070505120209 IA 200 en http://web.archive.org/web/20070505120209/http://twitter.com:80/BarackObama
20070513015443 IA 200 en http://web.archive.org/web/20070513015443/http://twitter.com:80/BarackObama
20070513141310 IA 200 en http://web.archive.org/web/20070513141310/http://twitter.com:80/BarackObama
20070514045148 IA 200 en http://web.archive.org/web/20070514045148/http://twitter.com:80/BarackObama
20070524090236 IA 200 en http://web.archive.org/web/20070524090236/http://twitter.com:80/BarackObama
@ibnesayeed
ibnesayeed / twitter-lang-curl.bash
Created March 19, 2018 03:15
Twitter Language Session Behavior Explored in cURL
$ curl --silent -c /tmp/twitter.cookie https://twitter.com/?lang=ar | grep "<html"
<html lang="ar" data-scribe-reduced-action-queue="true">
$ cat /tmp/twitter.cookie | grep lang
twitter.com FALSE / FALSE 0 lang ar
$ curl --silent https://twitter.com/ | grep "<html"
<html lang="en" data-scribe-reduced-action-queue="true">
$ curl --silent -H "Accept-Language: ur" https://twitter.com/ | grep "<html"
@ibnesayeed
ibnesayeed / twitter-lang.warc
Created March 19, 2018 02:27
WARC File of Twitter Language Analysis
WARC/1.0
WARC-Type: warcinfo
WARC-Date: 2018-03-16T21:58:38Z
WARC-Filename: WEB-20180316215838145-00000-35~2a25c0c89897~8443.warc.gz
WARC-Record-ID: <urn:uuid:c0516302-5c3e-4f80-bdc3-2a8d2f0e5484>
Content-Type: application/warc-fields
Content-Length: 377
software: Heritrix/3.2.0 http://crawler.archive.org
ip: 172.17.0.2
@ibnesayeed
ibnesayeed / twitter-lang-curl.cookie
Created March 18, 2018 21:34
Twitter Language Cookie in cURL