Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
The top 100 hosts by WARC record sizes (bytes) in commoncrawl CC-MAIN-2021-04.
url_host_name length
d2y1pz2y630308.cloudfront.net 35553494127
photos.google.com 22829413806
www.download.p4c.philips.com 19523224128
quod.lib.umich.edu 18400799789
s3.amazonaws.com 17043193709
support.google.com 16945185389
www.wmagazine.com 15723224197
api.whatsapp.com 15241728762
www.thecut.com 15017634948
www.cnn.com 14364985593
img1.wsimg.com 13996892292
images.unsplash.com 12482012571
pdds-cdn.ucweb.com 11234995131
www.csmedia1.com 11119980176
www.wsj.com 11063037618
www.radio.com 10995371850
nymag.com 10372284961
zalbarath666.wordpress.com 10304612805
www.vulture.com 10231405348
www.booking.com 9920142409
www.youtube.com 9722092068
www.grubstreet.com 9718432294
storage.googleapis.com 9556181842
www.copernicus.eu 9496837035
www.truemuslims.net 9423417362
centauro996.wordpress.com 9005045472
abagond.wordpress.com 8836294316
edepot.wur.nl 8816239350
s3-eu-west-1.amazonaws.com 8774566472
openparachute.wordpress.com 8685632992
junction10.wordpress.com 8606957763
skribh.wordpress.com 8435128098
www.govinfo.gov 8425251692
bravesandstuff.wordpress.com 8378810064
growingupnyc.cityofnewyork.us 8365169140
stacks.cdc.gov 8041631750
play.google.com 8019079157
tel.archives-ouvertes.fr 7751767734
nebula.wsimg.com 7491400203
www.tripadvisor.com 7470863282
www.walgreens.com 7388065639
sportsfanshop.jcpenney.com 7314605392
cn.tripadvisor.com 7249194336
www.popoffquotidiano.it 7164965306
www.reuters.com 7150591817
www.pastemagazine.com 7106773385
www.zdnet.com 7055571821
secureservercdn.net 7053245579
docs.wixstatic.com 6992199829
ar.tripadvisor.com 6862174994
www.flickr.com 6798533455
videos.files.wordpress.com 6732709720
icons8.com 6727251001
blogdowashingtondourado.wordpress.com 6707790195
yukarikayalar.wordpress.com 6684868960
resources.finalsite.net 6570666660
jadedphotography.wordpress.com 6557908107
www.gazetaprawna.pl 6552808975
hrs.isr.umich.edu 6518189820
www.airbnb.com 6387907013
satyamshot.wordpress.com 6363814933
static.wikia.nocookie.net 6337795595
downloads.zdnet.com 6333677699
www.upworthy.com 6299689026
amomama.fr 6278052616
supportaustralia.wordpress.com 6214053012
tw.news.yahoo.com 6193207901
sports.yahoo.com 6064245848
www.radiocorriere.teche.rai.it 6010233095
www.education.gov.za 5977157349
www.bbc.com 5969255385
th.tripadvisor.com 5967019969
dailycaller.com 5956894544
www.behance.net 5934179090
www.se.com 5927899394
www.nordstrom.com 5910066718
www.vol.at 5800695665
patentimages.storage.googleapis.com 5767272903
steemit.com 5761458616
no.tripadvisor.com 5732054747
files.danskkulturarv.dk 5721324889
pl.tripadvisor.com 5691678630
music.apple.com 5686284734
raulsseixas.wordpress.com 5537243433
chairish-prod.freetls.fastly.net 5476048792
www.coursera.org 5445598880
escholarship.org 5440644588
www.cnbc.com 5435897734
theintercept.com 5422550531
www.24h.com.vn 5365141844
downloads.hindawi.com 5347832777
podcasts.google.com 5340687113
www.mediasetplay.mediaset.it 5332844530
www.fastcompany.com 5321486511
north69.sillapa.net 5307457658
30demayo.wordpress.com 5270041113
finance.yahoo.com 5252952384
www.afternic.com 5235001701
www.vrbo.com 5122862378
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment