Skip to content

Instantly share code, notes, and snippets.

@edsu
Last active February 7, 2021 19:08
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save edsu/45bca70af66ac436329797073f02d02f to your computer and use it in GitHub Desktop.
Save edsu/45bca70af66ac436329797073f02d02f to your computer and use it in GitHub Desktop.
The top 100 host counts in commoncrawl CC-MAIN-2021-04
url_host_name total
getpocket.com 1640422
auth.webnode.com 1056353
telegram.me 543797
plus.google.com 472899
www.ncbi.nlm.nih.gov 433041
api.whatsapp.com 394818
web.skype.com 338835
www.amazon.com 296540
dx.doi.org 290895
polskapress.pl 265670
www.afternic.com 262961
pubmed.ncbi.nlm.nih.gov 229444
www.youtube.com 212668
imageshack.com 197240
social-plugins.line.me 196877
www.flickr.com 188827
idp.springer.com 179517
www.reuters.com 175401
daemon.indapass.hu 168306
wordpress.com 165339
docs.google.com 160734
www.nbcnews.com 157246
multimarket.com.es 156627
www.tumblr.com 148818
www.bloomberg.com 142896
itunes.apple.com 133008
onlinelibrary.wiley.com 130399
www.mylo.id 130140
tsutaya.tsite.jp 129497
photos.google.com 128609
www.booking.com 126402
www.pinterest.com 123730
open.spotify.com 120739
johnnyreads.com 119356
www.tistory.com 116729
apps.apple.com 116224
m.vk.com 115316
delicious.com 114465
support.microsoft.com 112506
sso.accounts.dowjones.com 112309
link.springer.com 111808
www.theguardian.com 111755
music.apple.com 111737
support.google.com 111114
app.photobucket.com 110227
medicinecalendars.stanford.edu 107485
sites.google.com 107193
archive.org 106333
www.vrbo.com 106138
www.nature.com 105433
www.proz.com 105084
www.yahoo.com 104990
www.geocaching.com 104616
idp.nature.com 104376
vimeo.com 103691
www.xing.com 102934
balkans.aljazeera.net 101482
market.yandex.ru 101470
www.honigland.rlp.de 100476
www.imdb.com 100166
www.amazon.co.jp 100022
thanhnien.vn 99716
play.google.com 99699
www.radio.com 99696
www.worldcat.org 98262
www.gu.se 98255
pc.video.dmkt-sp.jp 97923
m.pokupki.market.yandex.ru 97883
cna.public.lu 97451
www.bing.com 97362
www.europarl.europa.eu 97311
www.booklistonline.com 96576
www.biblegateway.com 95887
m.market.yandex.ru 95764
afisha.yandex.ru 95370
www.wsj.com 95048
pokupki.market.yandex.ru 94723
store.acer.com 93552
avia.yandex.ru 93407
github.com 93343
www.wetter.rlp.de 93151
www.huffingtonpost.com 93015
ead.lib.virginia.edu 92640
wp.me 92607
www.nordstrom.com 91964
social.msdn.microsoft.com 91730
compose.mail.yahoo.com 91686
realty.yandex.ru 90914
www.wowhead.com 90470
social.technet.microsoft.com 90438
techcrunch.com 90212
www.myplate.gov 90089
signon.thomsonreuters.com 89805
rabota.yandex.ru 88868
arstechnica.com 88783
www.huffpost.com 88568
www.bbc.co.uk 88495
validate.perfdrive.com 87714
gesetze.berlin.de 87366
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment