Skip to content

Instantly share code, notes, and snippets.

@sdstrowes
Last active December 6, 2018 18:58
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save sdstrowes/f75bc77581702bfad9467cb311c428a2 to your computer and use it in GitHub Desktop.
Save sdstrowes/f75bc77581702bfad9467cb311c428a2 to your computer and use it in GitHub Desktop.
Alexa/Umbrella comparison notes

Top 20

Top-20 domains in each top-million set:

Alexa

Format: rank,domain

$ head -n 20 top1m-2017-01-19-alexa.csv 
1,google.com
2,youtube.com
3,facebook.com
4,baidu.com
5,yahoo.com
6,wikipedia.org
7,google.co.in
8,amazon.com
9,qq.com
10,google.co.jp
11,live.com
12,tmall.com
13,taobao.com
14,vk.com
15,sohu.com
16,twitter.com
17,instagram.com
18,linkedin.com
19,reddit.com
20,sina.com.cn

Umbrella

Format: rank,domain

$ head -n 20 top1m-2017-01-19-umbrella.csv 
1,com
2,net
3,google.com
4,org
5,microsoft.com
6,googleapis.com
7,www.google.com
8,facebook.com
9,doubleclick.net
10,g.doubleclick.net
11,clients4.google.com
12,googleads.g.doubleclick.net
13,google-analytics.com
14,www.facebook.com
15,youtube.com
16,fbcdn.net
17,apple.com
18,www.googleapis.com
19,www.google-analytics.com
20,googlesyndication.com

Overlap

168,069 domains ranked in the Alexa dataset also appear in the Umbrella dataset (i.e., a straight-up string match without doing anything smart with subdomains)

Frequency of common names

DNS queries observe different domain counts than what users type or click in their browsers. This brings up clear differences with respect to some large content networks:

$ grep -c '[,\.]google\.' top1m-2017-01-19-*
top1m-2017-01-19-alexa.csv:      213
top1m-2017-01-19-umbrella.csv:  3296

$ grep -c '[,\.]googlevideo\.' top1m-2017-01-19-*
top1m-2017-01-19-alexa.csv:        1
top1m-2017-01-19-umbrella.csv: 40982

$ grep -c '[,\.]googleusercontent\.' top1m-2017-01-19-*
top1m-2017-01-19-alexa.csv:        1
top1m-2017-01-19-umbrella.csv:  2797

$ grep -c '[,\.]blogspot\.' top1m-2017-01-19-*
top1m-2017-01-19-alexa.csv:     9828
top1m-2017-01-19-umbrella.csv:  2113

$ grep -c '[,\.]akamai\.' top1m-2017-01-19-*
top1m-2017-01-19-alexa.csv:        2
top1m-2017-01-19-umbrella.csv:  4687

$ grep -c '[,\.]edgecastcdn\.' top1m-2017-01-19-*
top1m-2017-01-19-alexa.csv:        1
top1m-2017-01-19-umbrella.csv:   688

$ grep -c '[,\.]yahoo\.' top1m-2017-01-19-*
top1m-2017-01-19-alexa.csv:       14
top1m-2017-01-19-umbrella.csv:  2605

$ grep -c '[,\.]tumblr\.' top1m-2017-01-19-*
top1m-2017-01-19-alexa.csv:     6691
top1m-2017-01-19-umbrella.csv:  2224

Domain levels

The two datasets treat subdomains differently. The Cisco Umbrella dataset features an unfiltered set of domains queried, from TLDs through various levels of increasingly-specific subdomains. The Alexa dataset doesn't feature deep levels of subdomains in the same way.

Alexa

$ awk -F, '{count=split($2,a,"."); print count}' data/top1m-2017-01-19-alexa.csv | sort | uniq -c | awk '{print $2,$1}' | sort -k1,1n
2 866603
3 129820
4 3577

Umbrella

$ awk -F, '{count=split($2,a,"."); print count}' data/top1m-2017-01-19-umbrella.csv | sort | uniq -c | awk '{print $2,$1}' | sort -k1,1n
1 1645
2 263277
3 492408
4 167774
5 57580
6 13521
7 2695
8 739
9 229
10 102
11 29
12 1

top-level domains

The Cisco Umbrella dataset includes invalid TLDs by design. When I remove names with invalid TLDs, the Cisco Umbrella list is slightly shorter:

If I strip out queries sent to invalid TLDs and queries for the TLDs themseles, the Umbrella list is ~993,000 entries large:

$ wc -l data/top1m-2017-01-19-umbrella*
  993317 data/top1m-2017-01-19-umbrella-trimmed.csv
 1000000 data/top1m-2017-01-19-umbrella.csv

Some of the entries for invalid TLDs are interesting in their own right, representing queries that hosts are making (perhaps in error, by regex matching anything containing a 'string.string', etc); this is not the full list, but some of them are interesting.

Format: rank,domain

$ cat data/top1m-2017-01-19-umbrella.csv | awk -F, 'BEGIN {file="data/tlds-alpha-by-domain.txt" ; while ((getline line < file) > 0) {if (line ~ /#/) continue; tld[tolower(line)] = 1}} {foo=split($2,a,"."); if (foo == 1) {if (!(a[1] in tld)) {print $0}}}'  | head -n200
1804,local
2086,home
5523,lan
10264,tcs
12856,url
12986,uop
14140,localdomain
14952,belkin
15944,js
19832,internal
21710,html
22237,localhost
22421,olx
22833,comhttps
24153,corp
26198,comhttp
33926,254
38492,comm
40296,invalid
42833,asp
46798,evernote
46855,evernotepre
46968,evernoteci
49802,php
51068,gateway
51531,example
51750,xml
51909,loc
52453,workgroup
53307,koko
53492,undefined
57052,fcname
57482,baseurl
57749,intern
58418,private
59758,coms
60106,api
62189,mynet
63893,ip
64453,adsl
67065,domain
67860,localdomain4
67983,dlinkrouter
68485,dlink
71040,oiwtech
72126,totolink
72736,intra
73161,router
73202,error
75869,ide
76284,dll
78000,dom
78074,pk5001z
78991,connectify
81716,actdsltmp
86527,wirelessap
90156,gre
90546,c
91642,ico
91782,homestation
92521,mshome
93586,maxprint
95977,htm
96949,asus
98923,con
100663,msh
103328,microsoftedge
103942,multilaserap
105044,ukbbc
108770,pvt
108912,priv
110562,lcl
111409,wifi
111749,bit
111991,telus
116369,intranet
116897,pixel
118241,tld
120657,proxy
121924,1
123509,c3t
129563,request
129753,station
130278,comus
131159,http
134312,go
135797,comnull
137860,guest
138728,0
139801,vacourts
140653,localnet
141608,ocm
142143,pri
145832,onion
146054,chsfas
148199,comundefined
148458,blinkap
148788,provider
148789,prv
149518,realtek
150203,gothan
153423,openvpn
153674,dhcp
154920,cpe
159894,aspx
160604,mymax
161912,comn
162129,jpg
163053,ssg5-serial
164740,com0
165025,168
166173,test
166257,gif
167121,labox
167133,nethttp
168548,comimages
170794,cpm
171066,storeportal
172434,jw
172761,ds
173097,default
175444,mail
177620,caixa
179102,wag320n
180034,b
183105,net11n
183512,fpt
184870,null
185251,krossprecision
187255,bad
188814,configured
190271,dc
195794,webvpn
197415,exe
198666,localdom
198953,hotspot300
200070,2
200955,the
202556,tda
206179,altitude
206651,xom
206825,root
207343,png
209107,jetpack
209337,comstories
211648,modem
214192,encore
214455,comtemplates
215950,engeniusrouter
216880,locale
218303,facebook
218857,copm
219551,localdomain6
223705,wireless
223848,inc
224257,orghttp
224428,255
225041,twitter
226634,css
228393,coom
230946,alienvault
231879,or
232155,gbl
233120,wireless-n
238853,vom
241962,comie8webslice
246231,rcse
246643,sym
247121,a
255032,https
255157,reliant
256121,c3technology
256209,i2p
261101,e
261884,aquario
264731,mte
265060,cmd
265344,txt
265969,dmz
267092,server
268718,i
271617,lanapvdipsf01
272110,p-661hnu-f1
273569,setup
273872,enhwi-n3
273929,homegateway
275315,cmo
278366,come
278553,netgear
279596,jsp
279660,nlundefined
282246,fds
283363,g
284893,yu
285150,m
285311,ssg20-wlan
289630,conm
289706,grp
290753,adidas
293926,ecom
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment