Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Getting the Alexa top 1 million sites directly from the server, unzipping it, parsing the csv and getting each line as an array.
var request = require('request');
var unzip = require('unzip');
var csv2 = require('csv2');
request.get('http://s3.amazonaws.com/alexa-static/top-1m.csv.zip')
.pipe(unzip.Parse())
.on('entry', function (entry) {
entry.pipe(csv2()).on('data', console.log);
})
;
@fiestabonita

This comment has been minimized.

Copy link

commented Jan 31, 2015

Plop this into a terminal to get the aforementioned nodejs script working on Ubuntu:

sudo apt-get install nodejs node-request npm
npm install unzip
npm install csv2

chilts: i'm diggin' it. thank you for sharing this.

@steakknife

This comment has been minimized.

Copy link

commented Feb 2, 2016

+🍺

@vionemc

This comment has been minimized.

Copy link

commented Sep 30, 2016

May I ask?
Will it always give the latest data or is it the data from 2013, the time when you make this script?

@ao

This comment has been minimized.

Copy link

commented Oct 29, 2016

@vionemc The hosted zip file is automatically updated daily by alexa.

@BastienL

This comment has been minimized.

Copy link

commented Nov 22, 2016

Unfortunately the file no longer exists.

@ao

This comment has been minimized.

Copy link

commented Nov 22, 2016

Alexa has stopped offering this file (top-1m.csv.zip), you can now get the alternative free from Statvoo:
https://siteinfo.statvoo.com/dl/top-1million-sites.csv.zip (Ref: https://siteinfo.statvoo.com/top/sites)

@thomas-dorka

This comment has been minimized.

Copy link

commented Nov 23, 2016

Interestingly, the download works again today. I contacted the Alexa support and they said that the service is discontinued. I added a follow up question on why the service is available again and for how long. Let's see what happens there. The page that originally referenced the file (https://support.alexa.com/hc/en-us/articles/200461990-Can-I-get-a-list-of-top-sites-from-an-API-) does not link to the file anymore.

@thomas-dorka

This comment has been minimized.

Copy link

commented Nov 24, 2016

Reply from Alexa:

"The file is temporarily available again, yes. We'll post updates concerning the file to our FAQ and Twitter. We do not have any additional updates to share at this time."

So need to monitor e.g. this: https://twitter.com/Alexa_Support

@fizerkhan

This comment has been minimized.

Copy link

commented Nov 27, 2016

The link http://s3.amazonaws.com/alexa-static/top-1m.csv.zip is working now. But I don't know whether the file is upto-date. Any ideas?

@ao Does the link https://statvoo.com/dl/top-1million-sites.csv.zip from statvoo upto-date? (Or) Is it just a copy from Alaxa csv?

@karan1149

This comment has been minimized.

Copy link

commented Dec 1, 2016

Is there a smaller CSV anywhere? Maybe top 50,000 or top 100,000 sites?

@Ayesh

This comment has been minimized.

Copy link

commented Dec 5, 2016

CSV file is working again! Nice!
The data is not exactly up to date. I would say about 2 months. I have a site in the current the 67,000 positions today, and is in the lists 78,000s.

@djcornell

This comment has been minimized.

Copy link

commented Dec 15, 2016

OpenDNS has published a new top 1 million list here: http://s3-us-west-1.amazonaws.com/umbrella-static/index.html While the list is not composed in the same way we hope that it will be useful to some. Read up on the details here: https://blog.opendns.com/2016/12/14/cisco-umbrella-1-million/

@xdanx

This comment has been minimized.

Copy link

commented Mar 6, 2017

Can confirm http://s3.amazonaws.com/alexa-static/top-1m.csv.zip still available

@titus-shoats

This comment has been minimized.

Copy link

commented Mar 25, 2017

Thanks for this!

@elhardoum

This comment has been minimized.

Copy link

commented Mar 29, 2017

@karan1149 head -10000 top-1million-sites.csv will display top 10,000. it will be faster than iterating through all the list.
To find a specific domain cat top-1million-sites.csv | grep github.com.

@linussjo

This comment has been minimized.

Copy link

commented Apr 12, 2017

@xdanx is that file still updated daily or is it from a specific date?

@alexlehm

This comment has been minimized.

Copy link

commented May 7, 2017

the zip file has a timestamp of 5/7/2017 today, so it is at least generated each day.

@doctorhy

This comment has been minimized.

Copy link

commented May 21, 2017

I need to generate a long list of URL-like names (>10million). Any idea?

@liesislukas

This comment has been minimized.

Copy link

commented Jun 16, 2017

@doctorhy it's definitely not a place to ask it, go to stackoverflow.com and ask there. Not sure why not to google around how to generate a random string using a language of your choice in a first place.

@laravelish

This comment has been minimized.

Copy link

commented Aug 20, 2017

why do people need those domains ?

@hossamhossny

This comment has been minimized.

Copy link

commented Sep 19, 2017

It is definitely outdated. I have a site in the top million and it is not in the list. I would agree that the file is a few months old, 2-3 perhaps.

@AdityaAnand1

This comment has been minimized.

Copy link

commented Nov 4, 2017

Anybody know of a working solution? I'd prefer a script that I can run on my own to update my database.

@alaa-abdelsamie

This comment has been minimized.

Copy link

commented Jan 2, 2018

hello
Can you explain to me how I can get this rule by mysql file
thank you.

@saoirs3

This comment has been minimized.

Copy link

commented Feb 3, 2018

You might want to try this alternative from Cisco/OpenDNS

http://s3-us-west-1.amazonaws.com/umbrella-static/index.html

@chilts

This comment has been minimized.

Copy link
Owner Author

commented Apr 4, 2019

It seems that this link http://s3.amazonaws.com/alexa-static/top-1m.csv.zip is still there (2019-04-04), though I agree with many observations above that the list looks stale. I know Alexa does have a delay in the public list, but this looks fairly old to me and much older than 3 months or so.

Why? Because https://www.alexa.com/siteinfo/cssminifier.com (a site I run) tells me it's 95,425 rank in the world, but the spreadsheet today is more like 50,000. I remember that the site was up there at some point but that was a few years ago! So, all in all, I dunno.

And as an update for the code above, I created a dir in /tmp/top100 and did the following and voila, it still works with no changes and no errors:

$ node --version
v8.14.0
$ npm --version
6.9.0
$ npm install csv2 request unzip
+ csv2@0.1.1
+ request@2.88.0
+ unzip@0.1.11
updated 3 packages and audited 366 packages in 1.52s

$ node alexa.ja
[ '1', 'google.com' ]
[ '2', 'youtube.com' ]
[ '3', 'facebook.com' ]
...etc...
[ '999998', 'gaest.com' ]
[ '999999', 'gazetehaberler.com' ]
[ '1000000', 'gehring-group.com' ]

Thanks to everyone else for also suggesting other alternatives that are kept up to date, even if they are slightly different.

@hisivasankar

This comment has been minimized.

Copy link

commented May 27, 2019

Does anybody know where I can get similar data with the metadata like website category etc.,?

www.instagram.com => social
www.google.com => search
etc.?

@Rasoul-Jafari

This comment has been minimized.

Copy link

commented Jun 1, 2019

can anybody help me with feature extraction process??? I don't have the knowledge to use python code for feature extraction...If you do, I'd be happy to get some help...thanks in advance

@chilts

This comment has been minimized.

Copy link
Owner Author

commented Jun 5, 2019

can anybody help me with feature extraction process??? I don't have the knowledge to use python code for feature extraction...If you do, I'd be happy to get some help...thanks in advance

This is JavaScript in nodejs, not Python. Perhaps you can google for an example in Python instead.

@unmem

This comment has been minimized.

Copy link

commented Jun 22, 2019

Hi, there

➜ curl -I http://s3.amazonaws.com/alexa-static/top-1m.csv.zip
HTTP/1.1 200 OK
x-amz-id-2: 0QFTzV4zLKRksLmC4JWG/iE/qVKQlSsr7m+lZbnRlrxocqsYbqgHMnjxlBuMTfWQhwrt7/NsULA=
x-amz-request-id: DA7628383A36A79D
Date: Sat, 22 Jun 2019 10:54:43 GMT
Last-Modified: Sat, 22 Jun 2019 10:38:33 GMT                               # Does this means up-to-date?
ETag: "5a4fdd26b49d1e579335dde414012297"
x-amz-meta-alexa-last-modified: 20190622103832
Accept-Ranges: bytes
Content-Type: application/zip
Content-Length: 96
@musabgultekin

This comment has been minimized.

Copy link

commented Jun 28, 2019

Hi, there

➜ curl -I http://s3.amazonaws.com/alexa-static/top-1m.csv.zip
HTTP/1.1 200 OK
x-amz-id-2: 0QFTzV4zLKRksLmC4JWG/iE/qVKQlSsr7m+lZbnRlrxocqsYbqgHMnjxlBuMTfWQhwrt7/NsULA=
x-amz-request-id: DA7628383A36A79D
Date: Sat, 22 Jun 2019 10:54:43 GMT
Last-Modified: Sat, 22 Jun 2019 10:38:33 GMT                               # Does this means up-to-date?
ETag: "5a4fdd26b49d1e579335dde414012297"
x-amz-meta-alexa-last-modified: 20190622103832
Accept-Ranges: bytes
Content-Type: application/zip
Content-Length: 96

No, it can be cache response header

@garrett-leyenaar

This comment has been minimized.

Copy link

commented Oct 19, 2019

http://s3.amazonaws.com/alexa-static/top-1m.csv.zip
...does appear to be up-to-date. I think it's updated every 90 days. I checked for a domain that was registered April 2019 and it's in the list

@chilts

This comment has been minimized.

Copy link
Owner Author

commented Oct 21, 2019

Ah thanks @garrett-leyenaar, that's good to know it's still being updated.

@chilts

This comment has been minimized.

Copy link
Owner Author

commented Oct 21, 2019

Interesting that today (2019-10-22) I re-ran my steps from https://gist.github.com/chilts/7229605#gistcomment-2880207 and noticed that I only get entries 1 to 647605 entries printed out. So I downloaded the .zip file itself, and sure enough it doesn't have any entry after that. Whether it's a one-off problem today, I dunno. :)

$ unzip top-1m.csv.zip 
Archive:  top-1m.csv.zip
  inflating: top-1m.csv              

$ tail -n 10 top-1m.csv
647596,nic.xn--bck1b9a5dre4c
647597,not3.io
647598,omyfashiona.com
647599,otcbtc.com
647600,quranapk.com
647601,stylewithlife.com
647602,thecityvacation.com
647603,transferdmc.com
647604,uspersonality.com
647605,villasbeachfront.com.mx
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.