Create a gist now

Instantly share code, notes, and snippets.

What would you like to do?
Getting the Alexa top 1 million sites directly from the server, unzipping it, parsing the csv and getting each line as an array.
var request = require('request');
var unzip = require('unzip');
var csv2 = require('csv2');
request.get('http://s3.amazonaws.com/alexa-static/top-1m.csv.zip')
.pipe(unzip.Parse())
.on('entry', function (entry) {
entry.pipe(csv2()).on('data', console.log);
})
;

Plop this into a terminal to get the aforementioned nodejs script working on Ubuntu:

sudo apt-get install nodejs node-request npm
npm install unzip
npm install csv2

chilts: i'm diggin' it. thank you for sharing this.

+🍺

vionemc commented Sep 30, 2016

May I ask?
Will it always give the latest data or is it the data from 2013, the time when you make this script?

ao commented Oct 29, 2016

@vionemc The hosted zip file is automatically updated daily by alexa.

Unfortunately the file no longer exists.

ao commented Nov 22, 2016

Alexa has stopped offering this file (top-1m.csv.zip), you can now get the alternative free from Statvoo:
https://statvoo.com/dl/top-1million-sites.csv.zip (Ref: https://statvoo.com/top/sites)

Interestingly, the download works again today. I contacted the Alexa support and they said that the service is discontinued. I added a follow up question on why the service is available again and for how long. Let's see what happens there. The page that originally referenced the file (https://support.alexa.com/hc/en-us/articles/200461990-Can-I-get-a-list-of-top-sites-from-an-API-) does not link to the file anymore.

Reply from Alexa:

"The file is temporarily available again, yes. We'll post updates concerning the file to our FAQ and Twitter. We do not have any additional updates to share at this time."

So need to monitor e.g. this: https://twitter.com/Alexa_Support

fizerkhan commented Nov 27, 2016 edited

The link http://s3.amazonaws.com/alexa-static/top-1m.csv.zip is working now. But I don't know whether the file is upto-date. Any ideas?

@ao Does the link https://statvoo.com/dl/top-1million-sites.csv.zip from statvoo upto-date? (Or) Is it just a copy from Alaxa csv?

Is there a smaller CSV anywhere? Maybe top 50,000 or top 100,000 sites?

Ayesh commented Dec 5, 2016

CSV file is working again! Nice!
The data is not exactly up to date. I would say about 2 months. I have a site in the current the 67,000 positions today, and is in the lists 78,000s.

OpenDNS has published a new top 1 million list here: http://s3-us-west-1.amazonaws.com/umbrella-static/index.html While the list is not composed in the same way we hope that it will be useful to some. Read up on the details here: https://blog.opendns.com/2016/12/14/cisco-umbrella-1-million/

xdanx commented Mar 6, 2017

Can confirm http://s3.amazonaws.com/alexa-static/top-1m.csv.zip still available

Thanks for this!

@karan1149 head -10000 top-1million-sites.csv will display top 10,000. it will be faster than iterating through all the list.
To find a specific domain cat top-1million-sites.csv | grep github.com.

@xdanx is that file still updated daily or is it from a specific date?

alexlehm commented May 7, 2017

the zip file has a timestamp of 5/7/2017 today, so it is at least generated each day.

I need to generate a long list of URL-like names (>10million). Any idea?

@doctorhy it's definitely not a place to ask it, go to stackoverflow.com and ask there. Not sure why not to google around how to generate a random string using a language of your choice in a first place.

why do people need those domains ?

hossamhossny commented Sep 19, 2017 edited

It is definitely outdated. I have a site in the top million and it is not in the list. I would agree that the file is a few months old, 2-3 perhaps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment