Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Getting the Alexa top 1 million sites directly from the server, unzipping it, parsing the csv and getting each line as an array.
var request = require('request');
var unzip = require('unzip');
var csv2 = require('csv2');
request.get('http://s3.amazonaws.com/alexa-static/top-1m.csv.zip')
.pipe(unzip.Parse())
.on('entry', function (entry) {
entry.pipe(csv2()).on('data', console.log);
})
;
@chilts
Copy link
Author

chilts commented Jun 5, 2019

can anybody help me with feature extraction process??? I don't have the knowledge to use python code for feature extraction...If you do, I'd be happy to get some help...thanks in advance

This is JavaScript in nodejs, not Python. Perhaps you can google for an example in Python instead.

@snowman
Copy link

snowman commented Jun 22, 2019

Hi, there

➜ curl -I http://s3.amazonaws.com/alexa-static/top-1m.csv.zip
HTTP/1.1 200 OK
x-amz-id-2: 0QFTzV4zLKRksLmC4JWG/iE/qVKQlSsr7m+lZbnRlrxocqsYbqgHMnjxlBuMTfWQhwrt7/NsULA=
x-amz-request-id: DA7628383A36A79D
Date: Sat, 22 Jun 2019 10:54:43 GMT
Last-Modified: Sat, 22 Jun 2019 10:38:33 GMT                               # Does this means up-to-date?
ETag: "5a4fdd26b49d1e579335dde414012297"
x-amz-meta-alexa-last-modified: 20190622103832
Accept-Ranges: bytes
Content-Type: application/zip
Content-Length: 96

@musabgultekin
Copy link

musabgultekin commented Jun 28, 2019

Hi, there

➜ curl -I http://s3.amazonaws.com/alexa-static/top-1m.csv.zip
HTTP/1.1 200 OK
x-amz-id-2: 0QFTzV4zLKRksLmC4JWG/iE/qVKQlSsr7m+lZbnRlrxocqsYbqgHMnjxlBuMTfWQhwrt7/NsULA=
x-amz-request-id: DA7628383A36A79D
Date: Sat, 22 Jun 2019 10:54:43 GMT
Last-Modified: Sat, 22 Jun 2019 10:38:33 GMT                               # Does this means up-to-date?
ETag: "5a4fdd26b49d1e579335dde414012297"
x-amz-meta-alexa-last-modified: 20190622103832
Accept-Ranges: bytes
Content-Type: application/zip
Content-Length: 96

No, it can be cache response header

@garrett-leyenaar
Copy link

garrett-leyenaar commented Oct 19, 2019

http://s3.amazonaws.com/alexa-static/top-1m.csv.zip
...does appear to be up-to-date. I think it's updated every 90 days. I checked for a domain that was registered April 2019 and it's in the list

@chilts
Copy link
Author

chilts commented Oct 21, 2019

Ah thanks @garrett-leyenaar, that's good to know it's still being updated.

@chilts
Copy link
Author

chilts commented Oct 21, 2019

Interesting that today (2019-10-22) I re-ran my steps from https://gist.github.com/chilts/7229605#gistcomment-2880207 and noticed that I only get entries 1 to 647605 entries printed out. So I downloaded the .zip file itself, and sure enough it doesn't have any entry after that. Whether it's a one-off problem today, I dunno. :)

$ unzip top-1m.csv.zip 
Archive:  top-1m.csv.zip
  inflating: top-1m.csv              

$ tail -n 10 top-1m.csv
647596,nic.xn--bck1b9a5dre4c
647597,not3.io
647598,omyfashiona.com
647599,otcbtc.com
647600,quranapk.com
647601,stylewithlife.com
647602,thecityvacation.com
647603,transferdmc.com
647604,uspersonality.com
647605,villasbeachfront.com.mx

@yosunga
Copy link

yosunga commented Oct 30, 2019

www.ktservis.com.tr used to mirror this file but I think they removed it because of copyright issues.Any other mirrors ?

@rustyspoonz
Copy link

rustyspoonz commented Dec 2, 2019

www.ktservis.com.tr used to mirror this file but I think they removed it because of copyright issues.Any other mirrors ?

No need for a mirror, the file is still available using the URL from the script: http://s3.amazonaws.com/alexa-static/top-1m.csv.zip

@yosunga
Copy link

yosunga commented Dec 3, 2019

@hamlatzis
Copy link

hamlatzis commented Feb 13, 2020

The alexa zip file contains only 839000 entries

1m != 839000

@mikej165
Copy link

mikej165 commented Jun 23, 2020

As of today, the Alexa "one million" contains 547855 entries. Very strange.

@meeeller
Copy link

meeeller commented Jun 26, 2020

Today is 763k. Last summer it started being short of "one million". I am here again trying to figure out why.

We used Alexa in the past, still can't find anything on why it so short of 1 million. Good paper on T1M rankings pdf

@vladimarius
Copy link

vladimarius commented Sep 16, 2020

Alexa no longer provides that list for free.
You can download the list using their API.

The price is:
Alexa Top Sites API Requests (1 unit = 10 URLs returned) | $0.025 / unit

So for 1 million domains you'd pay 0.0025 * 1000000 = $2500 😃😃😃

@chilts
Copy link
Author

chilts commented Oct 1, 2020

💵 💲 Thanks for that info @vladimarius ... good to know it is still available. I imagine people will find other sources though with that price! Thanks again.

@seupedro
Copy link

seupedro commented Jan 23, 2021

still work alexa at 2021

@cameck
Copy link

cameck commented Apr 8, 2021

Just spoke with Amazon about this. There's no guarantee that the free list contains all 1 million, but it is still updated daily.

@chilts
Copy link
Author

chilts commented May 5, 2021

Thanks @seupedro and @cameck, always good to know that it's still working and the CSV is available. I wonder if the script still works. I'll try it again sometime soon and paste back here what worked or didn't and an update if needed.

@leilii
Copy link

leilii commented May 12, 2021

Hi,
how to get for example 10 top list into a text file not all?

@d668
Copy link

d668 commented Jun 21, 2021

the file now ends at 427k

@Waseemghafoor474
Copy link

Waseemghafoor474 commented Jul 9, 2021

CSV file is working again! Nice!
The data is not exactly up to date. I would say about 2 months. I have a site in the current the 67,000 positions today, and is in the lists 78,000s
Also how to get for example 10 top list into a text file not all?

https://stainely.com/

@xysecurity
Copy link

xysecurity commented Oct 11, 2021

425k for 2021.10.11

@tomwojcik
Copy link

tomwojcik commented Dec 8, 2021

@snowman
Copy link

snowman commented Dec 9, 2021

We will be retiring Alexa.com on May 1, 2022

https://support.alexa.com/hc/en-us/articles/4410503838999

Note, this is the last chance you can backup things

@ao
Copy link

ao commented Dec 14, 2021

With the Alexa top 1 million CSV/ZIP going away shortly, you can use https://statvoo.com/dl/top-1million-sites.csv.zip instead, which is linked to over here: https://statvoo.com/top/ranked and provides a list of the top 1million websites. (Updated daily)

@chilts
Copy link
Author

chilts commented Dec 15, 2021

Thanks @ao, that's good to know! :)

@huadaonan
Copy link

huadaonan commented Jan 29, 2022

great

@jorgeluislazo
Copy link

jorgeluislazo commented May 11, 2022

Can confirm http://s3.amazonaws.com/alexa-static/top-1m.csv.zip still works for me, 1M sites (as of May 11th 2022). I think the actual resources will be gone by December of 2022 though

@ciscospirit
Copy link

ciscospirit commented May 17, 2022

Hello,
does anyone knows how to get the top-1000 from a specific Country too?
i would search for the Austrian and Germany Top 1000 List. Can anybody help me out with a link to download?

@chilts
Copy link
Author

chilts commented May 17, 2022

@ciscospirit I don't know any off the top of my head, but perhaps do a search and see what you can find.

@chilts
Copy link
Author

chilts commented May 17, 2022

Hi everyone, I just noticed this site on a fork of this gist and also seems to be kept up to date:

I don't know if it's useful to anyone, but there we go. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment