Skip to content

Instantly share code, notes, and snippets.

@thewhitetulip
Last active June 22, 2021 16:06
Show Gist options
  • Star 10 You must be signed in to star a gist
  • Fork 7 You must be signed in to fork a gist
  • Save thewhitetulip/39971ef69e69b5b1c8a9dee6d9e7d58e to your computer and use it in GitHub Desktop.
Save thewhitetulip/39971ef69e69b5b1c8a9dee6d9e7d58e to your computer and use it in GitHub Desktop.
Download all UCI machine learning datasets: http://archive.ics.uci.edu
## Author: @thewhitetulip
## Purpose: to download all machine learning data sets available in UCI's machine learning repository
## Wrote this because didn't want to manually go and download all the lists and wanted to do a fun project in python after
## a long time.
import os
file = open("links.txt", "r")
lines = file.readlines()
lines = [line.strip() for line in lines]
folderName=""
fileName=""
changedDIR=0
for line in lines:
if line.startswith("/"):
if changedDIR!=0:
os.chdir("../")
folderName=line
try:
os.mkdir(folderName.split('/')[-1])
except:
print("dir exists")
changedDIR=1
os.chdir(folderName.split('/')[-1])
else:
fileName = line
if fileName:
os.popen('wget -c http://archive.ics.uci.edu'+ folderName+"/"+fileName)
#download all the html pages in the folders.txt file, which are all folders taken from the UCI website
wc -l folders.txt
files=`ls *.html.*`
for file in ${files}; do cat ${file} | tr '>' \\n | grep -e 'href' -e 'Index'; done > links.txt
cut -d'=' -f2- links.txt | tr '"' ' ' > links.txt
cat links2.txt | uniq >links.txt
### this will create the links.txt file, edit the file so that it contains the listing in the following format
#/ml/machine-learning-databases/balance-scale
#balance-scale.data
#balance-scale.names
#where first thing is the folder name, followed by the file names, I didn't get time to do that automatically
#because I was running out of time, also I do not have the entire list, if you figure out the way to do this automatically send me a PR
#or email me!
mkdir uci && cd uci
##make sure uci_data_sets.py and links.txt are present in this folder
# then run the following script python uci_data_sets.py
20newsgroups-mld/
CorelFeatures-mld/
JapaneseVowels-mld/
SyskillWebert-mld/
UNIX_user_data-mld/
abalone/
abscisic-acid/
access-lists/
acute/
adult/
annealing/
anonymous/
arcene/
arrhythmia/
artificial-characters/
audiology/
auslan-mld/
auslan2-mld/
auto-mpg/
autos/
badges/
bag-of-words/
balance-scale/
balloons/
blood-transfusion/
breast-cancer-wisconsin/
breast-cancer/
bridges/
car/
census-income-mld/
census-income/
census1990-mld/
character-trajectories/
chess/
chorales/
cmc/
coil-mld/
communities/
concrete/
connect-4/
contacts/
covertype-mld/
covtype/
cpu-performance/
credit-screening/
cylinder-bands/
demospongiae/
dermatology/
dexter/
dgp-2/
diabetes/
document-understanding/
dorothea/
ebl/
echocardiogram/
ecoli-mld/
ecoli/
eeg-mld/
el_nino-mld/
entree-mld/
event-detection/
faces-mld/
flags/
forest-fires/
function-finding/
gisette/
glass/
haberman/
hayes-roth/
heart-disease/
hepatitis/
hill-valley/
horse-colic/
housing/
icu/
image/
internet_ads/
internet_usage-mld/
ionosphere/
ipums-mld/
iris/
isolet/
kddcup98-mld/
kddcup99-mld/
kinship/
labor-negotiations/
led-display-creator/
lenses/
letter-recognition/
libras/
liver-disorders/
logic-theorist/
lung-cancer/
lymphography/
madelon/
magic/
mammographic-masses/
mechanical-analysis/
meta-data/
mfeat/
mnist-mld/
mobile-robots/
molecular-biology/
monks-problems/
moral-reasoner/
movies-mld/
msnbc-mld/
msweb-mld/
mushroom/
musk/
nsfabs-mld/
nursery/
opinion/
optdigits/
othello/
ozone/
p53/
page-blocks/
parkinsons/
pendigits/
photo-mld/
pima-indians-diabetes/
pioneer-mld/
plants/
poker/
postoperative-patient-data/
primary-tumor/
prodigy/
qsar/
quadrapeds/
restricted/
reuters21578-mld/
reuters_transcribed-mld/
robotfailure-mld/
secom/
semeion/
servo/
shuttle-landing-control/
solar-flare/
soybean/
space-shuttle/
spambase/
spect/
spectrometer/
sponge/
statlog/
student-loan/
synthetic-mld/
synthetic_control-mld/
tae/
tb-mld/
thyroid-disease/
tic-mld/
tic-tac-toe/
trains/
uji-penchars/
undocumented/
university/
url/
utilities/
volcanoes-mld/
voting-records/
water-treatment/
waveform/
wine-quality/
wine/
yeast-mld/
yeast/
zoo/
@thewhitetulip
Copy link
Author

Getting started, if you just want to download the datasets, go ahead and download the download_uci_data_sets.py
and the links.txt file in one folder and run python download_uci_data_sets.py

However if you want to play with things, go ahead and checkout
filelinks.py: prints the links of all folders
folders.txt: the list of all folders copied from the website
filescrubbing.sh: details about how to create the links.txt which you'll use with the download python script to download everything.

Some files are GB's in size, so check your internet connection first. I didn't download all the datasets, only the first fifty or so small ones.

@MeetThePatel
Copy link

You should make it a CLI, so for example, if I just want to download the arrhythmia data set, i could do mldataset download arrhythmia and it could go out and download just one.

@nealmcb
Copy link

nealmcb commented Jul 1, 2016

The intro comment (or README) should point to the human-readable info on this archive, and note that there are currently 350 data sets: https://archive.ics.uci.edu/ml/datasets.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment