-
-
Save thewhitetulip/39971ef69e69b5b1c8a9dee6d9e7d58e to your computer and use it in GitHub Desktop.
Download all UCI machine learning datasets: http://archive.ics.uci.edu
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
## Author: @thewhitetulip | |
## Purpose: to download all machine learning data sets available in UCI's machine learning repository | |
## Wrote this because didn't want to manually go and download all the lists and wanted to do a fun project in python after | |
## a long time. | |
import os | |
file = open("links.txt", "r") | |
lines = file.readlines() | |
lines = [line.strip() for line in lines] | |
folderName="" | |
fileName="" | |
changedDIR=0 | |
for line in lines: | |
if line.startswith("/"): | |
if changedDIR!=0: | |
os.chdir("../") | |
folderName=line | |
try: | |
os.mkdir(folderName.split('/')[-1]) | |
except: | |
print("dir exists") | |
changedDIR=1 | |
os.chdir(folderName.split('/')[-1]) | |
else: | |
fileName = line | |
if fileName: | |
os.popen('wget -c http://archive.ics.uci.edu'+ folderName+"/"+fileName) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
##scrapes all the .html files from which we'll get the actual links of the datasets | |
url='http://archive.ics.uci.edu/ml/machine-learning-databases/' | |
folderfile = file('folders.txt') | |
lines = folderfile.readlines() | |
lines = [line.strip() for line in lines] | |
print url+lines[0] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#download all the html pages in the folders.txt file, which are all folders taken from the UCI website | |
wc -l folders.txt | |
files=`ls *.html.*` | |
for file in ${files}; do cat ${file} | tr '>' \\n | grep -e 'href' -e 'Index'; done > links.txt | |
cut -d'=' -f2- links.txt | tr '"' ' ' > links.txt | |
cat links2.txt | uniq >links.txt | |
### this will create the links.txt file, edit the file so that it contains the listing in the following format | |
#/ml/machine-learning-databases/balance-scale | |
#balance-scale.data | |
#balance-scale.names | |
#where first thing is the folder name, followed by the file names, I didn't get time to do that automatically | |
#because I was running out of time, also I do not have the entire list, if you figure out the way to do this automatically send me a PR | |
#or email me! | |
mkdir uci && cd uci | |
##make sure uci_data_sets.py and links.txt are present in this folder | |
# then run the following script python uci_data_sets.py |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
20newsgroups-mld/ | |
CorelFeatures-mld/ | |
JapaneseVowels-mld/ | |
SyskillWebert-mld/ | |
UNIX_user_data-mld/ | |
abalone/ | |
abscisic-acid/ | |
access-lists/ | |
acute/ | |
adult/ | |
annealing/ | |
anonymous/ | |
arcene/ | |
arrhythmia/ | |
artificial-characters/ | |
audiology/ | |
auslan-mld/ | |
auslan2-mld/ | |
auto-mpg/ | |
autos/ | |
badges/ | |
bag-of-words/ | |
balance-scale/ | |
balloons/ | |
blood-transfusion/ | |
breast-cancer-wisconsin/ | |
breast-cancer/ | |
bridges/ | |
car/ | |
census-income-mld/ | |
census-income/ | |
census1990-mld/ | |
character-trajectories/ | |
chess/ | |
chorales/ | |
cmc/ | |
coil-mld/ | |
communities/ | |
concrete/ | |
connect-4/ | |
contacts/ | |
covertype-mld/ | |
covtype/ | |
cpu-performance/ | |
credit-screening/ | |
cylinder-bands/ | |
demospongiae/ | |
dermatology/ | |
dexter/ | |
dgp-2/ | |
diabetes/ | |
document-understanding/ | |
dorothea/ | |
ebl/ | |
echocardiogram/ | |
ecoli-mld/ | |
ecoli/ | |
eeg-mld/ | |
el_nino-mld/ | |
entree-mld/ | |
event-detection/ | |
faces-mld/ | |
flags/ | |
forest-fires/ | |
function-finding/ | |
gisette/ | |
glass/ | |
haberman/ | |
hayes-roth/ | |
heart-disease/ | |
hepatitis/ | |
hill-valley/ | |
horse-colic/ | |
housing/ | |
icu/ | |
image/ | |
internet_ads/ | |
internet_usage-mld/ | |
ionosphere/ | |
ipums-mld/ | |
iris/ | |
isolet/ | |
kddcup98-mld/ | |
kddcup99-mld/ | |
kinship/ | |
labor-negotiations/ | |
led-display-creator/ | |
lenses/ | |
letter-recognition/ | |
libras/ | |
liver-disorders/ | |
logic-theorist/ | |
lung-cancer/ | |
lymphography/ | |
madelon/ | |
magic/ | |
mammographic-masses/ | |
mechanical-analysis/ | |
meta-data/ | |
mfeat/ | |
mnist-mld/ | |
mobile-robots/ | |
molecular-biology/ | |
monks-problems/ | |
moral-reasoner/ | |
movies-mld/ | |
msnbc-mld/ | |
msweb-mld/ | |
mushroom/ | |
musk/ | |
nsfabs-mld/ | |
nursery/ | |
opinion/ | |
optdigits/ | |
othello/ | |
ozone/ | |
p53/ | |
page-blocks/ | |
parkinsons/ | |
pendigits/ | |
photo-mld/ | |
pima-indians-diabetes/ | |
pioneer-mld/ | |
plants/ | |
poker/ | |
postoperative-patient-data/ | |
primary-tumor/ | |
prodigy/ | |
qsar/ | |
quadrapeds/ | |
restricted/ | |
reuters21578-mld/ | |
reuters_transcribed-mld/ | |
robotfailure-mld/ | |
secom/ | |
semeion/ | |
servo/ | |
shuttle-landing-control/ | |
solar-flare/ | |
soybean/ | |
space-shuttle/ | |
spambase/ | |
spect/ | |
spectrometer/ | |
sponge/ | |
statlog/ | |
student-loan/ | |
synthetic-mld/ | |
synthetic_control-mld/ | |
tae/ | |
tb-mld/ | |
thyroid-disease/ | |
tic-mld/ | |
tic-tac-toe/ | |
trains/ | |
uji-penchars/ | |
undocumented/ | |
university/ | |
url/ | |
utilities/ | |
volcanoes-mld/ | |
voting-records/ | |
water-treatment/ | |
waveform/ | |
wine-quality/ | |
wine/ | |
yeast-mld/ | |
yeast/ | |
zoo/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
/ml/machine-learning-databases/CorelFeatures-mld | |
48000.jpg | |
151085.jpg | |
231076.jpg | |
294084.jpg | |
354090.jpg | |
534099.jpg | |
ColorHistogram.asc.gz | |
ColorMoments.asc.gz | |
CoocTexture.asc.gz | |
CorelFeatures.data.html | |
CorelFeatures.html | |
LayoutHistogram.asc.gz | |
size | |
/ml/machine-learning-databases/annealing | |
anneal.data | |
anneal.names | |
anneal.test | |
/ml/machine-learning-databases/mnist-mld | |
t10k-images.idx3-ubyte.gz | |
t10k-labels.idx1-ubyte.gz | |
train-images.idx3-ubyte.gz | |
train-labels.idx1-ubyte.gz | |
/ml/machine-learning-databases/mobile-robots | |
mobile-robots.names | |
mobile-robots.tar.gz | |
/ml/machine-learning-databases/molecular-biology | |
promoter-gene-sequences/ | |
protein-secondary-structure/ | |
splice-junction-gene-sequences/ | |
/ml/machine-learning-databases/monks-problems | |
monks-1.test | |
monks-1.train | |
monks-2.test | |
monks-2.train | |
monks-3.test | |
monks-3.train | |
monks.names | |
thrun.comparison.dat | |
thrun.comparison.ps.Z | |
update | |
/ml/machine-learning-databases/moral-reasoner | |
moral.data | |
moral.info | |
moral.theory | |
/ml/machine-learning-databases/movies-mld | |
data/ | |
doc.html | |
movies.data.html | |
movies.html | |
questions | |
uciform | |
unused/ | |
/ml/machine-learning-databases/msnbc-mld | |
description.txt | |
msnbc.data.html | |
msnbc.html | |
msnbc990928.seq.gz | |
/ml/machine-learning-databases/msweb-mld | |
anonymous-msweb.data.gz | |
anonymous-msweb.test.gz | |
msweb.data.html | |
msweb.html | |
msweb.task.html | |
/ml/machine-learning-databases/mushroom | |
agaricus-lepiota.data | |
agaricus-lepiota.names | |
expanded.Z | |
/ml/machine-learning-databases/musk | |
clean1.data.Z | |
clean1.info | |
clean1.names | |
clean2.data.Z | |
clean2.info | |
clean2.names | |
/ml/machine-learning-databases/anonymous | |
anonymous-msweb.data | |
anonymous-msweb.info | |
anonymous-msweb.test | |
/ml/machine-learning-databases/nsfabs-mld | |
Part1.zip | |
Part2.zip | |
Part3.zip | |
nsfabs_part1_out.zip | |
nsfabs_part2_out.zip | |
nsfabs_part3_out.zip | |
nsfawards.data.html | |
nsfawards.html | |
words.zip | |
/ml/machine-learning-databases/nursery | |
nursery.c45-names | |
nursery.data | |
nursery.names | |
/ml/machine-learning-databases/opinion | |
OpinosisDataset1.0.zip | |
opinion.names | |
/ml/machine-learning-databases/optdigits | |
optdigits-orig.cv.Z | |
optdigits-orig.names | |
optdigits-orig.tra.Z | |
optdigits-orig.wdep.Z | |
optdigits-orig.windep.Z | |
optdigits.names | |
optdigits.tes | |
optdigits.tra | |
readme.txt | |
/ml/machine-learning-databases/othello | |
new-othello.names | |
new-othello.theory | |
older-version/ | |
/ml/machine-learning-databases/ozone | |
eighthr.data | |
eighthr.names | |
onehr.data | |
onehr.names | |
/ml/machine-learning-databases/p53 | |
p53.names | |
p53_new_2012.zip | |
p53_old_2010.zip | |
/ml/machine-learning-databases/page-blocks | |
page-blocks.data.Z | |
page-blocks.names | |
/ml/machine-learning-databases/parkinsons | |
parkinsons.data | |
parkinsons.names | |
telemonitoring/ | |
/ml/machine-learning-databases/pendigits | |
pendigits-orig.names | |
pendigits-orig.tes.Z | |
pendigits-orig.tra.Z | |
pendigits.names | |
pendigits.tes | |
pendigits.tra | |
/ml/machine-learning-databases/arcene | |
ARCENE/ | |
Dataset.pdf | |
arcene_valid.labels | |
/ml/machine-learning-databases/photo-mld | |
photo.data.html | |
photo.html | |
photo.tar.gz | |
photo.task.html | |
/ml/machine-learning-databases/pima-indians-diabetes | |
costs/ | |
pima-indians-diabetes.data | |
pima-indians-diabetes.names | |
/ml/machine-learning-databases/pioneer-mld | |
gripper.data.gz | |
move.data.gz | |
pioneer.data.html | |
pioneer.html | |
pioneer.names | |
turn.data.gz | |
/ml/machine-learning-databases/plants | |
plants.data | |
plants.names | |
stateabbr.txt | |
/ml/machine-learning-databases/poker | |
poker-hand-testing.data | |
poker-hand-training-true.data | |
poker-hand.names | |
/ml/machine-learning-databases/postoperative-patient-data | |
post-operative.data | |
post-operative.names | |
/ml/machine-learning-databases/primary-tumor | |
primary-tumor-data | |
primary-tumor.names | |
/ml/machine-learning-databases/prodigy | |
domains/ | |
/ml/machine-learning-databases/qsar | |
drug_data | |
/ml/machine-learning-databases/quadrapeds | |
animals.c | |
animals.names | |
/ml/machine-learning-databases/arrhythmia | |
arrhythmia.data | |
arrhythmia.names | |
/ml/machine-learning-databases/restricted | |
breast-cancer/ | |
lymphography/ | |
primary-tumor/ | |
/ml/machine-learning-databases/reuters21578-mld | |
reuters21578.html | |
reuters21578.tar.gz | |
/ml/machine-learning-databases/reuters_transcribed-mld | |
ReutersTranscribedSubset.zip | |
ReutersTranscribedSubsetOld.zip | |
reuters_transcribed.html | |
/ml/machine-learning-databases/robotfailure-mld | |
a.out | |
format.c | |
lp1.data | |
lp2.data | |
lp3.data | |
lp4.data | |
lp5.data | |
robot | |
robotfailure.data.html | |
robotfailure.html | |
/ml/machine-learning-databases/secom | |
secom.data | |
secom.names | |
secom_labels.data | |
/ml/machine-learning-databases/semeion | |
semeion.data | |
semeion.names | |
/ml/machine-learning-databases/servo | |
servo.data | |
servo.names | |
/ml/machine-learning-databases/shuttle-landing-control | |
shuttle-landing-control.data | |
shuttle-landing-control.names | |
/ml/machine-learning-databases/solar-flare | |
flare.data1 | |
flare.data2 | |
flare.names | |
past-usage | |
/ml/machine-learning-databases/soybean | |
backup-large.data | |
backup-large.test | |
fisher-order | |
soybean-explanation | |
soybean-large.data | |
soybean-large.names | |
soybean-large.test | |
soybean-small.data | |
soybean-small.names | |
stepp-order | |
why-various-soybean-databases | |
/ml/machine-learning-databases/artificial-characters | |
character.names | |
character.tar.Z | |
convert.cc | |
domain_theory | |
/ml/machine-learning-databases/space-shuttle | |
o-ring-erosion-only.data | |
o-ring-erosion-or-blowby.data | |
o-ring-erosion.names | |
/ml/machine-learning-databases/spambase | |
spambase.DOCUMENTATION | |
spambase.data | |
spambase.names | |
spambase.zip | |
/ml/machine-learning-databases/spect | |
DonorNote.txt | |
SPECT.names | |
SPECT.test | |
SPECT.train | |
SPECTF.names | |
SPECTF.test | |
SPECTF.train | |
SPECTFincorrect.test | |
/ml/machine-learning-databases/spectrometer | |
gennari-message | |
lrs.data | |
lrs.names | |
original.data.Z | |
original.names | |
/ml/machine-learning-databases/sponge | |
sponge.data | |
sponge.info | |
/ml/machine-learning-databases/statlog | |
Statlog-README | |
australian/ | |
german/ | |
heart/ | |
satimage/ | |
segment/ | |
shuttle/ | |
statlog.names | |
vehicle/ | |
/ml/machine-learning-databases/student-loan | |
disabled.pl | |
domain-theory.pl | |
enlist.pl | |
enrolled.pl | |
filed_for_bankrupcy.pl | |
longest_absense_from_school.pl | |
male.pl | |
misc.pl | |
no_payment_due.pl | |
student-loan.names | |
unemployed.pl | |
/ml/machine-learning-databases/synthetic-mld | |
equation.gif | |
synthetic.data.gz | |
synthetic.data.html | |
synthetic.html | |
ts1-5.gif | |
ts6-10.gif | |
/ml/machine-learning-databases/synthetic_control-mld | |
cluster.jpeg | |
data.jpeg | |
synthetic_control.clustering.html | |
synthetic_control.data | |
synthetic_control.data.html | |
synthetic_control.html | |
/ml/machine-learning-databases/tae | |
tae.data | |
tae.names | |
/ml/machine-learning-databases/audiology | |
audiology.data | |
audiology.names | |
audiology.standardized.data | |
audiology.standardized.names | |
audiology.standardized.test | |
audiology.test | |
/ml/machine-learning-databases/tb-mld | |
tb.data.html | |
tb.html | |
tb_data.pl.bz2 | |
tb_data.pl.gz | |
tb_functions.pl.bz2 | |
tb_functions.pl.gz | |
/ml/machine-learning-databases/thyroid-disease | |
HELLO | |
allbp.data | |
allbp.names | |
allbp.test | |
allhyper.data | |
allhyper.names | |
allhyper.test | |
allhypo.data | |
allhypo.names | |
allhypo.test | |
allrep.data | |
allrep.names | |
allrep.test | |
ann-Readme | |
ann-test.data | |
ann-thyroid.names | |
ann-train.data | |
costs/ | |
dis.data | |
dis.names | |
dis.test | |
hypothyroid.data | |
hypothyroid.names | |
new-thyroid.data | |
new-thyroid.names | |
sick-euthyroid.data | |
sick-euthyroid.names | |
sick.data | |
sick.names | |
sick.test | |
thyroid.theory | |
thyroid0387.data | |
thyroid0387.names | |
/ml/machine-learning-databases/tic-mld | |
TicDataDescr.txt | |
dictionary.txt | |
tic.data.html | |
tic.html | |
tic.tar.gz | |
tic.task.html | |
ticdata2000.txt | |
ticeval2000.txt | |
tictgts2000.txt | |
/ml/machine-learning-databases/tic-tac-toe | |
tic-tac-toe.data | |
tic-tac-toe.names | |
/ml/machine-learning-databases/trains | |
east-west.info | |
trains-original.data | |
trains-transformed.data | |
trains.names | |
trains.supplement | |
trains.tar.Z | |
/ml/machine-learning-databases/uji-penchars | |
version1/ | |
version2/ | |
/ml/machine-learning-databases/undocumented | |
connectionist-bench/ | |
pazzani/ | |
sigillito/ | |
taylor/ | |
/ml/machine-learning-databases/university | |
university.data | |
university.names | |
/ml/machine-learning-databases/url | |
url.names | |
url_svmlight.tar.gz | |
/ml/machine-learning-databases/utilities | |
converter.lisp | |
utilities.doc | |
/ml/machine-learning-databases/auslan-mld | |
allsigns.tar.gz | |
auslan.data.html | |
auslan.html | |
/ml/machine-learning-databases/volcanoes-mld | |
uci_form.txt | |
volcanoes.data.html | |
volcanoes.html | |
volcanoes.tar.gz | |
/ml/machine-learning-databases/voting-records | |
house-votes-84.data | |
house-votes-84.names | |
/ml/machine-learning-databases/water-treatment | |
water-treatment.data | |
water-treatment.names | |
/ml/machine-learning-databases/waveform | |
waveform-+noise.c | |
waveform-+noise.data.Z | |
waveform-+noise.names | |
waveform.c | |
waveform.data.Z | |
waveform.names | |
/ml/machine-learning-databases/wine-quality | |
winequality-red.csv | |
winequality-white.csv | |
winequality.names | |
/ml/machine-learning-databases/wine | |
wine.data | |
wine.names | |
/ml/machine-learning-databases/yeast-mld | |
yeast.data.html | |
yeast.html | |
/ml/machine-learning-databases/yeast | |
yeast.data | |
yeast.names | |
/ml/machine-learning-databases/zoo | |
zoo.data | |
zoo.names | |
/ml/machine-learning-databases/auslan2-mld | |
auslan.data.html | |
auslan.html | |
tctodd.tar.bz2 | |
tctodd.tar.gz | |
/ml/machine-learning-databases/auto-mpg | |
auto-mpg.data | |
auto-mpg.data-original | |
auto-mpg.names | |
/ml/machine-learning-databases/autos | |
imports-85.data | |
imports-85.names | |
misc | |
/ml/machine-learning-databases/JapaneseVowels-mld | |
JapaneseVowels.data.html | |
JapaneseVowels.html | |
JapaneseVowels.task.html | |
ae.test | |
ae.train | |
size_ae.test | |
size_ae.train | |
/ml/machine-learning-databases/badges | |
badges.data | |
badges.info | |
/ml/machine-learning-databases/bag-of-words | |
docword.enron.txt.gz | |
docword.kos.txt.gz | |
docword.nips.txt.gz | |
docword.nytimes.txt.gz | |
docword.pubmed.txt.gz | |
readme.txt | |
vocab.enron.txt | |
vocab.kos.txt | |
vocab.nips.txt | |
vocab.nytimes.txt | |
vocab.pubmed.txt | |
/ml/machine-learning-databases/balance-scale | |
balance-scale.data | |
balance-scale.names | |
/ml/machine-learning-databases/balloons | |
adult+stretch.data | |
adult-stretch.data | |
balloons.names | |
yellow-small+adult-stretch.data | |
yellow-small.data | |
/ml/machine-learning-databases/blood-transfusion | |
transfusion.data | |
transfusion.names | |
/ml/machine-learning-databases/breast-cancer-wisconsin | |
breast-cancer-wisconsin.data | |
breast-cancer-wisconsin.names | |
unformatted-data | |
wdbc.data | |
wdbc.names | |
wpbc.data | |
wpbc.names | |
/ml/machine-learning-databases/breast-cancer | |
breast-cancer-data | |
breast-cancer.names | |
/ml/machine-learning-databases/bridges | |
bridges.data.version1 | |
bridges.data.version2 | |
bridges.names | |
/ml/machine-learning-databases/car | |
car.c45-names | |
car.data | |
car.names | |
/ml/machine-learning-databases/census-income-mld | |
census-income.data.gz | |
census-income.data.html | |
census-income.html | |
census-income.names | |
census-income.test.gz | |
census.tar.gz | |
/ml/machine-learning-databases/SyskillWebert-mld | |
SyskillWebert.data.html | |
SyskillWebert.html | |
SyskillWebert.tar.gz | |
SyskillWebert.task.html | |
/ml/machine-learning-databases/census-income | |
census-income.data | |
census-income.names | |
census-income.test | |
/ml/machine-learning-databases/census1990-mld | |
ReadMe.txt | |
USCensus1990-desc.html | |
USCensus1990-task.html | |
USCensus1990.attributes.txt | |
USCensus1990.data.txt | |
USCensus1990.html | |
USCensus1990.mapping.sql | |
USCensus1990.readme.txt | |
USCensus1990.task.txt | |
USCensus1990raw-desc.html | |
USCensus1990raw.attributes.txt | |
USCensus1990raw.coding.htm | |
USCensus1990raw.data.txt | |
USCensus1990raw.html | |
USCensus1990raw.readme.txt | |
/ml/machine-learning-databases/character-trajectories | |
mixoutALL_shifted.mat | |
trajectories.names | |
/ml/machine-learning-databases/chess | |
domain-theories/ | |
king-rook-vs-king-knight/ | |
king-rook-vs-king-pawn/ | |
king-rook-vs-king/ | |
/ml/machine-learning-databases/chorales | |
chorales.doc | |
chorales.lisp.Z | |
/ml/machine-learning-databases/cmc | |
cmc.data | |
cmc.names | |
/ml/machine-learning-databases/coil-mld | |
analysis.data | |
coil.data.html | |
coil.html | |
eval.data | |
instructions.txt | |
r2 | |
results.data | |
results.htm | |
results.txt | |
/ml/machine-learning-databases/communities | |
communities.data | |
communities.names | |
/ml/machine-learning-databases/concrete | |
compressive/ | |
slump/ | |
/ml/machine-learning-databases/connect-4 | |
connect-4.data.Z | |
connect-4.names | |
/ml/machine-learning-databases/UNIX_user_data-mld | |
UNIX_user_data.html | |
UNIX_user_data.tar.gz | |
/ml/machine-learning-databases/contacts | |
s.gu | |
/ml/machine-learning-databases/covertype-mld | |
covertype.data.html | |
covertype.html | |
covertype.task.html | |
covtype.data.gz | |
/ml/machine-learning-databases/covtype | |
covtype.data.gz | |
covtype.info | |
old_covtype.info | |
/ml/machine-learning-databases/cpu-performance | |
machine.data | |
machine.names | |
/ml/machine-learning-databases/credit-screening | |
credit.lisp | |
credit.names | |
crx.data | |
crx.names | |
/ml/machine-learning-databases/cylinder-bands | |
bands.data | |
bands.names | |
/ml/machine-learning-databases/demospongiae | |
demospongiae-503.pl | |
demospongiae-cases-503.noos | |
demospongiae-dm.noos | |
demospongiae-ontology.noos | |
demospongiae.names | |
sponge-220.pdf | |
/ml/machine-learning-databases/dermatology | |
dermatology.data | |
dermatology.names | |
/ml/machine-learning-databases/dexter | |
DEXTER/ | |
Dataset.pdf | |
dexter_valid.labels | |
/ml/machine-learning-databases/dgp-2 | |
DGP-2.c | |
DGP-2.names | |
/ml/machine-learning-databases/abalone | |
abalone.data | |
abalone.names | |
/ml/machine-learning-databases/diabetes | |
diabetes-data.tar.Z | |
/ml/machine-learning-databases/document-understanding | |
FOIL.data | |
document-understanding.info | |
test1.data | |
test2.data | |
test3.data | |
test4.data | |
test5.data | |
test6.data | |
train1.data | |
train2.data | |
train3.data | |
train4.data | |
train5.data | |
train6.data | |
/ml/machine-learning-databases/dorothea | |
DOROTHEA/ | |
Dataset.pdf | |
dorothea_valid.labels | |
/ml/machine-learning-databases/ebl | |
all | |
cup | |
deductive.assumable | |
emotion | |
ice | |
pople | |
safe-to-stack | |
suicide | |
/ml/machine-learning-databases/echocardiogram | |
echocardiogram.data | |
echocardiogram.names | |
/ml/machine-learning-databases/ecoli-mld | |
ecoli.data.html | |
ecoli.html | |
ecoli_data.pl.bz2 | |
ecoli_data.pl.gz | |
ecoli_functions.pl.bz2 | |
ecoli_functions.pl.gz | |
/ml/machine-learning-databases/ecoli | |
ecoli.data | |
ecoli.names | |
/ml/machine-learning-databases/eeg-mld | |
SMNI_CMI_TEST.tar.gz | |
SMNI_CMI_TRAIN.tar.gz | |
alcoholic.gif | |
control.gif | |
eeg.data.html | |
eeg.full.html | |
eeg.html | |
eeg_full.tar | |
eeg_full/ | |
smni_eeg_data.tar.gz | |
/ml/machine-learning-databases/el_nino-mld | |
el_nino.data.html | |
el_nino.html | |
elnino.col | |
elnino.gz | |
tao-all2.col | |
tao-all2.dat.gz | |
tao-all2.missing.gz | |
/ml/machine-learning-databases/entree-mld | |
entree.data.html | |
entree.html | |
entree_data.tar.gz | |
/ml/machine-learning-databases/abscisic-acid | |
plantCellSignaling.data | |
plantCellSignaling.names | |
/ml/machine-learning-databases/event-detection | |
CalIt2.data | |
CalIt2.events | |
CalIt2.names | |
Dodgers.data | |
Dodgers.events | |
Dodgers.names | |
/ml/machine-learning-databases/faces-mld | |
an2i_straight_neutral_open.jpg | |
at33_left_happy_sunglasses.jpg | |
boland_right_sad_open.jpg | |
ch4f_up_angry_sunglasses.jpg | |
faces.data.html | |
faces.html | |
faces.tar.Z | |
faces.tar.gz | |
faces_4.tar.Z | |
faces_4.tar.gz | |
hw97.ps | |
hw97.tex | |
/ml/machine-learning-databases/flags | |
flag.data | |
flag.names | |
/ml/machine-learning-databases/forest-fires | |
forestfires.csv | |
forestfires.names | |
/ml/machine-learning-databases/function-finding | |
function-finding.data | |
function-finding.names | |
/ml/machine-learning-databases/gisette | |
Dataset.pdf | |
GISETTE/ | |
gisette_valid.labels | |
/ml/machine-learning-databases/glass | |
glass.data | |
glass.names | |
glass.tag | |
/ml/machine-learning-databases/haberman | |
haberman.data | |
haberman.names | |
/ml/machine-learning-databases/hayes-roth | |
hayes-roth.data | |
hayes-roth.names | |
hayes-roth.test | |
/ml/machine-learning-databases/heart-disease | |
WARNING | |
ask-detrano | |
bak | |
cleve.mod | |
cleveland.data | |
costs/ | |
heart-disease.names | |
hungarian.data | |
long-beach-va.data | |
new.data | |
processed.cleveland.data | |
processed.hungarian.data | |
processed.switzerland.data | |
processed.va.data | |
reprocessed.hungarian.data | |
switzerland.data | |
/ml/machine-learning-databases/access-lists | |
other-repository-info/ | |
repository-list | |
software-list | |
/ml/machine-learning-databases/hepatitis | |
costs/ | |
hepatitis.data | |
hepatitis.names | |
/ml/machine-learning-databases/hill-valley | |
Hill-Valley.names | |
Hill_Valley_sample_arff.text | |
Hill_Valley_visual_examples.jpg | |
Hill_Valley_with_noise_Testing.data | |
Hill_Valley_with_noise_Training.data | |
Hill_Valley_without_noise_Testing.data | |
Hill_Valley_without_noise_Training.data | |
/ml/machine-learning-databases/horse-colic | |
horse-colic.data | |
horse-colic.names | |
horse-colic.names.original | |
horse-colic.test | |
/ml/machine-learning-databases/housing | |
housing.data | |
housing.names | |
/ml/machine-learning-databases/icu | |
icu-data.tar.Z | |
/ml/machine-learning-databases/image | |
segmentation.data | |
segmentation.names | |
segmentation.test | |
/ml/machine-learning-databases/internet_ads | |
ad-dataset.zip | |
ad.DOCUMENTATION | |
ad.data | |
ad.names | |
/ml/machine-learning-databases/internet_usage-mld | |
changes | |
final_general.col | |
final_general.dat.gz | |
internet_usage.data.html | |
internet_usage.html | |
/ml/machine-learning-databases/ionosphere | |
ionosphere.data | |
ionosphere.names | |
/ml/machine-learning-databases/ipums-mld | |
codebook | |
ipums.data.html | |
ipums.html | |
ipums.la.97.gz | |
ipums.la.97.gz.old | |
ipums.la.98.gz | |
ipums.la.99.gz | |
ipums.la.names | |
s.ipums.la.97.gz | |
s.ipums.la.98.gz | |
s.ipums.la.99.gz | |
/ml/machine-learning-databases/acute | |
diagnosis.data | |
diagnosis.names | |
/ml/machine-learning-databases/iris | |
bezdekIris.data | |
iris.data | |
iris.names | |
/ml/machine-learning-databases/isolet | |
isolet.info | |
isolet.names | |
isolet1+2+3+4.data.Z | |
isolet5.data.Z | |
/ml/machine-learning-databases/kddcup98-mld | |
epsilon_mirror/ | |
kddcup98.html | |
readme | |
restrictions | |
/ml/machine-learning-databases/kddcup99-mld | |
corrected.gz | |
kddcup.data.gz | |
kddcup.data_10_percent.gz | |
kddcup.names | |
kddcup.newtestdata_10_percent_unlabeled.gz | |
kddcup.testdata.unlabeled.gz | |
kddcup.testdata.unlabeled_10_percent.gz | |
kddcup99.html | |
task.html | |
training_attack_types | |
typo-correction.txt | |
/ml/machine-learning-databases/kinship | |
kinship.data | |
kinship.names | |
/ml/machine-learning-databases/labor-negotiations | |
C4.5/ | |
labor-negotiations.data | |
labor-negotiations.names | |
labor-negotiations.test | |
/ml/machine-learning-databases/led-display-creator | |
led-creator-+17.c | |
led-creator-+17.names | |
led-creator.c | |
led-creator.names | |
/ml/machine-learning-databases/lenses | |
lenses.data | |
lenses.names | |
/ml/machine-learning-databases/letter-recognition | |
letter-recognition.data | |
letter-recognition.data.Z | |
letter-recognition.names | |
/ml/machine-learning-databases/libras | |
movement_libras.data | |
movement_libras.names | |
movement_libras_1.data | |
movement_libras_5.data | |
movement_libras_8.data | |
movement_libras_9.data | |
movement_libras_10.data | |
/ml/machine-learning-databases/adult | |
adult.data | |
adult.names | |
adult.test | |
old.adult.names | |
/ml/machine-learning-databases/liver-disorders | |
bupa.data | |
bupa.names | |
costs/ | |
noteDuplicates.txt | |
/ml/machine-learning-databases/logic-theorist | |
CODE/ | |
ReadMe | |
ltnotes | |
/ml/machine-learning-databases/lung-cancer | |
lung-cancer.data | |
lung-cancer.names | |
/ml/machine-learning-databases/lymphography | |
lymphography-data | |
lymphography.names | |
/ml/machine-learning-databases/madelon | |
Dataset.pdf | |
MADELON/ | |
madelon_valid.labels | |
/ml/machine-learning-databases/magic | |
magic04.data | |
magic04.names | |
/ml/machine-learning-databases/mammographic-masses | |
mammographic_masses.data | |
mammographic_masses.names | |
/ml/machine-learning-databases/mechanical-analysis | |
PUMPS-DATA-SET/ | |
older-version/ | |
/ml/machine-learning-databases/meta-data | |
meta.data | |
meta.names | |
/ml/machine-learning-databases/mfeat | |
mfeat-fac | |
mfeat-fou | |
mfeat-kar | |
mfeat-mor | |
mfeat-pix | |
mfeat-zer | |
mfeat.info | |
mfeat.tar |
You should make it a CLI, so for example, if I just want to download the arrhythmia data set, i could do mldataset download arrhythmia
and it could go out and download just one.
The intro comment (or README) should point to the human-readable info on this archive, and note that there are currently 350 data sets: https://archive.ics.uci.edu/ml/datasets.html
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Getting started, if you just want to download the datasets, go ahead and download the download_uci_data_sets.py
and the links.txt file in one folder and run
python download_uci_data_sets.py
However if you want to play with things, go ahead and checkout
filelinks.py
: prints the links of all foldersfolders.txt
: the list of all folders copied from the websitefilescrubbing.sh
: details about how to create the links.txt which you'll use with the download python script to download everything.Some files are GB's in size, so check your internet connection first. I didn't download all the datasets, only the first fifty or so small ones.