Skip to content

Instantly share code, notes, and snippets.

@aligusnet
Last active December 11, 2021 19:56
Show Gist options
  • Star 36 You must be signed in to star a gist
  • Fork 22 You must be signed in to fork a gist
  • Save aligusnet/6478289 to your computer and use it in GitHub Desktop.
Save aligusnet/6478289 to your computer and use it in GitHub Desktop.
Download a weather dataset from the National Climatic Data Center (NCDC, http://www .ncdc.noaa.gov/). Prepare it for examples of "Hadoop: The Definitive Guide" book by Tom White. http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449311520 Usage: ./ncdc.sh 1901 1930 # download wheather datasets for period from 1901 to 1930.
#!/usr/bin/env bash
# global parameters
g_tmp_folder="ncdc_tmp";
g_output_folder="ncdc_data";
g_remote_host="ftp.ncdc.noaa.gov";
g_remote_path="pub/data/noaa";
# $1: folder_path
function create_folder {
if [ -d "$1" ]; then
rm -rf "$1";
fi
mkdir "$1"
}
# $1: year to download
function download_data {
local source_url="ftp://$g_remote_host/$g_remote_path/$1"
wget -r -c -q --no-parent -P "$g_tmp_folder" "$source_url";
}
# $1: year to process
function process_data {
local year="$1"
local local_path="$g_tmp_folder/$g_remote_host/$g_remote_path/$year"
local tmp_output_file="$g_tmp_folder/$year"
for file in $local_path/*; do
gunzip -c $file >> "$tmp_output_file"
done
zipped_file="$g_output_folder/$year.gz"
gzip -c "$tmp_output_file" >> "$zipped_file"
echo "created file: $zipped_file"
rm -rf "$local_path"
rm "$tmp_output_file"
}
# $1 - start year
# $2 - finish year
function main {
local start_year=1901
local finish_year=1920
if [ -n "$1" ]; then
start_year=$1
fi
if [ -n "$2" ]; then
finish_year=$2
fi
create_folder $g_tmp_folder
create_folder $g_output_folder
for year in `seq $start_year $finish_year`; do
download_data $year
process_data $year
done
rm -rf "$g_tmp_folder"
}
main $1 $2
@smitakl
Copy link

smitakl commented Apr 3, 2014

Thanks for the script. BTW, pub/data/noaa path doesn't seem to be valid.

@crush-157
Copy link

FTP location has changed.

It is now ftp://ftp.ncdc.noaa.gov/pub/data/noaa/

So line 7 should be:

g_remote_host="ftp.ncdc.noaa.gov";

@tomasdelvechio
Copy link

Thanks! The script with the change from crush-157 works fine!

@tirru
Copy link

tirru commented Aug 26, 2014

perfectly worked.

sudo bash ncdc.sh

user@ubuntuvm:~/climateData/ncdc_data$ ls
1901.gz 1904.gz 1907.gz 1910.gz 1913.gz 1916.gz 1919.gz
1902.gz 1905.gz 1908.gz 1911.gz 1914.gz 1917.gz 1920.gz
1903.gz 1906.gz 1909.gz 1912.gz 1915.gz 1918.gz

@rehevkor5
Copy link

Download location has changed again. Also, I have introduced changes so the script does not try to run process_data on files that have not been downloaded, and prints information about failed downloads to stderr. https://gist.github.com/rehevkor5/2e407950ca687b36fc54

@jithinodattu
Copy link

Thank you

@sasikirankarri
Copy link

Thanks for the valuable script and valuable edit by crush-157 😄

@BhavaniCP
Copy link

thank you so much

@forisg
Copy link

forisg commented Sep 3, 2015

Changed again to:
ftp://ftp.ncdc.noaa.gov/pub/data/noaa/

@tushar-chandra-030389
Copy link

Thanks

@lohithn4
Copy link

lohithn4 commented Apr 7, 2016

cool great for this work

@zhounanshu
Copy link

Thank you very much! :)

@ichigeki
Copy link

ichigeki commented Jun 2, 2016

Thanks Alexander. This is great. 👍

@huidi7
Copy link

huidi7 commented Aug 16, 2016

Thanks. Cool.

@holphi
Copy link

holphi commented Jan 9, 2017

That's really helpful! Thank you guys!

@aligusnet
Copy link
Author

thanks for your comments and special thanks to @crush-157 for the fix.

@AnayBhowmik
Copy link

Thanks a lot
worked perfectly

@danieldai
Copy link

Thanks, It still works

@binshi
Copy link

binshi commented Nov 22, 2017

I am running on mac and I get
gzip: ncdc_tmp/ftp.ncdc.noaa.gov/pub/data/noaa/1921/*.gz: No such file or directory
created file: ncdc_data/1921.gz

The above was due to non connectivity to internet. My bad.

@danieldai
Copy link

Thanks, it works

@BinitaBharati
Copy link

Great script, works beautifully.The ftp server location has also been updated in the script, so nothing needs to be edited, the script works as it is.

@yogirain
Copy link

yogirain commented Apr 7, 2018

Works excellant without a change, Thanks.

@engkimbs
Copy link

Awesome! Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment