Skip to content

Instantly share code, notes, and snippets.

@gardart
Last active April 11, 2018 02:00
Show Gist options
  • Save gardart/50c4c4bdd3bace67c7c515d3e4794970 to your computer and use it in GitHub Desktop.
Save gardart/50c4c4bdd3bace67c7c515d3e4794970 to your computer and use it in GitHub Desktop.
Convert icelandic weather html data (all stations) from html table to csv format - http://brunnur.vedur.is/athuganir/athtafla
# Convert icelandic weather html data (all stations) from html table to csv format
$ curl "http://brunnur.vedur.is/athuganir/athtafla/2015081210.html" 2>/dev/null | grep -i -e '</\?TABLE\|</\?TD\|</\?TR'| tr -d '\n' | sed 's/<\ /TR[^>]*>/\n/Ig' | sed 's/<\/\?\(TABLE\|TR\)[^>]*>//Ig' | sed 's/^<T[DH][^>]*>\|<\/\?T[DH][^>]*>$//Ig' | sed 's/<\/T[DH][^>]*><T[DH][^>]*>/,/Ig' | sed 's/<[^>] \+>//Ig' | sed 's/^[\ \t]*//g' | sed 's/^[\ \t]*//g' | sed '/^\s*$/d' | sed 's/^/2015081210,/'
Output:
2015081210,33751,Siglufjarðarvegur_Herkonugil,-99,6.9,6.9,7.9,80,6.7,7.1,10.2,92,-99
2015081210,33643,Stafá,40,9.3,8.9,9.5,38,4.9,4.9,7.1,79,-99
2015081210,32474,Steingrímsfjarðarheiði,440,4.4,3.9,4.5,65,11.5,11.6,14.2,99,-99
2015081210,31950,Stórholt,70,9.9,9.3,9.9,81,6.7,6.7,8.5,82,-99
############################################################
# How it works:
# Get the Contents of the URL
# curl "http://brunnur.vedur.is/athuganir/athtafla/2015081210.html" 2>/dev/null
# Extract HTML Table elements
# | grep -i -e '</\?TABLE\|</\?TD\|</\?TR\|</\?TH'
# Remove newlines
# | tr -d '\n\r'
#Replace </TR> with newline
# | sed 's/<\/TR[^>]*>/\n/Ig'
# Remove TABLE and TR tags
# | sed 's/<\/\?\(TABLE\|TR\)[^>]*>//Ig'
# Remove ^<TD>, ^<TH>, </TD>$, </TH>$
# | sed 's/^<T[DH][^>]*>\|<\/\?T[DH][^>]*>$//Ig'
# Replace </TD><TD> with comma
# | sed 's/<\/T[DH][^>]*><T[DH][^>]*>/,/Ig'
# Remove any remaining <TD>
# | sed 's/<[^>]\+>//Ig' | sed 's/^[\ \t]*//g'
# Remove any Whitespace at the beginning of the line
# | sed 's/^[\ \t]*//g'
# Remove empty lines
# | sed '/^\s*$/d'
# Add timestamp (YYYYMMDDHH) to the beginning of each line
# | sed 's/^/2015081210,/'
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment