Skip to content

Instantly share code, notes, and snippets.

@vapniks
Created April 11, 2018 03:38
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save vapniks/7d49fcd5f8a32f81f1ae500055ba1aa9 to your computer and use it in GitHub Desktop.
Save vapniks/7d49fcd5f8a32f81f1ae500055ba1aa9 to your computer and use it in GitHub Desktop.
An awk script to reshape UK crime data .csv files. This is at least 10 times faster than R or python (I ran out of memory when I tried doing it with R).
#!/usr/bin/awk -f
# Create variables containing counts of the number of different incident types within each area,
# where an area is defined as a unique longitude/latitude pair to the nearest 2 decimal places
BEGIN {
# Define csv fields
FPAT="\"[^\"]*\"|[^\",]*";
PROCINFO["sorted_in"] = "@ind_str_asc";
}
NR==1 {
# get the date from the current filename
patsplit(FILENAME,parts,"201[0-9]-(1[012]|0[0-9])");
date = parts[1];
}
NR>1 {
# round longitude and latitude to nearest 2 decimal places
long = sprintf("%.2f",$5);
lat = sprintf("%.2f",$6);
# increment the count for crimes of this type in this area
type = $10;
counts[long,lat][type] += 1;
# add crime type to all array so we have a record of all the different crime types
types[type] = 1;
}
END {
# print the column headers of the output
printf("long-lat,Month,");
for(type in types) printf("%s,",type);
printf("total\n");
# loop over all stored longitude:latitude pairs
for(indx in counts) {
indx2 = indx;
# replace long,lat index separator with ":" and print it
sub(SUBSEP,":",indx2);
printf("%s,",indx2);
# print the date
printf("%s,",date);
# keep a track of the total number of crimes in this area
total = 0;
# print the counts of different crime types in this area
for(type in types) {
num = counts[indx][type];
printf("%d,",num);
total += num;
}
# print the total number of crimes in this area
printf("%d\n",total);
}
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment