Skip to content

Instantly share code, notes, and snippets.

@ssp
Created July 30, 2012 08:09
Show Gist options
  • Save ssp/3205626 to your computer and use it in GitHub Desktop.
Save ssp/3205626 to your computer and use it in GitHub Desktop.
grep searchterms from query logs
# limit to:
# * searches
# * not SX20 presentation (used for Neuerwerbungen RSS feed)
# extract:
# * query term in TRM parameter
cat access_log | grep SRCHA | grep TRM | grep -v SX20 | grep -v "XML=1" | sed -e "s/.*TRM=//" -e "s/[& ].*//" -e "s/+/ /g" > searchterms
# sort the list:
# sort options:
# * -b: ignore leading blanks
# * -f: case insensitive
# * -n: numeric sort
# * -r: revert sort order
# uniq options:
# * -c: add count to output
# * -i: case insensitive
cat searchterms | sort -b -f | uniq -c -i | sort -b -n -r > searchterms-list
# convert the list to create spreadsheet-ready values
# * column 1: # of 'identical' queries
# * column 2: how often did that count occur
cat searchterms-list | sed -e 's/^\(\s*[0-9]*\) .*/\1/g' | uniq -c | sed -e 's/\s*\([0-9]*\)\s*\([0-9]*\)/\2\t\1/'
# try to see how many queries users made
# the XX in the SET=XX URL part gives a hint about that
cat access_log | grep SRCH | grep "/SET" | sed -e "s/.*SET=\([^&/]*\).*/\1/" | sort -n | uniq -c | sed -e 's/\s*\([0-9]*\)\s*\(.*\)/\2\t\1/'
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment