Skip to content

Instantly share code, notes, and snippets.

@RavisMsk
Last active December 15, 2015 07:09
Show Gist options
  • Save RavisMsk/5221290 to your computer and use it in GitHub Desktop.
Save RavisMsk/5221290 to your computer and use it in GitHub Desktop.
Boyer-Moore Algorithm implementation for analyzing apache logs.
#Copyright (c) 2013 Anisimov Nikita
#Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated
#documentation files (the "Software"), to deal in the Software without restriction, including without limitation
#the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and
#to permit persons to whom the Software is furnished to do so, subject to the following conditions:
#The above copyright notice and this permission notice shall be included in all copies or substantial portions of
#the Software.
#THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO
#THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
#AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
#TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
##################
# App for analyzing big logs from apache server and finding one IP address, from which half or more of all requests came.
# Uses Boyer-Moore algorithm, which means its O(n) and uses no additional memory, except one integer-counter.
# It can be modified for finding IP addresses, which generated, for example, 0.25 and more of all requests.
# In this app log files should be named like: (month)-(day).log. Example: 02-15.log
def majority(group):
confidence=0
candidate=False
for i in range(0, len(group)):
if confidence==0:
candidate=group[i]
confidence=confidence+1
elif candidate==group[i]:
confidence=confidence+1
else:
confidence=confidence-1
if confidence>0:
return candidate
else:
return False
def parseIPs(*dates):
pool=[]
for date in dates:
print '#####====='
print 'Working for date 02-%s' % date
fname='02-'+date+'.log'
content=''
with open(fname) as f:
content=f.readlines()
print 'Got %s records' % len(content)
for line in content:
buff=line.split(' - - ')
# print 'Inserting %s to pool' % buff[0]
pool.append(buff[0])
return pool
def main():
print '#####=====\n#####=====\nMajority of all requests came from %s' % majority(parseIPs('14','15','16','17','18'))
if __name__=='__main__':
main()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment