Skip to content

Instantly share code, notes, and snippets.

Last active Oct 29, 2017
What would you like to do?
A generic script for text processing originally written specific to some usecases around wiki text.
# -*- coding: utf-8 -*-
__author__ = 'Arun KR (kra3) <>'
__license__ = 'Simplified BSD'
import sys
""" rules.txt data.txt > result.txt
rules.txt could be any text file with one rule per line.
A Rule will be of format CODE :=> STRING`
CODE will be one of DL or RM.
DL for delete matching line - PS: matching starts from begining affter removing whitespaces.
RM for remove matching strings from a line.
You have to redirect output to a file or another unix command for further processing.
Utility developed for malayalam wikibooks maintainers.
def wiki_helper(data, rules):
# list of rules
delete_line_matches = []
remove_string_matches = []
# Open files
data_fh = open(data)
rules_fh = open(rules)
## extracting user defined rules
for line in rules_fh.readlines():
rule = line.split(':=>')
if len(rule) != 2: # safegaurd against malformed rules.
code, expr = map(str.strip, rule) # strip down her out of whitespaces
# circus to put rules at their places.
if code == 'DL':
elif code == 'RM':
## processing data with rules
for line in data_fh.readlines():
matched = False # sentinel
# loop until a match for delete line is found,
# set sentinel and be out as fast as you can.
for match in delete_line_matches:
if line.strip().startswith(match):
matched = True
# remove all those junk to become a slim beauty.
if not matched:
for expr in remove_string_matches:
line = line.replace(expr, '')
# Now, Go; take on the world...
print line,
if __name__ == '__main__':
if not len(sys.argv) == 3:
print "Incorrect format. Try:"
print "\ rules data"
wiki_helper(sys.argv[2], sys.argv[1])
Copy link

kra3 commented Apr 23, 2014


DL :=> {{കേരളത്തിലെ പക്ഷികളുടെ പട്ടിക - തുടക്കം|നിര=
RM :=> {{കേരളത്തിലെ പക്ഷികളുടെ പട്ടിക - ഉള്ളടക്കം|
RM :=> }}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment