Skip to content

Instantly share code, notes, and snippets.

@kurtraschke
Created September 1, 2010 03:04
Show Gist options
  • Star 4 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save kurtraschke/560162 to your computer and use it in GitHub Desktop.
Save kurtraschke/560162 to your computer and use it in GitHub Desktop.
Regular expression and script for parsing and sorting Library of Congress Classification call numbers
import re
myfile = open('list', 'r')
callnos = myfile.readlines()
p = re.compile("""^(?P<aclass>[A-Z]{1,3})
(?P<nclass>\\d{1,4})(\\ ?)
(\\.(?P<dclass>\\d{1,3}))?
(?P<date>\\ [A-Za-z0-9]{1,4}\\ )?
([\\ \\.](?P<c1>[A-Z][0-9]{1,4}))
(\\ (?P<c1d>[A-Za-z0-9]{0,4}))?
(\\.?(?P<c2>[A-Z][0-9]{1,4}))?
(\\ (?P<e8>\\w*)\\ ?)?
(\\ (?P<e9>\\w*)\\ ?)?
(\\ (?P<e10>\\w*)\\ ?)?""",
re.VERBOSE)
def ncmp(x,y):
if x is None and y is None:
return 0
if x is None:
return -1
if y is None:
return 1
x = int(x)
y = int(y)
return cmp(x,y)
def sortfunc(x, y):
xp = p.search(x)
yp = p.search(y)
parts = {'aclass':cmp,'nclass':ncmp,'dclass':ncmp,'date':cmp,'c1':cmp,'c1d':cmp,'c2':cmp,'e8':cmp,'e9':cmp,'e10':cmp}
for part in parts:
cr = parts[part](xp.group(part),yp.group(part))
if cr != 0:
return cr
def normalize(callno):
cp = p.search(callno)
out = cp.group('aclass') + cp.group('nclass')
if cp.group('dclass') is not None:
out += "."+cp.group('dclass')
if cp.group('date') is not None:
out += " "+cp.group('dclass')+ " "
out += "."+cp.group('c1')
if cp.group('c1d') is not None:
out += " "+cp.group('c1d')+ " "
if cp.group('c2') is not None:
out += " "+cp.group('c2')
if cp.group('e8') is not None:
out += " "+cp.group('e8')
if cp.group('e9') is not None:
out += " "+cp.group('e9')
if cp.group('e10') is not None:
out += " "+cp.group('e10')
return out
callnos.sort(sortfunc)
for callno in callnos:
print "%25s %25s" % (callno.strip(), normalize(callno).strip())
@jesstucker
Copy link

Hello,

I admire this bit of code very much. Since I am a python newb, would you mind demonstrating how to use it, especially the sortfunc()?

Thank you.

@sajattack
Copy link

Hi, thanks for all your work on this script. I've been experimenting with it and noticed that this sequence:

HA37 .C22 C87 2002
H61 .M5 2000 
H61 .K24
H61 .G593 2006
H35 .L54
H53 .U5 S5 2006

Is sorted in this order:

                 H35 .L54                   H35.L54
           H61 .G593 2006             H61.G593 2006
                 H61 .K24                   H61.K24
             H61 .M5 2000               H61.M5 2000
          H53 .U5 S5 2006           H53.U5 S5  2006
       HA37 .C22 C87 2002        HA37.C22 C87  2002

Shouldn't the H53 appear before the H61?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment