Skip to content

Instantly share code, notes, and snippets.

@caiobegotti
Created April 5, 2012 01:12
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save caiobegotti/2307114 to your computer and use it in GitHub Desktop.
Save caiobegotti/2307114 to your computer and use it in GitHub Desktop.
shell2python
Shell:
cat *.xml | sed 's/\([A-Z][[:alpha:]]\{0,\}\. [A-Z]\)/| \1/g' | tr '|' '\n' | sed 's/^\( .\{25\}\).*$/\1/g'
Cn. Octauii praecidi capu
P. Crassi
Sp. Albinus, homines cons
M. Antonii, omnium eloque
C. Caesaris, in quo mihi
C. Marius tum, cum Cimbri
Tib. Graccho legum auctor
C. Drusi domum compleri a
Cn. Aufidius praetorius e
M. Crassus, sed aliud mol
Python:
>>> regex = re.compile("[A-Z]'?\w{0,4}\. [A-Z]{0,}\w{0,}");
>>> regex.findall(text)
['Cn. Octauii', 'P. Crassi', 'Sp. Albinus', 'M. Antonii', 'C. Caesaris', 'C. Marius', 'Tib. Graccho', 'C. Drusi', 'Cn. Aufidius', 'M. Crassus']
PythonRegex.Com:
>>> regex = re.compile("([A-Z]'?\w{0,4}\. \b[A-Z]{0,}\b\w{0,}(\. )?(\b[A-Z]{0,}\b\w{0,})?)",re.UNICODE)
>>> r = regex.search(string)
>>> r
<_sre.SRE_Match object at 0xf5896972b7d36f08>
>>> regex.match(string)
None
# List the groups found
>>> r.groups()
(u'M. Tullius', None, u'')
# List the named dictionary objects found
>>> r.groupdict()
{}
# Run findall
>>> regex.findall(string)
[(u'M. Tullius', u'', u''), (u'Mer. Caio', u'', u''), (u'F. P. totalia', u'. ', u'totalia'), (u'Ga. Cesar', u'', u''), (u"M'. C. Memento", u'. ', u'Memento'), (u'M. Metello', u'', u''), (u'Q. Verrem', u'', u''), (u'M. Metellum', u'', u''), (u'M. Metellum', u'', u''), (u'Q. Metellum', u'', u''), (u'L. Metellus', u'', u''), (u"M'. Glabrionem", u'', u''), (u'M. Caesonius', u'', u''), (u'Q. Manlium', u'', u''), (u'Q. Cornificium', u'', u''), (u'P. Sulpicius', u'', u''), (u'M. Crepereius', u'', u''), (u'L. Cassius', u'', u''), (u'Cn. Tremellius', u'', u''), (u'M. Metelli', u'', u''), (u'Cn. Pompeius', u'', u''), (u'M. Metellum', u'', u'')]
# Run timeit test
>>> setup = ur"import re; regex =re.compile("([A-Z]'?\w{0,4}\. \b[A-Z]{0,}\b\w{0,}(\. )?(\ ...
>>> t = timeit.Timer('regex.search(string)',setup)
>>> t.timeit(10000)
6.82871484756
@caiobegotti
Copy link
Author

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# caio begotti <caio1982@gmail.com>
# this is under public domain

# reference: https://gist.github.com/2307114
# double-check: http://en.wiktionary.org/wiki/Appendix:Roman_praenomina

import codecs
import glob
import re

praenomina = []
for file in glob.glob('./*.xml'):
    content = codecs.open(file, "r", "utf8")
    text = content.read()
    regex = re.compile("[A-Z]'?\w{0,4}\. [A-Z]{0,}\w{0,}")
    for entry in regex.findall(text):
        praenomina.append(entry)

praenomina = sorted(set(praenomina))
for entry in praenomina:
    print entry.lower()

print len(praenomina)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment