Skip to content

Instantly share code, notes, and snippets.

@isseu
Last active August 29, 2015 14:08
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save isseu/636a1de7bdcbbb437e0c to your computer and use it in GitHub Desktop.
Save isseu/636a1de7bdcbbb437e0c to your computer and use it in GitHub Desktop.
Crawler sacar todos los ramos de la universidad catolica
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# Credits isseu
import urllib2
import re
from titlecase import titlecase
import unicodedata
cursos = []
data = urllib2.urlopen("http://catalogo.uc.cl/index.php?Itemid=55").read().replace("\n", "")
tabla = re.findall('<tr>.+?</tr>', data, flags=re.MULTILINE)
for item in tabla:
res = re.search("<td>(.+?)</td><td>(.+?)</td><td>(.+?)</td>", item, flags=re.MULTILINE)
if not res == None:
sigla_seccion = res.group(2).split("-")
if sigla_seccion[0] not in [i[1] for i in cursos]:
cursos.append([res.group(1), sigla_seccion[0], res.group(3)])
print "# Encontrados " + str(len(cursos)) + " cursos: "
for item in cursos:
print "Curso.create!(nombre: \"%s\", sigla: \"%s\")" % (titlecase(item[2].decode('utf-8').lower()), item[1])
@isseu
Copy link
Author

isseu commented Feb 1, 2015

Para instalar titlecase (sudo) pip install titlecase

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment