Skip to content

Instantly share code, notes, and snippets.

@isseu

isseu/crawler-ramos-uc.py

Last active Aug 29, 2015
Embed
What would you like to do?
Crawler sacar todos los ramos de la universidad catolica
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# Credits isseu
import urllib2
import re
from titlecase import titlecase
import unicodedata
cursos = []
data = urllib2.urlopen("http://catalogo.uc.cl/index.php?Itemid=55").read().replace("\n", "")
tabla = re.findall('<tr>.+?</tr>', data, flags=re.MULTILINE)
for item in tabla:
res = re.search("<td>(.+?)</td><td>(.+?)</td><td>(.+?)</td>", item, flags=re.MULTILINE)
if not res == None:
sigla_seccion = res.group(2).split("-")
if sigla_seccion[0] not in [i[1] for i in cursos]:
cursos.append([res.group(1), sigla_seccion[0], res.group(3)])
print "# Encontrados " + str(len(cursos)) + " cursos: "
for item in cursos:
print "Curso.create!(nombre: \"%s\", sigla: \"%s\")" % (titlecase(item[2].decode('utf-8').lower()), item[1])
@isseu

This comment has been minimized.

Copy link
Owner Author

@isseu isseu commented Feb 1, 2015

Para instalar titlecase (sudo) pip install titlecase

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment