Skip to content

Instantly share code, notes, and snippets.

@Maes95
Created September 21, 2020 13:58
Show Gist options
  • Save Maes95/24bcbc9ed16cc32ee19c890acf826904 to your computer and use it in GitHub Desktop.
Save Maes95/24bcbc9ed16cc32ee19c890acf826904 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Mineria de Repositorios con GitHub"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"La mayoria de librerias de este ejemplo están disponibles de manera nativa con Python 3.7+ (si lo tienes instalado). Solo será necesario instalar la librería que hace uso de la API de GitHub:\n",
"```bash\n",
"$ pip install PyGithub\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import datetime\n",
"import random\n",
"import json\n",
"import pickle\n",
"from github import Github"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Generamos un Token para consultar la API de GitHub a través de la libreria GitHub\n",
"- Necesitamos tener una cuenta en GitHub\n",
"- Seguimos [este sencillo tutorial](https://docs.github.com/es/github/authenticating-to-github/creating-a-personal-access-token) para generarlo\n",
"- El token NO es necesario para realizar las consultas, pero la cuota de peticiones que podemos hacer es significatimanete mayor y nos ahorrará mucho tiempo de espera entre peticiones"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"token =\"<token>\"\n",
"\n",
"g = Github(\"maes95\",token)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Podemos realizar consultas sencillas utilizando una serie de parámetros definidos por la librería (que se incluirán en la consulta a la API). Podemos ver en detalle la documentación el [repositorio de la librería](https://github.com/PyGithub/PyGithub) o en su [documentación oficial](https://pygithub.readthedocs.io/en/latest/examples/MainClass.html#search-repositories-by-language)\n",
"\n",
"Al ejecutar un consulta, obtendremos un objeto generador\n",
"- No realiza ninguna consulta o búsqueda\n",
"- Comienza a realizarla al iterar sobre el generador"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"query=\"\"\"\n",
" language:java \n",
" stars:>=500 \n",
" forks:>=300 \n",
" created:<2015-01-01 \n",
" pushed:>2020-01-01\n",
" archived:false\n",
" is:public\n",
"\"\"\"\n",
"\n",
"generator = g.search_repositories(query=query)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Podemos convertir en generador en una lista de repositorios de manera sencilla en Python. En este casi SI se realizará la consulta"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"# Al iterar (internamente) el generador, crea una lista a partir de la búsqueda\n",
"repositories= list(generator)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Guardamos la información de los repositorios recuperados en un archivo binario de Python\n",
"- Utilizamos la librería pickle\n",
"- Las búsquedas en la API de GitHub pueden variar con el tiempo, podemos obtener más o menos repositorios al realizar la misma búsqueda\n",
"- Lo guardamos con un timestamp para diferenciarlo inequivocamente"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Total repos: 721\n"
]
}
],
"source": [
"date=str(datetime.datetime.now())\n",
"\n",
"with open('repos_%s.pickle'%date, 'wb') as f:\n",
" pickle.dump(repositories, f)\n",
"\n",
"print(\"Total repos: %d\"%len(repositories))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Leemos el archivo binario. Si hemos cerrado y abierto de nuevo el notebook, el timestamp habrá cambiado y tendremos que ponerlo a mano."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"scrolled": false
},
"outputs": [],
"source": [
"repos = 'repos_%s.pickle'%date\n",
"#repos = 'repos_2020-09-21 12:03:13.421536.pickle'\n",
"with open(repos, 'rb') as f:\n",
" repositories = pickle.load(f)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Sobre los repositorios encontramos, podemos realizar filtros. Para empezar, vamos a filtrar los repositorios por su número de commits (para quedarnos solo con ciertos repositorios)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Total projects 342\n"
]
}
],
"source": [
"MAX_COMMITS = 10000\n",
"MIN_COMMITS = 1000\n",
"filtered_repos = []\n",
"for repo in repositories:\n",
" commits = repo.get_commits().totalCount\n",
" if commits >= MIN_COMMITS and commits <= MAX_COMMITS:\n",
" filtered_repos.append(repo)\n",
"print(\"Total projects %d\"%len(filtered_repos))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Realizamos un nuevo filtro. Vamos a quedarnos solo con los proyectos que tienen almenos uno de los distintos sistemas de construcción más típicos en Java.\n",
"\n",
"- Maven (pom.xml)\n",
"- Gradle (build.gradle)\n",
"- Ant (build.xml)\n",
"\n",
"Esta consulta es algo más laboriosa que la anterior, ya que tiene que comprobar que alguno de los archivos del repositorio coincide con los archivos de configuración que definimos. Tenemos que tener cuidado en el orden de las consultas, para que resulten lo más optimas posible."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Total repos: 314\n"
]
}
],
"source": [
"filtered_repos_2 = []\n",
"\n",
"for repo in filtered_repos:\n",
" contents = repo.get_contents(\"\")\n",
" for content_file in contents:\n",
" if content_file.path in [\"build.gradle\", \"pom.xml\", \"build.xml\"]:\n",
" filtered_repos_2.append(repo)\n",
" break\n",
" \n",
"print(\"Total repos: %d\"%len(filtered_repos_2))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Además de filtrar los proyectos por ficheros que contengan, también podemos inspeccionar ficheros concretos (incluso grupos de ficheros, por ejemplo, con terminación .java)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Total projects 267\n"
]
}
],
"source": [
"filtered_repos_3 = []\n",
"for repo in filtered_repos_2:\n",
" contents = repo.get_contents(\"\")\n",
" isAndroid = False\n",
" for file in contents:\n",
" if file.path == \"build.gradle\":\n",
" isAndroid = 'com.android.tools.build' in str(repo.get_contents(\"build.gradle\").decoded_content)\n",
" break\n",
" if not isAndroid: filtered_repos_3.append(repo)\n",
"print(\"Total projects %d\"%len(filtered_repos_3))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Puede que a pesar de los filtros que realizamos, obtengamos un gran número de repositorios, demasiados para el experimento que queremos realizar. Por ello, podemos limitar la muestra de repositorios escogiendo un número significativo al azar"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"code4craft/webmagic\n",
"googlemaps/google-maps-services-java\n",
"vavr-io/vavr\n",
"spring-projects/spring-data-jpa\n",
"todoroo/astrid\n",
"mockito/mockito\n",
"undertow-io/undertow\n",
"jchambers/pushy\n",
"MyCATApache/Mycat-Server\n",
"geometer/FBReaderJ\n",
"mcMMO-Dev/mcMMO\n",
"MinecraftForge/MinecraftForge\n",
"azkaban/azkaban\n",
"JodaOrg/joda-time\n",
"springfox/springfox\n",
"apache/tika\n",
"OryxProject/oryx\n",
"junit-team/junit4\n",
"apache/commons-lang\n",
"brianfrankcooper/YCSB\n",
"deeplearning4j/nd4j\n",
"ethereum/ethereumj\n",
"marytts/marytts\n",
"apache/nifi\n",
"tuguangquan/mybatis\n",
"javaparser/javaparser\n",
"thymeleaf/thymeleaf\n",
"apache/shiro\n",
"geometer/FBReaderJ\n",
"ltsopensource/light-task-scheduler\n"
]
}
],
"source": [
"# Seleccionamos 30 proyectos de manera aleatoria\n",
"sampling = random.choices(filtered_repos_3, k=30)\n",
"for project in sampling:\n",
" project_name = project.full_name.split(\"/\")[1]\n",
" print(project.full_name)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Ya tenemos seleccionados los repositorios. Pero GitHub no nos garantiza que estos repositorios siempre sigan ahí, podrían:\n",
"- Ser borrados\n",
"- Convertirse en repositorios privados\n",
"- Desaparecer commits o ramas de su histórico\n",
"\n",
"Por ello, una buena idea nada más seleccionarlos es clonarlos y tener una copia en local. Para ellos creamos una carpeta dónde guardarlos.\n",
"\n",
"La creación de una carpeta con cualquier lenguaje de programación es trivial, especialmente con las librerias actuales, pero me gustaría ilustrar un concepto sencillo, pero eficaz: no sobrescribir nunca los recursos creados. Podemos perder información previa al sobrescribirla con la nueva. En el caso de un directorio, simplemente no nos dejara crearlo, pero a la hora de crear archivos, los reemplazará sin preguntarnos."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Folder repositories already exist\n"
]
}
],
"source": [
"folder_name = 'repositories'\n",
"if not os.path.exists(folder_name):\n",
" print(\"Folder %s created!\"%folder_name)\n",
" os.mkdir(\"repositories\")\n",
"else:\n",
" print(\"Folder %s already exist\"%folder_name)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Clonamos de forma iterativa los repositorios en la nueva carpeta. Si ejecutamos de nuevo la siguiente celda, no clonará de nuevo los proyectos ya existentes"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {
"scrolled": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Cloning into 'repositories/webmagic'...\n",
"remote: Enumerating objects: 11, done.\u001b[K\n",
"remote: Counting objects: 100% (11/11), done.\u001b[K\n",
"remote: Compressing objects: 100% (8/8), done.\u001b[K\n",
"remote: Total 15335 (delta 0), reused 6 (delta 0), pack-reused 15324\u001b[K\n",
"Receiving objects: 100% (15335/15335), 16.63 MiB | 11.97 MiB/s, done.\n",
"Resolving deltas: 100% (6080/6080), done.\n",
" -> Project webmagic cloned!\n",
"Cloning into 'repositories/google-maps-services-java'...\n",
"remote: Enumerating objects: 77, done.\u001b[K\n",
"remote: Counting objects: 100% (77/77), done.\u001b[K\n",
"remote: Compressing objects: 100% (49/49), done.\u001b[K\n",
"remote: Total 13408 (delta 22), reused 39 (delta 9), pack-reused 13331\u001b[K\n",
"Receiving objects: 100% (13408/13408), 6.81 MiB | 8.64 MiB/s, done.\n",
"Resolving deltas: 100% (8566/8566), done.\n",
" -> Project google-maps-services-java cloned!\n",
"Cloning into 'repositories/vavr'...\n",
"remote: Enumerating objects: 9, done.\u001b[K\n",
"remote: Counting objects: 100% (9/9), done.\u001b[K\n",
"remote: Compressing objects: 100% (6/6), done.\u001b[K\n",
"remote: Total 67549 (delta 0), reused 4 (delta 0), pack-reused 67540\u001b[K\n",
"Receiving objects: 100% (67549/67549), 19.32 MiB | 11.72 MiB/s, done.\n",
"Resolving deltas: 100% (43661/43661), done.\n",
" -> Project vavr cloned!\n",
"Cloning into 'repositories/spring-data-jpa'...\n",
"remote: Enumerating objects: 38104, done.\u001b[K\n",
"remote: Total 38104 (delta 0), reused 0 (delta 0), pack-reused 38104\u001b[K\n",
"Receiving objects: 100% (38104/38104), 7.82 MiB | 8.28 MiB/s, done.\n",
"Resolving deltas: 100% (19331/19331), done.\n",
" -> Project spring-data-jpa cloned!\n",
"Cloning into 'repositories/astrid'...\n",
"remote: Enumerating objects: 83161, done.\u001b[K\n",
"remote: Total 83161 (delta 0), reused 0 (delta 0), pack-reused 83161\u001b[K\n",
"Receiving objects: 100% (83161/83161), 59.62 MiB | 14.24 MiB/s, done.\n",
"Resolving deltas: 100% (50047/50047), done.\n",
" -> Project astrid cloned!\n",
"Cloning into 'repositories/mockito'...\n",
"remote: Enumerating objects: 27, done.\u001b[K\n",
"remote: Counting objects: 100% (27/27), done.\u001b[K\n",
"remote: Compressing objects: 100% (20/20), done.\u001b[K\n",
"remote: Total 78276 (delta 4), reused 14 (delta 3), pack-reused 78249\u001b[K\n",
"Receiving objects: 100% (78276/78276), 45.65 MiB | 15.51 MiB/s, done.\n",
"Resolving deltas: 100% (37976/37976), done.\n",
" -> Project mockito cloned!\n",
"Cloning into 'repositories/undertow'...\n",
"remote: Enumerating objects: 45, done.\u001b[K\n",
"remote: Counting objects: 100% (45/45), done.\u001b[K\n",
"remote: Compressing objects: 100% (41/41), done.\u001b[K\n",
"remote: Total 108822 (delta 21), reused 23 (delta 2), pack-reused 108777\u001b[K\n",
"Receiving objects: 100% (108822/108822), 23.09 MiB | 4.94 MiB/s, done.\n",
"Resolving deltas: 100% (46056/46056), done.\n",
" -> Project undertow cloned!\n",
"Cloning into 'repositories/pushy'...\n",
"remote: Enumerating objects: 318, done.\u001b[K\n",
"remote: Counting objects: 100% (318/318), done.\u001b[K\n",
"remote: Compressing objects: 100% (188/188), done.\u001b[K\n",
"remote: Total 21375 (delta 143), reused 231 (delta 80), pack-reused 21057\u001b[K\n",
"Receiving objects: 100% (21375/21375), 7.71 MiB | 1.66 MiB/s, done.\n",
"Resolving deltas: 100% (9179/9179), done.\n",
" -> Project pushy cloned!\n",
"Cloning into 'repositories/Mycat-Server'...\n",
"remote: Enumerating objects: 12, done.\u001b[K\n",
"remote: Counting objects: 100% (12/12), done.\u001b[K\n",
"remote: Compressing objects: 100% (10/10), done.\u001b[K\n",
"remote: Total 38289 (delta 0), reused 5 (delta 0), pack-reused 38277\u001b[K\n",
"Receiving objects: 100% (38289/38289), 18.69 MiB | 12.02 MiB/s, done.\n",
"Resolving deltas: 100% (20945/20945), done.\n",
" -> Project Mycat-Server cloned!\n",
"Cloning into 'repositories/FBReaderJ'...\n",
"remote: Enumerating objects: 1, done.\u001b[K\n",
"remote: Counting objects: 100% (1/1), done.\u001b[K\n",
"remote: Total 229380 (delta 0), reused 0 (delta 0), pack-reused 229379\u001b[K\n",
"Receiving objects: 100% (229380/229380), 63.44 MiB | 14.19 MiB/s, done.\n",
"Resolving deltas: 100% (125040/125040), done.\n",
" -> Project FBReaderJ cloned!\n",
"Cloning into 'repositories/mcMMO'...\n",
"remote: Enumerating objects: 199, done.\u001b[K\n",
"remote: Counting objects: 100% (199/199), done.\u001b[K\n",
"remote: Compressing objects: 100% (131/131), done.\u001b[K\n",
"remote: Total 98257 (delta 66), reused 140 (delta 46), pack-reused 98058\u001b[K\n",
"Receiving objects: 100% (98257/98257), 23.37 MiB | 12.40 MiB/s, done.\n",
"Resolving deltas: 100% (54411/54411), done.\n",
" -> Project mcMMO cloned!\n",
"Cloning into 'repositories/MinecraftForge'...\n",
"remote: Enumerating objects: 13, done.\u001b[K\n",
"remote: Counting objects: 100% (13/13), done.\u001b[K\n",
"remote: Compressing objects: 100% (12/12), done.\u001b[K\n",
"remote: Total 125905 (delta 2), reused 4 (delta 1), pack-reused 125892\u001b[K\n",
"Receiving objects: 100% (125905/125905), 99.04 MiB | 17.38 MiB/s, done.\n",
"Resolving deltas: 100% (65711/65711), done.\n",
" -> Project MinecraftForge cloned!\n",
"Cloning into 'repositories/azkaban'...\n",
"remote: Enumerating objects: 38, done.\u001b[K\n",
"remote: Counting objects: 100% (38/38), done.\u001b[K\n",
"remote: Compressing objects: 100% (37/37), done.\u001b[K\n",
"remote: Total 40402 (delta 6), reused 0 (delta 0), pack-reused 40364\u001b[K\n",
"Receiving objects: 100% (40402/40402), 50.60 MiB | 17.48 MiB/s, done.\n",
"Resolving deltas: 100% (22306/22306), done.\n",
" -> Project azkaban cloned!\n",
"Cloning into 'repositories/joda-time'...\n",
"remote: Enumerating objects: 26496, done.\u001b[K\n",
"remote: Total 26496 (delta 0), reused 0 (delta 0), pack-reused 26496\u001b[K\n",
"Receiving objects: 100% (26496/26496), 10.62 MiB | 8.92 MiB/s, done.\n",
"Resolving deltas: 100% (13326/13326), done.\n",
" -> Project joda-time cloned!\n",
"Cloning into 'repositories/springfox'...\n",
"remote: Enumerating objects: 19, done.\u001b[K\n",
"remote: Counting objects: 100% (19/19), done.\u001b[K\n",
"remote: Compressing objects: 100% (19/19), done.\u001b[K\n",
"remote: Total 469073 (delta 8), reused 0 (delta 0), pack-reused 469054\u001b[K\n",
"Receiving objects: 100% (469073/469073), 181.29 MiB | 16.63 MiB/s, done.\n",
"Resolving deltas: 100% (381612/381612), done.\n",
" -> Project springfox cloned!\n",
"Cloning into 'repositories/tika'...\n",
"remote: Enumerating objects: 207, done.\u001b[K\n",
"remote: Counting objects: 100% (207/207), done.\u001b[K\n",
"remote: Compressing objects: 100% (131/131), done.\u001b[K\n",
"remote: Total 100529 (delta 29), reused 138 (delta 11), pack-reused 100322\u001b[K\n",
"Receiving objects: 100% (100529/100529), 171.94 MiB | 18.32 MiB/s, done.\n",
"Resolving deltas: 100% (38059/38059), done.\n",
"Updating files: 100% (2721/2721), done.\n",
" -> Project tika cloned!\n",
"Cloning into 'repositories/oryx'...\n",
"remote: Enumerating objects: 127, done.\u001b[K\n",
"remote: Counting objects: 100% (127/127), done.\u001b[K\n",
"remote: Compressing objects: 100% (72/72), done.\u001b[K\n",
"remote: Total 37703 (delta 13), reused 100 (delta 9), pack-reused 37576\u001b[K\n",
"Receiving objects: 100% (37703/37703), 7.21 MiB | 8.08 MiB/s, done.\n",
"Resolving deltas: 100% (16198/16198), done.\n",
" -> Project oryx cloned!\n",
"Cloning into 'repositories/junit4'...\n",
"remote: Enumerating objects: 1, done.\u001b[K\n",
"remote: Counting objects: 100% (1/1), done.\u001b[K\n",
"remote: Total 59020 (delta 0), reused 0 (delta 0), pack-reused 59019\u001b[K\n",
"Receiving objects: 100% (59020/59020), 23.36 MiB | 13.45 MiB/s, done.\n",
"Resolving deltas: 100% (42517/42517), done.\n",
" -> Project junit4 cloned!\n",
"Cloning into 'repositories/commons-lang'...\n",
"remote: Enumerating objects: 88, done.\u001b[K\n",
"remote: Counting objects: 100% (88/88), done.\u001b[K\n",
"remote: Compressing objects: 100% (37/37), done.\u001b[K\n",
"remote: Total 73678 (delta 29), reused 66 (delta 16), pack-reused 73590\u001b[K\n",
"Receiving objects: 100% (73678/73678), 23.93 MiB | 10.20 MiB/s, done.\n",
"Resolving deltas: 100% (31555/31555), done.\n",
" -> Project commons-lang cloned!\n",
"Cloning into 'repositories/YCSB'...\n",
"remote: Enumerating objects: 10, done.\u001b[K\n",
"remote: Counting objects: 100% (10/10), done.\u001b[K\n",
"remote: Compressing objects: 100% (5/5), done.\u001b[K\n",
"remote: Total 20375 (delta 0), reused 6 (delta 0), pack-reused 20365\u001b[K\n",
"Receiving objects: 100% (20375/20375), 31.52 MiB | 9.32 MiB/s, done.\n",
"Resolving deltas: 100% (7950/7950), done.\n",
" -> Project YCSB cloned!\n",
"Cloning into 'repositories/nd4j'...\n",
"remote: Enumerating objects: 10, done.\u001b[K\n",
"remote: Counting objects: 100% (10/10), done.\u001b[K\n",
"remote: Compressing objects: 100% (10/10), done.\u001b[K\n",
"remote: Total 183056 (delta 2), reused 0 (delta 0), pack-reused 183046\n",
"Receiving objects: 100% (183056/183056), 298.22 MiB | 19.02 MiB/s, done.\n",
"Resolving deltas: 100% (86666/86666), done.\n",
" -> Project nd4j cloned!\n",
"Cloning into 'repositories/ethereumj'...\n",
"remote: Enumerating objects: 25, done.\u001b[K\n",
"remote: Counting objects: 100% (25/25), done.\u001b[K\n",
"remote: Compressing objects: 100% (24/24), done.\u001b[K\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"remote: Total 75777 (delta 4), reused 10 (delta 0), pack-reused 75752\u001b[K\n",
"Receiving objects: 100% (75777/75777), 46.86 MiB | 12.79 MiB/s, done.\n",
"Resolving deltas: 100% (41231/41231), done.\n",
" -> Project ethereumj cloned!\n",
"Cloning into 'repositories/marytts'...\n",
"remote: Enumerating objects: 238, done.\u001b[K\n",
"remote: Counting objects: 100% (238/238), done.\u001b[K\n",
"remote: Compressing objects: 100% (142/142), done.\u001b[K\n",
"remote: Total 74830 (delta 75), reused 164 (delta 29), pack-reused 74592\u001b[K\n",
"Receiving objects: 100% (74830/74830), 146.01 MiB | 16.67 MiB/s, done.\n",
"Resolving deltas: 100% (43803/43803), done.\n",
" -> Project marytts cloned!\n",
"Cloning into 'repositories/nifi'...\n",
"remote: Enumerating objects: 7, done.\u001b[K\n",
"remote: Counting objects: 100% (7/7), done.\u001b[K\n",
"remote: Compressing objects: 100% (6/6), done.\u001b[K\n",
"remote: Total 260647 (delta 0), reused 3 (delta 0), pack-reused 260640\u001b[K\n",
"Receiving objects: 100% (260647/260647), 165.10 MiB | 16.35 MiB/s, done.\n",
"Resolving deltas: 100% (107259/107259), done.\n",
"Updating files: 100% (8912/8912), done.\n",
" -> Project nifi cloned!\n",
"Cloning into 'repositories/mybatis'...\n",
"remote: Enumerating objects: 3, done.\u001b[K\n",
"remote: Counting objects: 100% (3/3), done.\u001b[K\n",
"remote: Compressing objects: 100% (3/3), done.\u001b[K\n",
"remote: Total 118128 (delta 0), reused 0 (delta 0), pack-reused 118125\u001b[K\n",
"Receiving objects: 100% (118128/118128), 49.62 MiB | 15.87 MiB/s, done.\n",
"Resolving deltas: 100% (96498/96498), done.\n",
" -> Project mybatis cloned!\n",
"Cloning into 'repositories/javaparser'...\n",
"remote: Enumerating objects: 3, done.\u001b[K\n",
"remote: Counting objects: 100% (3/3), done.\u001b[K\n",
"remote: Compressing objects: 100% (3/3), done.\u001b[K\n",
"remote: Total 106679 (delta 0), reused 0 (delta 0), pack-reused 106676\u001b[K\n",
"Receiving objects: 100% (106679/106679), 24.06 MiB | 8.09 MiB/s, done.\n",
"Resolving deltas: 100% (56933/56933), done.\n",
" -> Project javaparser cloned!\n",
"Cloning into 'repositories/thymeleaf'...\n",
"remote: Enumerating objects: 7, done.\u001b[K\n",
"remote: Counting objects: 100% (7/7), done.\u001b[K\n",
"remote: Compressing objects: 100% (5/5), done.\u001b[K\n",
"remote: Total 26939 (delta 2), reused 6 (delta 2), pack-reused 26932\u001b[K\n",
"Receiving objects: 100% (26939/26939), 8.24 MiB | 5.44 MiB/s, done.\n",
"Resolving deltas: 100% (15879/15879), done.\n",
" -> Project thymeleaf cloned!\n",
"Cloning into 'repositories/shiro'...\n",
"remote: Enumerating objects: 70, done.\u001b[K\n",
"remote: Counting objects: 100% (70/70), done.\u001b[K\n",
"remote: Compressing objects: 100% (42/42), done.\u001b[K\n",
"remote: Total 45639 (delta 5), reused 52 (delta 4), pack-reused 45569\u001b[K\n",
"Receiving objects: 100% (45639/45639), 24.76 MiB | 15.11 MiB/s, done.\n",
"Resolving deltas: 100% (23408/23408), done.\n",
" -> Project shiro cloned!\n",
" -> Project geometer/FBReaderJ already exist in local folder!\n",
"Cloning into 'repositories/light-task-scheduler'...\n",
"remote: Enumerating objects: 6, done.\u001b[K\n",
"remote: Counting objects: 100% (6/6), done.\u001b[K\n",
"remote: Compressing objects: 100% (6/6), done.\u001b[K\n",
"remote: Total 25946 (delta 2), reused 0 (delta 0), pack-reused 25940\u001b[K\n",
"Receiving objects: 100% (25946/25946), 97.60 MiB | 17.23 MiB/s, done.\n",
"Resolving deltas: 100% (11069/11069), done.\n",
" -> Project light-task-scheduler cloned!\n"
]
}
],
"source": [
"for project in sampling:\n",
" project_name = project.full_name.split(\"/\")[1]\n",
" \n",
" project_folder = \"%s/%s\" % (folder_name, project_name)\n",
" \n",
" # CHECK IF PROJECT EXISTS\n",
" if os.path.exists(project_folder):\n",
" print(\" -> Project %s already exist in local folder!\"%project.full_name)\n",
" else:\n",
" !git clone $project.clone_url $project_folder\n",
" print(\" -> Project %s cloned!\"%project_name)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
#!/usr/bin/env python
# coding: utf-8
# # Mineria de Repositorios con GitHub
# La mayoria de librerias de este ejemplo están disponibles de manera nativa con Python 3.7+ (si lo tienes instalado). Solo será necesario instalar la librería que hace uso de la API de GitHub:
# ```bash
# $ pip install PyGithub
# ```
import os
import datetime
import random
import json
import pickle
from github import Github
# Generamos un Token para consultar la API de GitHub a través de la libreria GitHub
# - Necesitamos tener una cuenta en GitHub
# - Seguimos [este sencillo tutorial](https://docs.github.com/es/github/authenticating-to-github/creating-a-personal-access-token) para generarlo
# - El token NO es necesario para realizar las consultas, pero la cuota de peticiones que podemos hacer es significatimanete mayor y nos ahorrará mucho tiempo de espera entre peticiones
token ="<token>"
g = Github("maes95",token)
# Podemos realizar consultas sencillas utilizando una serie de parámetros definidos por la librería (que se incluirán en la consulta a la API). Podemos ver en detalle la documentación el [repositorio de la librería](https://github.com/PyGithub/PyGithub) o en su [documentación oficial](https://pygithub.readthedocs.io/en/latest/examples/MainClass.html#search-repositories-by-language)
#
# Al ejecutar un consulta, obtendremos un objeto generador
# - No realiza ninguna consulta o búsqueda
# - Comienza a realizarla al iterar sobre el generador
query="""
language:java
stars:>=500
forks:>=300
created:<2015-01-01
pushed:>2020-01-01
archived:false
is:public
"""
generator = g.search_repositories(query=query)
# Podemos convertir en generador en una lista de repositorios de manera sencilla en Python. En este casi SI se realizará la consulta
# Al iterar (internamente) el generador, crea una lista a partir de la búsqueda
repositories= list(generator)
# Guardamos la información de los repositorios recuperados en un archivo binario de Python
# - Utilizamos la librería pickle
# - Las búsquedas en la API de GitHub pueden variar con el tiempo, podemos obtener más o menos repositorios al realizar la misma búsqueda
# - Lo guardamos con un timestamp para diferenciarlo inequivocamente
date=str(datetime.datetime.now())
with open('repos_%s.pickle'%date, 'wb') as f:
pickle.dump(repositories, f)
print("Total repos: %d"%len(repositories))
# Leemos el archivo binario. Si hemos cerrado y abierto de nuevo el notebook, el timestamp habrá cambiado y tendremos que ponerlo a mano.
repos = 'repos_%s.pickle'%date
#repos = 'repos_2020-09-21 12:03:13.421536.pickle'
with open(repos, 'rb') as f:
repositories = pickle.load(f)
# Sobre los repositorios encontramos, podemos realizar filtros. Para empezar, vamos a filtrar los repositorios por su número de commits (para quedarnos solo con ciertos repositorios)
MAX_COMMITS = 10000
MIN_COMMITS = 1000
filtered_repos = []
for repo in repositories:
commits = repo.get_commits().totalCount
if commits >= MIN_COMMITS and commits <= MAX_COMMITS:
filtered_repos.append(repo)
print("Total projects %d"%len(filtered_repos))
# Realizamos un nuevo filtro. Vamos a quedarnos solo con los proyectos que tienen almenos uno de los distintos sistemas de construcción más típicos en Java.
#
# - Maven (pom.xml)
# - Gradle (build.gradle)
# - Ant (build.xml)
#
# Esta consulta es algo más laboriosa que la anterior, ya que tiene que comprobar que alguno de los archivos del repositorio coincide con los archivos de configuración que definimos. Tenemos que tener cuidado en el orden de las consultas, para que resulten lo más optimas posible.
filtered_repos_2 = []
for repo in filtered_repos:
contents = repo.get_contents("")
for content_file in contents:
if content_file.path in ["build.gradle", "pom.xml", "build.xml"]:
filtered_repos_2.append(repo)
break
print("Total repos: %d"%len(filtered_repos_2))
# Además de filtrar los proyectos por ficheros que contengan, también podemos inspeccionar ficheros concretos (incluso grupos de ficheros, por ejemplo, con terminación .java)
filtered_repos_3 = []
for repo in filtered_repos_2:
contents = repo.get_contents("")
isAndroid = False
for file in contents:
if file.path == "build.gradle":
isAndroid = 'com.android.tools.build' in str(repo.get_contents("build.gradle").decoded_content)
break
if not isAndroid: filtered_repos_3.append(repo)
print("Total projects %d"%len(filtered_repos_3))
# Puede que a pesar de los filtros que realizamos, obtengamos un gran número de repositorios, demasiados para el experimento que queremos realizar. Por ello, podemos limitar la muestra de repositorios escogiendo un número significativo al azar
# Seleccionamos 30 proyectos de manera aleatoria
sampling = random.choices(filtered_repos_3, k=30)
for project in sampling:
project_name = project.full_name.split("/")[1]
print(project.full_name)
# Ya tenemos seleccionados los repositorios. Pero GitHub no nos garantiza que estos repositorios siempre sigan ahí, podrían:
# - Ser borrados
# - Convertirse en repositorios privados
# - Desaparecer commits o ramas de su histórico
#
# Por ello, una buena idea nada más seleccionarlos es clonarlos y tener una copia en local. Para ellos creamos una carpeta dónde guardarlos.
#
# La creación de una carpeta con cualquier lenguaje de programación es trivial, especialmente con las librerias actuales, pero me gustaría ilustrar un concepto sencillo, pero eficaz: no sobrescribir nunca los recursos creados. Podemos perder información previa al sobrescribirla con la nueva. En el caso de un directorio, simplemente no nos dejara crearlo, pero a la hora de crear archivos, los reemplazará sin preguntarnos.
folder_name = 'repositories'
if not os.path.exists(folder_name):
print("Folder %s created!"%folder_name)
os.mkdir("repositories")
else:
print("Folder %s already exist"%folder_name)
# Clonamos de forma iterativa los repositorios en la nueva carpeta. Si ejecutamos de nuevo la siguiente celda, no clonará de nuevo los proyectos ya existentes
for project in sampling:
project_name = project.full_name.split("/")[1]
project_folder = "%s/%s" % (folder_name, project_name)
# CHECK IF PROJECT EXISTS
if os.path.exists(project_folder):
print(" -> Project %s already exist in local folder!"%project.full_name)
else:
get_ipython().system('git clone $project.clone_url $project_folder')
print(" -> Project %s cloned!"%project_name)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment