Skip to content

Instantly share code, notes, and snippets.

@feststelltaste
Last active January 31, 2020 18:22
Show Gist options
  • Save feststelltaste/3bcbbebd5fc6e152ea82d40b2de2e41f to your computer and use it in GitHub Desktop.
Save feststelltaste/3bcbbebd5fc6e152ea82d40b2de2e41f to your computer and use it in GitHub Desktop.
Git Analysis Rename Problem (Draft)
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Some problems when analyzing Git logs\n",
"\n",
"In the past, I did a lot of Git log analysis on my blog. The main reason is that developers know what Git is and what kind of data it provides. So it is easy to connect to developers then doing more advanced analysis of Git data.\n",
"\n",
"But there is an area of problems with these kinds of analysis when you want to do file-based analysis in a long-running repository: Deletions, merges, splits and renames.\n",
"\n",
"For the latter one, I want to show you the kinds of problems in this notebook:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Git Example repository\n",
"\n",
"For this analysis, we want to use a little but long-lived repository: The Spring PetClinic project (anti-refactored by me to show some interesting things).\n",
"\n",
"We first clone this repository locally."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Cloning into 'spring-petclinic'...\n",
"Checking out files: 100% (549/549), done.\n"
]
}
],
"source": [
"%%bash\n",
"\n",
"git clone https://github.com/JavaOnAutobahn/spring-petclinic"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, we export the Git history by using a special command (background explained [here](https://www.feststelltaste.de/reading-a-git-repos-commit-history-with-pandas-efficiently/))"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"%%bash\n",
"\n",
"# path to git repository\n",
"cd spring-petclinic\n",
"git log --numstat --pretty=format:\"%x09%x09%x09%ai\" -- *.java > git_log.csv"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"With a little helper function, we import the exported data (see link above for details on that as well)."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>additions</th>\n",
" <th>deletions</th>\n",
" <th>churn</th>\n",
" <th>filename</th>\n",
" </tr>\n",
" <tr>\n",
" <th>timestamp</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <td>2019-03-05 22:32:20+01:00</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>src/main/java/org/springframework/samples/petc...</td>\n",
" </tr>\n",
" <tr>\n",
" <td>2019-03-05 22:32:20+01:00</td>\n",
" <td>2.0</td>\n",
" <td>0.0</td>\n",
" <td>2.0</td>\n",
" <td>src/main/java/org/springframework/samples/petc...</td>\n",
" </tr>\n",
" <tr>\n",
" <td>2019-03-05 22:32:20+01:00</td>\n",
" <td>2.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>src/main/java/org/springframework/samples/petc...</td>\n",
" </tr>\n",
" <tr>\n",
" <td>2019-03-05 22:32:20+01:00</td>\n",
" <td>2.0</td>\n",
" <td>0.0</td>\n",
" <td>2.0</td>\n",
" <td>src/main/java/org/springframework/samples/petc...</td>\n",
" </tr>\n",
" <tr>\n",
" <td>2019-03-05 22:32:20+01:00</td>\n",
" <td>3.0</td>\n",
" <td>0.0</td>\n",
" <td>3.0</td>\n",
" <td>src/main/java/org/springframework/samples/petc...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" additions deletions churn \\\n",
"timestamp \n",
"2019-03-05 22:32:20+01:00 1.0 1.0 0.0 \n",
"2019-03-05 22:32:20+01:00 2.0 0.0 2.0 \n",
"2019-03-05 22:32:20+01:00 2.0 1.0 1.0 \n",
"2019-03-05 22:32:20+01:00 2.0 0.0 2.0 \n",
"2019-03-05 22:32:20+01:00 3.0 0.0 3.0 \n",
"\n",
" filename \n",
"timestamp \n",
"2019-03-05 22:32:20+01:00 src/main/java/org/springframework/samples/petc... \n",
"2019-03-05 22:32:20+01:00 src/main/java/org/springframework/samples/petc... \n",
"2019-03-05 22:32:20+01:00 src/main/java/org/springframework/samples/petc... \n",
"2019-03-05 22:32:20+01:00 src/main/java/org/springframework/samples/petc... \n",
"2019-03-05 22:32:20+01:00 src/main/java/org/springframework/samples/petc... "
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"\n",
"def parse_git_log(path):\n",
" # reading\n",
" git_log = pd.read_csv(\n",
" path,\n",
" sep=\"\\t\", \n",
" header=None,\n",
" names=[\n",
" 'additions', \n",
" 'deletions', \n",
" 'filename', \n",
" 'timestamp'])\n",
"\n",
" # converting in \"one line\"\n",
" git_log = git_log[['additions', 'deletions', 'filename']]\\\n",
" .join(git_log[['timestamp']]\\\n",
" .fillna(method='ffill'))\\\n",
" .dropna().reset_index(drop=True)\n",
"\n",
" # data type conversions\n",
" git_log['additions'] = pd.to_numeric(git_log['additions'], errors='coerce')\n",
" git_log['deletions'] = pd.to_numeric(git_log['deletions'], errors='coerce')\n",
" churn = git_log['additions'] - git_log['deletions']\n",
" git_log.insert(2, \"churn\", churn)\n",
" git_log['timestamp'] = pd.to_datetime(git_log['timestamp'])\n",
" return git_log.set_index('timestamp')\n",
"\n",
"timed_log = parse_git_log(\"spring-petclinic/git_log.csv\")\n",
"timed_log.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So what we got is a nice parsed pandas dataframe we can use for further analysis.\n",
"\n",
"## Analysis\n",
"Let's dive into the actual problem analysis. Say we want to do some file-based analysis of the software project with data based on Git. So we group our features along the timestamps.\n",
"\n",
"(Note that we keep the last timestamp entry for each file to do an analysis based on the most recent data later on)."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>additions</th>\n",
" <th>deletions</th>\n",
" <th>churn</th>\n",
" <th>timestamp</th>\n",
" </tr>\n",
" <tr>\n",
" <th>filename</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <td>org.springframework.samples.petclinic/src/main/java/org/springframework/samples/petclinic/HomeController.java</td>\n",
" <td>17.0</td>\n",
" <td>17.0</td>\n",
" <td>0.0</td>\n",
" <td>2013-01-09 17:24:48+08:00</td>\n",
" </tr>\n",
" <tr>\n",
" <td>org.springframework.samples.petclinic/src/main/java/org/springframework/samples/petclinic/appointments/Appointment.java</td>\n",
" <td>37.0</td>\n",
" <td>37.0</td>\n",
" <td>0.0</td>\n",
" <td>2013-01-09 17:24:48+08:00</td>\n",
" </tr>\n",
" <tr>\n",
" <td>org.springframework.samples.petclinic/src/main/java/org/springframework/samples/petclinic/appointments/AppointmentBook.java</td>\n",
" <td>13.0</td>\n",
" <td>13.0</td>\n",
" <td>0.0</td>\n",
" <td>2013-01-09 17:24:48+08:00</td>\n",
" </tr>\n",
" <tr>\n",
" <td>org.springframework.samples.petclinic/src/main/java/org/springframework/samples/petclinic/appointments/AppointmentForm.java</td>\n",
" <td>67.0</td>\n",
" <td>67.0</td>\n",
" <td>0.0</td>\n",
" <td>2013-01-09 17:24:48+08:00</td>\n",
" </tr>\n",
" <tr>\n",
" <td>org.springframework.samples.petclinic/src/main/java/org/springframework/samples/petclinic/appointments/Appointments.java</td>\n",
" <td>15.0</td>\n",
" <td>15.0</td>\n",
" <td>0.0</td>\n",
" <td>2013-01-09 17:24:48+08:00</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" additions deletions \\\n",
"filename \n",
"org.springframework.samples.petclinic/src/main/... 17.0 17.0 \n",
"org.springframework.samples.petclinic/src/main/... 37.0 37.0 \n",
"org.springframework.samples.petclinic/src/main/... 13.0 13.0 \n",
"org.springframework.samples.petclinic/src/main/... 67.0 67.0 \n",
"org.springframework.samples.petclinic/src/main/... 15.0 15.0 \n",
"\n",
" churn \\\n",
"filename \n",
"org.springframework.samples.petclinic/src/main/... 0.0 \n",
"org.springframework.samples.petclinic/src/main/... 0.0 \n",
"org.springframework.samples.petclinic/src/main/... 0.0 \n",
"org.springframework.samples.petclinic/src/main/... 0.0 \n",
"org.springframework.samples.petclinic/src/main/... 0.0 \n",
"\n",
" timestamp \n",
"filename \n",
"org.springframework.samples.petclinic/src/main/... 2013-01-09 17:24:48+08:00 \n",
"org.springframework.samples.petclinic/src/main/... 2013-01-09 17:24:48+08:00 \n",
"org.springframework.samples.petclinic/src/main/... 2013-01-09 17:24:48+08:00 \n",
"org.springframework.samples.petclinic/src/main/... 2013-01-09 17:24:48+08:00 \n",
"org.springframework.samples.petclinic/src/main/... 2013-01-09 17:24:48+08:00 "
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"file_churns = timed_log.reset_index().groupby('filename').agg({\n",
" \"additions\" : \"sum\",\n",
" \"deletions\" : \"sum\",\n",
" \"churn\" : \"sum\",\n",
" \"timestamp\" : \"first\"\n",
"})\n",
"file_churns.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So, at this point, something weird happens: **There are files that have a negative number of lines!**\n",
"\n",
"How can this happen?"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>additions</th>\n",
" <th>deletions</th>\n",
" <th>churn</th>\n",
" <th>timestamp</th>\n",
" </tr>\n",
" <tr>\n",
" <th>filename</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <td>src/test/java/org/springframework/samples/petclinic/service/ClinicServiceJpaTests.java</td>\n",
" <td>11.0</td>\n",
" <td>13.0</td>\n",
" <td>-2.0</td>\n",
" <td>2018-11-15 18:39:27+01:00</td>\n",
" </tr>\n",
" <tr>\n",
" <td>src/main/java/org/springframework/samples/petclinic/repository/jpa/package-info.java</td>\n",
" <td>0.0</td>\n",
" <td>3.0</td>\n",
" <td>-3.0</td>\n",
" <td>2015-10-16 09:33:06+02:00</td>\n",
" </tr>\n",
" <tr>\n",
" <td>src/main/java/org/springframework/samples/petclinic/repository/jdbc/package-info.java</td>\n",
" <td>0.0</td>\n",
" <td>3.0</td>\n",
" <td>-3.0</td>\n",
" <td>2015-10-16 09:33:06+02:00</td>\n",
" </tr>\n",
" <tr>\n",
" <td>src/main/java/org/springframework/samples/petclinic/web/VetsAtomView.java</td>\n",
" <td>56.0</td>\n",
" <td>130.0</td>\n",
" <td>-74.0</td>\n",
" <td>2015-05-12 19:07:35+08:00</td>\n",
" </tr>\n",
" <tr>\n",
" <td>src/test/java/org/springframework/samples/petclinic/web/VisitsViewTests.java</td>\n",
" <td>7.0</td>\n",
" <td>77.0</td>\n",
" <td>-70.0</td>\n",
" <td>2015-05-10 06:45:39+08:00</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" additions deletions \\\n",
"filename \n",
"src/test/java/org/springframework/samples/petcl... 11.0 13.0 \n",
"src/main/java/org/springframework/samples/petcl... 0.0 3.0 \n",
"src/main/java/org/springframework/samples/petcl... 0.0 3.0 \n",
"src/main/java/org/springframework/samples/petcl... 56.0 130.0 \n",
"src/test/java/org/springframework/samples/petcl... 7.0 77.0 \n",
"\n",
" churn \\\n",
"filename \n",
"src/test/java/org/springframework/samples/petcl... -2.0 \n",
"src/main/java/org/springframework/samples/petcl... -3.0 \n",
"src/main/java/org/springframework/samples/petcl... -3.0 \n",
"src/main/java/org/springframework/samples/petcl... -74.0 \n",
"src/test/java/org/springframework/samples/petcl... -70.0 \n",
"\n",
" timestamp \n",
"filename \n",
"src/test/java/org/springframework/samples/petcl... 2018-11-15 18:39:27+01:00 \n",
"src/main/java/org/springframework/samples/petcl... 2015-10-16 09:33:06+02:00 \n",
"src/main/java/org/springframework/samples/petcl... 2015-10-16 09:33:06+02:00 \n",
"src/main/java/org/springframework/samples/petcl... 2015-05-12 19:07:35+08:00 \n",
"src/test/java/org/springframework/samples/petcl... 2015-05-10 06:45:39+08:00 "
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"weird_churns = file_churns[file_churns['churn'] < 0].sort_values(by=\"timestamp\", ascending=False)\n",
"weird_churns.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's look at a more recent file with such a negative number of lines (\"recent\" because then it is more likely that it still exists in the repository)."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'src/test/java/org/springframework/samples/petclinic/service/ClinicServiceJpaTests.java'"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"weird_churn_filename = weird_churns.iloc[0].name\n",
"weird_churn_filename"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For this file, we want to follow the development. Using the `--follow` option if Git, we can trace the evolution of this single file. As in the first Git data export, we store this data into a file."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"%%bash\n",
"cd spring-petclinic\n",
"git log --numstat --pretty=format:\"%x09%x09%x09%ai\" --follow src/test/java/org/springframework/samples/petclinic/service/ClinicServiceJpaTests.java > ../weird_churn_filename_log.csv"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's read in the data with our little helper function from above."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>additions</th>\n",
" <th>deletions</th>\n",
" <th>churn</th>\n",
" <th>filename</th>\n",
" </tr>\n",
" <tr>\n",
" <th>timestamp</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <td>2018-11-15 18:39:27+01:00</td>\n",
" <td>3.0</td>\n",
" <td>2.0</td>\n",
" <td>1.0</td>\n",
" <td>src/test/java/org/springframework/samples/petc...</td>\n",
" </tr>\n",
" <tr>\n",
" <td>2015-10-16 09:33:06+02:00</td>\n",
" <td>3.0</td>\n",
" <td>4.0</td>\n",
" <td>-1.0</td>\n",
" <td>src/test/java/org/springframework/samples/petc...</td>\n",
" </tr>\n",
" <tr>\n",
" <td>2013-12-16 20:58:15+09:00</td>\n",
" <td>2.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>src/test/java/org/springframework/samples/petc...</td>\n",
" </tr>\n",
" <tr>\n",
" <td>2013-06-28 12:00:29+08:00</td>\n",
" <td>1.0</td>\n",
" <td>2.0</td>\n",
" <td>-1.0</td>\n",
" <td>src/test/java/org/springframework/samples/petc...</td>\n",
" </tr>\n",
" <tr>\n",
" <td>2013-03-04 12:15:20+08:00</td>\n",
" <td>2.0</td>\n",
" <td>4.0</td>\n",
" <td>-2.0</td>\n",
" <td>src/test/java/org/springframework/samples/petc...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" additions deletions churn \\\n",
"timestamp \n",
"2018-11-15 18:39:27+01:00 3.0 2.0 1.0 \n",
"2015-10-16 09:33:06+02:00 3.0 4.0 -1.0 \n",
"2013-12-16 20:58:15+09:00 2.0 1.0 1.0 \n",
"2013-06-28 12:00:29+08:00 1.0 2.0 -1.0 \n",
"2013-03-04 12:15:20+08:00 2.0 4.0 -2.0 \n",
"\n",
" filename \n",
"timestamp \n",
"2018-11-15 18:39:27+01:00 src/test/java/org/springframework/samples/petc... \n",
"2015-10-16 09:33:06+02:00 src/test/java/org/springframework/samples/petc... \n",
"2013-12-16 20:58:15+09:00 src/test/java/org/springframework/samples/petc... \n",
"2013-06-28 12:00:29+08:00 src/test/java/org/springframework/samples/petc... \n",
"2013-03-04 12:15:20+08:00 src/test/java/org/springframework/samples/petc... "
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"weird_file_churn = parse_git_log(\"weird_churn_filename_log.csv\")\n",
"weird_file_churn.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Insights\n",
"\n",
"OK, what is the problem with the negative number of lines?\n",
"\n",
"Let's look at the history of this one specific file: It was **renamed** several times!"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"src/test/java/org/springframework/samples/petclinic/service/ClinicServiceJpaTests.java 5\n",
"src/test/java/org/springframework/samples/petclinic/jpa/AbstractJpaClinicTests.java 4\n",
"src/test/java/org/springframework/samples/petclinic/jpa/JpaClinicTests.java 3\n",
"src/test/java/org/springframework/samples/petclinic/jpa/JpaOwnerRepositoryImplTests.java 2\n",
"src/test/java/org/springframework/samples/petclinic/repository/jpa/JpaOwnerRepositoryImplTests.java 2\n",
"src/test/java/org/springframework/samples/petclinic/jpa/{JpaClinicTests.java => JpaClinicImplTests.java} 1\n",
"src/test/java/org/springframework/samples/petclinic/jpa/{AbstractJpaClinicTests.java => JpaClinicTests.java} 1\n",
"src/test/java/org/springframework/samples/petclinic/jpa/{JpaClinicImplTests.java => JpaOwnerRepositoryImplTests.java} 1\n",
"src/test/java/org/springframework/samples/petclinic/{ => repository}/jpa/JpaOwnerRepositoryImplTests.java 1\n",
"src/test/java/org/springframework/samples/petclinic/{repository/jpa/JpaOwnerRepositoryImplTests.java => service/ClinicServiceJpaTests.java} 1\n",
"Name: filename, dtype: int64"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"weird_file_churn['filename'].value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Albeit Git provides rename tracking features, some of the renames aren't renames compliant to the Git rename approach (the ones with the `=>` are the ones that Git can track) and thus making it difficult to track those renames with standard means.\n",
"\n",
"If we now sum up all the `churn` values for these files, we get the actual number of lines for the files based on pure Git repository data."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"23.0"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"weird_file_churn['churn'].sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's compare this one with the actual number of lines in the real file using the word count comment `wc`."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"23 spring-petclinic/src/test/java/org/springframework/samples/petclinic/service/ClinicServiceJpaTests.java\n"
]
}
],
"source": [
"%%bash\n",
"wc -l spring-petclinic/src/test/java/org/springframework/samples/petclinic/service/ClinicServiceJpaTests.java"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Cool, this one matches! This might not be always the case for example if you do some weird renaming actions with your source code base or to some merges or splitting ups of files."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"-2.0"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"weird_file_churn[weird_file_churn['filename'] == weird_churn_filename]['churn'].sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Visualization\n",
"\n",
"Let's look at the number of lines for this specific file to get a feeling if the data is right at all.\n"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 1440x360 with 1 Axes>"
]
},
"metadata": {
"image/png": {
"height": 300,
"width": 1165
},
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"%matplotlib inline\n",
"weird_file_churn[['additions', 'deletions', 'churn']].cumsum().plot(figsize=[20,5]);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We see that somehow we got a negative number of lines of code at the beginning, which could be an indication that there was something wrong with the previous rename detection. But later on, we get a positive number of lines."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Conclusion\n",
"\n",
"So there are limitations of Git repository analysis when you don't want to dive deep into a more sophisticated model of the evolutions of a project.\n",
"\n",
"Here are some ideas to mitigate this problem around renames:\n",
"\n",
"1. Maybe more advanced Git repository mining tools: There are tools like the open-source tool [PyDriller](https://pydriller.readthedocs.io) or commercial tools like [CodeScence](https://codescene.io/) or [TeamScale](https://www.cqse.eu/en/products/teamscale/landing/) (from the later I know that they've invested significant brain-power to solve file renaming and merging problems)\n",
"\n",
"2. Leverage Git rename detection: Git provides rename detection by default. You might be able to tweak some parameters to get the results you need. I once used this but I can't remember any further details, though :-(\n",
"\n",
"3. Avoid file-based Git analysis: There are plenty of other interesting analyses waiting for you out there which could be more valuable in your specific context.\n",
"\n",
"4. Use the actual lines of code: You might use tools like `cloc` to get the real number of lines of your currently existing files in the repository.\n",
"\n",
"\n",
"As of today, I've chosen the latter two options (with a tendency to 3. ;-)).\n",
"\n",
"Using Git repository data together with the actual number of lines of code (option 4.) is good enough for me to get a first glimpse at the evolution of a software project.\n",
"\n",
"Your context could be a different one where you have to choose more sophisticated techniques to handle all the problems around Git analysis. It would be very interesting to get to know your specific context!\n",
"\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
pandas
matplotlib
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment