Skip to content

Instantly share code, notes, and snippets.

@albertovilla
Created February 17, 2018 10:18
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save albertovilla/0a85e943e741f803aab06d2e3ca8cd79 to your computer and use it in GitHub Desktop.
Save albertovilla/0a85e943e741f803aab06d2e3ca8cd79 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{"nbformat":4,"nbformat_minor":2,"metadata":{"language_info":{"file_extension":".py","pygments_lexer":"ipython3","codemirror_mode":{"name":"ipython","version":3},"nbconvert_exporter":"python","name":"python","version":"3.5.2","mimetype":"text/x-python"},"kernelspec":{"language":"python","name":"python3","display_name":"Python 3"}},"cells":[{"source":"## 1. Introduction\n<p>Version control repositories like CVS, Subversion or Git can be a real gold mine for software developers. They contain every change to the source code including the date (the \"when\"), the responsible developer (the \"who\"), as well as little message that describes the intention (the \"what\") of a change.</p>\n<p><a href=\"https://commons.wikimedia.org/wiki/File:Tux.svg\">\n<img style=\"float: right;margin:5px 20px 5px 1px\" width=\"150px\" src=\"https://s3.amazonaws.com/assets.datacamp.com/production/project_111/img/tux.png\" alt=\"Tux - the Linux mascot\">\n</a></p>\n<p>In this notebook, we will analyze the evolution of a very famous open-source project &ndash; the Linux kernel. The Linux kernel is the heart of some Linux distributions like Debian, Ubuntu or CentOS. </p>\n<p>We get some first insights into the work of the development efforts by </p>\n<ul>\n<li>identifying the TOP 10 contributors and</li>\n<li>visualizing the commits over the years.</li>\n</ul>\n<p>Linus Torvalds, the (spoiler alert!) main contributor to the Linux kernel (and also the creator of Git), created a <a href=\"https://github.com/torvalds/linux/\">mirror of the Linux repository on GitHub</a>. It contains the complete history of kernel development for the last 13 years.</p>\n<p>For our analysis, we will use a Git log file with the following content:</p>","metadata":{"dc":{"key":"4"},"run_control":{"frozen":true},"tags":["context"],"deletable":false,"editable":false},"cell_type":"markdown"},{"source":"# Printing the content of git_log_excerpt.csv\n# ... YOUR CODE FOR TASK 1 ...\nlog = open('datasets/git_log_excerpt.csv', 'r')\nprint(log.read())","metadata":{"dc":{"key":"4"},"trusted":true,"tags":["sample_code"]},"cell_type":"code","execution_count":176,"outputs":[{"output_type":"stream","name":"stdout","text":"1502382966#Linus Torvalds\n1501368308#Max Gurtovoy\n1501625560#James Smart\n1501625559#James Smart\n1500568442#Martin Wilck\n1502273719#Xin Long\n1502278684#Nikolay Borisov\n1502238384#Girish Moodalbail\n1502228709#Florian Fainelli\n1502223836#Jon Paul Maloy\n"}]},{"source":"## 2. Reading in the dataset\n<p>The dataset was created by using the command <code>git log --encoding=latin-1 --pretty=\"%at#%aN\"</code>. The <code>latin-1</code> encoded text output was saved in a header-less csv file. In this file, each row is a commit entry with the following information:</p>\n<ul>\n<li><code>timestamp</code>: the time of the commit as a UNIX timestamp in seconds since 1970-01-01 00:00:00 (Git log placeholder \"<code>%at</code>\")</li>\n<li><code>author</code>: the name of the author that performed the commit (Git log placeholder \"<code>%aN</code>\")</li>\n</ul>\n<p>The columns are separated by the number sign <code>#</code>. The complete dataset is in the <code>datasets/</code> directory. It is a <code>gz</code>-compressed csv file named <code>git_log.gz</code>.</p>","metadata":{"dc":{"key":"11"},"run_control":{"frozen":true},"tags":["context"],"deletable":false,"editable":false},"cell_type":"markdown"},{"source":"# Loading in the pandas module\n# ... YOUR CODE FOR TASK 2 ...\nimport pandas as pd\n\n# Reading in the log file\ngit_log = pd.read_csv('datasets/git_log.gz', sep='#', encoding='latin-1', header=None, names=['timestamp', 'author'])\n\n# Printing out the first 5 rows\n# ... YOUR CODE FOR TASK 2 ...\nprint(git_log.head(5))","metadata":{"dc":{"key":"11"},"trusted":true,"tags":["sample_code"]},"cell_type":"code","execution_count":178,"outputs":[{"output_type":"stream","name":"stdout","text":" timestamp author\n0 1502826583 Linus Torvalds\n1 1501749089 Adrian Hunter\n2 1501749088 Adrian Hunter\n3 1501882480 Kees Cook\n4 1497271395 Rob Clark\n"}]},{"source":"## 3. Getting an overview\n<p>The dataset contains the information about every single code contribution (a \"commit\") to the Linux kernel over the last 13 years. We'll first take a look at the number of authors and their commits to the repository.</p>","metadata":{"dc":{"key":"18"},"run_control":{"frozen":true},"tags":["context"],"deletable":false,"editable":false},"cell_type":"markdown"},{"source":"# calculating number of commits\nnumber_of_commits = git_log.shape[0]\n\n# calculating number of authors\nnumber_of_authors = len(git_log.dropna().author.unique())\n\n# printing out the results\nprint(\"%s authors committed %s code changes.\" % (number_of_authors, number_of_commits))","metadata":{"dc":{"key":"18"},"trusted":true,"tags":["sample_code"]},"cell_type":"code","execution_count":180,"outputs":[{"output_type":"stream","name":"stdout","text":"17385 authors committed 699071 code changes.\n"}]},{"source":"## 4. Finding the TOP 10 contributors\n<p>There are some very important people that changed the Linux kernel very often. To see if there are any bottlenecks, we take a look at the TOP 10 authors with the most commits.</p>","metadata":{"dc":{"key":"25"},"run_control":{"frozen":true},"tags":["context"],"deletable":false,"editable":false},"cell_type":"markdown"},{"source":"# Identifying the top 10 authors\ntop_10_authors = git_log.groupby(['author']).agg('count').sort_values(by=['timestamp'], ascending=False)[0:10]\n\n# Listing contents of 'top_10_authors'\ntop_10_authors","metadata":{"dc":{"key":"25"},"trusted":true,"tags":["sample_code"]},"cell_type":"code","execution_count":182,"outputs":[{"output_type":"execute_result","execution_count":182,"metadata":{},"data":{"text/plain":" timestamp\nauthor \nLinus Torvalds 23361\nDavid S. Miller 9106\nMark Brown 6802\nTakashi Iwai 6209\nAl Viro 6006\nH Hartley Sweeten 5938\nIngo Molnar 5344\nMauro Carvalho Chehab 5204\nArnd Bergmann 4890\nGreg Kroah-Hartman 4580","text/html":"<div>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>timestamp</th>\n </tr>\n <tr>\n <th>author</th>\n <th></th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>Linus Torvalds</th>\n <td>23361</td>\n </tr>\n <tr>\n <th>David S. Miller</th>\n <td>9106</td>\n </tr>\n <tr>\n <th>Mark Brown</th>\n <td>6802</td>\n </tr>\n <tr>\n <th>Takashi Iwai</th>\n <td>6209</td>\n </tr>\n <tr>\n <th>Al Viro</th>\n <td>6006</td>\n </tr>\n <tr>\n <th>H Hartley Sweeten</th>\n <td>5938</td>\n </tr>\n <tr>\n <th>Ingo Molnar</th>\n <td>5344</td>\n </tr>\n <tr>\n <th>Mauro Carvalho Chehab</th>\n <td>5204</td>\n </tr>\n <tr>\n <th>Arnd Bergmann</th>\n <td>4890</td>\n </tr>\n <tr>\n <th>Greg Kroah-Hartman</th>\n <td>4580</td>\n </tr>\n </tbody>\n</table>\n</div>"}}]},{"source":"## 5. Wrangling the data\n<p>For our analysis, we want to visualize the contributions over time. For this, we use the information in the <code>timestamp</code> column to create a time series-based column.</p>","metadata":{"dc":{"key":"32"},"run_control":{"frozen":true},"tags":["context"],"deletable":false,"editable":false},"cell_type":"markdown"},{"source":"# converting the timestamp column\n# ... YOUR CODE FOR TASK 5 ...\ngit_log.timestamp = pd.to_datetime(git_log['timestamp'], unit='s')\n\n# summarizing the converted timestamp column\n# ... YOUR CODE FOR TASK 5 ...\ngit_log.timestamp.describe()","metadata":{"dc":{"key":"32"},"trusted":true,"tags":["sample_code"]},"cell_type":"code","execution_count":184,"outputs":[{"output_type":"execute_result","execution_count":184,"metadata":{},"data":{"text/plain":"count 699071\nunique 668448\ntop 2008-09-04 05:30:19\nfreq 99\nfirst 1970-01-01 00:00:01\nlast 2037-04-25 08:08:26\nName: timestamp, dtype: object"}}]},{"source":"## 6. Treating wrong timestamps\n<p>As we can see from the results above, some contributors had their operating system's time incorrectly set when they committed to the repository. We'll clean up the <code>timestamp</code> column by dropping the rows with the incorrect timestamps.</p>","metadata":{"dc":{"key":"39"},"run_control":{"frozen":true},"tags":["context"],"deletable":false,"editable":false},"cell_type":"markdown"},{"source":"# determining the first real commit timestamp\nfrom_2005_to_2018 = (git_log.timestamp > '2005-04-16 22:20:00') & (git_log.timestamp <= '2018-02-17 00:00:00')\nvalid_timestamps = git_log[from_2005_to_2018].sort_values(by=['timestamp']).timestamp\n\nvalid_timestamps = valid_timestamps.reset_index()\n\nfirst_commit_timestamp = valid_timestamps.timestamp[0]\n\n# determining the last sensible commit timestamp\nlast_commit_timestamp = valid_timestamps.timestamp[len(valid_timestamps)-1]\n\n# filtering out wrong timestamps\ncorrected_log = git_log[(git_log.timestamp >= first_commit_timestamp) & (git_log.timestamp <= last_commit_timestamp)]\n\n# summarizing the corrected timestamp column\n# ... YOUR CODE FOR TASK 6 ...\ncorrected_log.describe()","metadata":{"dc":{"key":"39"},"trusted":true,"tags":["sample_code"]},"cell_type":"code","execution_count":186,"outputs":[{"output_type":"execute_result","execution_count":186,"metadata":{},"data":{"text/plain":" timestamp author\ncount 698569 698568\nunique 667977 17375\ntop 2008-09-04 05:30:19 Linus Torvalds\nfreq 99 23361\nfirst 2005-04-16 22:20:36 NaN\nlast 2017-10-03 12:57:00 NaN","text/html":"<div>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>timestamp</th>\n <th>author</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>count</th>\n <td>698569</td>\n <td>698568</td>\n </tr>\n <tr>\n <th>unique</th>\n <td>667977</td>\n <td>17375</td>\n </tr>\n <tr>\n <th>top</th>\n <td>2008-09-04 05:30:19</td>\n <td>Linus Torvalds</td>\n </tr>\n <tr>\n <th>freq</th>\n <td>99</td>\n <td>23361</td>\n </tr>\n <tr>\n <th>first</th>\n <td>2005-04-16 22:20:36</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>last</th>\n <td>2017-10-03 12:57:00</td>\n <td>NaN</td>\n </tr>\n </tbody>\n</table>\n</div>"}}]},{"source":"## 7. Grouping commits per year\n<p>To find out how the development activity has increased over time, we'll group the commits by year and count them up.</p>","metadata":{"dc":{"key":"46"},"run_control":{"frozen":true},"tags":["context"],"deletable":false,"editable":false},"cell_type":"markdown"},{"source":"# Counting the no. commits per year\n# commits_per_year = corrected_log.groupby(corrected_log.timestamp.dt.year).agg('count')\ncommits_per_year = corrected_log.groupby(pd.Grouper(key='timestamp', freq='AS')).agg('count')\n\n# Listing the first rows\n# ... YOUR CODE FOR TASK 7 ...\nprint(commits_per_year.head())","metadata":{"dc":{"key":"46"},"trusted":true,"tags":["sample_code"]},"cell_type":"code","execution_count":188,"outputs":[{"output_type":"stream","name":"stdout","text":" author\ntimestamp \n2005-01-01 16229\n2006-01-01 29255\n2007-01-01 33759\n2008-01-01 48847\n2009-01-01 52572\n"}]},{"source":"## 8. Visualizing the history of Linux\n<p>Finally, we'll make a plot out of these counts to better see how the development effort on Linux has increased over the the last few years. </p>","metadata":{"dc":{"key":"53"},"run_control":{"frozen":true},"tags":["context"],"deletable":false,"editable":false},"cell_type":"markdown"},{"source":"# Setting up plotting in Jupyter notebooks\n%matplotlib inline\n\n# plot the data\n# ... YOUR CODE FOR TASK 8 ...\ncommits_per_year.plot(title='Linux commits evolution', kind='bar', legend='off')","metadata":{"dc":{"key":"53"},"trusted":true,"tags":["sample_code"]},"cell_type":"code","execution_count":190,"outputs":[{"output_type":"execute_result","execution_count":190,"metadata":{},"data":{"text/plain":"<matplotlib.axes._subplots.AxesSubplot at 0x7f7b157f0b38>"}},{"output_type":"display_data","metadata":{},"data":{"text/plain":"<matplotlib.figure.Figure at 0x7f7b12aa84e0>","image/png":"\n"}}]},{"source":"## 9. Conclusion\n<p>Thanks to the solid foundation and caretaking of Linux Torvalds, many other developers are now able to contribute to the Linux kernel as well. There is no decrease of development activity at sight!</p>","metadata":{"dc":{"key":"60"},"run_control":{"frozen":true},"tags":["context"],"deletable":false,"editable":false},"cell_type":"markdown"},{"source":"# calculating or setting the year with the most commits to Linux\n\nyear_with_most_commits = commits_per_year.idxmax()","metadata":{"dc":{"key":"60"},"trusted":true,"tags":["sample_code"]},"cell_type":"code","execution_count":192,"outputs":[]}]}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment