Skip to content

Instantly share code, notes, and snippets.

@jnavila
Created May 6, 2012 17:57
Show Gist options
  • Star 3 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save jnavila/2623537 to your computer and use it in GitHub Desktop.
Save jnavila/2623537 to your computer and use it in GitHub Desktop.
Digging in githubarchive for forks and pull requests

Rationale

Github makes it easy to start participating to a project. The recommended way of doing this is just fork, push your changes and send a pull request to the origin project. So the question is now, how are projects using this workflow, and which projects are relying on it.

Note: all along, I use a kernel of query to extract pull requests related to forks. The request is limited to one month to keep it under the size limit. As an added bonus, the query provides the average latency between fork and pull-request.

Projects using the fork to pull paradigm

The query named Fork2PullRequestByProject.sql helps sort out the greatest projects using of this feature over the month of april:

Fork2Pull ordered by projects

Number one: Homebrew ! It seems that this workflow makes it really easy for a new comer on a project based on recipes with ties to a large number of sources to push a change related to source version update.

Second: bootstrap from Twitter. A javascript lib seems a pretty good playfield for starting sharing changes.

Third: rails

Sorting by language

This time, let's sort the combined fork+pullrequest by language. Fork2Pull ordered by language

The query named Fork2PullRequestByLanguage.sql gives the results.

This time, the result is quite expected : javascript, ruby, python. One remark though: javascript and ruby numbers exceed the double of python, as if they were really better fitted for this kind of use.

In average, all language keep the latency between fork and pull request under a day, java being the longest.

Study of latency

The query name Fork2PullByLatency.sql sorts the events by a exponential scale of the delay between fork and pull request.

Fork2Pull ordered by latency

One thing to notice is the 2 maximums in the series, one peaking at 8 to 16 minutes and a smaller one peaking at around one day.

My interpretation of this fact is that we have two kinds of conditions that lead to a fork: the quick fix one which is done in the same movement as the fork and the code pull request one which needs some more thorough review before submitting the pull request.

This assumption could be validated by a deeper analysis of the size of the pull requests in each case.

SELECT
ForkTable.repository_language,
COUNT(DISTINCT ForkTable.url) AS f2p_number,
AVG(PARSE_UTC_USEC(PullTable.created_at)-PARSE_UTC_USEC(ForkTable.created_at))/3600000000 AS f2p_interval_hour
FROM
(SELECT
url,
repository_url,
repository_language,
MIN(created_at) AS created_at
FROM
[githubarchive:github.timeline]
WHERE type='ForkEvent'
AND PARSE_UTC_USEC(created_at) >= PARSE_UTC_USEC('2012-04-01 00:00:00')
AND PARSE_UTC_USEC(created_at) < PARSE_UTC_USEC('2012-05-01 00:00:00')
GROUP BY
repository_language,
repository_url,
url)
AS ForkTable
INNER JOIN
(SELECT
repository_url,
payload_pull_request_head_repo_html_url,
MIN(created_at) AS created_at
FROM
[githubarchive:github.timeline]
WHERE type='PullRequestEvent'
AND PARSE_UTC_USEC(created_at) >= PARSE_UTC_USEC('2012-04-01 00:00:00')
AND PARSE_UTC_USEC(created_at) < PARSE_UTC_USEC('2012-05-01 00:00:00')
GROUP BY
repository_url,
payload_pull_request_head_repo_html_url)
AS PullTable
ON
ForkTable.repository_url=PullTable.repository_url AND
ForkTable.url=PullTable.payload_pull_request_head_repo_html_url
GROUP BY
ForkTable.repository_language
ORDER BY
f2p_number DESC
SELECT
COUNT(DISTINCT ForkTable.url) AS f2p_number,
FLOOR(LOG2((PARSE_UTC_USEC(PullTable.created_at)-PARSE_UTC_USEC(ForkTable.created_at))/30000000)) AS f2p_interval_log_2_minute
FROM
(SELECT
url,
repository_url,
repository_language,
MIN(created_at) AS created_at
FROM
[githubarchive:github.timeline]
WHERE type='ForkEvent'
AND PARSE_UTC_USEC(created_at) >= PARSE_UTC_USEC('2012-04-01 00:00:00')
AND PARSE_UTC_USEC(created_at) < PARSE_UTC_USEC('2012-05-01 00:00:00')
GROUP BY
repository_language,
repository_url,
url)
AS ForkTable
INNER JOIN
(SELECT
repository_url,
payload_pull_request_head_repo_html_url,
MIN(created_at) AS created_at
FROM
[githubarchive:github.timeline]
WHERE type='PullRequestEvent'
AND PARSE_UTC_USEC(created_at) >= PARSE_UTC_USEC('2012-04-01 00:00:00')
AND PARSE_UTC_USEC(created_at) < PARSE_UTC_USEC('2012-05-01 00:00:00')
GROUP BY
repository_url,
payload_pull_request_head_repo_html_url)
AS PullTable
ON
ForkTable.repository_url=PullTable.repository_url AND
ForkTable.url=PullTable.payload_pull_request_head_repo_html_url
WHERE PARSE_UTC_USEC(PullTable.created_at)>PARSE_UTC_USEC(ForkTable.created_at)
GROUP BY
f2p_interval_log_2_minute
ORDER BY
f2p_interval_log_2_minute ASC
SELECT
ForkTable.repository_url,
COUNT(DISTINCT ForkTable.url) AS f2p_number,
AVG(PARSE_UTC_USEC(PullTable.created_at)-PARSE_UTC_USEC(ForkTable.created_at))/3600000000 AS f2p_interval_hour
FROM
(SELECT
url,
repository_url,
MIN(created_at) AS created_at
FROM
[githubarchive:github.timeline]
WHERE type='ForkEvent'
AND PARSE_UTC_USEC(created_at) >= PARSE_UTC_USEC('2012-04-01 00:00:00')
AND PARSE_UTC_USEC(created_at) < PARSE_UTC_USEC('2012-05-01 00:00:00')
GROUP BY
repository_url,
url)
AS ForkTable
INNER JOIN
(SELECT
repository_url,
payload_pull_request_head_repo_html_url,
MIN(created_at) AS created_at
FROM
[githubarchive:github.timeline]
WHERE type='PullRequestEvent'
AND PARSE_UTC_USEC(created_at) >= PARSE_UTC_USEC('2012-04-01 00:00:00')
AND PARSE_UTC_USEC(created_at) < PARSE_UTC_USEC('2012-05-01 00:00:00')
GROUP BY
repository_url,
payload_pull_request_head_repo_html_url)
AS PullTable
ON
ForkTable.repository_url=PullTable.repository_url AND
ForkTable.url=PullTable.payload_pull_request_head_repo_html_url
GROUP BY
ForkTable.repository_url
ORDER BY
f2p_number DESC
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment