Skip to content

Instantly share code, notes, and snippets.

@warenlg
Created August 26, 2019 14:37
Show Gist options
  • Save warenlg/34d8ee1228c472583ad949808d5687bd to your computer and use it in GitHub Desktop.
Save warenlg/34d8ee1228c472583ad949808d5687bd to your computer and use it in GitHub Desktop.

Identity matching in the litterature

Most relevant papers

  • Developer identification methods for integrated data from various sources, 2005 from Jesus M. Gonzalez-Barahona and Gregorio Robles.

    • Early work done on identity matching.
    • Approach, based on the application of heuristics, to identify the many identities of developers.
    • Cumulates identity info from different data sources: source code, versionin repo, bug tracking, mailing lists etc.
    • Evaluation on the GNOME project.
    • Tackle the privacy issues.
  • A comparison of identity merge algorithms for software repositories, 2011

    • Approach very similar to ours.
    • Metrics clearly defined.
    • Description of different algorithms, some of them more complex: Bird’s algorithm, Robles’s approach + improvments.
    • Comparison of the algorithms evaluated on large open source projects.
  • Who’s who in GNOME: using LSA to merge software repository identities, 2012

    • Start the paper saying that existing identity merging algorithms are sensitive to large discrepancies between the aliases used by the same individual: the noisier the data, the worse their performance. -> Discussion more pragmatic about the noise in the data and the scale.
    • Study all GNOME Git repositories and discuss robustness of existing algorithms.
    • Propose a new identity merging algorithm based on Latent Semantic Analysis (LSA)
  • Who is who in the mailing list? Comparing six disambiguation heuristics to identify multiple addresses of a participant, 2015

    • Benchmark of 6 heuristics (among them, the ones covered in the first benchmark from 2011) to perform identity matching, tested on the Apache projects.
    • Interesting research questions, among them one about time window influence on the matching performance:
      • What is the performance of the disambiguation heuristics?
      • How does the time window influence the performance of the heuristics?
      • How does the community size influence the performance of the heuristics?
  • Maispion: A Tool for Analysing and Visualising Open Source Software Developer Communities, 2009

    • Tool for analysing software developer communities.
    • Solves the identity merging issue using the Leveinstein distance.
    • Also includes an extensive study of the temporal commit activity i.e. commit time series.
  • Identity matching and geographical movement of open-source software mailing list participants, 2014

    • Large PhD thesis where the identity matching issue has been extensively tackled in section 1.
    • Propose an identity matching algorithm that is able to handle data sets of different orders of magnitude, and is robust to noisy data.
    • Some specs of the algorithm: term-document matrix, edit distance augmentation, tf–idf, singular value decomposition and rank reduction, cosine similarity.
    • Included discussions on optimisation and scalability.

Papers related to Social Network Analysis (SNA) in open source projects

  • Mining Email Social Networks, 2006 from C. Bird and P. Devanbu.

    • Construct social networks of email correspondents.
    • Tackle interesting questions comming right after identity matching, and related to: (1) the social status of different types of OSS participants (2) the relationship of email activity and commit activity (3) the relationship of social status with commit activity.
    • Similar paper: Mining Email Social Networks in Postgres
    • Following paper: Validity of Network Analyses in Open Source Projects, 2008, that studies the stability of network metrics like centrality of nodes, in the presence of inadequate and missing data.
  • Latent Social Structure in Open Source Projects, 2008 from C. Bird and P. Devanbu.

    • Very well written and verbose paper that talks about the dynamic, self-organizing, latent, and usually not explicitly stated structure under the “bazaar-like” nature of Open Source Software (OSS) Projects.
    • Observes that subcommunities form spontaneously within the developer teams.
    • Gives lessons for how commercial software teams might be organized.
    • Details techniques for detecting community structure in complex networks, extract and study latent subcommunities from the email social network of several projects.
    • Observe also that subcommunities manifest most strongly in technical discussions, and are significantly connected with collaboration behaviour.
  • Applying Social Network Analysis to the Information in CVS Repositories, 2004 from Gregorio Robles, Jesus M. Gonzalez-Barahona.

    • Details the basics concepts of social network analysis.
    • Defines the networks of developers and projects with the corresponding interesting measurements possible.
    • Analysis of the GNOME and Apache networks.
  • Using Social Network Analysis Techniques to Study Collaboration between a FLOSS Community and a Company, 2008 from Gregorio Robles, Jesus M. Gonzalez-Barahona.

    • Extracts information about the development process of FLOSS projects.
    • Constructs and studies the developers network.
    • Defines and studies network parameters in the context of VCS analysis, like: distance centrality, betweenness centrality, coordination degree, centrality eigenvector etc.
    • Detects the most important events in a development history.
    • Highlights aspects such as efficiency in the development process, release management and leadership turnover.
    • slides
    • Similar paper: Studying the evolution of libre software projects using publicly available data, 2003.
  • Evolution of the core team of developers in libre software projects, 2006 from Gregorio Robles, Jesus M. Gonzalez-Barahona.

    • Studies the stability and permanence of the core team in open source projects.
    • Their activity is calculated over time, looking for core team evolution patterns.
    • Evaluation made on the GIMP project that is a case of "code gods".
    • Several visuals in the paper.

Others

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment