Skip to content

Instantly share code, notes, and snippets.

@ria18405
Last active October 19, 2020 17:09
Show Gist options
  • Save ria18405/b379eb00a4f87bfb98df5530e0f551dd to your computer and use it in GitHub Desktop.
Save ria18405/b379eb00a4f87bfb98df5530e0f551dd to your computer and use it in GitHub Desktop.
Social Currency Metric System
google-summer-of-code

Implementing Social Currency Metric System in GrimoireLab

CONTENTS

Student Developer Info:

DESCRIPTION

The SCMS measures the social currency of your OSS community’s members
Open Source Software (OSS) projects are based on collaborative efforts, where contributors improve upon the source code and share the changes within the community. In such a context, understanding community opinions is paramount to ensure the sustainability and health of the project. Nevertheless, the lack of tooling forces decision-makers into manual and time-consuming work to track the community interactions and properly gauge them.

The Social Currency Metric System (SCMS) is a qualitative data collection, processing and measurement system that was developed based on social scientific theories measuring the sentiments in a community as something called social currencies. The Social Currency Metric System represents the reputation of a community as measured in community transparency, utility, consistency, merit, and trust. The SCMS aims to make sense of the data in passive daily interactions, to determine community health holistically. The SCMS was built to help the community rely more on community sentiment than easily trackable, but less informative quantitative data.

The SCMS broadly considers how human interactions build relationships and trust in a community. To do this, the system simplifies a complicated process of analysing qualitative data created by users in a community, and it makes understanding your target audience's statements straightforward and quantifiable.

The SCMS works by collecting and processing daily community interaction data like emails, Github comments, tweets, and conversations on public forums. Using these sources as the basis of the system the SCMS will empower community leaders to make key quality decisions regarding transparency and actionability of open source project health based directly off of the thoughts and feelings of their users.

The SCMS is built on the top of GrimoireLab, an open-source toolkit used for Software Development Analytics. Implementing the SCMS will ultimately help community leaders, power users, and other stakeholders leverage qualitative data for social listening so that they can rely less on the behaviours quantitative data tracks and more on community sentiment.

TERMINOLOGY

  • GrimoireLab: GrimoireLab is an open Source toolkit for Software Development Analytics. It provides tools for -

    1. Data Gathering: Collecting data from around 30-40 data sources, like git, github, slack, twitter, mails, etc.
    2. Data Enrichment: It involves merging duplicate identities, adding additional information about contributors affiliation, calculation delays, geographical data, etc. GrimoireLab consists of special enrichers for performing this task. (GrimoireLab ELK)
    3. Data Consumption and Visualisation: Having a dashboard which allows filtering by time range, project, repository, contributor, etc.
  • ElasticSearch: Elasticsearch is a distributed document store. Instead of storing information as rows of columnar data, Elasticsearch stores complex data structures that have been serialized as JSON documents.

  • ElasticSearch index: An ElasticSearch index, stores all fields and its corresponding data. An index can be thought of as an optimized collection of documents and each document is a collection of fields, which are the key-value pairs that contain your data.

Figure 1: Overview of the approach

WORK

ACTIVITY REPORTS

Community Bonding - May 5th to May 31st, 2020

In the Community Bonding Period, I made my initial contact with my mentors, decided the ways of weekly sync, communication channels with mentors and set up the project tracker repository.

During this period, I understood and observed the community closely, along with understanding my project in detail. To understand the idea of SCMS correctly, I collected a small amount of data from tweets of Amazon and understood it from a social standpoint. I executed a pilot study to analyse the pros and cons of two paths that we were considering (using ad-hoc indexes or using existing indexes). This informed further implementation steps for the project that differed from the plan at the time of my application.

The modified timeline involved writing enrichers for the SCMS, thereby having ad-hoc indexes; instead of using existing indexes as was in the proposed timeline. The proposed timeline included one week to explore automation techniques by an AI algorithm, and after discussions, we found it more important to test the system by implementing it for a community.

Week Number Blog Weekly Summary
WEEK 0 Accepted for GSoC'20
WEEK 1 Community Bonding week 1 Weekly report
WEEK 2 Community Bonding week 2 Weekly report
WEEK 3 Community Bonding week 3 Weekly report

Coding Period 1 - June 1st to July 3rd, 2020

In the first coding period, I observed the primary channels for community interactions for GrimoireLab and found Github comments, emails on the mailing lists and IRC interactions to be predominant. Thus, we decided to focus on these channels (i.e Github, Mailing list, IRC). I wrote SCMS enrichers for all these channels, along with the script to convert ElasticSearch Indices to an Excel Sheet, Airtable, and/or Google Sheets. (see ES2GSheet in Fig. 1)

I soon noticed that Airtable has a limitation of 1200 records per worksheet for the non-priced version, and it couldn't sufficiently capture all community interactions. So, Google Sheets seemed a better alternative, having a limit of 5 million cells per spreadsheet.

Since we needed to have a look at the visualisations and dashboard and make sure that the prototype is ready, I randomly tagged the entire sheet with a different combination of tags. For this, I wrote a script to randomly tag the excel sheet with social currency tags. Then, I wrote a script to convert the entire Excel data (tagged) into a JSON file (see GSheet2Dashboard in Fig. 1), which is further used to add the extra information of social currency tags added into the same index.

Week Number Blog Weekly Summary
WEEK 1 Coding Period 1 week 1 Weekly report
WEEK 2 Coding Period 1 week 2 Weekly report
WEEK 3 Coding Period 1 week 3 Weekly report
WEEK 4 Coding Period 1 week 4 Weekly report

Coding Period 2 - July 3rd to July 27th, 2020

During the second coding period, along with social currencies, we began incorporating record categories (eg, administrative, interpersonal, community content etc), and a scoring system.

The Categories tag was used to categorise the data into different blocks, to get an overview of different discussion topics among the community. The scoring system was based on the “Relevance” of records and went from as low as -3 to a +3. It meant that highly relevant comments get a +3 score, and comments not directly related to the community or project, and are not crucial for the community to track, get scored as a -3.

Additionally, I learnt to make visualisations in Kibana and developed a Dashboard for SCMS. I presented the dashboard to all the mentors, and we discussed how we could make it more informative. We came up with some exciting ideas of visualisations like the median score, representing category information, having percentage indicators for social currency tags and more. Additionally, I wrote unit tests for all SCMS enrichers, and the documentation on ‘How to implement the SCMS’.

Dashboard of GrimoireLab: Figure 2 shows the dashboard of GrimorieLab obtained by taking three prominent channels (Github, IRC, mails) into consideration. It was the result of randomly tagging around 6000 combined records. It has a score gauge, displaying the average and median score. It has 3 pie charts representing ratios of social currencies, categories, and sources of data. Time series plot shows the transition of tags over the course of time, and a bar graph shows the frequency of community interactions.

Figure 2: Dashboard of GrimoireLab

Week Number Blog Weekly Summary
WEEK 5 Coding Period 2 week 5 Weekly report
WEEK 6 Coding Period 2 week 6 Weekly report
WEEK 7 Coding Period 2 week 7 Weekly report
WEEK 8 Coding Period 2 week 8 Weekly report

Coding Period 3 - July 31st to Aug 24th, 2020

In the third coding period, we initiated a case study with the community - CHAOSS to have a usability check. For this case study, I implemented the SCMS for CHAOSS, which involved collection, processing, visualisation of community interactions from 3 most popular channels of interactions- Twitter, IRC, Mailing Lists.

We included all conversations from the beginning of the project. It was a wonderful experience working on the case study. I’d like to offer a big thanks to Elizabeth for participating in the case study, providing valuable inputs, and helping us enhance the user experience while tagging. After the case study, we understood the importance of various views, tags, filters at a Google Sheet level which can upgrade the experience while tagging. The case study also made us think over the prospects of automating the execution of scripts to get new weekly data. After this, I worked to combine all the scripts used for the SCMS, polish and test them

Week Number Blog Weekly Summary
WEEK 9 Coding Period 3 week 9 Weekly report
WEEK 10 Coding Period 3 week 10 Weekly report
WEEK 11 Coding Period 3 week 11 Weekly report

CASE STUDY WITH CHAOSS

The case study was constructive in applying the knowledge of social currencies and the system to a community. The case study made us understand the setbacks of the current implementation approach (e.g., the need to have a database for all tagged records), and enhance the user experience of the SCMS. We tackled some pain points by implementing views, filters and functions in Google Sheets. The case study gave us room to think about actual difficulties from a community’s standpoint. I have added the limitations observed while implementing the SCMS for the CHAOSS community under Future Scopes.

Figure 3: Dashboard of CHAOSS

Here are some of the inferences from CHAOSS Dashboard:

  1. Utility reached its peak in Oct’2019, having a value of 12. At all other times, the value of utility has been around 2–3, expressing an odd utility state. We could infer that measures should be taken to make Utility more uniform.

  2. The value of all other social currencies like Trust, Merit, Consistency has been deficient throughout; the value is around 1–2 in each month. To uplift the overall social currency, we’ll need to work to increase Trust, Merit and Consistency in the community.

  3. Channel-based inferences:

    • We found 42.86% of emails categorised as Administrative, and only 14% of mails involve community content, which means Increasing contributor participation is imperative.
    • IRC conversations involve more technical support discussions (32%) or interpersonal discussions (owing to it being a chat system) - (24%)
    • 38% of tweets are about community content, and 42% of tweets are about Conferences (owing to the nature of the forum).
  4. In all platforms, we can see that +ve weights overweight -ve weights, meaning most community discussions are relevant to projects and other people involved.

  5. Transparency has been fairly consistent throughout the last two years. Also, we see “Transparency” as the most popular social currency of the CHAOSS community.

IMPLEMENTATION STEPS

STEP 1: ES to Google Sheets

  1. Set up GrimpoireLab SirModred. (Getting-Started)

  2. According to the channels to be analysed, Set projects.json and setup.cfg as mentioned here

    A simple example could be:

    1. Set setup.cfg as:
    	[scmspipermail]
    	raw_index = scmspipermail_raw
    	enriched_index = scmspipermail_enriched
    	no-ssl-verify = true
    
    	[scmsgithub]
    	raw_index = scmsgithub_raw
    	enriched_index = scmsgithub_enriched
    	api-token = xxxx
    	sleep-for-rate = true
    	no-archive = true
    	category = issue
    	sleep-time = 300
    
    	[scmssupybot]
    	raw_index = scmssupybot_raw
    	enriched_index = scmssupybot_enriched
    
    1. Set projects.json as :
    {
    	"chaoss": {
    		"scmsgithub": [
    			"https://github.com/chaoss/grimoirelab-perceval",
    			"https://github.com/chaoss/grimoirelab-elk"
    			],
    		"scmspipermail": [
    			"https://lists.linuxfoundation.org/pipermail/grimoirelab-discussions/"
    			],
    		"scmssupybot": [
    			"irc://chat.freenode.net/chaoss-community /irclogs/freenode/#chaoss-community",
    			"irc://chat.freenode.net/grimoirelab /irclogs/freenode/#grimoirelab"
    			]
    		}
    }
    
  3. Enrich raw data by executing modred with the parameters as:

    --enrich --panels --cfg ./setup.cfg --backends scmssupybot scmsgithub scmspipermail

    (Here, add as many data sources you are using in SCMS)

  4. Set alias: This step is important because we want to refer all (more than 1) SCMS indexes together. Here, in the example below, we have set alias as all_scms for 3 SCMS enriched indexes (scmspipermail_chaoss_enriched,scmsgithub_chaoss_enriched, scmssupybot_chaoss_enriched)

    POST/_aliases
    {
    	"actions": [
            {
                "add": {
                    "index": "scmspipermail_enriched",
                    "alias": "all_scms"
                }
            },
            {
                "add": {
                    "index": "scmsgithub_enriched",
                    "alias": "all_scms"
                }
            },
            {
                "add": {
                    "index": "scmssupybot_enriched",
                    "alias": "all_scms"
                }
            }
        ]
    }
    
  5. Set up SCMS-creds.json file with your credentials for using Google Sheet API.

  6. Execute a script ES2GSheet which will convert Elastic Search index(all_scms) into a Google Sheet. (Output: Enriched data from Elastic Search is uploaded on Google Sheet).

    cd utils/
    python3 ES2GSheet.py 
    

We have set up our first basic instance for SCMS Implementation!

Figure 4: Convertion of ElasticSearch Indexes to a Google Sheet

STEP 2: Tagging & Codex formation:

  1. Set up a codex, containing the definitions, use cases of all social currencies, categories.

  2. Tag all records of the spreadsheet by adding 'category', 'weight', 'SCMS Tags' (You can add upto 5 SCMS tags: Tag 1, Tag 2, Tag 3, Tag 4, Tag 5)

The scms currencies are: Transparency, Utility, Consistency, Merit, Trust.

  1. The Categories varies from 1 community to the other. Some example categories are: Administrative, Interpersonal, Community Content, Conference etc.

  2. Weight is also community dependent, there are many weighing scales, and methods that one can incorporate. We usually consider weight from -3 to +3, where

    • +ve weight = very relevant comment wrt the project, community
    • 0 weight = neutral relevance
    • -ve weight = completely irrelevant/ unnecessary discussion comments.

    You can also set Weight based on a Happiness method where

    • +ve weight = positive comment,
    • 0 weight = neutral,
    • -ve weight = negative comment
  3. Keep the codex updated as the tagging proceeds.

Figure 5: Codex of GrimoireLab

STEP 3: Google Sheet to Dashboard

  1. Download the Google Sheet as a CSV(.csv)- current file

  2. Convert the CSV file to a JSON file using a script GSheet2Dashboard.

    	cd utils/
    	python3 utils/GSheet2Dashbaord.py
    

    Output: "extra_data.json" (See Figure 6)

Figure 6: extra data json file

  1. Upload this json file to a github gist and set the url of this json gist in the setup.cfg file as explained below.

  2. Now, we need to execute a study enrich_extra_data to include the tagged information back to the Enriched index. The definition of this study can be found here. Enrich extra data by modifying the setup.cfg as below.

    [scmspipermail]
    raw_index = scmspipermail_raw
    enriched_index = scmspipermail_enriched
    no-ssl-verify = true
    studies = [enrich_extra_data:scms]
    
    [scmsgithub]
    raw_index = scmsgithub_raw
    enriched_index = scmsgithub_enriched
    api-token = xxxx
    sleep-for-rate = true
    no-archive = true
    category = issue
    sleep-time = 300
    studies = [enrich_extra_data:scms]
    
    [scmssupybot]
    raw_index = scmssupybot_raw
    enriched_index = scmssupybot_enriched
    studies = [enrich_extra_data:scms]
    
    [enrich_extra_data:scms]
    json_url= https://gist.githubusercontent.com/ria18405/cdc0e16898a26e2a566fbb475cbf1a3b/raw/4ffa5f41d9bfbed88a51b952e994ca290a713719/chaoss-tagged.json
    

    (Set json_url as the gist url containing all extra data)

  3. Execute modred the same way as done above:

    --enrich --panels --cfg ./setup.cfg --backends scmssupybot scmsgithub scmspipermail

Now, we can analyse the dashboard formed.

FUTURE SCOPE

  • Automation:
    • Tagging(Keyword Analysis | Sentiment Analysis) - understanding from codex
    • Automatic reflection of changes in the Dashboard by making real time changes in the Google Sheet.
    • Auto-import data (Week Wise data)- (Running the script automatically weekly)
  • UI:
    • User Interface to implement SCMS
    • Change implementation of tagging interface from Google Sheets to more advanced implementation similar to Medallia or MaxQDA
    • Updation of Automatically generated Tags manually (manual override of tags)
  • Codex:
    • Expanding Codex terms by adding subtags, subcategories etc
  • Database:
    • Saving tagged records into an SQL db or similar (GSheets limit reaches)
  • Overall:
    • Chrome Extension to tag records simultaneously.

LEARNINGS:

Working with CHAOSS was a wholesome experience. I got to learn a new thing almost every day. Here are a few of the most prominent ones: ⭐

  • Collaborate with the team sitting at varied locations both synchronously and asynchronously; operating in significantly different time zones.
  • Understanding Social Scientific Theories, and the importance of SCMS in healthy community building.
  • Understanding the role and problems of community leaders, marketing experts, strategists in different communities and corporate companies.
  • Undergoing CHAOSS pilot study made me understand how communities infer their social reputation and improve their community health by adopting several measures.
  • Working on a huge codebase & Writing clean and well-commented code.
  • Performing pilot studies to analyse the pros and cons of different code architectures.
  • Understanding the architecture of the existing codebase, planning and designing code accordingly, and then start to code.
  • Understanding code written by someone else, modifying it, along with adding relevant tests.
  • Giving attention to details while feature additions.
  • Accessing ElasticSearch with Python, and integrating with Excel, Google Sheet.
  • Work with tools; DVCS (git), Elastic Search, Kibana, Python, Google Sheets, GrimoireLab, containerization/virtualization (docker)
  • Writing technical weekly blogs for weekly synch, and jotting minutes of the meetings, which helped in timely execution of the project.

A Big THANKYOU! ❤️

  • To my mentor Valerio Cosentino for helping me with the roadblocks, and explaining every technical aspect of the project with so much patience!
  • To my Mentors Georg Link, Dylan Marcy, Samantha Venia Logan for their continued support that helped me grow and understand things in much clarity this summer!
  • To Elizabeth Barron for participating in the pilot project of SCMS, and being really supportive!
  • To the @CHAOSS community for being super appreciative of the work!
  • To all my friends and family, who’ve motivated and supported throughout!

FOOTNOTES

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment