webchick/CHAOSScon - GMD

## CHAOSScon - GMD
CHAOSS WG on Growth-Maturity-Metrics: status report

Idea behind CHAOSS working groups is to put together people who think about metrics, wants to talk about metrics, etc. with people who write software. Two WGs: growth/maturity/decline / diversity&inclusion.

This WG: every project goes through a natural cycle of growth/maturity/decline. We aim to make this cycle visible to people new to an open source project.

Work with GMD metrics, define them precisely, explain use cases, provide reference implementation.

For example; ‘count commits’ … what do we mean by this? Count merge commits? Empty commits? etc.

1. Define the metric as precisely as possible.
2. Explain why you want the metric and what it should show.
3. Create a sample implementation that can be used between different software vendors.

WG delivers periodically to the main repo of Releases of GD metrics

https://github.com/chaoss/wg-gmd

Over next few months:
More and more definitions of metrics, more and more code to gather metrics, and more and more use cases for why these are desired.

Data gathered in Perceval. Goes to API, extracts data as a JSON file. Raw data.
At some point, provide some Python code to generate the metrics.
Allow people to fork and make their own metrics as well.

Repo layout:
- activity-metrics/
- examples/
2_Growth-Maturity-Decline.md

Metics:
- Open issues
- Closed issues
- Issue Resolution Efficiency
- Open Issue Age
- First Response to Issue Duration
- Closed Issue Resolution Duration
- Issue Resolution Duration

Eeach has:

1. Description

2. Use cases

3. Sample filter / visualization

4. Sample implementations

5. Known implementations

6. External References (Literature)


—

Examples:

How to compute code commits

Download Python notebook and run in-browser

Grabs all commits, produces JSON file.

Then, parse JSON file to produce e.g. number of commits.

Can allow for edge cases: only count non-merge commits, empty commits, only commits to master.

Pandas allows for buckets/filters. Use this to pull commits from geographical area, timezone, …

—

Three “families” of metricsL

- Code development
- Issue resolution
- Community growth


Goal / Question / Metric model (borrowed from D&I)

Goal: “metrics family”
Template: description, use cases..etc

Example implementations w/ Python notebook w/ Perceval output

Todo list:
—————

Goal is to complete Python notebooks for all metrics.
Discuss + document use cases
Discuss + decide accurate definitions

Separate fundamental from general:
- Time series
- Filters (certain period, certain repo)
- Buckets (by person, company, repo)

Affects when you put data into e.g. a dashboard. A bit more technical.

Metrics Status dashboard: https://dev.augurlabs.io/.

====

FEEDBACK

Favourite endpoint from Augur is “contribution” endpoint; lots of data distilled into a simple picture. Saves tons of time.

Another piece of feedback: GHarchive, GHtorrent.. that’s kind of like the ocean. But more granular time box the better. Monthly is the current granularity, but smaller chunks is hugely valuable to the community.

One thing we need to know is in addition to the number you want, how you want to present it. For example, as a total, but also by week. These kinds of details are interesting to us.

Totals tend to be a vanity metric; more useful to break down via weekly/monthly; I can start bringing that to others on the team and see how changes have moved numbers up/down. Vs. doing some kind of average metric by overall months in a year.

Lots of independent metrics, but lots of folks like “Red/Yellow/Green” aggregates of scores to help us make decisions quickly. Do we see trying to create recommended aggregates/scores? People live and die by PDSS(?) today, which is an aggregate of all kinds of metrics, as well as weighting of said metrics.

Approach this in layers. First, need the raw metrics. Then, aggregate from there, and decide on weighting. But weighting will likely be specific to each community. However, could do recommendations by “persona”; e.g. community manager vs. developer manager.

One use case could be “For this specific goal, we can have this combination of metrics.”

But, people want opinions. They want consistency across projects. If each project makes up their own qualitative metric, it’s not useful to consumers.

Could possibly do:
- Here’s a set of metrics if you’re managing the project.
- Here’s a set of metrics for platforms
- Here’s a set of metrics for web frameworks
…

Find commonalities among projects/communities.

Concern is that the CHAOSS project then becomes non-agnostic. More political. “You’re doing well, you’re not doing great.”

There is tension. But for it to be valuable to customers, they need truth. We can’t be everything to everyone. It is the thing that defines commercial adoption of open source.

If we had a metric, for example, there are more published CVEs, does this have more truth behind it that you’re effectively growing?
Open Stack went “big 10” and everyone forgot about Open Stack. Kubernetes tried to build on that, become opinionated about what they were and what they weren’t.

We need “an” opinion, doesn’t necessarily have to be “the” opinion. But “an” opinion makes a great deal of stride.

Metrics need to be assigned some value, and we need to stand behind them for them to have value. Is there value for making a strong stand on “this is what a commit is” or is it better for another company to take what we’ve done and make those judgments and create that red/yellow/green dashboard. There’s possibly a separation of concerns here.

Weights are what the opinion is. I specify what’s important to me. Conceivably could create a tool that creates an opinion. (C is the most important, then B, then A.) We produce metrics, and individuals can generate their own meaning. Can share weighting and make this transparent to others: are you more interested in commits vs. code vs..

Rather than us defining what healthy is, agree with making a toolkit that allows others to define what healthy is.

Rather than 50% of our projects are above/below this number, agreement on what threshold separates red/yellow/green.

Sounds like we’re trying to define the scope of authority of the CHAOSS group. Is this an encyclopedia/dictionary? Or is it a coach?
- The goal is transparency. If you follow our definitions, everything will follow the same format and can be compared. Vs. every number being different and apples/oranges comparison.
- Second is just because you can capture a number doesn’t mean it’s useful. So let’s focus on those that are useful to people, not just those that are easy to produce.

One of the primary guidance CHAOSS could provide is guidance on how to ask the question, and what questions to ask.
-  Ideally, this comes in via the use cases. “I want this and this, so use this and this metrics.”

Not just that people don’t know what health looks like, but that there are 7 different opinions within same organization about what health looks like. Need alignment on that first. We can’t replace strategic planning/mission alignment.

Could turn this on its head; if you measure metric that you can measure, you collect as many as you think would be useful, and then when you get to answering the questions, you look at pool of data and see if you have it. When you’re strategizing, you don’t have all answers. But, if at day 1 you talk about the type of data you can capture, you can then use this to inform the questions. Vs;. If you’re just connecting a smaller subset, you may not have all of the right data.

In the same way, you might have a shared weight… “this is CHAOSS, here are the things we value. We’re very biased towards in-person hacking event. Vs. we’re a web framework, we don’t care where you are. vs. We’re so small, we can’t even afford to have events.” Your avoidance to take a position on what is healthy is great, but desperately we all want someone to come forward with “Out of all 100K open source projects, here are the ones that fall above/below X” across all projects. That’s not an opinion, that’s descriptive. And this would be very valuable, so don’t shy away from it.

One use case for CHAOSS is what are these “collections” of metrics, and what are the thresholds you might look at?

Look at LEED certification, which is an interesting analogy. They talk about all the points in the system. If you meet 80% you’re platinum, 70% you’re gold… they make the system and do the certification. You need to convince them based on your metrics. So you want to define the points without giving the certification, but maybe you want to do both.

— sharing collections/weights?

Trying to figure out how far we go in making recommendations. We’re not set on anything; trying to figure that out ourselves. Work Groups are trying to formalize a set of metrics that can help address the problems around Growth/Maturity/Decline and Diversity/Inclusion, these are valuable based on what we’re hearing in the field, and these ones aren’t. We don’t want to make this up as we go, we want to talk to you and have you tell us. Otherwise we  have to guess, and we’ll guess badly. :) Can we give a broad idea on these policies? Yeah, can be done. But again, policies, we’d need to hear from you. What levels are you setting that are important to you? Why are those levels important? If we guess, it’ll be wrong. Then I become less concerned about staying agnostic, because if we have data from people about “here’s how these things are applied” in practice, that’s great, and something we can return to.

If sub-working groups define metrics, and give score based on it, say this is a “community-defined” metric, based on entire open source community’s opinion.

I’m interested in how metrics can be implemented; no use in metrics that can’t be implemented. We can take metrics and tell you if it is or isn’t implementable. That allows us to have a number of metrics not only good enough, but also can be computed. Validates the metrics.

There are things out there similar, but not necessarily trustworthy.

From project perspective (Bitergia) http://prospector.bitergia.net/project/

We gather different metrics and have an overall bar (red/yellow/green) and allow admins to change weights to adjust what’s valued over others; this impacts the color of the bar.

Coluor coding helps at a first glance “this is red, what does that mean?” And clicking in discover “oh, I don’t care about that.”

CHAOSS project is handling the metrics underneath. Which metrics are relevant? How are those metrics defined?

Could have people contributing their own answers to “what is growth/maturity/decline?” But even # of commits can say something different in all different projects. Look at “commits per line of code” or something.

We have over 50 projects at Conservancy. Number of committers is smaller in some, but commits per person much higher. Some are academic based, lots of commits during the school year and then drops off sharply during summer. Is this unhealthy in highly academic project? No. So not necessarily coming up with “the answer,” but giving people guidance on asking the right questions. “Are there reasons that you have a tight group of committers? Are there reasons that you have contribution drop-off?”

This is why weights are risky. Numbers don’t know the story.

One metric: how much time goes into supporting a community over how large it gets. If we open source this thing, how much time am I going to spend triaging issues, etc. “Time sunk.” e.g. Once a community gets over 100, support needed triples. Maybe you want to stay small.

Talked about difficulty of providing weights that are too general / too specific. Is there any way to aggregate weights in general, so an academic project can look up in comparison to other academic projects vs. having one authority.

Or, a couple of “case studies” … here’s how we set our set of weights because we’re this kind of project. People can see themselves “We’re similar, but a smaller school.” Or “We’re similar, but based in X location.”

One way to do that, is publish everyone’s weight, and tag them. “Local team” vs. “Distributed team” ; “Academic team” vs. “Distributed team.” You’re not the authority on. The “right” standard, but giving people tools to discover standards within their industry.

In the same way you give guidance on 401K, “you’re young and acceptance of risk is higher, try this” or “you’re older and want to reduce risk“
	CHAOSS WG on Growth-Maturity-Metrics: status report

	Idea behind CHAOSS working groups is to put together people who think about metrics, wants to talk about metrics, etc. with people who write software. Two WGs: growth/maturity/decline / diversity&inclusion.

	This WG: every project goes through a natural cycle of growth/maturity/decline. We aim to make this cycle visible to people new to an open source project.

	Work with GMD metrics, define them precisely, explain use cases, provide reference implementation.

	For example; ‘count commits’ … what do we mean by this? Count merge commits? Empty commits? etc.

	1. Define the metric as precisely as possible.
	2. Explain why you want the metric and what it should show.
	3. Create a sample implementation that can be used between different software vendors.

	WG delivers periodically to the main repo of Releases of GD metrics

	https://github.com/chaoss/wg-gmd

	Over next few months:
	More and more definitions of metrics, more and more code to gather metrics, and more and more use cases for why these are desired.

	Data gathered in Perceval. Goes to API, extracts data as a JSON file. Raw data.
	At some point, provide some Python code to generate the metrics.
	Allow people to fork and make their own metrics as well.

	Repo layout:
	- activity-metrics/
	- examples/
	2_Growth-Maturity-Decline.md

	Metics:
	- Open issues
	- Closed issues
	- Issue Resolution Efficiency
	- Open Issue Age
	- First Response to Issue Duration
	- Closed Issue Resolution Duration
	- Issue Resolution Duration

	Eeach has:

	1. Description

	2. Use cases

	3. Sample filter / visualization

	4. Sample implementations

	5. Known implementations

	6. External References (Literature)



	—

	Examples:

	How to compute code commits

	Download Python notebook and run in-browser

	Grabs all commits, produces JSON file.

	Then, parse JSON file to produce e.g. number of commits.

	Can allow for edge cases: only count non-merge commits, empty commits, only commits to master.

	Pandas allows for buckets/filters. Use this to pull commits from geographical area, timezone, …

	—

	Three “families” of metricsL

	- Code development
	- Issue resolution
	- Community growth


	Goal / Question / Metric model (borrowed from D&I)

	Goal: “metrics family”
	Template: description, use cases..etc

	Example implementations w/ Python notebook w/ Perceval output

	Todo list:
	—————

	Goal is to complete Python notebooks for all metrics.
	Discuss + document use cases
	Discuss + decide accurate definitions

	Separate fundamental from general:
	- Time series
	- Filters (certain period, certain repo)
	- Buckets (by person, company, repo)

	Affects when you put data into e.g. a dashboard. A bit more technical.

	Metrics Status dashboard: https://dev.augurlabs.io/.

	====

	FEEDBACK

	Favourite endpoint from Augur is “contribution” endpoint; lots of data distilled into a simple picture. Saves tons of time.

	Another piece of feedback: GHarchive, GHtorrent.. that’s kind of like the ocean. But more granular time box the better. Monthly is the current granularity, but smaller chunks is hugely valuable to the community.

	One thing we need to know is in addition to the number you want, how you want to present it. For example, as a total, but also by week. These kinds of details are interesting to us.

	Totals tend to be a vanity metric; more useful to break down via weekly/monthly; I can start bringing that to others on the team and see how changes have moved numbers up/down. Vs. doing some kind of average metric by overall months in a year.

	Lots of independent metrics, but lots of folks like “Red/Yellow/Green” aggregates of scores to help us make decisions quickly. Do we see trying to create recommended aggregates/scores? People live and die by PDSS(?) today, which is an aggregate of all kinds of metrics, as well as weighting of said metrics.

	Approach this in layers. First, need the raw metrics. Then, aggregate from there, and decide on weighting. But weighting will likely be specific to each community. However, could do recommendations by “persona”; e.g. community manager vs. developer manager.

	One use case could be “For this specific goal, we can have this combination of metrics.”

	But, people want opinions. They want consistency across projects. If each project makes up their own qualitative metric, it’s not useful to consumers.

	Could possibly do:
	- Here’s a set of metrics if you’re managing the project.
	- Here’s a set of metrics for platforms
	- Here’s a set of metrics for web frameworks
	…

	Find commonalities among projects/communities.

	Concern is that the CHAOSS project then becomes non-agnostic. More political. “You’re doing well, you’re not doing great.”

	There is tension. But for it to be valuable to customers, they need truth. We can’t be everything to everyone. It is the thing that defines commercial adoption of open source.

	If we had a metric, for example, there are more published CVEs, does this have more truth behind it that you’re effectively growing?
	Open Stack went “big 10” and everyone forgot about Open Stack. Kubernetes tried to build on that, become opinionated about what they were and what they weren’t.

	We need “an” opinion, doesn’t necessarily have to be “the” opinion. But “an” opinion makes a great deal of stride.

	Metrics need to be assigned some value, and we need to stand behind them for them to have value. Is there value for making a strong stand on “this is what a commit is” or is it better for another company to take what we’ve done and make those judgments and create that red/yellow/green dashboard. There’s possibly a separation of concerns here.

	Weights are what the opinion is. I specify what’s important to me. Conceivably could create a tool that creates an opinion. (C is the most important, then B, then A.) We produce metrics, and individuals can generate their own meaning. Can share weighting and make this transparent to others: are you more interested in commits vs. code vs..

	Rather than us defining what healthy is, agree with making a toolkit that allows others to define what healthy is.

	Rather than 50% of our projects are above/below this number, agreement on what threshold separates red/yellow/green.

	Sounds like we’re trying to define the scope of authority of the CHAOSS group. Is this an encyclopedia/dictionary? Or is it a coach?
	- The goal is transparency. If you follow our definitions, everything will follow the same format and can be compared. Vs. every number being different and apples/oranges comparison.
	- Second is just because you can capture a number doesn’t mean it’s useful. So let’s focus on those that are useful to people, not just those that are easy to produce.

	One of the primary guidance CHAOSS could provide is guidance on how to ask the question, and what questions to ask.
	- Ideally, this comes in via the use cases. “I want this and this, so use this and this metrics.”

	Not just that people don’t know what health looks like, but that there are 7 different opinions within same organization about what health looks like. Need alignment on that first. We can’t replace strategic planning/mission alignment.

	Could turn this on its head; if you measure metric that you can measure, you collect as many as you think would be useful, and then when you get to answering the questions, you look at pool of data and see if you have it. When you’re strategizing, you don’t have all answers. But, if at day 1 you talk about the type of data you can capture, you can then use this to inform the questions. Vs;. If you’re just connecting a smaller subset, you may not have all of the right data.

	In the same way, you might have a shared weight… “this is CHAOSS, here are the things we value. We’re very biased towards in-person hacking event. Vs. we’re a web framework, we don’t care where you are. vs. We’re so small, we can’t even afford to have events.” Your avoidance to take a position on what is healthy is great, but desperately we all want someone to come forward with “Out of all 100K open source projects, here are the ones that fall above/below X” across all projects. That’s not an opinion, that’s descriptive. And this would be very valuable, so don’t shy away from it.

	One use case for CHAOSS is what are these “collections” of metrics, and what are the thresholds you might look at?

	Look at LEED certification, which is an interesting analogy. They talk about all the points in the system. If you meet 80% you’re platinum, 70% you’re gold… they make the system and do the certification. You need to convince them based on your metrics. So you want to define the points without giving the certification, but maybe you want to do both.

	— sharing collections/weights?

	Trying to figure out how far we go in making recommendations. We’re not set on anything; trying to figure that out ourselves. Work Groups are trying to formalize a set of metrics that can help address the problems around Growth/Maturity/Decline and Diversity/Inclusion, these are valuable based on what we’re hearing in the field, and these ones aren’t. We don’t want to make this up as we go, we want to talk to you and have you tell us. Otherwise we have to guess, and we’ll guess badly. :) Can we give a broad idea on these policies? Yeah, can be done. But again, policies, we’d need to hear from you. What levels are you setting that are important to you? Why are those levels important? If we guess, it’ll be wrong. Then I become less concerned about staying agnostic, because if we have data from people about “here’s how these things are applied” in practice, that’s great, and something we can return to.

	If sub-working groups define metrics, and give score based on it, say this is a “community-defined” metric, based on entire open source community’s opinion.

	I’m interested in how metrics can be implemented; no use in metrics that can’t be implemented. We can take metrics and tell you if it is or isn’t implementable. That allows us to have a number of metrics not only good enough, but also can be computed. Validates the metrics.

	There are things out there similar, but not necessarily trustworthy.

	From project perspective (Bitergia) http://prospector.bitergia.net/project/

	We gather different metrics and have an overall bar (red/yellow/green) and allow admins to change weights to adjust what’s valued over others; this impacts the color of the bar.

	Coluor coding helps at a first glance “this is red, what does that mean?” And clicking in discover “oh, I don’t care about that.”

	CHAOSS project is handling the metrics underneath. Which metrics are relevant? How are those metrics defined?

	Could have people contributing their own answers to “what is growth/maturity/decline?” But even # of commits can say something different in all different projects. Look at “commits per line of code” or something.

	We have over 50 projects at Conservancy. Number of committers is smaller in some, but commits per person much higher. Some are academic based, lots of commits during the school year and then drops off sharply during summer. Is this unhealthy in highly academic project? No. So not necessarily coming up with “the answer,” but giving people guidance on asking the right questions. “Are there reasons that you have a tight group of committers? Are there reasons that you have contribution drop-off?”

	This is why weights are risky. Numbers don’t know the story.

	One metric: how much time goes into supporting a community over how large it gets. If we open source this thing, how much time am I going to spend triaging issues, etc. “Time sunk.” e.g. Once a community gets over 100, support needed triples. Maybe you want to stay small.

	Talked about difficulty of providing weights that are too general / too specific. Is there any way to aggregate weights in general, so an academic project can look up in comparison to other academic projects vs. having one authority.

	Or, a couple of “case studies” … here’s how we set our set of weights because we’re this kind of project. People can see themselves “We’re similar, but a smaller school.” Or “We’re similar, but based in X location.”

	One way to do that, is publish everyone’s weight, and tag them. “Local team” vs. “Distributed team” ; “Academic team” vs. “Distributed team.” You’re not the authority on. The “right” standard, but giving people tools to discover standards within their industry.

	In the same way you give guidance on 401K, “you’re young and acceptance of risk is higher, try this” or “you’re older and want to reduce risk“