yozlet/example-graphql-README.md

## example-graphql-README.md

      
    Raw
  

              example-graphql-README.md
            
          
    Here's an example of the kind of GraphQL query that Code.gov might
submit to the GitHub GraphQL endpoint. It also explains why the switch to using
GraphQL (in GitHub API version 4) makes things a ton
easier than all the REST calls we were doing before (in GitHub API version 3).
Try this query out right now!
This link will load the GitHub GraphQL Explorer with the non-annotated
query ready to go.
What is this query?

This is a query I cooked up when first exploring GraphQL while doing
some work on Code.gov.
What this query is for: Get specific metadata for several thousand
repositories, across 30+ US Government organizations, without
hitting GitHub API query limits.
Why was this a problem in GitHub API version 3?

Previously, trying to do this using v3 of the GitHub API involved
several thousand REST HTTP calls. One call would get us a list of
100 repos. We'd then need 100 * X more HTTP calls, where X is the
number of different kinds of metadata we need for each repo: one
call for the repo languages, one call for the open pull requests,
etc. And then we need code to glue it all together into a document
which we feed into our ElasticSearch index. (Admittedly, if we
were storing this data relationally, this splitting of metadata
would make much more sense.)
Not only was it all deeply inefficient, but we butted up against
GitHub's API request limits.
Why is this much easier in GitHub API version 4?

In version 4, the query method has changed from "make lots of little
REST API requests" to "make fewer, bigger GraphQL requests".
Not only were we making many more REST API requests before, but they
were of many different types, because we needed many different types
of data. Now we can specify all the different types of data we need
in one query. (We still need to run that query multiple times, because
GitHub only lets us fetch 100 repos at a time.)
Here's the GitHub GraphQL API (v4) docs:
https://developer.github.com/v4/
You can try this query out in the GitHub GraphQL Explorer:
https://developer.github.com/v4/explorer/
This query provides a list of organization IDs, then asks for a list
of repositories for each organization, with metadata about each.
One of the things I like about GraphQL is that the structure of
the query describes the structure of the results object. However,
while the returned structure format is JSON, note that GraphQL is
not valid JSON. (e.g. lack of commas, comments are valid, etc.)
I'll put more things to note inline.

  
## example.graphql
{

  # Here's the list of organization IDs.
  # I don't know if there's a limit to these lists.
  # Any time you see a plural ("nodes") it *usually* means we're
  # about to loop over a result set. But I think this is implementer
  # convention; "nodes" and "edges" are GraphQL reserved words, but
  # most of the rest aren't.
  # For the moment, I've only put two organization IDs in here.
  nodes(ids: ["MDEyOk9yZ2FuaXphdGlvbjYyMzM5OTQ=", "MDEyOk9yZ2FuaXphdGlvbjY0MzA3MA=="]) {

    # Join to a nested list of organization objects.
    id
    ... on Organization {
      name # The organization name

      # Further nesting: Repos.
      # Only asking for 100 because that's a GitHub-set limit.
      # When you want to filter the set of objects returned, it
      # seems to be done in this kind of way - as arguments to
      # the node list. So here, we're filtering for public repos.
      repositories(first: 100, privacy: PUBLIC) {
        nodes {

          # Metadata about the repos. I'm only fetching a few values
          # but I could, in theory, go wild with sub-queries about
          # contributors, pull requests, etc.
          # This is ALL coming back in this ONE query.
          name
          createdAt
          url
          homepageUrl
          languages(first:5) {
            nodes {
              name
            }
          }
          pullRequests(first:5, states:[OPEN]) {
            nodes {
              author {
              	login
              }
              title
            }
          }
        }
      }
    }
  }

  # All the useful repo data is up above.
  # This following section is cost data about the query itself.
  # The GitHub GraphQL API gives API users a limited hourly budget,
  # in "points", to stop the system being overloaded. Every time
  # you execute a query, it subtracts the cost from your remaining
  # budget for the hour.

  # If you're experimenting with big queries, I recommend adding
  # this block to the *beginning* of your query, so you can keep
  # an eye on things. Remember that a low query cost becomes a
  # problem when you're doing a few hundred of these an hour.
  # For more information: https://developer.github.com/v4/guides/resource-limitations/

  rateLimit {
    limit     # Your maximum budget. Your budget is reset to this every hour.
    cost      # The cost of this query.
    remaining # How much of your API budget remains.
    resetAt   # The time (in UTC epoch seconds) when your budget will reset.
  }
}

# If it wasn't for the 100-node-per-list limit, we'd be able to get
# *all* the repos for each organization with this query. Instead,
# we need to do a series of queries which page through the repos,
# a hundred at a time. However, note that we're actually getting 200
# repos back with this query - 100 per org. So the current plan is
# to provide the org IDs for each gov org in GitHub (30 or so) and
# list them all in the query, so we get up to 3000 repos back.
#
# So it's kind of horizontally-slicing across the org structure.
#
# (We've had some conversations with GitHub folks and they say that
# this kind of query behaviour is fine, as long as we keep an eye
# on our query cost limits.)

# Future work:
#
# Would it be possible to provide the list of organization IDs in
# a GraphQL variable rather than hardcoded into the query?

## nocomments-example.graphql
{
  nodes(ids: ["MDEyOk9yZ2FuaXphdGlvbjYyMzM5OTQ=", "MDEyOk9yZ2FuaXphdGlvbjY0MzA3MA=="]) {
    id
    ... on Organization {
      name

      repositories(first: 100, privacy: PUBLIC) {
        nodes {
          name
          createdAt
          url
          homepageUrl
          languages(first:5) {
            nodes {
              name
            }
          }
          pullRequests(first:5, states:[OPEN]) {
            nodes {
              author {
              	login
              }
              title
            }
          }
        }
      }
    }
  }

  rateLimit {
    limit
    cost
    remaining
    resetAt
  }
}
	{

	# Here's the list of organization IDs.
	# I don't know if there's a limit to these lists.
	# Any time you see a plural ("nodes") it usually means we're
	# about to loop over a result set. But I think this is implementer
	# convention; "nodes" and "edges" are GraphQL reserved words, but
	# most of the rest aren't.
	# For the moment, I've only put two organization IDs in here.
	nodes(ids: ["MDEyOk9yZ2FuaXphdGlvbjYyMzM5OTQ=", "MDEyOk9yZ2FuaXphdGlvbjY0MzA3MA=="]) {

	# Join to a nested list of organization objects.
	id
	... on Organization {
	name # The organization name

	# Further nesting: Repos.
	# Only asking for 100 because that's a GitHub-set limit.
	# When you want to filter the set of objects returned, it
	# seems to be done in this kind of way - as arguments to
	# the node list. So here, we're filtering for public repos.
	repositories(first: 100, privacy: PUBLIC) {
	nodes {

	# Metadata about the repos. I'm only fetching a few values
	# but I could, in theory, go wild with sub-queries about
	# contributors, pull requests, etc.
	# This is ALL coming back in this ONE query.
	name
	createdAt
	url
	homepageUrl
	languages(first:5) {
	nodes {
	name
	}
	}
	pullRequests(first:5, states:[OPEN]) {
	nodes {
	author {
	login
	}
	title
	}
	}
	}
	}
	}
	}

	# All the useful repo data is up above.
	# This following section is cost data about the query itself.
	# The GitHub GraphQL API gives API users a limited hourly budget,
	# in "points", to stop the system being overloaded. Every time
	# you execute a query, it subtracts the cost from your remaining
	# budget for the hour.

	# If you're experimenting with big queries, I recommend adding
	# this block to the beginning of your query, so you can keep
	# an eye on things. Remember that a low query cost becomes a
	# problem when you're doing a few hundred of these an hour.
	# For more information: https://developer.github.com/v4/guides/resource-limitations/

	rateLimit {
	limit # Your maximum budget. Your budget is reset to this every hour.
	cost # The cost of this query.
	remaining # How much of your API budget remains.
	resetAt # The time (in UTC epoch seconds) when your budget will reset.
	}
	}

	# If it wasn't for the 100-node-per-list limit, we'd be able to get
	# all the repos for each organization with this query. Instead,
	# we need to do a series of queries which page through the repos,
	# a hundred at a time. However, note that we're actually getting 200
	# repos back with this query - 100 per org. So the current plan is
	# to provide the org IDs for each gov org in GitHub (30 or so) and
	# list them all in the query, so we get up to 3000 repos back.
	#
	# So it's kind of horizontally-slicing across the org structure.
	#
	# (We've had some conversations with GitHub folks and they say that
	# this kind of query behaviour is fine, as long as we keep an eye
	# on our query cost limits.)

	# Future work:
	#
	# Would it be possible to provide the list of organization IDs in
	# a GraphQL variable rather than hardcoded into the query?