Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Github GraphQL query example

Here's an example of the kind of GraphQL query that Code.gov might submit to the GitHub GraphQL endpoint. It also explains why the switch to using GraphQL (in GitHub API version 4) makes things a ton easier than all the REST calls we were doing before (in GitHub API version 3).

Try this query out right now! This link will load the GitHub GraphQL Explorer with the non-annotated query ready to go.

What is this query?

This is a query I cooked up when first exploring GraphQL while doing some work on Code.gov.

What this query is for: Get specific metadata for several thousand repositories, across 30+ US Government organizations, without hitting GitHub API query limits.

Why was this a problem in GitHub API version 3?

Previously, trying to do this using v3 of the GitHub API involved several thousand REST HTTP calls. One call would get us a list of 100 repos. We'd then need 100 * X more HTTP calls, where X is the number of different kinds of metadata we need for each repo: one call for the repo languages, one call for the open pull requests, etc. And then we need code to glue it all together into a document which we feed into our ElasticSearch index. (Admittedly, if we were storing this data relationally, this splitting of metadata would make much more sense.)

Not only was it all deeply inefficient, but we butted up against GitHub's API request limits.

Why is this much easier in GitHub API version 4?

In version 4, the query method has changed from "make lots of little REST API requests" to "make fewer, bigger GraphQL requests".

Not only were we making many more REST API requests before, but they were of many different types, because we needed many different types of data. Now we can specify all the different types of data we need in one query. (We still need to run that query multiple times, because GitHub only lets us fetch 100 repos at a time.)

Here's the GitHub GraphQL API (v4) docs: https://developer.github.com/v4/

You can try this query out in the GitHub GraphQL Explorer: https://developer.github.com/v4/explorer/

This query provides a list of organization IDs, then asks for a list of repositories for each organization, with metadata about each.

One of the things I like about GraphQL is that the structure of the query describes the structure of the results object. However, while the returned structure format is JSON, note that GraphQL is not valid JSON. (e.g. lack of commas, comments are valid, etc.) I'll put more things to note inline.

{
# Here's the list of organization IDs.
# I don't know if there's a limit to these lists.
# Any time you see a plural ("nodes") it *usually* means we're
# about to loop over a result set. But I think this is implementer
# convention; "nodes" and "edges" are GraphQL reserved words, but
# most of the rest aren't.
# For the moment, I've only put two organization IDs in here.
nodes(ids: ["MDEyOk9yZ2FuaXphdGlvbjYyMzM5OTQ=", "MDEyOk9yZ2FuaXphdGlvbjY0MzA3MA=="]) {
# Join to a nested list of organization objects.
id
... on Organization {
name # The organization name
# Further nesting: Repos.
# Only asking for 100 because that's a GitHub-set limit.
# When you want to filter the set of objects returned, it
# seems to be done in this kind of way - as arguments to
# the node list. So here, we're filtering for public repos.
repositories(first: 100, privacy: PUBLIC) {
nodes {
# Metadata about the repos. I'm only fetching a few values
# but I could, in theory, go wild with sub-queries about
# contributors, pull requests, etc.
# This is ALL coming back in this ONE query.
name
createdAt
url
homepageUrl
languages(first:5) {
nodes {
name
}
}
pullRequests(first:5, states:[OPEN]) {
nodes {
author {
login
}
title
}
}
}
}
}
}
# All the useful repo data is up above.
# This following section is cost data about the query itself.
# The GitHub GraphQL API gives API users a limited hourly budget,
# in "points", to stop the system being overloaded. Every time
# you execute a query, it subtracts the cost from your remaining
# budget for the hour.
# If you're experimenting with big queries, I recommend adding
# this block to the *beginning* of your query, so you can keep
# an eye on things. Remember that a low query cost becomes a
# problem when you're doing a few hundred of these an hour.
# For more information: https://developer.github.com/v4/guides/resource-limitations/
rateLimit {
limit # Your maximum budget. Your budget is reset to this every hour.
cost # The cost of this query.
remaining # How much of your API budget remains.
resetAt # The time (in UTC epoch seconds) when your budget will reset.
}
}
# If it wasn't for the 100-node-per-list limit, we'd be able to get
# *all* the repos for each organization with this query. Instead,
# we need to do a series of queries which page through the repos,
# a hundred at a time. However, note that we're actually getting 200
# repos back with this query - 100 per org. So the current plan is
# to provide the org IDs for each gov org in GitHub (30 or so) and
# list them all in the query, so we get up to 3000 repos back.
#
# So it's kind of horizontally-slicing across the org structure.
#
# (We've had some conversations with GitHub folks and they say that
# this kind of query behaviour is fine, as long as we keep an eye
# on our query cost limits.)
# Future work:
#
# Would it be possible to provide the list of organization IDs in
# a GraphQL variable rather than hardcoded into the query?
{
nodes(ids: ["MDEyOk9yZ2FuaXphdGlvbjYyMzM5OTQ=", "MDEyOk9yZ2FuaXphdGlvbjY0MzA3MA=="]) {
id
... on Organization {
name
repositories(first: 100, privacy: PUBLIC) {
nodes {
name
createdAt
url
homepageUrl
languages(first:5) {
nodes {
name
}
}
pullRequests(first:5, states:[OPEN]) {
nodes {
author {
login
}
title
}
}
}
}
}
}
rateLimit {
limit
cost
remaining
resetAt
}
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment