Get the metadata and content of all files in a selected GitHub repo, using GraphQL
You might want to get a tree summary of files in a repo without downloading the repo, or maybe you want to lookup the contents of a file again without download the whole repo.
The approach here is to query data from GitHub using the Github V4 GraphQL API.
In the sample GQL file in this gist, I included some useful attributes about files in a GitHub repo. The query can be modified to work with any repo you have read access to.
name
- File or path name.
mode
- Usually
16384
or33188
.
- Usually
type
blob
for text or binary files.tree
for a directory path.
text
- This is the content of your file. For larger files, this field will of coruse make your JSON response very long.
- From the schema: "UTF8 text data or null if the Blob is binary".
- Includes
\n
for line breaks in text. Note your code might have"\n"
in strings too.
isBinary
- Useful if you want to separate file types or not try and count lines in a binary.
- Binary might be images or compiled files.
- I could not find a summary for number of files or find the number of lines, so you have to work ithat out yourself.
- Regarding the
expression
value forobject
- See
expression
orGitObject
in the Object reference docs."A Git revision expression suitable for rev-parse".
- Choose a commit reference and add a colon e.g.
"HEAD:"
. You can usemaster
or a commit ID instead. - You will only get objects at the repo root though, unless you use a nested query or choose a path. e.g.
"master: docs/"
. - You can also use a nested query to get multiple level down, as in the second GQL file below. But I can't see a way to nest this recursively. And a Fragment doesn't let you nest in itself.
- See
Try the query out in the explorer.
- Go to the explorer and sign in - V4 explorer
- Paste the GQL query from
get_github_files.gql
to the main pane. - Paste the sample JSON from
sample_params.json
into the query variables pane. - Press the play/arrow button to run.
Use curl
, or a library in Python, Ruby, etc.
Here is a generic example from the GitHub docs. This as it is will fail though, as the auth token is missing. You must generate and pass an auth token for GraphQL. The REST API lets you make requests without an auth token (within limits).
$ curl -H "Authorization: bearer token" -X POST -d " \
{ \
\"query\": \"query { viewer { login }}\" \
} \
" https://api.github.com/graphql
After executing get_github_files.gql
.
Simplified JSON output
{
"entries": [
{
"name": ".gitignore",
"type": "blob",
"object": {
"byteSize": 32,
"text": "node_modules/\npackage-lock.json\n"
}
},
{
"name": ".vscode",
"type": "tree",
"object": {}
},
{
"name": "CONTRIBUTING.md",
"type": "blob",
"object": {
"byteSize": 1520,
"text": "..."
}
}
]
}
- Thanks to this gist for getting me going with using the
Tree
andBlob
structure. - Intro to GraphQL
- Github
- Github V4 GraphQL docs.
- Forming calls
- Target URL for query: api.github.com/graphql
- Projects
- GraphQL guide in
MichaelCurrin/learn-to-code
repo. - MichaelCurrin/github-graphql-tool - A Python-based project which reports on stats around repos of a GitHub user or org, using GraphQL.
- GraphQL guide in