Skip to content

Instantly share code, notes, and snippets.

@jlewi
Created March 23, 2023 00:18
Show Gist options
  • Save jlewi/2e5897d28c7165e2f0299e03e065f0a9 to your computer and use it in GitHub Desktop.
Save jlewi/2e5897d28c7165e2f0299e03e065f0a9 to your computer and use it in GitHub Desktop.
example conversation with ChatGPT to write an sql query.
{
"title": "Is there a public bigquery dataset for GitHub?",
"creationTime": "2023-03-22T15:32:30.456",
"messages": [
{
"fromID": "0lShIJyQbEYatxFxW6jaicnzwDb2",
"toID": "1",
"blocks": [
{
"type": "text",
"text": "Is there a public bigquery dataset for GitHub?"
}
],
"creationTime": "2023-03-22T15:32:43.508"
},
{
"fromID": "1",
"toID": "0lShIJyQbEYatxFxW6jaicnzwDb2",
"blocks": [
{
"type": "text",
"text": "Yes, there is a public BigQuery dataset for GitHub. You can find more information about it in the Google Cloud documentation here: https://cloud.google.com/blog/products/gcp/google-bigquery-public-datasets-now-include-github-archive"
}
],
"creationTime": "2023-03-22T15:32:46.469"
},
{
"fromID": "0lShIJyQbEYatxFxW6jaicnzwDb2",
"toID": "1",
"blocks": [
{
"type": "text",
"text": "Suppose you have a bigquery table with a column named language of type repeated record which has a field name. Write an SQL query to count the number of rows for each named language?"
}
],
"creationTime": "2023-03-22T15:44:22.796"
},
{
"fromID": "1",
"toID": "0lShIJyQbEYatxFxW6jaicnzwDb2",
"blocks": [
{
"type": "text",
"text": "You can write the following SQL query to count the number of rows for each named language:"
},
{
"type": "code",
"lang": "python",
"source": "\nSELECT language.name, COUNT(*) as count\nFROM `project.dataset.table`, UNNEST(language) as language\nGROUP BY language.name",
"index": 0
},
{
"type": "text",
"text": "\nThis query uses `UNNEST` to separate each element of the `language` array column, and then groups the rows by each language's name field, with `COUNT` used to count the number of rows for each language. \n\nDocument: https://cloud.google.com/bigquery/docs/reference/standard-sql/arrays"
}
],
"creationTime": "2023-03-22T15:44:29.225"
},
{
"fromID": "0lShIJyQbEYatxFxW6jaicnzwDb2",
"toID": "1",
"blocks": [
{
"type": "text",
"text": "Write an SQL query to get the number of unique PR creators from the github archive?"
}
],
"creationTime": "2023-03-22T16:53:02.452"
},
{
"fromID": "1",
"toID": "0lShIJyQbEYatxFxW6jaicnzwDb2",
"blocks": [
{
"type": "text",
"text": "Yes, here's an SQL that does this!"
},
{
"type": "code",
"lang": "SQL",
"source": "\nSELECT COUNT(DISTINCT actor.id) AS num_unique_contributors\nFROM `bigquery-public-data.github_repos.commits`\nWHERE message LIKE 'Merge pull request%'",
"index": 1
},
{
"type": "text",
"text": "\n\nThis query selects from the `commits` table in the `github_repos` dataset in BigQuery. We filter the results to only include merge commits that were generated by the GitHub UI when someone merged a pull request. Then we count the number of distinct actors (i.e. people) who created these commits, which gives us the number of unique contributors who merged pull requests.\n\nNote that this query assumes that all pull requests were merged using the GitHub UI (i.e. by clicking the \"Merge pull request\" button). If pull requests were merged in some other way (such as by merging the branch locally and then pushing the result to GitHub), those merges won't be counted. \n\nReference: https://docs.github.com/en/github/collaborating-with-issues-and-pull-requests/about-pull-request-merges"
}
],
"creationTime": "2023-03-22T16:53:14.267"
},
{
"fromID": "0lShIJyQbEYatxFxW6jaicnzwDb2",
"toID": "1",
"blocks": [
{
"type": "text",
"text": "Here is the schema for a bigquery table containing GitHub data; [{\"name\":\"type\",\"type\":\"STRING\",\"mode\":\"NULLABLE\"},{\"name\":\"public\",\"type\":\"BOOLEAN\",\"mode\":\"NULLABLE\"},{\"name\":\"payload\",\"type\":\"STRING\",\"mode\":\"NULLABLE\"},{\"name\":\"repo\",\"type\":\"RECORD\",\"mode\":\"NULLABLE\",\"fields\":[{\"name\":\"id\",\"type\":\"INTEGER\",\"mode\":\"NULLABLE\"},{\"name\":\"name\",\"type\":\"STRING\",\"mode\":\"NULLABLE\"},{\"name\":\"url\",\"type\":\"STRING\",\"mode\":\"NULLABLE\"}]},{\"name\":\"actor\",\"type\":\"RECORD\",\"mode\":\"NULLABLE\",\"fields\":[{\"name\":\"id\",\"type\":\"INTEGER\",\"mode\":\"NULLABLE\"},{\"name\":\"login\",\"type\":\"STRING\",\"mode\":\"NULLABLE\"},{\"name\":\"gravatar_id\",\"type\":\"STRING\",\"mode\":\"NULLABLE\"},{\"name\":\"avatar_url\",\"type\":\"STRING\",\"mode\":\"NULLABLE\"},{\"name\":\"url\",\"type\":\"STRING\",\"mode\":\"NULLABLE\"}]},{\"name\":\"org\",\"type\":\"RECORD\",\"mode\":\"NULLABLE\",\"fields\":[{\"name\":\"id\",\"type\":\"INTEGER\",\"mode\":\"NULLABLE\"},{\"name\":\"login\",\"type\":\"STRING\",\"mode\":\"NULLABLE\"},{\"name\":\"gravatar_id\",\"type\":\"STRING\",\"mode\":\"NULLABLE\"},{\"name\":\"avatar_url\",\"type\":\"STRING\",\"mode\":\"NULLABLE\"},{\"name\":\"url\",\"type\":\"STRING\",\"mode\":\"NULLABLE\"}]},{\"name\":\"created_at\",\"type\":\"TIMESTAMP\",\"mode\":\"NULLABLE\"},{\"name\":\"id\",\"type\":\"STRING\",\"mode\":\"NULLABLE\"},{\"name\":\"other\",\"type\":\"STRING\",\"mode\":\"NULLABLE\"}]. Write an SQL query to compute the number of unique PR creators"
}
],
"creationTime": "2023-03-22T16:58:17.770"
},
{
"fromID": "1",
"toID": "0lShIJyQbEYatxFxW6jaicnzwDb2",
"blocks": [
{
"type": "text",
"text": "Here is an SQL query that computes the number of unique PR creators based on the schema for the bigquery table containing GitHub data:"
},
{
"type": "code",
"lang": "SQL",
"source": "\nSELECT COUNT(DISTINCT actor.login) AS num_unique_PR_creators\nFROM `your_project_name.your_dataset_name.your_table_name`\nWHERE type = 'PullRequestEvent' AND payload.pull_request.merged = true",
"index": 2
},
{
"type": "text",
"text": "\n\nThis query counts the number of distinct login names from the `actor` field in the table where the `type` field is equal to 'PullRequestEvent' and the `merged` field within the `payload.pull_request` object is true. This should give you the number of unique PR creators. \n\nNote that this query assumes that all pull requests were merged using the GitHub UI (i.e. by clicking the \"Merge pull request\" button). If pull requests were merged in some other way (such as by merging the branch locally and then pushing the result to GitHub), those merges won't be counted. \n\nReference: https://docs.github.com/en/github/collaborating-with-issues-and-pull-requests/about-pull-request-merges"
}
],
"creationTime": "2023-03-22T16:58:27.054"
},
{
"fromID": "0lShIJyQbEYatxFxW6jaicnzwDb2",
"toID": "1",
"blocks": [
{
"type": "text",
"text": "The last SQL query is incorrect. It doesn't account for the fact that payload is a serialized JSON record. Fix the query to properly account for that."
}
],
"creationTime": "2023-03-22T17:02:34.116"
},
{
"fromID": "1",
"toID": "0lShIJyQbEYatxFxW6jaicnzwDb2",
"blocks": [
{
"type": "text",
"text": "Sure! Here's an updated SQL query that accounts for the JSON structure of the `payload` field:"
},
{
"type": "code",
"lang": "SQL",
"source": "\nSELECT COUNT(DISTINCT actor.login) AS num_unique_PR_creators\nFROM `your_project_name.your_dataset_name.your_table_name`,\nUNNEST(JSON_EXTRACT_ARRAY(payload, '$.pull_request')) PR\nWHERE type = 'PullRequestEvent' AND\nPR.merged = true",
"index": 3
},
{
"type": "text",
"text": "\n\nThis query uses `JSON_EXTRACT_ARRAY` with the `$` operator to extract the `pull_request` object from the `payload` field, and then we `UNNEST` this object to be able to access its fields as regular columns in the query. By doing this, we can access the `merged` field directly without having to specify the nested structure.\n\nNote that this query still assumes that all pull requests were merged using the GitHub UI (i.e. by clicking the \"Merge pull request\" button). If pull requests were merged in some other way (such as by merging the branch locally and then pushing the result to GitHub), those merges won't be counted.\n\nReference: https://cloud.google.com/bigquery/docs/reference/standard-sql/json_functions"
}
],
"creationTime": "2023-03-22T17:02:44.359"
},
{
"fromID": "0lShIJyQbEYatxFxW6jaicnzwDb2",
"toID": "1",
"blocks": [
{
"type": "text",
"text": "Edit the last query to include the actual table name which is githubarchive.month.202302"
}
],
"creationTime": "2023-03-22T17:04:13.278"
},
{
"fromID": "1",
"toID": "0lShIJyQbEYatxFxW6jaicnzwDb2",
"blocks": [
{
"type": "text",
"text": "Sure, here's the updated SQL query that includes the actual table name `githubarchive.month.202302`:\n```SQL\nSELECT COUNT(DISTINCT actor.login) AS num_unique_PR_creators\nFROM `githubarchive.month.202302`,\nUNNEST(JSON_EXTRACT_ARRAY(payload, '$.pull_request')) PR\nWHERE type = 'PullRequestEvent' AND\nPR.merged = true\n```\n\nThis query is similar to the previous query, but with the actual table name specified as `githubarchive.month.202302`. It also uses `JSON_EXTRACT_ARRAY` and `UNNEST` to access the `pull_request` object from the `payload` field and then count the number of unique PR creators. As before, this query assumes that all pull requests were merged using the GitHub UI."
}
],
"creationTime": "2023-03-22T17:04:19.797"
},
{
"fromID": "0lShIJyQbEYatxFxW6jaicnzwDb2",
"toID": "1",
"blocks": [
{
"type": "text",
"text": "Trying to run this query in BigQuery gives the error `Cannot access field merged on a value with type STRING at [5:4]` can you correct the query?"
}
],
"creationTime": "2023-03-22T17:05:24.343"
},
{
"fromID": "1",
"toID": "0lShIJyQbEYatxFxW6jaicnzwDb2",
"blocks": [
{
"type": "text",
"text": "Yes, you can modify the query to correctly access the JSON fields by using the `SAFE_CAST` function to convert the JSON fields to the correct data type:\n```SQL\nSELECT COUNT(DISTINCT actor.login) AS num_unique_PR_creators\nFROM `githubarchive.month.202302`,\nUNNEST(JSON_EXTRACT_ARRAY(SAFE_CAST(payload AS JSON), '$.pull_request')) PR\nWHERE type = 'PullRequestEvent' AND\nSAFE_CAST(PR.merged AS BOOLEAN) = true\n```\n\nThis query uses `SAFE_CAST` to convert the `payload` field to a JSON data type, and then `JSON_EXTRACT_ARRAY` and `UNNEST` to access the `pull_request` object from the `payload` field. The `SAFE_CAST` function is also used to convert the `merged` field to a boolean data type before comparing it to `true`.\n\nNote that this query still assumes that all pull requests were merged using the GitHub UI (i.e. by clicking the \"Merge pull request\" button). If pull requests were merged in some other way (such as by merging the branch locally and then pushing the result to GitHub), those merges won't be counted."
}
],
"creationTime": "2023-03-22T17:06:01.929"
},
{
"fromID": "0lShIJyQbEYatxFxW6jaicnzwDb2",
"toID": "1",
"blocks": [
{
"type": "text",
"text": "This query doesn't properly handle the payload field. In BigQuery if you want to extract a value from a JSON string you need to use the function JSON_EXTRACT which \"Extracts a JSON value, such as an array or object, or a JSON scalar value, such as a string, number, or boolean.\" Please fix the query to correctly extract fields from payload using json_extract."
}
],
"creationTime": "2023-03-22T17:09:15.563"
},
{
"fromID": "1",
"toID": "0lShIJyQbEYatxFxW6jaicnzwDb2",
"blocks": [
{
"type": "text",
"text": "You're absolutely right. Here is an updated SQL query that properly handles the JSON structure of the `payload` field using the `JSON_EXTRACT` function:\n```SQL\nSELECT COUNT(DISTINCT actor.login) AS num_unique_PR_creators\nFROM `githubarchive.month.202302`\nWHERE type = 'PullRequestEvent' AND\nJSON_EXTRACT(payload, '$.pull_request.merged') = 'true'\n```\n\nThis query simply uses `JSON_EXTRACT` to extract the `merged` field from the `pull_request` object in the `payload` field, and checks whether its value is equal to the string `'true'`. The query still assumes that all pull requests were merged using the GitHub UI."
}
],
"creationTime": "2023-03-22T17:09:22.829"
},
{
"fromID": "0lShIJyQbEYatxFxW6jaicnzwDb2",
"toID": "1",
"blocks": [
{
"type": "text",
"text": "Now edit the query to only consider PRs in the github organization Kubeflow."
}
],
"creationTime": "2023-03-22T17:11:03.163"
},
{
"fromID": "1",
"toID": "0lShIJyQbEYatxFxW6jaicnzwDb2",
"blocks": [
{
"type": "text",
"text": "To filter the results to only include pull requests from the Kubeflow organization, you can add an additional `WHERE` clause that checks whether the repository name starts with `'kubeflow/'`:\n\n```SQL\nSELECT COUNT(DISTINCT actor.login) AS num_unique_PR_creators\nFROM `githubarchive.month.202302`\nWHERE type = 'PullRequestEvent'\n AND JSON_EXTRACT(payload, '$.pull_request.merged') = 'true'\n AND JSON_EXTRACT(payload, '$.repository.name') LIKE 'kubeflow/%'\n```\n\nThis query checks both the `type` field and the `merged` field in the same way as the previous query, but adds an additional check for the `repository.name` field using `JSON_EXTRACT` and the `LIKE` operator to match repository names from the Kubeflow organization. Note that this assumes that all repositories in the Kubeflow organization start with the prefix `'kubeflow/'`. If this is not the case, you may need to adjust the filter accordingly.\n\nReference: https://cloud.google.com/bigquery/docs/reference/standard-sql/json_functions"
}
],
"creationTime": "2023-03-22T17:11:12.389"
},
{
"fromID": "0lShIJyQbEYatxFxW6jaicnzwDb2",
"toID": "1",
"blocks": [
{
"type": "text",
"text": "That doesn't seem right. I think the field org.login contains the name of the GitHub organization. Can we use that instead to restrict the query to the Kubeflow organization?"
}
],
"creationTime": "2023-03-22T17:17:40.566"
},
{
"fromID": "1",
"toID": "0lShIJyQbEYatxFxW6jaicnzwDb2",
"blocks": [
{
"type": "text",
"text": "You're right! The `org.login` field contains the name of the GitHub organization, so we can use that to filter the query to only include pull requests from the Kubeflow organization.\n\nHere is the updated SQL query:\n```SQL\nSELECT COUNT(DISTINCT actor.login) AS num_unique_PR_creators\nFROM `githubarchive.month.202302`\nWHERE type = 'PullRequestEvent'\n AND JSON_EXTRACT(payload, '$.pull_request.merged') = 'true'\n AND JSON_EXTRACT(payload, '$.org.login') = 'kubeflow'\n```\n\nThis query adds an additional `WHERE` clause that checks whether the value of the `login` field inside the `org` object is equal to `'kubeflow'`. This will filter the results to only include pull requests from the Kubeflow organization.\n\nReference: https://cloud.google.com/bigquery/docs/reference/standard-sql/json_functions"
}
],
"creationTime": "2023-03-22T17:17:48.886"
}
]
}
@jlewi
Copy link
Author

jlewi commented Mar 23, 2023

It does a decent job although it requires a lot of coaching. The final isn't correct. The problem is its treating org.login as a JSON string but its not.

Here's a correct query.

SELECT COUNT(DISTINCT actor.login) AS num_unique_PR_creators
FROM `githubarchive.month.202302`
WHERE type = 'PullRequestEvent'
  AND JSON_EXTRACT(payload, '$.pull_request.merged') = 'true'
  AND org.login = 'kubeflow'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment