Skip to content

Instantly share code, notes, and snippets.

@j6k4m8
Last active April 13, 2021 20:53
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save j6k4m8/3b86b0a78c7966e9257be2677feff781 to your computer and use it in GitHub Desktop.
Save j6k4m8/3b86b0a78c7966e9257be2677feff781 to your computer and use it in GitHub Desktop.
Understand gender and race distributions of the papers you cite
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"id": "contrary-visiting",
"metadata": {},
"source": [
"# Citation Diversity Statement Generator\n",
"\n",
"\n",
"Based upon [this work](https://github.com/dalejn/cleanBib).\n",
"\n",
"This notebook is standalone and does not require any additional postprocessing. Note that the models used here differ from those used in the work referenced below, and the two are not yet interchangeable.\n",
"\n",
"> [![DOI](https://zenodo.org/badge/232916183.svg)](https://zenodo.org/badge/latestdoi/232916183)\n",
"> \n",
"> Motivated from work by:\n",
"> \n",
"> * J. D. Dworkin, K. A. Linn, E. G. Teich, P. Zurn, R. T. Shinohara, and D. S. Bassett (2020). The extent and drivers of gender imbalance in neuroscience reference lists. *Nature Neuroscience*. [doi: https://doi.org/10.1038/s41593-020-0658-y](https://doi.org/10.1038/s41593-020-0658-y)\n",
"> \n",
"> * M.A. Bertolero, J.D. Dworkin, S.U. David, C. López Lloreda, P. Srivastava, J. Stiso, D. Zhou, K. Dzirasa, D.A. Fair, A.N. Kaczkurkin, B.J. Marlin, D. Shohamy, L.Q. Uddin, P. Zurn, D.S. Bassett (2020). Racial and ethnic imbalance in neuroscience reference lists and intersections with gender. *bioRxiv*. [doi: https://doi.org/10.1101/2020.10.12.336230](https://www.biorxiv.org/content/10.1101/2020.10.12.336230v1)\n",
"> \n",
"> See also these Perspectives with actionable recommendations moving forward for scientists at all levels: \n",
"> \n",
"> * J. D. Dworkin, P. Zurn, and D. S. Bassett (2020). (In)citing Action to Realize an Equitable Future. *Neuron*. [doi: https://doi.org/10.1016/j.neuron.2020.05.011](https://doi.org/10.1016/j.neuron.2020.05.011)\n",
"> * P. Zurn, D.S. Bassett, and N.C. Rust (2020). \"The Citation Diversity Statement: A Practice of Transparency, A Way of Life.\" *Trends in Cognitive Sciences* [doi: https://doi.org/10.1016/j.tics.2020.06.009](https://doi.org/10.1016/j.tics.2020.06.009)\n",
"> \n",
"> And editorials and research highlights of this work: \n",
"> * A.L. Fairhall and E. Marder (2020). Acknowledging female voices. *Nature Neuroscience*. [doi: https://doi.org/10.1038/s41593-020-0667-x](https://www.nature.com/articles/s41593-020-0667-x) \n",
"> * Widening the scope of diversity (2020). *Nature Neuroscience*. [doi: https://doi.org/10.1038/s41593-020-0670-2](https://www.nature.com/articles/s41593-020-0670-2) \n",
"> * Z. Budrikis (2020). Growing citation gender gap. *Nature Reviews Physics*. [doi: https://doi.org/10.1038/s42254-020-0207-3](https://doi.org/10.1038/s42254-020-0207-3)\n",
"> "
]
},
{
"cell_type": "markdown",
"id": "rubber-technician",
"metadata": {},
"source": [
"## Usage Instructions\n",
"\n",
"Start by uploading your .bib file to this Jupyter notebook folder.\n",
"\n",
"Then add your gender-api API key to the cell below. \n",
"\n",
"You do not need to make any more modifications to the notebook after this cell; once you have updated the cell below, you can run the full notebook. The rest of the notebook is documented as a convenience."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "round-michael",
"metadata": {},
"outputs": [],
"source": [
"BIB_FILE = \"YOUR FILE HERE.bib\"\n",
"GENDER_API_KEY = \"YOUR TOKEN HERE\" # https://gender-api.com/en/account/overview"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "flying-pioneer",
"metadata": {},
"outputs": [],
"source": [
"!pip install bibtexparser ethnicolr pandas tqdm requests"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "military-belief",
"metadata": {},
"outputs": [],
"source": [
"import functools\n",
"\n",
"import requests\n",
"import bibtexparser\n",
"from ethnicolr import pred_fl_reg_name\n",
"from bibtexparser.customization import author, convert_to_unicode\n",
"import pandas as pd\n",
"from tqdm.auto import tqdm\n",
"import warnings"
]
},
{
"cell_type": "markdown",
"id": "viral-tribune",
"metadata": {},
"source": [
"## Cached (long-running) race and gender predictions\n",
"\n",
"The following functions are \"expensive,\" either because they require a long-running function call (perhaps ML inference?) or a limited API service call. \n",
"\n",
"For that reason, we \"decorate\" them with a LRU cache, so rerunning the cells that call these functions will run faster and not incur any additional computational or API-credit overhead."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "modern-alaska",
"metadata": {},
"outputs": [],
"source": [
"@functools.lru_cache(maxsize=None)\n",
"def get_gender_for_name(name):\n",
" if name is None:\n",
" return \"unknown\"\n",
" return requests.get(\n",
" f\"https://gender-api.com/get?name={name}&key={GENDER_API_KEY}\"\n",
" ).json().get(\"gender\", None)\n",
"\n",
"@functools.lru_cache(maxsize=None)\n",
"def get_race_for_name(name):\n",
" if name is None:\n",
" return \"unknown\"\n",
" name_parts = name.split(\", \")\n",
" lname = name_parts[0]\n",
" fname = \", \".join(name_parts[1:])\n",
" return pred_fl_reg_name(pd.DataFrame([{\n",
" \"fname\": fname, \n",
" \"lname\": lname\n",
" }]), \"lname\", \"fname\").iloc[0]['race']"
]
},
{
"cell_type": "markdown",
"id": "seeing-looking",
"metadata": {},
"source": [
"## Crossref API interface\n",
"\n",
"We use the [Crossref API](https://github.com/CrossRef/rest-api-doc) to infill author information if it's not included in your bibliography. Because we only care about authorship for this notebook, we use the `&select` clause on the API call to avoid downloading unnecessary metadata (slow)."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "palestinian-faculty",
"metadata": {},
"outputs": [],
"source": [
"def search_reference(title: str = None, author: str = None, doi: str = None):\n",
" \"\"\"\n",
" Usage:\n",
" \n",
" search_reference(title=\"graphs\", author=\"Erdős Rényi\")\n",
" \n",
" \"\"\"\n",
" if author and title:\n",
" return requests.get(\n",
" f\"https://api.crossref.org/works?query.bibliographic={title}&query.author={author}&select=title,author\"\n",
" ).json()['message']['items']\n",
" \n",
" if title:\n",
" return requests.get(\n",
" f\"https://api.crossref.org/works?query.bibliographic={title}&select=title,author\"\n",
" ).json()['message']['items']\n",
"\n",
" if doi:\n",
" return requests.get(\n",
" f\"https://api.crossref.org/works?query.bibliographic={doi}&select=title,author\"\n",
" ).json()['message']['items']\n",
" \n",
"_healed_document_count = 0\n",
"\n",
"def healed_from_crossref(reference_dict: dict) -> dict:\n",
" \"\"\"\n",
" Given a reference, try to get a list of authors,\n",
" using as much available information as possible.\n",
" \"\"\"\n",
" global _healed_document_count\n",
" \n",
" # First, do we have name and title?\n",
" if reference_dict.get('author', None):\n",
" return reference_dict\n",
" \n",
" if reference_dict.get('doi', None):\n",
" crossref_version = search_reference(doi=reference_dict['doi'])\n",
" \n",
" if reference_dict.get('title', None):\n",
" crossref_version = search_reference(title=reference_dict['title'])\n",
" \n",
" # preferred version:\n",
" preferred_version = crossref_version[0]\n",
" authors = [\n",
" f\"{author['family']}, {author['given']}\" \n",
" for author in preferred_version['author']\n",
" ]\n",
" \n",
" _healed_document_count += 1\n",
" warnings.warn(f\"Healing document {preferred_version['title']} ({authors[0]})\")\n",
" \n",
" return {\n",
" **reference_dict,\n",
" \"title\": preferred_version['title'],\n",
" \"author\": authors,\n",
" }"
]
},
{
"cell_type": "markdown",
"id": "funny-bristol",
"metadata": {},
"source": [
"## Extract authors from the bibliography\n",
"\n",
"These functions actually perform the bibtex extraction."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "efficient-entrance",
"metadata": {},
"outputs": [],
"source": [
"def get_authors_from_bibliography():\n",
" def customizations(record):\n",
" record = author(record)\n",
" record = convert_to_unicode(record)\n",
" record = healed_from_crossref(record)\n",
" return record\n",
"\n",
" parser = bibtexparser.bparser.BibTexParser(\n",
" common_strings=True,\n",
" )\n",
" parser.customization = customizations\n",
"\n",
" bib_database = parser.parse_file(open(BIB_FILE, 'r'))\n",
" authors = [\n",
" [a for a in entry.get('author', []) if \"al., et\" not in a]\n",
" for entry in\n",
" bib_database.entries\n",
" ]\n",
" return authors\n",
"\n",
"\n",
"def get_first_and_last_authors(author_list):\n",
" if len(author_list) == 0:\n",
" return None, None\n",
" if len(author_list) > 1:\n",
" return author_list[0], author_list[-1]\n",
" else:\n",
" return author_list[0], None"
]
},
{
"cell_type": "markdown",
"id": "heated-spare",
"metadata": {},
"source": [
"## The real stuff\n",
"\n",
"Here we actually start invoking functions (this is the first point at which the notebook might noticeably slow down). First, we get a list of authors from the bibliography, where `authors` will be a list for each paper, each of which contains a string list of the format `Last, First M.`.\n",
"\n",
"If we have to heal any of your citations (because it doesn't include authorship information), we'll print a red warning in the output of this cell."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "quick-rebate",
"metadata": {},
"outputs": [],
"source": [
"authors = get_authors_from_bibliography()"
]
},
{
"cell_type": "markdown",
"id": "reserved-joshua",
"metadata": {},
"source": [
"All we care about is the first and last author for this notebook, so we can ignore the rest."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "central-hardwood",
"metadata": {},
"outputs": [],
"source": [
"first_and_last_authors = [get_first_and_last_authors(a) for a in authors]"
]
},
{
"cell_type": "markdown",
"id": "difficult-muslim",
"metadata": {},
"source": [
"## (Slow!) Get gender and race predictions for first and last authors\n",
"\n",
"This cell will be slow the first time you run it (it took about one minute to run on a manuscript of ~70 cites), but nearly instant the second time you run it (since the results will be cached by the LRU cache explained above). "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "smoking-hollywood",
"metadata": {},
"outputs": [],
"source": [
"first_and_last_author_genders = [\n",
" (get_gender_for_name(first), get_gender_for_name(last))\n",
" for first, last in tqdm(first_and_last_authors)\n",
"]\n",
"\n",
"first_and_last_author_races = [\n",
" (get_race_for_name(first), get_race_for_name(last))\n",
" for first, last in tqdm(first_and_last_authors)\n",
"]"
]
},
{
"cell_type": "markdown",
"id": "fundamental-filling",
"metadata": {},
"source": [
"## Save outputs\n",
"\n",
"We save the outputs of the predictions to a CSV which you can download for your records."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "elementary-combat",
"metadata": {},
"outputs": [],
"source": [
"author_dataframe = pd.DataFrame({\n",
" \"first_author\": [i[0] for i in first_and_last_authors], \n",
" \"first_author_gender\": [i[0] for i in first_and_last_author_genders],\n",
" \"first_author_race\": [i[0] for i in first_and_last_author_races],\n",
" \n",
" \"last_author\": [i[-1] for i in first_and_last_authors],\n",
" \"last_author_gender\": [i[-1] for i in first_and_last_author_genders],\n",
" \"last_author_race\": [i[-1] for i in first_and_last_author_races],\n",
"})\n",
"\n",
"author_dataframe.to_csv(\"author-gender-and-race-predictions.csv\")"
]
},
{
"cell_type": "markdown",
"id": "posted-pitch",
"metadata": {},
"source": [
"# Basic Analysis\n",
"\n",
"We perform some basic analyses here, including general statistics about the bibliography as well as authorship gender breakdowns."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "hybrid-motor",
"metadata": {},
"outputs": [],
"source": [
"total_paper_count = len(first_and_last_author_genders)\n",
"single_author_paper_count = sum(1 for F, L in first_and_last_authors if L is None or F is None)\n",
"no_author_paper_count = sum(1 for F, L in first_and_last_authors if L is None and F is None)"
]
},
{
"cell_type": "markdown",
"id": "extensive-davis",
"metadata": {},
"source": [
"## Gender Breakdown"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "generous-smile",
"metadata": {},
"outputs": [],
"source": [
"male_male_count = sum(1 for F, L in first_and_last_author_genders if F == 'male' and L == 'male')\n",
"male_female_count = sum(1 for F, L in first_and_last_author_genders if F == 'male' and L == 'female')\n",
"female_male_count = sum(1 for F, L in first_and_last_author_genders if F == 'female' and L == 'male')\n",
"female_female_count = sum(1 for F, L in first_and_last_author_genders if F == 'female' and L == 'female')"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "illegal-space",
"metadata": {},
"outputs": [],
"source": [
"# paper_count = total_paper_count\n",
"paper_count = total_paper_count - single_author_paper_count"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "korean-scott",
"metadata": {},
"outputs": [],
"source": [
"male_male_ratio = male_male_count/paper_count\n",
"male_female_ratio = male_female_count/paper_count\n",
"female_male_ratio = female_male_count/paper_count\n",
"female_female_ratio = female_female_count/paper_count\n",
"\n",
"normalized_pct = sum([male_male_ratio,male_female_ratio,female_male_ratio,female_female_ratio])\n",
"\n",
"male_male_norm_ratio = male_male_ratio/normalized_pct\n",
"male_female_norm_ratio = male_female_ratio/normalized_pct\n",
"female_male_norm_ratio = female_male_ratio/normalized_pct\n",
"female_female_norm_ratio = female_female_ratio/normalized_pct"
]
},
{
"cell_type": "markdown",
"id": "objective-kingdom",
"metadata": {},
"source": [
"## Race Breakdown"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "experimental-matthew",
"metadata": {},
"outputs": [],
"source": [
"nw_nw_count = sum(1 for F, L in first_and_last_author_races if F != 'nh_white' and L != 'nh_white')\n",
"nw_w_count = sum(1 for F, L in first_and_last_author_races if F != 'nh_white' and L == 'nh_white')\n",
"w_nw_count = sum(1 for F, L in first_and_last_author_races if F == 'nh_white' and L != 'nh_white')\n",
"w_w_count = sum(1 for F, L in first_and_last_author_races if F == 'nh_white' and L == 'nh_white')\n",
"\n",
"normalized_race_count = sum([nw_nw_count,nw_w_count,w_nw_count,w_w_count])\n",
"\n",
"nw_nw_count_ratio = nw_nw_count / normalized_race_count\n",
"nw_w_count_ratio = nw_w_count / normalized_race_count\n",
"w_nw_count_ratio = w_nw_count / normalized_race_count\n",
"w_w_count_ratio = w_w_count / normalized_race_count"
]
},
{
"cell_type": "markdown",
"id": "reflected-particular",
"metadata": {},
"source": [
"## Outputs"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "copyrighted-perception",
"metadata": {},
"outputs": [],
"source": [
"from IPython.display import HTML, Markdown\n",
"\n",
"pct = lambda x: f\"{round(x * 100, 2)}%\"\n",
"\n",
"Markdown(f\"\"\"\n",
"\n",
"## Paper Metrics\n",
"\n",
"* Number of papers: {total_paper_count}\n",
"* Number of papers healed with Crossref: {_healed_document_count}\n",
"* Single-author paper count: {single_author_paper_count}\n",
"\n",
"## Authorship Gender (First/Last)\n",
"* F/F: {pct(female_female_norm_ratio)}\n",
"* M/F: {pct(male_female_norm_ratio)}\n",
"* F/M: {pct(female_male_norm_ratio)}\n",
"* M/M: {pct(male_male_norm_ratio)}\n",
"\n",
"## Authorship Race (First/Last)\n",
"* PoC/PoC: {pct(nw_nw_count_ratio)}\n",
"* PoC/White: {pct(nw_w_count_ratio)}\n",
"* White/PoC: {pct(w_nw_count_ratio)}\n",
"* White/White: {pct(w_w_count_ratio)}\n",
"\n",
"\"\"\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "solar-fairy",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.7"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment