Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Draft blog post on potential gender biases in algorithms used by commercial library search tools

Algorithmic Gender Bias in Discovery Systems

This is a working draft of a possible future blog post

At GVSU, we used the Summon discovery service to provide search across most of our resources. One of Summon 2.0's features is the "Topic Explorer," a sidebar that provides general reference information when a search is done. The Topic Explorer shows short excerpts from Reference Sources, including Wikipedia, when a search meets certain criteria. From the Summon Press Release announcing Summon 2.0 in 2013:

Developed by analyzing global Summon usage data and leveraging commercial and open access reference content, as well as librarian expertise, this new feature helps users get started (presearch) with the research process and allows librarians to help users when and where they need it most.

The Topic Explorer returns different reference articles about the superhero "batman," for instance, depending on the host library's subscriptions and settings for which reference sources are used. At GVSU, we get a blurb from the Gale Virtual Reference Library:

http://gvsu.summon.serialssolutions.com/#!/search?ho=t&l=en&q=batman

While The University of Huddersfield has an article from Wikipedia:

http://hud.summon.serialssolutions.com/#!/search?ho=t&l=en&q=batman

Yesterday (November 11, 2015) I tweeted about a search in the Summon discovery service for "stress in the workplace" that brought up the topic "Women in the workforce" from Wikipedia.1 This seemed not only an inappropriate subject to match the search terms, but also one that showed elements of gender bias. I've been following the work of Safiya Nobel, a faculty member in UCLA's Information Studies department. Dr. Nobel's work on gender and racial bias in commercial search engines has been really influential on my own thinking of the ethical responsibilities of libraries that provide digital research tools. I had previously thought that the indexing process of commercial discovery services were not as susceptible as the commercial algorithms that led to the systemic problems in Google's portrayal of women and girls of color. However, seeing this search made me rethink that position, since Summon relies on some algorithmic mechanism to match up these "topic" summaries with searches.

As of this morning (11/12/2015), ProQuest has made a change to either the algorithm or the mapping of topics to remove this particular topic from coming up. While I think that is a good thing, it makes me wonder what other topics might be exhibiting unintended gender or racial biases. In the meantime, I did have the foresight yesterday to run the search in many of the Summon instances listed in the Community Wiki (a small selection of all Summon customers, and sadly, I didn't capture all the listings). I took screen captures of them, and am reproducing them here to show some interesting tendencies (like toggling and English search in non-English Summon instances brings up the "Women in the workforce" Wikipedia Topic). Below are the screenshots:

Grand Valley State University
Grand Valley State University, USA

University of Huddersfield
University of Huddersfield, UK

Cornerstone University
Cornerstone University, USA - Wikipedia turned off

Dartmouth College
Dartmouth College, USA

Griffith University
Griffith University, Australia - Wikipedia turned off

University of Denver
University of Denver, USA

Sheffield Hallam University
Sheffield Hallam University, UK - Wikipedia turned off? No longer active customer

Virginia Tech
Virginia Tech, USA

University of Boras
University of Borȧs, Sweden

Queensland University of Technology
Queensland University of Technology, Australia - Wikipedia turned off?

James Cook University
James Cook University, Australia

Edith Cowan University
Edith Cowan University, Australia - Wikipedia turned off?

Universitat Konstanz
Universität Konstanz, Germany

I'm currently exploring the Topic Explorer's results on a number of potentially problematic search terms pulled from our Summon usage logs. If you'd like to donate anonymous results from your own Summon logs, I'd appreciate the opportunity to broaden the dataset. Feel free to drop me a line at reidsmam@gvsu.edu or on Twitter at @mreidsma.

Some of the questions I am working on include, but are not limited to:

  • Are there other instances of unconscious bias reflected in the algorithms results?
  • If so, are they distributed across different reference sources, or are they limited to a single source?
  • Are the biases inherent in the subject-to-subject matching alone, or are there instances of bias evident in the content of the Topic panes?

I do want to emphasize that I'm not necessarily putting ProQuest or Wikipedia on trial here by examining these terms (although I am sure some will see it that way; another problem with a library world that has decided to outsource our search and discovery services to commercial entities that do not necessarily share our values.) I'm not looking at the content of the services as much as I am looking at the algorithms themselves.

I don't think that algorithms by necessity have to expose bias, but I also don't think we'll be able to create these kinds of sophisticated algorithms without first examining our own conscious and unconscious biases and making sure that these are not reflected in the way we code.


  1. I should note that these search terms are used by a colleague of mine to test new search tools, and we had noticed this discrepancy before, at least a year ago. I assumed at the time that this was a quirk in the service that would be worked out as the service matured. After reading Dr. Nobel's work, I was reminded of the search and ran it again and found that the Topic Pane continued to show the biased result.
@JohnMarkOckerbloom

This comment has been minimized.

Copy link

commented Nov 13, 2015

Thanks for researching and critiquing this!

One way to address problems like these is to use better data. As it happens, we have the data available to suggest a better Wikipedia article for "Stress in the workplace". In particular, in LCSH, "Stress in the workplace" is an alternative term for "Job stress", per the Library of Congress' linked open data set at http://id.loc.gov/authorities/subjects/sh85070588.html . And LCSH's "Job stress" in turn is associated with Wikipedia's "Occupational stress" in an open data set I've had posted for a while on Github, at https://github.com/JohnMarkOckerbloom/ftl/blob/master/data/wikimap .

I use this dataset, along with data provided by VIAF and some fairly conservative algorithmic enhancements, to provide bidirectional links between library searches and Wikipedia articles in my Forward to Libraries project. But anyone else, whether Proquest, librarians, or interested tinkerers, is free to reuse the data to make better connections between library subjects and Wikipedia. (They're also welcome to contribute enhancements-- for instance, I had LCSH's "Women -- Employment" corresponding to Wikipedia's "Women in the workforce", but in my most recent Github checkin I also added a mapping to that Wikipedia article from LCSH's "Women employees".)

Data alone won't solve all the problems of bias in online searches. But human-compiled data like the wikimap file above can often produce better results than naive algorithms like the one Proquest seems to be using in Summon. And we in libraries have had an impressive record over time in compiling data I hope we can continue to compile and apply data in new ways to improve discovery.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.