Skip to content

Instantly share code, notes, and snippets.

@mwalton
Last active May 17, 2024 22:09
Show Gist options
  • Save mwalton/d07cfb6cd9cdfebfb9280cc7a55900ca to your computer and use it in GitHub Desktop.
Save mwalton/d07cfb6cd9cdfebfb9280cc7a55900ca to your computer and use it in GitHub Desktop.
CSSS Notes

Bias, Fairness and Inequality in an Algorithmic Age

Chair: Sasha Johfre (UW) Moderator: Kosuke Imai (Harvard)

Algorithmic Reinterpretations: College Rankings and Socioeconomic Self-Shorting

James Chu (Columbia)

  • ways that people interpret metrics may drive self-selection
  • problems:
    • cannot capture all that is valuable
    • exacerbate racial, gender and class inequalities
    • school rankings: do not adequately capture career and social mobility
    • recidivism: do not capture historic injustices
  • if algorithms reduce dimensions to a metric, people "decode" or reinterpret these metrics back to the input domain (and importantly, are biased in how they decode)
  • Algorithmic reinterpretation = inconsistent decoding
  • decoding of a metric is not 1:1
  • hypothesis: interpretation of college ranking metrics is clustered by socioeconomic status (SES)
  • impressions findings (n=2k):
    • high SES: interpret high score for exclusivity, rigor, safety, stress
    • low SES: interpret high score as signal for exclusion
  • differences in adolescent perceptions of college tuition price

Advancing Algorithmic Fairness: A Statistical Learning Approach with Causal Constraints

Razieh Nabi (Emory)

  • legal framing: "would the employer have taken the same action if the employee had a different race and everything else was the same?" maps to causal DAG
  • is unfairness always about the direct effect of X on Y? definitions should be context specific
  • mitigation approaches:
    • pre-process the observed data
    • post-process the statistical output
    • re-train subject to fairness constraints (constrained optimization)
  • intuition: lagrange multiplier indexes a constraint-specific path encoding constrained model space where influence of the sensitive attribute is 0
  • observation: formalizations of fairness only codify legal and political desiderata (no substitute for discourse, debate & policymaking)

Fair inference in multilevel data analysis

Peter Hoff (Duke)

Advances in Social Network Analysis

Chair: Tyler McCormick (UW) Moderator: Yuan Hsiao (UW)

Social Networks & Health

Weihua An (Emory)

Mosuo social networks do not support universalist theories of gendered social relationships

Siobhan Mattison (U New Mexico)

  • Gender believed in the literature to constrain activities & social relationships

  • Evolutionary theorists link constraints to gendered differences in payoffs from childcare vs politicking & securing reproductive partners

  • “WEIRD” societies: Western Educated Industrialized Rich & Democratic

    • Non-“naturalistic” societies
    • Academic framing of WEIRD societies universalize their cultural norms about gender through the literature
  • Setting: Ethnic Mosuo of China

    • Matrilineal & patrilineal subgroups
    • Lots in common (language, art, etc.) but gendered kinship is different (monogomy vs “walking” marriage, paternal vs maternal authority & inheritance)
    • Stat. sig. impacts (reversals) in health, social graph sparsity between men & women in matrilineal & patrilineal groups
    • Conjecture: differences in local ecology motivates differences in gendered social relation? (Kind of a “nothing burger”)
  • Universalism is boring. Essentialist thinking about gender is not doing the work that scientifically meaningful categories should do. Need to be more specific to social/cultural context in describing paths from sex to gender

Beijing Tang (CMU)

  • Signed networks: positive & negative degree (eg “like” or “dislike” edges)
  • Collaboration & competition in trade networks
  • Presence of signed edges introduce problems / change each other (“the enemy of my enemy is my friend” / “the friend of my friend is my friend”)
  • Empirical evidence for balance theory in social networks, protein interactions & ecology
  • A signed network is balanced at the population level if for all triples of edges, the expected product of signs is positive
  • Authors propose an estimator in this paradigm and generative model satisfying balanced sign constraints
  • Correlates of war dataset (CoW)

Measuring and understanding the dynamics of populations of scholars

Emilio Zagheni (Max Planck)

  • measuring migration of scholars based on info on institutional affiliation changes
  • The demography of the peripatetic researcher: Evidence on highly mobile scholars from the Web of Science
  • Scholarly Migration Database
  • provide evidence of relationship between economic development & migration propensity
  • migration trends & patterns
    • gender inequality
    • international mobility: expands networks & creates visibility
    • what might this mean for gender disparity? higher gender disparity among internationally mobile scholars vs non-mobile scholars (median .32 vs .47 respectively in '98-'02) also changing over time (.5 vs .54 in '13-'17)
    • impact of policies on migration: how did brexit affect migration of scholars?
      • p(entering UK | EU origin) no change
      • p(leaving UK | EU origin) increased
      • p(leaving UK | UK origin) decreased
      • p(entering UK | UK origin) increased
  • internal migration observation: center of gravity within the US "moves west" between 2015-2020; reversal of historical trend
  • survival probability (liklihood of continuing to publish): little variation between genders, strong differences between countries and entry year

Text as data short course

Benjamin Mako Hill: social scientist studying online communities using NLP

NLP: subfield at intersection of computing & human language Text as data: analyzing text to generate insight

  • Text as Data: A New Framework for Machine Learning and the Social Sciences
  • Data sources relevant to census: Govt and legal documents, annual reports, internal comms, transcripts, legislation, public records
  • What can you do w/ text as data?
    • Discovery: EDA, concept & hypothesis discovery
    • Measurement: operationalizing a concept; measuring phenomena at scale
    • Causal inference: assess effect of counterfactual intervention
    • Text as treatment, outcome or confounder

Case studies

Representation:

it’s complicated (but mostly about reducing dimensionality)

  • Tokenization: split strings up (lots of ways to do this)
  • stop words: (commonly used words filtered out in preprocessing “the”, “is”, “at” etc, not always done, sometimes it maters)
  • Stemming: reducing words to stem / root form ({changing, changed, change} —> “chang”)
  • Lemmatization: converting word to meaningful base form given context
  • Vectorization:
    • bag of words (simplest, ignores syntax / word order)
    • N-grams: capture context and word order
  • Representing many documents
    • Term/document matrix: term frequency vectors of multiple documents (often big & sparse)
    • Term frequency inverse document frequency (TF-IDF) relevance of word to a document, compared to other documents
    • Further reducing dimensionality: principal component analysis / singular value decomposition (often work well after TF-IDF weighting)
  • Word embeddings
    • “You shall know a word by the company it keeps” - firth 1957
    • Represent a word based on words around it in small context window
    • Examples: word2vec (predict word from context or context from word) & GloVe (dim reduction of co-occurence counts within window)
    • Intuition: words that are semantically similar (roughly) map to similar parts of feature space (be careful w/ this, manifolds are messy)
    • Vector “arithmetic”: “king” - “man” = “queen”
  • Transformers:
    • predict words in a sequence, trained on large corpora (that's it)
    • Popular proprietary models create reproducibility problems for social sciences
    • Lack of transparency in training data (also bad)

Tools

  • Python: NLTK, spaCy
  • R: tm, textTinyR

Analysis

  • Dictionary methods: LIWC, VADER, Harvard-IV
  • unsupervised methods: grouping like things into partitions (clustering, topic models)
  • supervised methods: use a small sample of documents to train, make predictions / classifications of other documents
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment