Last active
May 17, 2024 22:09
-
-
Save mwalton/d07cfb6cd9cdfebfb9280cc7a55900ca to your computer and use it in GitHub Desktop.
CSSS Notes
Chair: Sasha Johfre (UW) Moderator: Kosuke Imai (Harvard)
James Chu (Columbia)
- ways that people interpret metrics may drive self-selection
- problems:
- cannot capture all that is valuable
- exacerbate racial, gender and class inequalities
- school rankings: do not adequately capture career and social mobility
- recidivism: do not capture historic injustices
- if algorithms reduce dimensions to a metric, people "decode" or reinterpret these metrics back to the input domain (and importantly, are biased in how they decode)
- Algorithmic reinterpretation = inconsistent decoding
- decoding of a metric is not 1:1
- hypothesis: interpretation of college ranking metrics is clustered by socioeconomic status (SES)
- impressions findings (n=2k):
- high SES: interpret high score for exclusivity, rigor, safety, stress
- low SES: interpret high score as signal for exclusion
- differences in adolescent perceptions of college tuition price
Razieh Nabi (Emory)
- legal framing: "would the employer have taken the same action if the employee had a different race and everything else was the same?" maps to causal DAG
- is unfairness always about the direct effect of X on Y? definitions should be context specific
- mitigation approaches:
- pre-process the observed data
- post-process the statistical output
- re-train subject to fairness constraints (constrained optimization)
- intuition: lagrange multiplier indexes a constraint-specific path encoding constrained model space where influence of the sensitive attribute is 0
- observation: formalizations of fairness only codify legal and political desiderata (no substitute for discourse, debate & policymaking)
Peter Hoff (Duke)
- goal: infer group-specific parameters from group-specific samples
- hierarchical modeling enable data-sharing between groups (perhaps small sample size within group) but are biased & lack group-specific error control
- Smaller p-values via indirect information FAB Inference Slides linear FAB R package
Benjamin Mako Hill: social scientist studying online communities using NLP
NLP: subfield at intersection of computing & human language Text as data: analyzing text to generate insight
- Text as Data: A New Framework for Machine Learning and the Social Sciences
- Data sources relevant to census: Govt and legal documents, annual reports, internal comms, transcripts, legislation, public records
- What can you do w/ text as data?
- Discovery: EDA, concept & hypothesis discovery
- Measurement: operationalizing a concept; measuring phenomena at scale
- Causal inference: assess effect of counterfactual intervention
- Text as treatment, outcome or confounder
- Emotional manipulation content on Facebook: ~7k participants, sentiment analysis LIWC (Linguistic Inquiry & Word Count), attempt to measure impact on text produced by participants given emotional intervention
- Topic modeling for cultural sociology journalistic coverage of government funding for the arts
- Taboo topics on wikipedia: taboo article classifier; wikipedia’s social process of production is shaped by taboo
it’s complicated (but mostly about reducing dimensionality)
- Tokenization: split strings up (lots of ways to do this)
- stop words: (commonly used words filtered out in preprocessing “the”, “is”, “at” etc, not always done, sometimes it maters)
- Stemming: reducing words to stem / root form ({changing, changed, change} —> “chang”)
- Lemmatization: converting word to meaningful base form given context
- Vectorization:
- bag of words (simplest, ignores syntax / word order)
- N-grams: capture context and word order
- Representing many documents
- Term/document matrix: term frequency vectors of multiple documents (often big & sparse)
- Term frequency inverse document frequency (TF-IDF) relevance of word to a document, compared to other documents
- Further reducing dimensionality: principal component analysis / singular value decomposition (often work well after TF-IDF weighting)
- Word embeddings
- “You shall know a word by the company it keeps” - firth 1957
- Represent a word based on words around it in small context window
- Examples: word2vec (predict word from context or context from word) & GloVe (dim reduction of co-occurence counts within window)
- Intuition: words that are semantically similar (roughly) map to similar parts of feature space (be careful w/ this, manifolds are messy)
- Vector “arithmetic”: “king” - “man” = “queen”
- Transformers:
- predict words in a sequence, trained on large corpora (that's it)
- Popular proprietary models create reproducibility problems for social sciences
- Lack of transparency in training data (also bad)
- Python: NLTK, spaCy
- R: tm, textTinyR
- Dictionary methods: LIWC, VADER, Harvard-IV
- unsupervised methods: grouping like things into partitions (clustering, topic models)
- supervised methods: use a small sample of documents to train, make predictions / classifications of other documents
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment