Skip to content

Instantly share code, notes, and snippets.

@BroaderImpact
Last active February 16, 2023 23:02
Show Gist options
  • Select an option

  • Save BroaderImpact/eb8b5164b46d97b6227a0755c99ed4f0 to your computer and use it in GitHub Desktop.

Select an option

Save BroaderImpact/eb8b5164b46d97b6227a0755c99ed4f0 to your computer and use it in GitHub Desktop.
NDWA sample
Display the source blob
Display the rendered blob
Raw
{"cells":[{"attachments":{},"cell_type":"markdown","metadata":{},"source":["Since its inception, the National Domestic Workers Alliance has collected survey data about domestic workers and low-propensity voters of color. With the creation of a data department (myself and a former organizer), expanded capacity allowed for this data to be used in predictive analytics. Two models emerged: one to measure the likelihood of a respondent being a domestic worker (as defined by participation in the care economy) and one to measure the likelihood of member engagement with the organizational primary purpose.\n","\n"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["from ndwa import worker, meng"]},{"attachments":{},"cell_type":"markdown","metadata":{},"source":["In the domestic worker identification project, a machine learning model was trained on a large dataset of voter, demographic, immigration and consumer data to predict which respondents were most likely to identify as domestic workers. This model was then fine-tuned using a smaller dataset of voter demographic and consumer data from a 2020 COVID survey to predict which respondents were most likely to be domestic workers."]},{"attachments":{},"cell_type":"markdown","metadata":{},"source":["The first step of the project was to preprocess the data by cleaning and normalizing the data, and then encoding categorical variables. The demographic and consumer data were used as features, and the target variable was whether or not the respondent replied “Yes” to the general survey question “Are you a domestic worker?”."]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["# load data\n","df_train = pd.read_csv('survey_training.csv')\n","df_scoring = pd.read_csv('survey_scoring.csv')\n"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["# preview data\n","df_train.head()"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["# check for missing values\n","df_train.isnull().sum()"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["# clean dataset\n","from eda import clean_dataset\n","\n","def clean_dataset(df):\n"," assert isinstance(df, pd.DataFrame), \"df needs to be a pd.DataFrame\"\n"," df.fillna(0, inplace=True)\n"," indices_to_keep = ~df.isin([np.nan, np.inf, -np.inf]).any(1)\n"," return df[indices_to_keep]#.astype(np.float64)\n","\n","df_train_clean = clean_dataset(df_train)"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["# define feature columns\n","feature_cols = ['immigration_status', 'zip_code', 'member_status', 'voter_propensity', 'race', 'ethnicity']\n","\n","# scale features\n","scaler = preprocessing.StandardScaler()\n","df_train_clean[feature_cols] = scaler.fit_transform(df_train_clean[feature_cols])\n","df_scoring_clean[feature_cols] = scaler.transform(df_scoring_clean[feature_cols])"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["# define X and y\n","X = df_train_clean[feature_cols]\n","y = df_train_clean['domestic_worker']\n","\n","# split data into train and test sets\n","X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)\n","\n","# train logistic regression model\n","logreg = linear_model.LogisticRegression()\n","logreg.fit(X_train, y_train)\n","\n","# train decision tree model\n","dt = DecisionTreeClassifier()\n","dt.fit(X_train, y_train)"]},{"attachments":{},"cell_type":"markdown","metadata":{},"source":["A machine learning model was then trained on the large dataset using gradient boosting algorithm. This model was then fine-tuned using the smaller dataset from the 2020 COVID survey. The fine-tuning process involved re-training the model using the smaller dataset while keeping the pre-trained weights from the initial model as a starting point. This allowed the model to quickly adapt to the new dataset while still leveraging the knowledge learned from the larger dataset."]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["from sklearn.ensemble import GradientBoostingClassifier\n","\n","# train gradient boosting algorithm\n","gb = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_depth=1, random_state=0).fit(X_train, y_train)\n","gb.score(X_test, y_test)"]},{"attachments":{},"cell_type":"markdown","metadata":{},"source":["The fine-tuned model was then evaluated on a hold-out test set from the 2020 COVID survey and was found to have improved performance compared to a model trained from scratch on the smaller dataset."]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["apply(X_covid)\n","\n","# model score for smaller dataset\n","gb.score(X_covid, y_covid)"]},{"attachments":{},"cell_type":"markdown","metadata":{},"source":["This project demonstrates the effectiveness of transfer learning for tabular data, as the domestic worker model was able to leverage the knowledge learned from the larger dataset to improve its performance on the smaller dataset from 2020."]}],"metadata":{"language_info":{"name":"python"},"orig_nbformat":4},"nbformat":4,"nbformat_minor":2}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment