Skip to content

Instantly share code, notes, and snippets.

@georgehc
Last active November 7, 2025 04:21
Show Gist options
  • Select an option

  • Save georgehc/3a0781dcedb6bbdc159c0fc17409b5d4 to your computer and use it in GitHub Desktop.

Select an option

Save georgehc/3a0781dcedb6bbdc159c0fc17409b5d4 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"id": "22cbf3c0",
"metadata": {},
"source": [
"# UDA Preprocessing the 20 Newsgroups Dataset\n",
"\n",
"Author: George H. Chen (georgechen [at symbol] cmu.edu)\n",
"\n",
"We will be using the standard 20 Newsgroups dataset for a few upcoming demos. We do some simple preprocessing in this notebook and save the results. Subsequence demos will load this preprocessed version of the dataset.\n",
"\n",
"We begin by loading in 10,000 random posts from the 20 Newsgroups dataset."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "a7959639",
"metadata": {},
"outputs": [],
"source": [
"from sklearn.datasets import fetch_20newsgroups\n",
"num_articles = 10000\n",
"\n",
"dataset = fetch_20newsgroups(shuffle=True, random_state=0,\n",
" remove=('headers', 'footers', 'quotes'))\n",
"raw_docs = dataset.data[:num_articles]"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "f7fb9cc7",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"list"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type(raw_docs)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "d3b4f579",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"10000"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(raw_docs)"
]
},
{
"cell_type": "markdown",
"id": "0d379c00",
"metadata": {},
"source": [
"We can look at an example post:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "0b3c926e",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"\n",
"Koberg,\n",
"\n",
"\tJust a couple of minor corrections here...\n",
"\n",
"\t1) The Churches of Christ do not usually believe in speaking in\n",
"tongues, in fact many of them are known for being strongly opposed to\n",
"Pentecostal teaching. You are probably thinking of Church of God in\n",
"Christ, the largest African-American Pentecostal denomination.\n",
"\n",
"\t2) I'm not sure what you mean by \"signifying believers\" but it\n",
"should be pointed out that the Assemblies of God does not now, nor has it\n",
"ever, held that speaking in tongues is the sign that one is a Christian. \n",
"The doctrine that traditional Pentecostals (including the A/G) maintain is\n",
"that speaking in tongues is the sign of a second experience after becoming\n",
"a Christian in which one is \"Baptized in the Holy Spirit\" That may be\n",
"what you were referring to, but I point this out because Pentecostals are\n",
"frequently labeled as believing that you have to speak in tongues in order\n",
"to be a Christian. Such a position is only held by some groups and not the\n",
"majority of Pentecostals. Many Pentecostals will quote the passage in\n",
"Mark 16 about \"these signs following them that believe\" but they generally\n",
"do not interpret this as meaning if you don't pactice the signs you aren't\n",
"\"saved\".\n",
"\n",
"\t3) I know it's hard to summarize the beliefs of a movement that\n",
"has such diversity, but I think you've made some pretty big\n",
"generalizations here. Do \"Neo-Pentecostals\" only believe in tongues as a\n",
"sign and tongues as prayer but NOT tongues as revelatory with a message? \n",
"I've never heard of that before. In fact I would have characterized them\n",
"as believing the same as Pentecostals except less likely to see tongues as\n",
"a sign of Spirit Baptism. Also, while neo-Pentecostals may not be\n",
"inclined to speak in tongues in the non-Pentecostal churches they attend,\n",
"they do have their own meetings and, in many cases, a whole church will be\n",
"charismatic.\n"
]
}
],
"source": [
"print(raw_docs[0])"
]
},
{
"cell_type": "markdown",
"id": "9beb68de",
"metadata": {},
"source": [
"Let's do the following simple preprocessing where we lemmatize every word:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "66218826",
"metadata": {},
"outputs": [],
"source": [
"import spacy\n",
"nlp = spacy.load('en_core_web_sm', disable = ['parser', 'ner'])\n",
"\n",
"def lemmatize(text):\n",
" return ' '.join([token.lemma_ for token in nlp(text)])"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "05213eac",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [01:43<00:00, 96.78it/s]\n"
]
}
],
"source": [
"from tqdm import tqdm\n",
"\n",
"docs = []\n",
"for raw_doc in tqdm(raw_docs):\n",
" docs.append(lemmatize(raw_doc))"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "f90fc67d",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"\n",
" Koberg , \n",
"\n",
"\t just a couple of minor correction here ... \n",
"\n",
"\t 1 ) the church of Christ do not usually believe in speak in \n",
" tongue , in fact many of they be know for be strongly oppose to \n",
" pentecostal teaching . you be probably think of Church of God in \n",
" Christ , the large african - american Pentecostal denomination . \n",
"\n",
"\t 2 ) I be not sure what you mean by \" signify believer \" but it \n",
" should be point out that the assembly of God do not now , nor have it \n",
" ever , hold that speak in tongue be the sign that one be a Christian . \n",
" the doctrine that traditional pentecostal ( include the A / g ) maintain be \n",
" that speak in tongue be the sign of a second experience after become \n",
" a Christian in which one be \" baptize in the Holy Spirit \" that may be \n",
" what you be refer to , but I point this out because pentecostal be \n",
" frequently label as believe that you have to speak in tongue in order \n",
" to be a Christian . such a position be only hold by some group and not the \n",
" majority of Pentecostals . many Pentecostals will quote the passage in \n",
" Mark 16 about \" these sign follow they that believe \" but they generally \n",
" do not interpret this as mean if you do not pactice the sign you be not \n",
" \" save \" . \n",
"\n",
"\t 3 ) I know it be hard to summarize the belief of a movement that \n",
" have such diversity , but I think you have make some pretty big \n",
" generalization here . do \" Neo - Pentecostals \" only believe in tongue as a \n",
" sign and tongue as prayer but not tongue as revelatory with a message ? \n",
" I have never hear of that before . in fact I would have characterize they \n",
" as believe the same as pentecostal except less likely to see tongue as \n",
" a sign of Spirit Baptism . also , while neo - Pentecostals may not be \n",
" incline to speak in tongue in the non - pentecostal church they attend , \n",
" they do have their own meeting and , in many case , a whole church will be \n",
" charismatic .\n"
]
}
],
"source": [
"print(docs[0])"
]
},
{
"cell_type": "markdown",
"id": "a361b605",
"metadata": {},
"source": [
"Now we save the lemmatized text:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "78ffecd3",
"metadata": {},
"outputs": [],
"source": [
"import pickle\n",
"\n",
"with open('lemmatized_text.xz', 'wb') as f:\n",
" pickle.dump(docs, f)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python [conda env:base] *",
"language": "python",
"name": "conda-base-py"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.13.5"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment