Skip to content

Instantly share code, notes, and snippets.

@rll307
Last active September 30, 2019 09:58
Show Gist options
  • Save rll307/6d8de36d787967e91c25b0699eda1ca1 to your computer and use it in GitHub Desktop.
Save rll307/6d8de36d787967e91c25b0699eda1ca1 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Lexical analysis\n",
"The purpose of this notebook is to make available the methodological steps I used in the following article:\n",
"\n",
"Lima-Lopes, RE de. **The Reaction to Social Quotas: A study of Facebook Comments in Brazilian Portuguese**.\n",
"\n",
"The paper was submitted to a major Brazilian journal and the referece will be updated when it is published."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Dada Colection\n",
"Data was collected using the software [Netvizz](https://wiki.digitalmethods.net/Dmi/ToolNetvizz). It used to scrape data form Facebook’s pages and communities. The software could get information such as posts, comments on posts, general statistics of a page and posts in a given period. It only worked with pages that have set their status as public and, by default, anonymises usernames field as it generates a \\*.tab file. Today the software is discontinued since [Facebook](http://www.facebook.com) has had a more conservative data scrape policy.\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Objective\n",
"- To study grammatical patterns in users comments on UFBA's annoucemment that its social quota programme would include imigrants, refugees and transexual people.\n",
"- It is believed that the analysis of lexis might revel some interesting characteristics of the discourse of this comments"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Code"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Packages\n",
"library(tm)\n",
"library(plyr)\n",
"library(readtext)\n",
"library(RColorBrewer)\n",
"library(FactoMineR)\n",
"library(ggplot2)\n",
"library(readr)\n",
"library(tidyverse)\n",
"library(quanteda)\n",
"library(ggplot2)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Stopwords\n",
"my.stopwords <- read_csv(\"stop_port2.csv\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Reading the comments files (Netvizz would provide \\*tab delimited tables\n",
"## It generated 4 files, one for each time data was collected. The number of files represent the number of times I had to collect the comments.\n",
"## Restrictions on number of comments were part of Netvizz/Facebook interacion\n",
"comentarios_01 <- read_csv(\"comentarios_01.csv\")\n",
"comentarios_02 <- read_csv(\"comentarios_02.csv\")\n",
"comentarios_03 <- read_csv(\"comentarios_03.csv\")\n",
"base1 <- rbind(comentarios_01,comentarios_02,comentarios_03)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Creating a corpus using *tm* \n",
"text_string <- as.character(base1$comment_message)\n",
"corpus.cluster <- Corpus(VectorSource(text_string))\n",
"corpus.cluster <- tm_map(corpus.cluster, content_transformer(tolower))\n",
"removeURL <- function(x) gsub(\"http[[:alnum:][:punct:]]*\", \"\", x) \n",
"remove.users <-function(x) gsub(\"@[[:alnum:][:punct:]]*\",\"\",x)\n",
"corpus.cluster <- tm_map(corpus.cluster, content_transformer(removeURL))\n",
"corpus.cluster <- tm_map(corpus.cluster,content_transformer(remove.users))\n",
"corpus.cluster = tm_map(corpus.cluster, stripWhitespace)\n",
"corpus.cluster <- tm_map(corpus.cluster, removePunctuation)\n",
"corpus.cluster <- tm_map(corpus.cluster, \n",
" function(x)removeWords(x,c(stopwords(\"pt\"),stopport)))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Creating a matrix od terms\n",
"cluster.tdm <- TermDocumentMatrix(corpus.cluster)\n",
"#Deleting sparse words\n",
"cluster.df <- as.data.frame(inspect(cluster.tdm))\n",
"#Converting the corpus to a matrix\n",
"cluster.m <- as.matrix(cluster.tdm)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"cluster.wf <- rowSums(cluster.m)\n",
"#Deleting sparse words (90%)\n",
"cluster.m1 <- cluster.m[cluster.wf>quantile(cluster.wf,probs=0.99), ]\n",
"#Revmoning 0 columns\n",
"cluster.m1 <- cluster.m1[,colSums(cluster.m1)!=0]\n",
"#Creating binary relationships\n",
"cluster.m1[cluster.m1 > 1] = 1\n",
"cluster.m1dist = dist(cluster.m1, method=\"binary\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#creating the colour dendogram\n",
"dend <- as.dendrogram(clus1)\n",
"labelColors <- c(\"#809acd\", \"#000000\", \"#EB6841\", \"#666666\",\"#80cdb3\", \n",
" \"#c5ab8a\",\"#ffa500\",\"#0000ff\", \"#523415\", \"#b882ee\")\n",
"clusMember <- cutree(clus1, 10)\n",
"colLab <- function(n) {\n",
" if (is.leaf(n)) {\n",
" a <- attributes(n)\n",
" labCol <- labelColors[clusMember[which(names(clusMember) == a$label)]]\n",
" attr(n, \"nodePar\") <- c(a$nodePar, lab.col = labCol)\n",
" }\n",
" n\n",
"}\n",
"clusDendro = dendrapply(dend, colLab)\n",
"plot(clusDendro,cex=0.9)\n",
"rect.hclust(clusDendro,k=2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"After texts we processed in a dendogram, comments were classified in four types:\n",
"- In favour of the quotas system\n",
"- Against the quotas system\n",
"- Interaction amongst users\n",
"- Discrimination and racism against Northeast Brazilian citizens"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Visualizing the number of word in each kind of discourse\n",
"\n",
"polaridade.raw <- read.csv(\"polaridade_geral.csv\", stringsAsFactors = FALSE, fileEncoding = \"UTF-8\")\n",
"View(polaridade.raw)\n",
"\n",
"##Sleecting columns\n",
"polaridade.raw <- polaridade.raw[, 1:2]\n",
"names(polaridade.raw) <- c(\"Label\", \"Text\")\n",
"View(polaridade.raw)\n",
"\n",
"##Converteing classes in values\n",
"polaridade.raw$Label <- as.factor(polaridade.raw$Label)\n",
"\n",
"#Observing the value of each theme\n",
"prop.table(table(polaridade.raw$Label))\n",
"polaridade.raw$TextLength <- nchar(polaridade.raw$Text)\n",
"summary(polaridade.raw$TextLength)\n",
"\n",
"#plotting\n",
"\n",
"ggplot(polaridade.raw, aes(x = TextLength, fill = Label)) +\n",
" theme_bw() +\n",
" geom_quantile()\n",
" labs(caption=\"Source: Data\", y = \"Text Count\", x = \"Length of Text\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Creating teh corpus for concordancing\n",
"trans <-corpus(text_string)\n",
"trans.tokens <- tokens(trans, remove_punct = TRUE, \n",
" remove_numbers = TRUE, remove_url = TRUE)\n",
"#general command for concordancing\n",
"x.kwic <- kwic(trans, pattern = \"x.*\", window = 25, \n",
" case_insensitive=TRUE, valuetype = \"regex\")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "R",
"language": "R",
"name": "ir"
},
"language_info": {
"codemirror_mode": "r",
"file_extension": ".r",
"mimetype": "text/x-r-source",
"name": "R",
"pygments_lexer": "r",
"version": "3.6.1"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment