Last active
September 30, 2019 09:58
-
-
Save rll307/6d8de36d787967e91c25b0699eda1ca1 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# Lexical analysis\n", | |
"The purpose of this notebook is to make available the methodological steps I used in the following article:\n", | |
"\n", | |
"Lima-Lopes, RE de. **The Reaction to Social Quotas: A study of Facebook Comments in Brazilian Portuguese**.\n", | |
"\n", | |
"The paper was submitted to a major Brazilian journal and the referece will be updated when it is published." | |
] | |
}, | |
{ | |
"attachments": {}, | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Dada Colection\n", | |
"Data was collected using the software [Netvizz](https://wiki.digitalmethods.net/Dmi/ToolNetvizz). It used to scrape data form Facebook’s pages and communities. The software could get information such as posts, comments on posts, general statistics of a page and posts in a given period. It only worked with pages that have set their status as public and, by default, anonymises usernames field as it generates a \\*.tab file. Today the software is discontinued since [Facebook](http://www.facebook.com) has had a more conservative data scrape policy.\n" | |
] | |
}, | |
{ | |
"attachments": {}, | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Objective\n", | |
"- To study grammatical patterns in users comments on UFBA's annoucemment that its social quota programme would include imigrants, refugees and transexual people.\n", | |
"- It is believed that the analysis of lexis might revel some interesting characteristics of the discourse of this comments" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Code" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"#Packages\n", | |
"library(tm)\n", | |
"library(plyr)\n", | |
"library(readtext)\n", | |
"library(RColorBrewer)\n", | |
"library(FactoMineR)\n", | |
"library(ggplot2)\n", | |
"library(readr)\n", | |
"library(tidyverse)\n", | |
"library(quanteda)\n", | |
"library(ggplot2)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"#Stopwords\n", | |
"my.stopwords <- read_csv(\"stop_port2.csv\")" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"#Reading the comments files (Netvizz would provide \\*tab delimited tables\n", | |
"## It generated 4 files, one for each time data was collected. The number of files represent the number of times I had to collect the comments.\n", | |
"## Restrictions on number of comments were part of Netvizz/Facebook interacion\n", | |
"comentarios_01 <- read_csv(\"comentarios_01.csv\")\n", | |
"comentarios_02 <- read_csv(\"comentarios_02.csv\")\n", | |
"comentarios_03 <- read_csv(\"comentarios_03.csv\")\n", | |
"base1 <- rbind(comentarios_01,comentarios_02,comentarios_03)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"#Creating a corpus using *tm* \n", | |
"text_string <- as.character(base1$comment_message)\n", | |
"corpus.cluster <- Corpus(VectorSource(text_string))\n", | |
"corpus.cluster <- tm_map(corpus.cluster, content_transformer(tolower))\n", | |
"removeURL <- function(x) gsub(\"http[[:alnum:][:punct:]]*\", \"\", x) \n", | |
"remove.users <-function(x) gsub(\"@[[:alnum:][:punct:]]*\",\"\",x)\n", | |
"corpus.cluster <- tm_map(corpus.cluster, content_transformer(removeURL))\n", | |
"corpus.cluster <- tm_map(corpus.cluster,content_transformer(remove.users))\n", | |
"corpus.cluster = tm_map(corpus.cluster, stripWhitespace)\n", | |
"corpus.cluster <- tm_map(corpus.cluster, removePunctuation)\n", | |
"corpus.cluster <- tm_map(corpus.cluster, \n", | |
" function(x)removeWords(x,c(stopwords(\"pt\"),stopport)))" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"#Creating a matrix od terms\n", | |
"cluster.tdm <- TermDocumentMatrix(corpus.cluster)\n", | |
"#Deleting sparse words\n", | |
"cluster.df <- as.data.frame(inspect(cluster.tdm))\n", | |
"#Converting the corpus to a matrix\n", | |
"cluster.m <- as.matrix(cluster.tdm)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"cluster.wf <- rowSums(cluster.m)\n", | |
"#Deleting sparse words (90%)\n", | |
"cluster.m1 <- cluster.m[cluster.wf>quantile(cluster.wf,probs=0.99), ]\n", | |
"#Revmoning 0 columns\n", | |
"cluster.m1 <- cluster.m1[,colSums(cluster.m1)!=0]\n", | |
"#Creating binary relationships\n", | |
"cluster.m1[cluster.m1 > 1] = 1\n", | |
"cluster.m1dist = dist(cluster.m1, method=\"binary\")" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"#creating the colour dendogram\n", | |
"dend <- as.dendrogram(clus1)\n", | |
"labelColors <- c(\"#809acd\", \"#000000\", \"#EB6841\", \"#666666\",\"#80cdb3\", \n", | |
" \"#c5ab8a\",\"#ffa500\",\"#0000ff\", \"#523415\", \"#b882ee\")\n", | |
"clusMember <- cutree(clus1, 10)\n", | |
"colLab <- function(n) {\n", | |
" if (is.leaf(n)) {\n", | |
" a <- attributes(n)\n", | |
" labCol <- labelColors[clusMember[which(names(clusMember) == a$label)]]\n", | |
" attr(n, \"nodePar\") <- c(a$nodePar, lab.col = labCol)\n", | |
" }\n", | |
" n\n", | |
"}\n", | |
"clusDendro = dendrapply(dend, colLab)\n", | |
"plot(clusDendro,cex=0.9)\n", | |
"rect.hclust(clusDendro,k=2)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"After texts we processed in a dendogram, comments were classified in four types:\n", | |
"- In favour of the quotas system\n", | |
"- Against the quotas system\n", | |
"- Interaction amongst users\n", | |
"- Discrimination and racism against Northeast Brazilian citizens" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"#Visualizing the number of word in each kind of discourse\n", | |
"\n", | |
"polaridade.raw <- read.csv(\"polaridade_geral.csv\", stringsAsFactors = FALSE, fileEncoding = \"UTF-8\")\n", | |
"View(polaridade.raw)\n", | |
"\n", | |
"##Sleecting columns\n", | |
"polaridade.raw <- polaridade.raw[, 1:2]\n", | |
"names(polaridade.raw) <- c(\"Label\", \"Text\")\n", | |
"View(polaridade.raw)\n", | |
"\n", | |
"##Converteing classes in values\n", | |
"polaridade.raw$Label <- as.factor(polaridade.raw$Label)\n", | |
"\n", | |
"#Observing the value of each theme\n", | |
"prop.table(table(polaridade.raw$Label))\n", | |
"polaridade.raw$TextLength <- nchar(polaridade.raw$Text)\n", | |
"summary(polaridade.raw$TextLength)\n", | |
"\n", | |
"#plotting\n", | |
"\n", | |
"ggplot(polaridade.raw, aes(x = TextLength, fill = Label)) +\n", | |
" theme_bw() +\n", | |
" geom_quantile()\n", | |
" labs(caption=\"Source: Data\", y = \"Text Count\", x = \"Length of Text\")" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"#Creating teh corpus for concordancing\n", | |
"trans <-corpus(text_string)\n", | |
"trans.tokens <- tokens(trans, remove_punct = TRUE, \n", | |
" remove_numbers = TRUE, remove_url = TRUE)\n", | |
"#general command for concordancing\n", | |
"x.kwic <- kwic(trans, pattern = \"x.*\", window = 25, \n", | |
" case_insensitive=TRUE, valuetype = \"regex\")" | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "R", | |
"language": "R", | |
"name": "ir" | |
}, | |
"language_info": { | |
"codemirror_mode": "r", | |
"file_extension": ".r", | |
"mimetype": "text/x-r-source", | |
"name": "R", | |
"pygments_lexer": "r", | |
"version": "3.6.1" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 2 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment