rll307/Vagas_trans.ipynb

## Vagas_trans.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Lexical analysis\n",
    "The purpose of this notebook is to make available the methodological steps I used in the following article:\n",
    "\n",
    "Lima-Lopes, RE de. **The Reaction to Social Quotas: A study of Facebook Comments in Brazilian Portuguese**.\n",
    "\n",
    "The paper was submitted to a major Brazilian journal and the referece will be updated when it is published."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Dada Colection\n",
    "Data was collected using the software [Netvizz](https://wiki.digitalmethods.net/Dmi/ToolNetvizz). It used to scrape data form Facebook’s pages and communities. The software could get information such as posts, comments on posts, general statistics of a page and posts in a given period. It only worked with pages that have set their status as public and, by default, anonymises usernames field as it generates a \\*.tab file. Today the software is discontinued since [Facebook](http://www.facebook.com) has had a more conservative data scrape policy.\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Objective\n",
    "- To study grammatical patterns in users comments on UFBA's annoucemment that its social quota programme would include imigrants, refugees and transexual people.\n",
    "- It is believed that the analysis of lexis might revel some interesting characteristics of the discourse of this comments"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Code"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Packages\n",
    "library(tm)\n",
    "library(plyr)\n",
    "library(readtext)\n",
    "library(RColorBrewer)\n",
    "library(FactoMineR)\n",
    "library(ggplot2)\n",
    "library(readr)\n",
    "library(tidyverse)\n",
    "library(quanteda)\n",
    "library(ggplot2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Stopwords\n",
    "my.stopwords <- read_csv(\"stop_port2.csv\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Reading the comments files (Netvizz would provide \\*tab delimited tables\n",
    "## It generated 4 files, one for each time data was collected. The number of files represent the number of times I had to collect the comments.\n",
    "## Restrictions on number of comments were part of Netvizz/Facebook interacion\n",
    "comentarios_01 <- read_csv(\"comentarios_01.csv\")\n",
    "comentarios_02 <- read_csv(\"comentarios_02.csv\")\n",
    "comentarios_03 <- read_csv(\"comentarios_03.csv\")\n",
    "base1 <- rbind(comentarios_01,comentarios_02,comentarios_03)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Creating a corpus using *tm* \n",
    "text_string <- as.character(base1$comment_message)\n",
    "corpus.cluster <- Corpus(VectorSource(text_string))\n",
    "corpus.cluster <- tm_map(corpus.cluster, content_transformer(tolower))\n",
    "removeURL <- function(x) gsub(\"http[[:alnum:][:punct:]]*\", \"\", x) \n",
    "remove.users <-function(x) gsub(\"@[[:alnum:][:punct:]]*\",\"\",x)\n",
    "corpus.cluster <- tm_map(corpus.cluster, content_transformer(removeURL))\n",
    "corpus.cluster <- tm_map(corpus.cluster,content_transformer(remove.users))\n",
    "corpus.cluster  = tm_map(corpus.cluster, stripWhitespace)\n",
    "corpus.cluster <- tm_map(corpus.cluster, removePunctuation)\n",
    "corpus.cluster <- tm_map(corpus.cluster, \n",
    "                         function(x)removeWords(x,c(stopwords(\"pt\"),stopport)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Creating a matrix od terms\n",
    "cluster.tdm <- TermDocumentMatrix(corpus.cluster)\n",
    "#Deleting sparse words\n",
    "cluster.df <- as.data.frame(inspect(cluster.tdm))\n",
    "#Converting the corpus to a matrix\n",
    "cluster.m <- as.matrix(cluster.tdm)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cluster.wf <- rowSums(cluster.m)\n",
    "#Deleting sparse words (90%)\n",
    "cluster.m1 <- cluster.m[cluster.wf>quantile(cluster.wf,probs=0.99), ]\n",
    "#Revmoning 0 columns\n",
    "cluster.m1 <- cluster.m1[,colSums(cluster.m1)!=0]\n",
    "#Creating binary relationships\n",
    "cluster.m1[cluster.m1 > 1] = 1\n",
    "cluster.m1dist = dist(cluster.m1, method=\"binary\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#creating the colour dendogram\n",
    "dend <- as.dendrogram(clus1)\n",
    "labelColors <- c(\"#809acd\", \"#000000\", \"#EB6841\", \"#666666\",\"#80cdb3\", \n",
    "                \"#c5ab8a\",\"#ffa500\",\"#0000ff\", \"#523415\", \"#b882ee\")\n",
    "clusMember <- cutree(clus1, 10)\n",
    "colLab <- function(n) {\n",
    "  if (is.leaf(n)) {\n",
    "    a <- attributes(n)\n",
    "    labCol <- labelColors[clusMember[which(names(clusMember) == a$label)]]\n",
    "    attr(n, \"nodePar\") <- c(a$nodePar, lab.col = labCol)\n",
    "  }\n",
    "  n\n",
    "}\n",
    "clusDendro = dendrapply(dend, colLab)\n",
    "plot(clusDendro,cex=0.9)\n",
    "rect.hclust(clusDendro,k=2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After texts we processed in a dendogram, comments were classified in four types:\n",
    "- In favour of the quotas system\n",
    "- Against the quotas system\n",
    "- Interaction amongst users\n",
    "- Discrimination and racism against Northeast Brazilian citizens"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Visualizing the number of word in each kind of discourse\n",
    "\n",
    "polaridade.raw <- read.csv(\"polaridade_geral.csv\", stringsAsFactors = FALSE, fileEncoding = \"UTF-8\")\n",
    "View(polaridade.raw)\n",
    "\n",
    "##Sleecting columns\n",
    "polaridade.raw <- polaridade.raw[, 1:2]\n",
    "names(polaridade.raw) <- c(\"Label\", \"Text\")\n",
    "View(polaridade.raw)\n",
    "\n",
    "##Converteing classes in values\n",
    "polaridade.raw$Label <- as.factor(polaridade.raw$Label)\n",
    "\n",
    "#Observing the value of each theme\n",
    "prop.table(table(polaridade.raw$Label))\n",
    "polaridade.raw$TextLength <- nchar(polaridade.raw$Text)\n",
    "summary(polaridade.raw$TextLength)\n",
    "\n",
    "#plotting\n",
    "\n",
    "ggplot(polaridade.raw, aes(x = TextLength, fill = Label)) +\n",
    "  theme_bw() +\n",
    "  geom_quantile()\n",
    "  labs(caption=\"Source: Data\", y = \"Text Count\", x = \"Length of Text\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Creating teh corpus for concordancing\n",
    "trans <-corpus(text_string)\n",
    "trans.tokens <- tokens(trans, remove_punct = TRUE, \n",
    "                        remove_numbers = TRUE, remove_url = TRUE)\n",
    "#general command for concordancing\n",
    "x.kwic <- kwic(trans, pattern = \"x.*\", window = 25, \n",
    "                   case_insensitive=TRUE, valuetype = \"regex\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "R",
   "language": "R",
   "name": "ir"
  },
  "language_info": {
   "codemirror_mode": "r",
   "file_extension": ".r",
   "mimetype": "text/x-r-source",
   "name": "R",
   "pygments_lexer": "r",
   "version": "3.6.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Lexical analysis\n",
	"The purpose of this notebook is to make available the methodological steps I used in the following article:\n",
	"\n",
	"Lima-Lopes, RE de. The Reaction to Social Quotas: A study of Facebook Comments in Brazilian Portuguese.\n",
	"\n",
	"The paper was submitted to a major Brazilian journal and the referece will be updated when it is published."
	]
	},
	{
	"attachments": {},
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Dada Colection\n",
	"Data was collected using the software [Netvizz](https://wiki.digitalmethods.net/Dmi/ToolNetvizz). It used to scrape data form Facebook’s pages and communities. The software could get information such as posts, comments on posts, general statistics of a page and posts in a given period. It only worked with pages that have set their status as public and, by default, anonymises usernames field as it generates a \\*.tab file. Today the software is discontinued since [Facebook](http://www.facebook.com) has had a more conservative data scrape policy.\n"
	]
	},
	{
	"attachments": {},
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Objective\n",
	"- To study grammatical patterns in users comments on UFBA's annoucemment that its social quota programme would include imigrants, refugees and transexual people.\n",
	"- It is believed that the analysis of lexis might revel some interesting characteristics of the discourse of this comments"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Code"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"#Packages\n",
	"library(tm)\n",
	"library(plyr)\n",
	"library(readtext)\n",
	"library(RColorBrewer)\n",
	"library(FactoMineR)\n",
	"library(ggplot2)\n",
	"library(readr)\n",
	"library(tidyverse)\n",
	"library(quanteda)\n",
	"library(ggplot2)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"#Stopwords\n",
	"my.stopwords <- read_csv(\"stop_port2.csv\")"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"#Reading the comments files (Netvizz would provide \\*tab delimited tables\n",
	"## It generated 4 files, one for each time data was collected. The number of files represent the number of times I had to collect the comments.\n",
	"## Restrictions on number of comments were part of Netvizz/Facebook interacion\n",
	"comentarios_01 <- read_csv(\"comentarios_01.csv\")\n",
	"comentarios_02 <- read_csv(\"comentarios_02.csv\")\n",
	"comentarios_03 <- read_csv(\"comentarios_03.csv\")\n",
	"base1 <- rbind(comentarios_01,comentarios_02,comentarios_03)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"#Creating a corpus using tm \n",
	"text_string <- as.character(base1$comment_message)\n",
	"corpus.cluster <- Corpus(VectorSource(text_string))\n",
	"corpus.cluster <- tm_map(corpus.cluster, content_transformer(tolower))\n",
	"removeURL <- function(x) gsub(\"http[[:alnum:][:punct:]]*\", \"\", x) \n",
	"remove.users <-function(x) gsub(\"@[[:alnum:][:punct:]]*\",\"\",x)\n",
	"corpus.cluster <- tm_map(corpus.cluster, content_transformer(removeURL))\n",
	"corpus.cluster <- tm_map(corpus.cluster,content_transformer(remove.users))\n",
	"corpus.cluster = tm_map(corpus.cluster, stripWhitespace)\n",
	"corpus.cluster <- tm_map(corpus.cluster, removePunctuation)\n",
	"corpus.cluster <- tm_map(corpus.cluster, \n",
	" function(x)removeWords(x,c(stopwords(\"pt\"),stopport)))"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"#Creating a matrix od terms\n",
	"cluster.tdm <- TermDocumentMatrix(corpus.cluster)\n",
	"#Deleting sparse words\n",
	"cluster.df <- as.data.frame(inspect(cluster.tdm))\n",
	"#Converting the corpus to a matrix\n",
	"cluster.m <- as.matrix(cluster.tdm)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"cluster.wf <- rowSums(cluster.m)\n",
	"#Deleting sparse words (90%)\n",
	"cluster.m1 <- cluster.m[cluster.wf>quantile(cluster.wf,probs=0.99), ]\n",
	"#Revmoning 0 columns\n",
	"cluster.m1 <- cluster.m1[,colSums(cluster.m1)!=0]\n",
	"#Creating binary relationships\n",
	"cluster.m1[cluster.m1 > 1] = 1\n",
	"cluster.m1dist = dist(cluster.m1, method=\"binary\")"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"#creating the colour dendogram\n",
	"dend <- as.dendrogram(clus1)\n",
	"labelColors <- c(\"#809acd\", \"#000000\", \"#EB6841\", \"#666666\",\"#80cdb3\", \n",
	" \"#c5ab8a\",\"#ffa500\",\"#0000ff\", \"#523415\", \"#b882ee\")\n",
	"clusMember <- cutree(clus1, 10)\n",
	"colLab <- function(n) {\n",
	" if (is.leaf(n)) {\n",
	" a <- attributes(n)\n",
	" labCol <- labelColors[clusMember[which(names(clusMember) == a$label)]]\n",
	" attr(n, \"nodePar\") <- c(a$nodePar, lab.col = labCol)\n",
	" }\n",
	" n\n",
	"}\n",
	"clusDendro = dendrapply(dend, colLab)\n",
	"plot(clusDendro,cex=0.9)\n",
	"rect.hclust(clusDendro,k=2)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"After texts we processed in a dendogram, comments were classified in four types:\n",
	"- In favour of the quotas system\n",
	"- Against the quotas system\n",
	"- Interaction amongst users\n",
	"- Discrimination and racism against Northeast Brazilian citizens"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"#Visualizing the number of word in each kind of discourse\n",
	"\n",
	"polaridade.raw <- read.csv(\"polaridade_geral.csv\", stringsAsFactors = FALSE, fileEncoding = \"UTF-8\")\n",
	"View(polaridade.raw)\n",
	"\n",
	"##Sleecting columns\n",
	"polaridade.raw <- polaridade.raw[, 1:2]\n",
	"names(polaridade.raw) <- c(\"Label\", \"Text\")\n",
	"View(polaridade.raw)\n",
	"\n",
	"##Converteing classes in values\n",
	"polaridade.raw$Label <- as.factor(polaridade.raw$Label)\n",
	"\n",
	"#Observing the value of each theme\n",
	"prop.table(table(polaridade.raw$Label))\n",
	"polaridade.raw$TextLength <- nchar(polaridade.raw$Text)\n",
	"summary(polaridade.raw$TextLength)\n",
	"\n",
	"#plotting\n",
	"\n",
	"ggplot(polaridade.raw, aes(x = TextLength, fill = Label)) +\n",
	" theme_bw() +\n",
	" geom_quantile()\n",
	" labs(caption=\"Source: Data\", y = \"Text Count\", x = \"Length of Text\")"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"#Creating teh corpus for concordancing\n",
	"trans <-corpus(text_string)\n",
	"trans.tokens <- tokens(trans, remove_punct = TRUE, \n",
	" remove_numbers = TRUE, remove_url = TRUE)\n",
	"#general command for concordancing\n",
	"x.kwic <- kwic(trans, pattern = \"x.*\", window = 25, \n",
	" case_insensitive=TRUE, valuetype = \"regex\")"
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "R",
	"language": "R",
	"name": "ir"
	},
	"language_info": {
	"codemirror_mode": "r",
	"file_extension": ".r",
	"mimetype": "text/x-r-source",
	"name": "R",
	"pygments_lexer": "r",
	"version": "3.6.1"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 2
	}