Skip to content

Instantly share code, notes, and snippets.

@heaven00
Created March 1, 2018 05:31
Show Gist options
  • Save heaven00/d76785e3317513eb23c0e8eeb9cc274c to your computer and use it in GitHub Desktop.
Save heaven00/d76785e3317513eb23c0e8eeb9cc274c to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "code",
"execution_count": 45,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"using TextAnalysis\n",
"using DataFrames\n",
"using StatsBase\n",
"using CSV\n",
"using DataStructures"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/markdown": [
"`CSV.read(fullpath::Union{AbstractString,IO}, sink::Type{T}=DataFrame, args...; kwargs...)` => `typeof(sink)`\n",
"\n",
"`CSV.read(fullpath::Union{AbstractString,IO}, sink::Data.Sink; kwargs...)` => `Data.Sink`\n",
"\n",
"parses a delimited file into a Julia structure (a DataFrame by default, but any valid `Data.Sink` may be requested).\n",
"\n",
"Minimal error-reporting happens w/ `CSV.read` for performance reasons; for problematic csv files, try [`CSV.validate`](@ref) which takes exact same arguments as `CSV.read` and provides much more information for why reading the file failed.\n",
"\n",
"Positional arguments:\n",
"\n",
" * `fullpath`; can be a file name (string) or other `IO` instance\n",
" * `sink::Type{T}`; `DataFrame` by default, but may also be other `Data.Sink` types that support streaming via `Data.Field` interface; note that the method argument can be the *type* of `Data.Sink`, plus any required arguments the sink may need (`args...`). or an already constructed `sink` may be passed (2nd method above)\n",
"\n",
"Keyword Arguments:\n",
"\n",
" * `delim::Union{Char,UInt8}`: how fields in the file are delimited; default `','`\n",
" * `quotechar::Union{Char,UInt8}`: the character that indicates a quoted field that may contain the `delim` or newlines; default `'\"'`\n",
" * `escapechar::Union{Char,UInt8}`: the character that escapes a `quotechar` in a quoted field; default `'\\'`\n",
" * `null::String`: indicates how NULL values are represented in the dataset; default `\"\"`\n",
" * `dateformat::Union{AbstractString,Dates.DateFormat}`: how dates/datetimes are represented in the dataset; default `Base.Dates.ISODateTimeFormat`\n",
" * `decimal::Union{Char,UInt8}`: character to recognize as the decimal point in a float number, e.g. `3.14` or `3,14`; default `'.'`\n",
" * `truestring`: string to represent `true::Bool` values in a csv file; default `\"true\"`. Note that `truestring` and `falsestring` cannot start with the same character.\n",
" * `falsestring`: string to represent `false::Bool` values in a csv file; default `\"false\"`\n",
" * `header`: column names can be provided manually as a complete Vector{String}, or as an Int/AbstractRange which indicates the row/rows that contain the column names\n",
" * `datarow::Int`: specifies the row on which the actual data starts in the file; by default, the data is expected on the next row after the header row(s); for a file without column names (header), specify `datarow=1`\n",
" * `types`: column types can be provided manually as a complete Vector{Type}, or in a Dict to reference individual columns by name or number\n",
" * `nullable::Bool`: indicates whether values can be nullable or not; `true` by default. If set to `false` and missing values are encountered, a `Data.NullException` will be thrown\n",
" * `footerskip::Int`: indicates the number of rows to skip at the end of the file\n",
" * `rows_for_type_detect::Int=100`: indicates how many rows should be read to infer the types of columns\n",
" * `rows::Int`: indicates the total number of rows to read from the file; by default the file is pre-parsed to count the # of rows; `-1` can be passed to skip a full-file scan, but the `Data.Sink` must be set up to account for a potentially unknown # of rows\n",
" * `use_mmap::Bool=true`: whether the underlying file will be mmapped or not while parsing; note that on Windows machines, the underlying file will not be \"deletable\" until Julia GC has run (can be run manually via `gc()`) due to the use of a finalizer when reading the file.\n",
" * `append::Bool=false`: if the `sink` argument provided is an existing table, `append=true` will append the source's data to the existing data instead of doing a full replace\n",
" * `transforms::Dict{Union{String,Int},Function}`: a Dict of transforms to apply to values as they are parsed. Note that a column can be specified by either number or column name.\n",
" * `transpose::Bool=false`: when reading the underlying csv data, rows should be treated as columns and columns as rows, thus the resulting dataset will be the \"transpose\" of the actual csv data.\n",
" * `categorical::Bool=true`: read string column as a `CategoricalArray` ([ref](https://github.com/JuliaData/CategoricalArrays.jl)), as long as the % of unique values seen during type detection is less than 67%. This will dramatically reduce memory use in cases where the number of unique values is small.\n",
" * `weakrefstrings::Bool=true`: whether to use [`WeakRefStrings`](https://github.com/quinnj/WeakRefStrings.jl) package to speed up file parsing; can only be `=true` for the `Sink` objects that support `WeakRefStringArray` columns. Note that `WeakRefStringArray` still returns regular `String` elements.\n",
"\n",
"Example usage:\n",
"\n",
"```\n",
"julia> dt = CSV.read(\"bids.csv\")\n",
"7656334×9 DataFrames.DataFrame\n",
"│ Row │ bid_id │ bidder_id │ auction │ merchandise │ device │\n",
"├─────────┼─────────┼─────────────────────────────────────────┼─────────┼──────────────────┼─────────────┤\n",
"│ 1 │ 0 │ \"8dac2b259fd1c6d1120e519fb1ac14fbqvax8\" │ \"ewmzr\" │ \"jewelry\" │ \"phone0\" │\n",
"│ 2 │ 1 │ \"668d393e858e8126275433046bbd35c6tywop\" │ \"aeqok\" │ \"furniture\" │ \"phone1\" │\n",
"│ 3 │ 2 │ \"aa5f360084278b35d746fa6af3a7a1a5ra3xe\" │ \"wa00e\" │ \"home goods\" │ \"phone2\" │\n",
"...\n",
"```\n",
"\n",
"Other example invocations may include:\n",
"\n",
"```julia\n",
"# read in a tab-delimited file\n",
"CSV.read(file; delim='\t')\n",
"\n",
"# read in a comma-delimited file with null values represented as '\\N', such as a MySQL export\n",
"CSV.read(file; null=\"\\N\")\n",
"\n",
"# read a csv file that happens to have column names in the first column, and grouped data in rows instead of columns\n",
"CSV.read(file; transpose=true)\n",
"\n",
"# manually provided column names; must match # of columns of data in file\n",
"# this assumes there is no header row in the file itself, so data parsing will start at the very beginning of the file\n",
"CSV.read(file; header=[\"col1\", \"col2\", \"col3\"])\n",
"\n",
"# manually provided column names, even though the file itself has column names on the first row\n",
"# `datarow` is specified to ensure data parsing occurs at correct location\n",
"CSV.read(file; header=[\"col1\", \"col2\", \"col3\"], datarow=2)\n",
"\n",
"# types provided manually; as a vector, must match length of columns in actual data\n",
"CSV.read(file; types=[Int, Int, Float64])\n",
"\n",
"# types provided manually; as a Dict, can specify columns by # or column name\n",
"CSV.read(file; types=Dict(3=>Float64, 6=>String))\n",
"CSV.read(file; types=Dict(\"col3\"=>Float64, \"col6\"=>String))\n",
"\n",
"# manually provided # of rows; if known beforehand, this will improve parsing speed\n",
"# this is also a way to limit the # of rows to be read in a file if only a sample is needed\n",
"CSV.read(file; rows=10000)\n",
"\n",
"# for data files, `file` and `file2`, with the same structure, read both into a single DataFrame\n",
"# note that `df` is used as a 2nd argument in the 2nd call to `CSV.read` and the keyword argument\n",
"# `append=true` is passed\n",
"df = CSV.read(file)\n",
"df = CSV.read(file2, df; append=true)\n",
"\n",
"# manually construct a `CSV.Source` once, then stream its data to both a DataFrame\n",
"# and SQLite table `sqlite_table` in the SQLite database `db`\n",
"# note the use of `CSV.reset!` to ensure the `source` can be streamed from again\n",
"source = CSV.Source(file)\n",
"df1 = CSV.read(source, DataFrame)\n",
"CSV.reset!(source)\n",
"db = SQLite.DB()\n",
"sq1 = CSV.read(source, SQLite.Sink, db, \"sqlite_table\")\n",
"```\n"
],
"text/plain": [
"`CSV.read(fullpath::Union{AbstractString,IO}, sink::Type{T}=DataFrame, args...; kwargs...)` => `typeof(sink)`\n",
"\n",
"`CSV.read(fullpath::Union{AbstractString,IO}, sink::Data.Sink; kwargs...)` => `Data.Sink`\n",
"\n",
"parses a delimited file into a Julia structure (a DataFrame by default, but any valid `Data.Sink` may be requested).\n",
"\n",
"Minimal error-reporting happens w/ `CSV.read` for performance reasons; for problematic csv files, try [`CSV.validate`](@ref) which takes exact same arguments as `CSV.read` and provides much more information for why reading the file failed.\n",
"\n",
"Positional arguments:\n",
"\n",
" * `fullpath`; can be a file name (string) or other `IO` instance\n",
" * `sink::Type{T}`; `DataFrame` by default, but may also be other `Data.Sink` types that support streaming via `Data.Field` interface; note that the method argument can be the *type* of `Data.Sink`, plus any required arguments the sink may need (`args...`). or an already constructed `sink` may be passed (2nd method above)\n",
"\n",
"Keyword Arguments:\n",
"\n",
" * `delim::Union{Char,UInt8}`: how fields in the file are delimited; default `','`\n",
" * `quotechar::Union{Char,UInt8}`: the character that indicates a quoted field that may contain the `delim` or newlines; default `'\"'`\n",
" * `escapechar::Union{Char,UInt8}`: the character that escapes a `quotechar` in a quoted field; default `'\\'`\n",
" * `null::String`: indicates how NULL values are represented in the dataset; default `\"\"`\n",
" * `dateformat::Union{AbstractString,Dates.DateFormat}`: how dates/datetimes are represented in the dataset; default `Base.Dates.ISODateTimeFormat`\n",
" * `decimal::Union{Char,UInt8}`: character to recognize as the decimal point in a float number, e.g. `3.14` or `3,14`; default `'.'`\n",
" * `truestring`: string to represent `true::Bool` values in a csv file; default `\"true\"`. Note that `truestring` and `falsestring` cannot start with the same character.\n",
" * `falsestring`: string to represent `false::Bool` values in a csv file; default `\"false\"`\n",
" * `header`: column names can be provided manually as a complete Vector{String}, or as an Int/AbstractRange which indicates the row/rows that contain the column names\n",
" * `datarow::Int`: specifies the row on which the actual data starts in the file; by default, the data is expected on the next row after the header row(s); for a file without column names (header), specify `datarow=1`\n",
" * `types`: column types can be provided manually as a complete Vector{Type}, or in a Dict to reference individual columns by name or number\n",
" * `nullable::Bool`: indicates whether values can be nullable or not; `true` by default. If set to `false` and missing values are encountered, a `Data.NullException` will be thrown\n",
" * `footerskip::Int`: indicates the number of rows to skip at the end of the file\n",
" * `rows_for_type_detect::Int=100`: indicates how many rows should be read to infer the types of columns\n",
" * `rows::Int`: indicates the total number of rows to read from the file; by default the file is pre-parsed to count the # of rows; `-1` can be passed to skip a full-file scan, but the `Data.Sink` must be set up to account for a potentially unknown # of rows\n",
" * `use_mmap::Bool=true`: whether the underlying file will be mmapped or not while parsing; note that on Windows machines, the underlying file will not be \"deletable\" until Julia GC has run (can be run manually via `gc()`) due to the use of a finalizer when reading the file.\n",
" * `append::Bool=false`: if the `sink` argument provided is an existing table, `append=true` will append the source's data to the existing data instead of doing a full replace\n",
" * `transforms::Dict{Union{String,Int},Function}`: a Dict of transforms to apply to values as they are parsed. Note that a column can be specified by either number or column name.\n",
" * `transpose::Bool=false`: when reading the underlying csv data, rows should be treated as columns and columns as rows, thus the resulting dataset will be the \"transpose\" of the actual csv data.\n",
" * `categorical::Bool=true`: read string column as a `CategoricalArray` ([ref](https://github.com/JuliaData/CategoricalArrays.jl)), as long as the % of unique values seen during type detection is less than 67%. This will dramatically reduce memory use in cases where the number of unique values is small.\n",
" * `weakrefstrings::Bool=true`: whether to use [`WeakRefStrings`](https://github.com/quinnj/WeakRefStrings.jl) package to speed up file parsing; can only be `=true` for the `Sink` objects that support `WeakRefStringArray` columns. Note that `WeakRefStringArray` still returns regular `String` elements.\n",
"\n",
"Example usage:\n",
"\n",
"```\n",
"julia> dt = CSV.read(\"bids.csv\")\n",
"7656334×9 DataFrames.DataFrame\n",
"│ Row │ bid_id │ bidder_id │ auction │ merchandise │ device │\n",
"├─────────┼─────────┼─────────────────────────────────────────┼─────────┼──────────────────┼─────────────┤\n",
"│ 1 │ 0 │ \"8dac2b259fd1c6d1120e519fb1ac14fbqvax8\" │ \"ewmzr\" │ \"jewelry\" │ \"phone0\" │\n",
"│ 2 │ 1 │ \"668d393e858e8126275433046bbd35c6tywop\" │ \"aeqok\" │ \"furniture\" │ \"phone1\" │\n",
"│ 3 │ 2 │ \"aa5f360084278b35d746fa6af3a7a1a5ra3xe\" │ \"wa00e\" │ \"home goods\" │ \"phone2\" │\n",
"...\n",
"```\n",
"\n",
"Other example invocations may include:\n",
"\n",
"```julia\n",
"# read in a tab-delimited file\n",
"CSV.read(file; delim='\t')\n",
"\n",
"# read in a comma-delimited file with null values represented as '\\N', such as a MySQL export\n",
"CSV.read(file; null=\"\\N\")\n",
"\n",
"# read a csv file that happens to have column names in the first column, and grouped data in rows instead of columns\n",
"CSV.read(file; transpose=true)\n",
"\n",
"# manually provided column names; must match # of columns of data in file\n",
"# this assumes there is no header row in the file itself, so data parsing will start at the very beginning of the file\n",
"CSV.read(file; header=[\"col1\", \"col2\", \"col3\"])\n",
"\n",
"# manually provided column names, even though the file itself has column names on the first row\n",
"# `datarow` is specified to ensure data parsing occurs at correct location\n",
"CSV.read(file; header=[\"col1\", \"col2\", \"col3\"], datarow=2)\n",
"\n",
"# types provided manually; as a vector, must match length of columns in actual data\n",
"CSV.read(file; types=[Int, Int, Float64])\n",
"\n",
"# types provided manually; as a Dict, can specify columns by # or column name\n",
"CSV.read(file; types=Dict(3=>Float64, 6=>String))\n",
"CSV.read(file; types=Dict(\"col3\"=>Float64, \"col6\"=>String))\n",
"\n",
"# manually provided # of rows; if known beforehand, this will improve parsing speed\n",
"# this is also a way to limit the # of rows to be read in a file if only a sample is needed\n",
"CSV.read(file; rows=10000)\n",
"\n",
"# for data files, `file` and `file2`, with the same structure, read both into a single DataFrame\n",
"# note that `df` is used as a 2nd argument in the 2nd call to `CSV.read` and the keyword argument\n",
"# `append=true` is passed\n",
"df = CSV.read(file)\n",
"df = CSV.read(file2, df; append=true)\n",
"\n",
"# manually construct a `CSV.Source` once, then stream its data to both a DataFrame\n",
"# and SQLite table `sqlite_table` in the SQLite database `db`\n",
"# note the use of `CSV.reset!` to ensure the `source` can be streamed from again\n",
"source = CSV.Source(file)\n",
"df1 = CSV.read(source, DataFrame)\n",
"CSV.reset!(source)\n",
"db = SQLite.DB()\n",
"sq1 = CSV.read(source, SQLite.Sink, db, \"sqlite_table\")\n",
"```\n"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"?CSV.read"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<table class=\"data-frame\"><thead><tr><th></th><th>id</th><th>comment_text</th><th>toxic</th><th>severe_toxic</th><th>obscene</th><th>threat</th><th>insult</th><th>identity_hate</th></tr></thead><tbody><tr><th>1</th><td>0000997932d777bf</td><td>Explanation\\nWhy the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td></tr><tr><th>2</th><td>000103f0d9cfb60f</td><td>D'aww! He matches this background colour I'm seemingly stuck with. Thanks. (talk) 21:51, January 11, 2016 (UTC)</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td></tr><tr><th>3</th><td>000113f07ec002fd</td><td>Hey man, I'm really not trying to edit war. It's just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page. He seems to care more about the formatting than the actual info.</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td></tr><tr><th>4</th><td>0001b41b1c6bb37e</td><td>\"\\nMore\\nI can't make any real suggestions on improvement - I wondered if the section statistics should be later on, or a subsection of \"\"types of accidents\"\" -I think the references may need tidying so that they are all in the exact same format ie date format etc. I can do that later on, if no-one else does first - if you have any preferences for formatting style on references or want to do it yourself please let me know.\\n\\nThere appears to be a backlog on articles for review so I guess there may be a delay until a reviewer turns up. It's listed in the relevant form eg Wikipedia:Good_article_nominations#Transport \"</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td></tr><tr><th>5</th><td>0001d958c54c6e35</td><td>You, sir, are my hero. Any chance you remember what page that's on?</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td></tr><tr><th>6</th><td>00025465d4725e87</td><td>\"\\n\\nCongratulations from me as well, use the tools well.  · talk \"</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td></tr><tr><th>7</th><td>0002bcb3da6cb337</td><td>COCKSUCKER BEFORE YOU PISS AROUND ON MY WORK</td><td>1</td><td>1</td><td>1</td><td>0</td><td>1</td><td>0</td></tr><tr><th>8</th><td>00031b1e95af7921</td><td>Your vandalism to the Matt Shirvington article has been reverted. Please don't do it again, or you will be banned.</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td></tr><tr><th>9</th><td>00037261f536c51d</td><td>Sorry if the word 'nonsense' was offensive to you. Anyway, I'm not intending to write anything in the article(wow they would jump on me for vandalism), I'm merely requesting that it be more encyclopedic so one can use it for school as a reference. I have been to the selective breeding page but it's almost a stub. It points to 'animal breeding' which is a short messy article that gives you no info. There must be someone around with expertise in eugenics? 93.161.107.169</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td></tr><tr><th>10</th><td>00040093b2687caa</td><td>alignment on this subject and which are contrary to those of DuLithgow</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td></tr><tr><th>11</th><td>0005300084f90edc</td><td>\"\\nFair use rationale for Image:Wonju.jpg\\n\\nThanks for uploading Image:Wonju.jpg. I notice the image page specifies that the image is being used under fair use but there is no explanation or rationale as to why its use in Wikipedia articles constitutes fair use. In addition to the boilerplate fair use template, you must also write out on the image description page a specific explanation or rationale for why using this image in each article is consistent with fair use.\\n\\nPlease go to the image description page and edit it to include a fair use rationale.\\n\\nIf you have uploaded other fair use media, consider checking that you have specified the fair use rationale on those pages too. You can find a list of 'image' pages you have edited by clicking on the \"\"my contributions\"\" link (it is located at the very top of any Wikipedia page when you are logged in), and then selecting \"\"Image\"\" from the dropdown box. Note that any fair use images uploaded after 4 May, 2006, and lacking such an explanation will be deleted one week after they have been uploaded, as described on criteria for speedy deletion. If you have any questions please ask them at the Media copyright questions page. Thank you. (talk • contribs • ) \\nUnspecified source for Image:Wonju.jpg\\n\\nThanks for uploading Image:Wonju.jpg. I noticed that the file's description page currently doesn't specify who created the content, so the copyright status is unclear. If you did not create this file yourself, then you will need to specify the owner of the copyright. If you obtained it from a website, then a link to the website from which it was taken, together with a restatement of that website's terms of use of its content, is usually sufficient information. However, if the copyright holder is different from the website's publisher, then their copyright should also be acknowledged.\\n\\nAs well as adding the source, please add a proper copyright licensing tag if the file doesn't have one already. If you created/took the picture, audio, or video then the tag can be used to release it under the GFDL. If you believe the media meets the criteria at Wikipedia:Fair use, use a tag such as or one of the other tags listed at Wikipedia:Image copyright tags#Fair use. See Wikipedia:Image copyright tags for the full list of copyright tags that you can use.\\n\\nIf you have uploaded other files, consider checking that you have specified their source and tagged them, too. You can find a list of files you have uploaded by following [ this link]. Unsourced and untagged images may be deleted one week after they have been tagged, as described on criteria for speedy deletion. If the image is copyrighted under a non-free license (per Wikipedia:Fair use) then the image will be deleted 48 hours after . If you have any questions please ask them at the Media copyright questions page. Thank you. (talk • contribs • ) \"</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td></tr><tr><th>12</th><td>00054a5e18b50dd4</td><td>bbq \\n\\nbe a man and lets discuss it-maybe over the phone?</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td></tr><tr><th>13</th><td>0005c987bdfc9d4b</td><td>Hey... what is it..\\n@ | talk .\\nWhat is it... an exclusive group of some WP TALIBANS...who are good at destroying, self-appointed purist who GANG UP any one who asks them questions abt their ANTI-SOCIAL and DESTRUCTIVE (non)-contribution at WP?\\n\\nAsk Sityush to clean up his behavior than issue me nonsensical warnings...</td><td>1</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td></tr><tr><th>14</th><td>0006f16e4e9f292e</td><td>Before you start throwing accusations and warnings at me, lets review the edit itself-making ad hominem attacks isn't going to strengthen your argument, it will merely make it look like you are abusing your power as an admin. \\nNow, the edit itself is relevant-this is probably the single most talked about event int he news as of late. His absence is notable, since he is the only living ex-president who did not attend. That's certainly more notable than his dedicating an aircracft carrier. \\nI intend to revert this edit, in hopes of attracting the attention of an admin that is willing to look at the issue itself, and not throw accusations around quite so liberally. Perhaps, if you achieve a level of civility where you can do this, we can have a rational discussion on the topic and resolve the matter peacefully.</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td></tr><tr><th>15</th><td>00070ef96486d6f9</td><td>Oh, and the girl above started her arguments with me. She stuck her nose where it doesn't belong. I believe the argument was between me and Yvesnimmo. But like I said, the situation was settled and I apologized. Thanks,</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td></tr><tr><th>16</th><td>00078f8ce7eb276d</td><td>\"\\n\\nJuelz Santanas Age\\n\\nIn 2002, Juelz Santana was 18 years old, then came February 18th, which makes Juelz turn 19 making songs with The Diplomats. The third neff to be signed to Cam's label under Roc A Fella. In 2003, he was 20 years old coming out with his own singles \"\"Santana's Town\"\" and \"\"Down\"\". So yes, he is born in 1983. He really is, how could he be older then Lloyd Banks? And how could he be 22 when his birthday passed? The homie neff is 23 years old. 1983 - 2006 (Juelz death, god forbid if your thinking about that) equals 23. Go to your caculator and stop changing his year of birth. My god.\"</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td></tr><tr><th>17</th><td>0007e25b2121310b</td><td>Bye! \\n\\nDon't look, come or think of comming back! Tosser.</td><td>1</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td></tr><tr><th>18</th><td>000897889268bc93</td><td>REDIRECT Talk:Voydan Pop Georgiev- Chernodrinski</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td></tr><tr><th>19</th><td>0009801bd85e5806</td><td>The Mitsurugi point made no sense - why not argue to include Hindi on Ryo Sakazaki's page to include more information?</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td></tr><tr><th>20</th><td>0009eaea3325de8c</td><td>Don't mean to bother you \\n\\nI see that you're writing something regarding removing anything posted here and if you do oh well but if not and you can acctually discuss this with me then even better.\\n\\nI'd like to ask you to take a closer look at the Premature wrestling deaths catagory and the men listed in it, surely these men belong together in some catagory. Is there anything that you think we can do with the catagory besides delting it?</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td></tr><tr><th>21</th><td>000b08c464718505</td><td>\"\\n\\n Regarding your recent edits \\n\\nOnce again, please read WP:FILMPLOT before editing any more film articles. Your edits are simply not good, with entirely too many unnecessary details and very bad writing. Please stop before you do further damage. -''''''The '45 \"</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td></tr><tr><th>22</th><td>000bfd0867774845</td><td>\"\\nGood to know. About me, yeah, I'm studying now.(Deepu) \"</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td></tr><tr><th>23</th><td>000c0dfd995809fa</td><td>\"\\n\\n Snowflakes are NOT always symmetrical! \\n\\nUnder Geometry it is stated that \"\"A snowflake always has six symmetric arms.\"\" This assertion is simply not true! According to Kenneth Libbrecht, \"\"The rather unattractive irregular crystals are by far the most common variety.\"\" http://www.its.caltech.edu/~atomic/snowcrystals/myths/myths.htm#perfection Someone really need to take a look at his site and get FACTS off of it because I still see a decent number of falsities on this page. (forgive me Im new at this and dont want to edit anything)\"</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td></tr><tr><th>24</th><td>000c6a3f0cd3ba8e</td><td>\"\\n\\n The Signpost: 24 September 2012 \\n\\n Read this Signpost in full\\n Single-page\\n Unsubscribe\\n \\n\"</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td></tr><tr><th>25</th><td>000cfee90f50d471</td><td>\"\\n\\nRe-considering 1st paragraph edit?\\nI don't understand the reasons for 's recent edit of this article not that I'm sure that the data are necessarily \"\"wrong.\"\" Rather, I'm persuaded that the strategy of introducing academic honors in the first paragraph is an unhelpful approach to this specific subject. I note that articles about other sitting Justices have been similarly \"\"enhanced;\"\" and I also believe those changes are no improvement. \\n\\nIn support of my view that this edit should be reverted, I would invite anyone to re-visit articles written about the following pairs of jurists.\\n A1. Benjamin Cardozo\\n A2. Learned Hand\\n\\n B1. John Marshall Harlan\\n B2. John Marshall Harlan II\\n\\nThe question becomes: Would the current version of the Wikipedia article about any one of them or either pair be improved by academic credentials in the introductory paragraph? I think not.\\n\\nPerhaps it helps to repeat a wry argument Kathleen Sullivan of Stanford Law makes when she suggests that some on the Harvard Law faculty wonder how Antonin Scalia avoided learning what others have managed to grasp about the processes of judging? I would hope this anecdote gently illustrates the point. \\n\\nLess humorous, but an even stronger argument is the one Clarence Thomas makes when he mentions wanting to return his law degree to Yale.\\n\\nAt a minimum, I'm questioning this edit? It deserves to be reconsidered. \"</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td></tr><tr><th>26</th><td>000eefc67a2c930f</td><td>Radial symmetry \\n\\nSeveral now extinct lineages included in the Echinodermata were bilateral such as Homostelea, or even asymmetrical such as Cothurnocystis (Stylophora).\\n\\n-</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td></tr><tr><th>27</th><td>000f35deef84dc4a</td><td>There's no need to apologize. A Wikipedia article is made for reconciling knowledge about a subject from different sources, and you've done history studies and not archaeology studies, I guess. I could scan the page, e-mail it to you, and then you could ask someone to translate the page.</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td></tr><tr><th>28</th><td>000ffab30195c5e1</td><td>Yes, because the mother of the child in the case against Michael Jackson was studied in here motives and reasonings and judged upon her character just as harshly as Wacko Jacko himself. Don't tell me to ignore it and incriminate myself. I am going to continue refuting the bullshit that Jayjg keeps throwing at me. 18:01, 16 Jun 2005 (UTC)</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td></tr><tr><th>29</th><td>0010307a3a50a353</td><td>\"\\nOk. But it will take a bit of work but I can't quite picture it. Do you have an example I can base it on? the Duck \"</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td></tr><tr><th>30</th><td>0010833a96e1f886</td><td>\"== A barnstar for you! ==\\n\\n The Real Life Barnstar lets us be the stars\\n \"</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td></tr><tr><th>&vellip;</th><td>&vellip;</td><td>&vellip;</td><td>&vellip;</td><td>&vellip;</td><td>&vellip;</td><td>&vellip;</td><td>&vellip;</td><td>&vellip;</td></tr></tbody></table>"
],
"text/plain": [
"159571×8 DataFrames.DataFrame. Omitted printing of 7 columns\n",
"│ Row │ id │\n",
"├────────┼──────────────────┤\n",
"│ 1 │ 0000997932d777bf │\n",
"│ 2 │ 000103f0d9cfb60f │\n",
"│ 3 │ 000113f07ec002fd │\n",
"│ 4 │ 0001b41b1c6bb37e │\n",
"│ 5 │ 0001d958c54c6e35 │\n",
"│ 6 │ 00025465d4725e87 │\n",
"│ 7 │ 0002bcb3da6cb337 │\n",
"│ 8 │ 00031b1e95af7921 │\n",
"│ 9 │ 00037261f536c51d │\n",
"│ 10 │ 00040093b2687caa │\n",
"│ 11 │ 0005300084f90edc │\n",
"⋮\n",
"│ 159560 │ ffca8d71d71a3fae │\n",
"│ 159561 │ ffcdcb71854f6d8a │\n",
"│ 159562 │ ffd2e85b07b3c7e4 │\n",
"│ 159563 │ ffd72e9766c09c97 │\n",
"│ 159564 │ ffe029a7c79dc7fe │\n",
"│ 159565 │ ffe897e7f7182c90 │\n",
"│ 159566 │ ffe8b9316245be30 │\n",
"│ 159567 │ ffe987279560d7ff │\n",
"│ 159568 │ ffea4adeee384e90 │\n",
"│ 159569 │ ffee36eab5c267c9 │\n",
"│ 159570 │ fff125370e4aaaf3 │\n",
"│ 159571 │ fff46fc426af1f9a │"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"train = DataFrame(readdlm(\"data/train.csv\", ','; skipstart=1), [:id, :comment_text, :toxic, :severe_toxic, :obscene, :threat, :insult, :identity_hate])"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<table class=\"data-frame\"><thead><tr><th></th><th>id</th><th>comment_text</th></tr></thead><tbody><tr><th>1</th><td>00001cee341fdb12</td><td>Yo bitch Ja Rule is more succesful then you'll ever be whats up with you and hating you sad mofuckas...i should bitch slap ur pethedic white faces and get you to kiss my ass you guys sicken me. Ja rule is about pride in da music man. dont diss that shit on him. and nothin is wrong bein like tupac he was a brother too...fuckin white boys get things right next time.,</td></tr><tr><th>2</th><td>0000247867823ef7</td><td>== From RfC == \\n\\n The title is fine as it is, IMO.</td></tr><tr><th>3</th><td>00013b17ad220c46</td><td>\" \\n\\n == Sources == \\n\\n * Zawe Ashton on Lapland — / \"</td></tr><tr><th>4</th><td>00017563c3f7919a</td><td>:If you have a look back at the source, the information I updated was the correct form. I can only guess the source hadn't updated. I shall update the information once again but thank you for your message.</td></tr><tr><th>5</th><td>00017695ad8997eb</td><td>I don't anonymously edit articles at all.</td></tr><tr><th>6</th><td>0001ea8717f6de06</td><td>Thank you for understanding. I think very highly of you and would not revert without discussion.</td></tr><tr><th>7</th><td>00024115d4cbde0f</td><td>Please do not add nonsense to Wikipedia. Such edits are considered vandalism and quickly undone. If you would like to experiment, please use the sandbox instead. Thank you. -</td></tr><tr><th>8</th><td>000247e83dcc1211</td><td>:Dear god this site is horrible.</td></tr><tr><th>9</th><td>00025358d4737918</td><td>\" \\n Only a fool can believe in such numbers. \\n The correct number lies between 10 000 to 15 000. \\n Ponder the numbers carefully. \\n\\n This error will persist for a long time as it continues to reproduce... The latest reproduction I know is from ENCYCLOPÆDIA BRITANNICA ALMANAC 2008 wich states \\n Magnittude: 8.7 (fair enough) \\n victims: 70 000 (today 10 000 to 15 000 is not \"\"a lot\"\" so I guess people just come out with a number that impresses enough, I don't know. But I know this: it's just a shameless lucky number that they throw in the air. \\n GC \\n\\n \"</td></tr><tr><th>10</th><td>00026d1092fe71cc</td><td>== Double Redirects == \\n\\n When fixing double redirects, don't just blank the outer one, you need edit it to point it to the final target, unless you think it's inappropriate, in which case, it needs to be nominated at WP:RfD</td></tr><tr><th>11</th><td>0002eadc3b301559</td><td>I think its crap that the link to roggenbier is to this article. Somebody that knows how to do things should change it.</td></tr><tr><th>12</th><td>0002f87b16116a7f</td><td>\"::: Somebody will invariably try to add Religion? Really?? You mean, the way people have invariably kept adding \"\"Religion\"\" to the Samuel Beckett infobox? And why do you bother bringing up the long-dead completely non-existent \"\"Influences\"\" issue? You're just flailing, making up crap on the fly. \\n ::: For comparison, the only explicit acknowledgement in the entire Amos Oz article that he is personally Jewish is in the categories! \\n\\n \"</td></tr><tr><th>13</th><td>0003806b11932181</td><td>, 25 February 2010 (UTC) \\n\\n :::Looking it over, it's clear that (a banned sockpuppet of ) ignored the consensus (&amp;, fwiw, policy-appropriate) choice to leave the page at Chihuahua (Mexico) and the current page should be returned there. Anyone have the time to fix the incoming links? - 18:24</td></tr><tr><th>14</th><td>0003e1cccfd5a40a</td><td>\" \\n\\n It says it right there that it IS a type. The \"\"Type\"\" of institution is needed in this case because there are three levels of SUNY schools: \\n -University Centers and Doctoral Granting Institutions \\n -State Colleges \\n -Community Colleges. \\n\\n It is needed in this case to clarify that UB is a SUNY Center. It says it even in Binghamton University, University at Albany, State University of New York, and Stony Brook University. Stop trying to say it's not because I am totally right in this case.\"</td></tr><tr><th>15</th><td>00059ace3e3e9a53</td><td>\" \\n\\n == Before adding a new product to the list, make sure it's relevant == \\n\\n Before adding a new product to the list, make sure it has a wikipedia entry already, \"\"proving\"\" it's relevance and giving the reader the possibility to read more about it. \\n Otherwise it could be subject to deletion. See this article's revision history.\"</td></tr><tr><th>16</th><td>000634272d0d44eb</td><td>==Current Position== \\n Anyone have confirmation that Sir, Alfred is no longer at the airport and is hospitalised?</td></tr><tr><th>17</th><td>000663aff0fffc80</td><td>this other one from 1897</td></tr><tr><th>18</th><td>000689dd34e20979</td><td>== Reason for banning throwing == \\n\\n This article needs a section on /why/ throwing is banned. At the moment, to a non-cricket fan, it seems kind of arbitrary.</td></tr><tr><th>19</th><td>000834769115370c</td><td>:: Wallamoose was changing the cited material to say things the original source did not say. In response to his objections, I modified the article as we went along. I was not just reverting him. I repeatedly asked him to use the talk page. I've been trying to add to the article for a long time. It's so thin on content. This is wrong.</td></tr><tr><th>20</th><td>000844b52dee5f3f</td><td>|blocked]] from editing Wikipedia. |</td></tr><tr><th>21</th><td>00084da5d4ead7aa</td><td>==Indefinitely blocked== \\n I have indefinitely blocked this account.</td></tr><tr><th>22</th><td>00091c35fa9d0465</td><td>== Arabs are committing genocide in Iraq, but no protests in Europe. == \\n\\n May Europe also burn in hell.</td></tr><tr><th>23</th><td>000968ce11f5ee34</td><td>Please stop. If you continue to vandalize Wikipedia, as you did to Homosexuality, you will be blocked from editing.</td></tr><tr><th>24</th><td>0009734200a85047</td><td>== Energy == \\n\\n I have edited the introduction, because previously it said that passive transport does not use any kind of energy. This is not true. Passive transport relies on the kinetic energy of the substance that is being transported. This kinetic energy is what causes it to move around and (by random chance) cross the membrane. The difference is that active transport actually uses the cell's energy (ATP or electrochemical gradient) to pump the substance across the membrane.</td></tr><tr><th>25</th><td>00097b6214686db5</td><td>:yeah, thanks for reviving the tradition of pissing all over articles because you want to live out your ethnic essentialism. Why let mere facts get into the way of enjoying that.</td></tr><tr><th>26</th><td>0009aef4bd9e1697</td><td>MLM Software,NBFC software,Non Banking Financial Company,NBFC software company,NBFC software in india,software for banking,Gold loan software.MLM Software \\n\\n '''SEO Services \\n Search Engine Optimization \\n www.liveindiatech.com \\n\\n According to a recenBold textt survey people have moved away from searching print media for their needs. They use search engines to find the products and services. The first step to have a successful presence over the internet is creating your own website but that is not enough. When someone searches for the products/services that you offer your name needs to be listed high in the search engine. \\n\\n Live India Tech guarantees you search engine optimization using which your organizations name will feature in the top ten listing in all the search engines. This will ensure that the traffic to your site will increase exponentially aiding you in the sales of more products and services. \\n\\n We have invested enough resources to develop an SEO system that is far superior to anything else in the market. \\n\\n www.liveindiatech.com</td></tr><tr><th>27</th><td>000a02d807ae0254</td><td>@RedSlash, cut it short. If you have sources stating the RoK is sovereign post them. Otherwise please aknowledge WP is not the place to make OR.</td></tr><tr><th>28</th><td>000a6c6d4e89b9bc</td><td>==================== \\n Deception is the way of the Ninja..... \\n\\n Hence, Frank Dux is an amazing Ninja</td></tr><tr><th>29</th><td>000bafe2080bba82</td><td>. \\n\\n Jews are not a race because you can only get it from your mother. Your own mention of Ethiopian Jews not testing \\n as Jews proves it is not, as well as the fact that we accept converts</td></tr><tr><th>30</th><td>000bf0a9894b2807</td><td>:::If Ollie or others think that one list of the oldest people we know about is too long, the easy answer is to raise the cutoff age. 110 is purely a round number and a full 12 years shorter then the record. We can make it the top 1000 or top 500 or everyone above 115 - tell us what the maximum list size is and we can set a threshold.</td></tr><tr><th>&vellip;</th><td>&vellip;</td><td>&vellip;</td></tr></tbody></table>"
],
"text/plain": [
"153164×2 DataFrames.DataFrame. Omitted printing of 1 columns\n",
"│ Row │ id │\n",
"├────────┼──────────────────┤\n",
"│ 1 │ 00001cee341fdb12 │\n",
"│ 2 │ 0000247867823ef7 │\n",
"│ 3 │ 00013b17ad220c46 │\n",
"│ 4 │ 00017563c3f7919a │\n",
"│ 5 │ 00017695ad8997eb │\n",
"│ 6 │ 0001ea8717f6de06 │\n",
"│ 7 │ 00024115d4cbde0f │\n",
"│ 8 │ 000247e83dcc1211 │\n",
"│ 9 │ 00025358d4737918 │\n",
"│ 10 │ 00026d1092fe71cc │\n",
"│ 11 │ 0002eadc3b301559 │\n",
"⋮\n",
"│ 153153 │ fff9fa508f400ee6 │\n",
"│ 153154 │ fffa3fae1890b40a │\n",
"│ 153155 │ fffa8a11c4378854 │\n",
"│ 153156 │ fffac2a094c8e0e2 │\n",
"│ 153157 │ fffb5451268fb5ba │\n",
"│ 153158 │ fffc2b34bbe61c8d │\n",
"│ 153159 │ fffc489742ffe69b │\n",
"│ 153160 │ fffcd0960ee309b5 │\n",
"│ 153161 │ fffd7a9a6eb32c16 │\n",
"│ 153162 │ fffda9e8d6fafa9e │\n",
"│ 153163 │ fffe8f1340a79fc2 │\n",
"│ 153164 │ ffffce3fb183ee80 │"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"test = DataFrame(readdlm(\"data/test.csv\", ','; skipstart=1), [:id, :comment_text])"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"search: \u001b[1mD\u001b[22m\u001b[1ma\u001b[22m\u001b[1mt\u001b[22m\u001b[1ma\u001b[22m\u001b[1mF\u001b[22m\u001b[1mr\u001b[22m\u001b[1ma\u001b[22m\u001b[1mm\u001b[22m\u001b[1me\u001b[22m \u001b[1mD\u001b[22m\u001b[1ma\u001b[22m\u001b[1mt\u001b[22m\u001b[1ma\u001b[22m\u001b[1mF\u001b[22m\u001b[1mr\u001b[22m\u001b[1ma\u001b[22m\u001b[1mm\u001b[22m\u001b[1me\u001b[22ms \u001b[1mD\u001b[22m\u001b[1ma\u001b[22m\u001b[1mt\u001b[22m\u001b[1ma\u001b[22m\u001b[1mF\u001b[22m\u001b[1mr\u001b[22m\u001b[1ma\u001b[22m\u001b[1mm\u001b[22m\u001b[1me\u001b[22mRow Sub\u001b[1mD\u001b[22m\u001b[1ma\u001b[22m\u001b[1mt\u001b[22m\u001b[1ma\u001b[22m\u001b[1mF\u001b[22m\u001b[1mr\u001b[22m\u001b[1ma\u001b[22m\u001b[1mm\u001b[22m\u001b[1me\u001b[22m Groupe\u001b[1md\u001b[22mD\u001b[1ma\u001b[22m\u001b[1mt\u001b[22m\u001b[1ma\u001b[22m\u001b[1mF\u001b[22m\u001b[1mr\u001b[22m\u001b[1ma\u001b[22m\u001b[1mm\u001b[22m\u001b[1me\u001b[22m\n",
"\n"
]
},
{
"data": {
"text/markdown": [
"An AbstractDataFrame that stores a set of named columns\n",
"\n",
"The columns are normally AbstractVectors stored in memory, particularly a Vector or CategoricalVector.\n",
"\n",
"**Constructors**\n",
"\n",
"```julia\n",
"DataFrame(columns::Vector, names::Vector{Symbol}; makeunique::Bool=false)\n",
"DataFrame(columns::Matrix, names::Vector{Symbol}; makeunique::Bool=false)\n",
"DataFrame(kwargs...)\n",
"DataFrame(pairs::Pair{Symbol}...; makeunique::Bool=false)\n",
"DataFrame() # an empty DataFrame\n",
"DataFrame(t::Type, nrows::Integer, ncols::Integer) # an empty DataFrame of arbitrary size\n",
"DataFrame(column_eltypes::Vector, names::Vector, nrows::Integer; makeunique::Bool=false)\n",
"DataFrame(column_eltypes::Vector, cnames::Vector, categorical::Vector, nrows::Integer;\n",
" makeunique::Bool=false)\n",
"DataFrame(ds::Vector{Associative})\n",
"```\n",
"\n",
"**Arguments**\n",
"\n",
" * `columns` : a Vector with each column as contents or a Matrix\n",
" * `names` : the column names\n",
" * `makeunique` : if `false` (the default), an error will be raised if duplicates in `names` are found; if `true`, duplicate names will be suffixed with `_i` (`i` starting at 1 for the first duplicate).\n",
" * `kwargs` : the key gives the column names, and the value is the column contents\n",
" * `t` : elemental type of all columns\n",
" * `nrows`, `ncols` : number of rows and columns\n",
" * `column_eltypes` : elemental type of each column\n",
" * `categorical` : `Vector{Bool}` indicating which columns should be converted to `CategoricalVector`\n",
" * `ds` : a vector of Associatives\n",
"\n",
"Each column in `columns` should be the same length.\n",
"\n",
"**Notes**\n",
"\n",
"A `DataFrame` is a lightweight object. As long as columns are not manipulated, creation of a DataFrame from existing AbstractVectors is inexpensive. For example, indexing on columns is inexpensive, but indexing by rows is expensive because copies are made of each column.\n",
"\n",
"Because column types can vary, a DataFrame is not type stable. For performance-critical code, do not index into a DataFrame inside of loops.\n",
"\n",
"**Examples**\n",
"\n",
"```julia\n",
"df = DataFrame()\n",
"v = [\"x\",\"y\",\"z\"][rand(1:3, 10)]\n",
"df1 = DataFrame(Any[collect(1:10), v, rand(10)], [:A, :B, :C])\n",
"df2 = DataFrame(A = 1:10, B = v, C = rand(10))\n",
"dump(df1)\n",
"dump(df2)\n",
"describe(df2)\n",
"head(df1)\n",
"df1[:A] + df2[:C]\n",
"df1[1:4, 1:2]\n",
"df1[[:A,:C]]\n",
"df1[1:2, [:A,:C]]\n",
"df1[:, [:A,:C]]\n",
"df1[:, [1,3]]\n",
"df1[1:4, :]\n",
"df1[1:4, :C]\n",
"df1[1:4, :C] = 40. * df1[1:4, :C]\n",
"[df1; df2] # vcat\n",
"[df1 df2] # hcat\n",
"size(df1)\n",
"```\n"
],
"text/plain": [
"An AbstractDataFrame that stores a set of named columns\n",
"\n",
"The columns are normally AbstractVectors stored in memory, particularly a Vector or CategoricalVector.\n",
"\n",
"**Constructors**\n",
"\n",
"```julia\n",
"DataFrame(columns::Vector, names::Vector{Symbol}; makeunique::Bool=false)\n",
"DataFrame(columns::Matrix, names::Vector{Symbol}; makeunique::Bool=false)\n",
"DataFrame(kwargs...)\n",
"DataFrame(pairs::Pair{Symbol}...; makeunique::Bool=false)\n",
"DataFrame() # an empty DataFrame\n",
"DataFrame(t::Type, nrows::Integer, ncols::Integer) # an empty DataFrame of arbitrary size\n",
"DataFrame(column_eltypes::Vector, names::Vector, nrows::Integer; makeunique::Bool=false)\n",
"DataFrame(column_eltypes::Vector, cnames::Vector, categorical::Vector, nrows::Integer;\n",
" makeunique::Bool=false)\n",
"DataFrame(ds::Vector{Associative})\n",
"```\n",
"\n",
"**Arguments**\n",
"\n",
" * `columns` : a Vector with each column as contents or a Matrix\n",
" * `names` : the column names\n",
" * `makeunique` : if `false` (the default), an error will be raised if duplicates in `names` are found; if `true`, duplicate names will be suffixed with `_i` (`i` starting at 1 for the first duplicate).\n",
" * `kwargs` : the key gives the column names, and the value is the column contents\n",
" * `t` : elemental type of all columns\n",
" * `nrows`, `ncols` : number of rows and columns\n",
" * `column_eltypes` : elemental type of each column\n",
" * `categorical` : `Vector{Bool}` indicating which columns should be converted to `CategoricalVector`\n",
" * `ds` : a vector of Associatives\n",
"\n",
"Each column in `columns` should be the same length.\n",
"\n",
"**Notes**\n",
"\n",
"A `DataFrame` is a lightweight object. As long as columns are not manipulated, creation of a DataFrame from existing AbstractVectors is inexpensive. For example, indexing on columns is inexpensive, but indexing by rows is expensive because copies are made of each column.\n",
"\n",
"Because column types can vary, a DataFrame is not type stable. For performance-critical code, do not index into a DataFrame inside of loops.\n",
"\n",
"**Examples**\n",
"\n",
"```julia\n",
"df = DataFrame()\n",
"v = [\"x\",\"y\",\"z\"][rand(1:3, 10)]\n",
"df1 = DataFrame(Any[collect(1:10), v, rand(10)], [:A, :B, :C])\n",
"df2 = DataFrame(A = 1:10, B = v, C = rand(10))\n",
"dump(df1)\n",
"dump(df2)\n",
"describe(df2)\n",
"head(df1)\n",
"df1[:A] + df2[:C]\n",
"df1[1:4, 1:2]\n",
"df1[[:A,:C]]\n",
"df1[1:2, [:A,:C]]\n",
"df1[:, [:A,:C]]\n",
"df1[:, [1,3]]\n",
"df1[1:4, :]\n",
"df1[1:4, :C]\n",
"df1[1:4, :C] = 40. * df1[1:4, :C]\n",
"[df1; df2] # vcat\n",
"[df1 df2] # hcat\n",
"size(df1)\n",
"```\n"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"?DataFrame"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"Check ditribution of training samples"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"159571-element BitArray{1}:\n",
" false\n",
" false\n",
" false\n",
" false\n",
" false\n",
" false\n",
" true\n",
" false\n",
" false\n",
" false\n",
" false\n",
" false\n",
" true\n",
" ⋮\n",
" false\n",
" false\n",
" false\n",
" false\n",
" false\n",
" false\n",
" false\n",
" false\n",
" false\n",
" false\n",
" false\n",
" false"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"train[:bad_ones] = train[:toxic] + train[:severe_toxic] + train[:obscene] + train[:threat] + train[:insult] + train[:identity_hate] .> 0"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"ename": "LoadError",
"evalue": "\u001b[91mUndefVarError: @from not defined\u001b[39m",
"output_type": "error",
"traceback": [
"\u001b[91mUndefVarError: @from not defined\u001b[39m",
"",
"Stacktrace:",
" [1] \u001b[1minclude_string\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m::String, ::String\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m./loading.jl:522\u001b[22m\u001b[22m"
]
}
],
"source": [
"@time bad_ones = @from i in train begin\n",
" @where i.bad_ones > 0\n",
" @select i\n",
" @collect DataFrame\n",
"end"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 0.376406 seconds (151.12 k allocations: 8.744 MiB)\n"
]
},
{
"data": {
"text/html": [
"<table class=\"data-frame\"><thead><tr><th></th><th>id</th><th>comment_text</th><th>toxic</th><th>severe_toxic</th><th>obscene</th><th>threat</th><th>insult</th><th>identity_hate</th><th>bad_ones</th></tr></thead><tbody><tr><th>1</th><td>0002bcb3da6cb337</td><td>COCKSUCKER BEFORE YOU PISS AROUND ON MY WORK</td><td>1</td><td>1</td><td>1</td><td>0</td><td>1</td><td>0</td><td>true</td></tr><tr><th>2</th><td>0005c987bdfc9d4b</td><td>Hey... what is it..\\n@ | talk .\\nWhat is it... an exclusive group of some WP TALIBANS...who are good at destroying, self-appointed purist who GANG UP any one who asks them questions abt their ANTI-SOCIAL and DESTRUCTIVE (non)-contribution at WP?\\n\\nAsk Sityush to clean up his behavior than issue me nonsensical warnings...</td><td>1</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>true</td></tr><tr><th>3</th><td>0007e25b2121310b</td><td>Bye! \\n\\nDon't look, come or think of comming back! Tosser.</td><td>1</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>true</td></tr><tr><th>4</th><td>001810bf8c45bf5f</td><td>You are gay or antisemmitian? \\n\\nArchangel WHite Tiger\\n\\nMeow! Greetingshhh!\\n\\nUh, there are two ways, why you do erased my comment about WW2, that holocaust was brutally slaying of Jews and not gays/Gypsys/Slavs/anyone...\\n\\n1 - If you are anti-semitian, than shave your head bald and go to the skinhead meetings!\\n\\n2 - If you doubt words of the Bible, that homosexuality is a deadly sin, make a pentagram tatoo on your forehead go to the satanistic masses with your gay pals!\\n\\n3 - First and last warning, you fucking gay - I won't appreciate if any more nazi shwain would write in my page! I don't wish to talk to you anymore!\\n\\nBeware of the Dark Side!</td><td>1</td><td>0</td><td>1</td><td>0</td><td>1</td><td>1</td><td>true</td></tr><tr><th>5</th><td>00190820581d90ce</td><td>FUCK YOUR FILTHY MOTHER IN THE ASS, DRY!</td><td>1</td><td>0</td><td>1</td><td>0</td><td>1</td><td>0</td><td>true</td></tr><tr><th>6</th><td>001956c382006abd</td><td>I'm Sorry \\n\\nI'm sorry I screwed around with someones talk page. It was very bad to do. I know how having the templates on their talk page helps you assert your dominance over them. I know I should bow down to the almighty administrators. But then again, I'm going to go play outside....with your mom. 76.122.79.82</td><td>1</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>true</td></tr><tr><th>7</th><td>001dc38a83d420cf</td><td>GET FUCKED UP. GET FUCKEEED UP. GOT A DRINK THAT YOU CANT PUT DOWN???/ GET FUCK UP GET FUCKED UP. I'M FUCKED UP RIGHT NOW!</td><td>1</td><td>0</td><td>1</td><td>0</td><td>0</td><td>0</td><td>true</td></tr><tr><th>8</th><td>0020e7119b96eeeb</td><td>Stupid peace of shit stop deleting my stuff asshole go die and fall in a hole go to hell!</td><td>1</td><td>1</td><td>1</td><td>0</td><td>1</td><td>0</td><td>true</td></tr><tr><th>9</th><td>0020fd96ed3b8c8b</td><td>=Tony Sidaway is obviously a fistfuckee. He loves an arm up his ass.</td><td>1</td><td>0</td><td>1</td><td>0</td><td>1</td><td>0</td><td>true</td></tr><tr><th>10</th><td>0021fe88bc4da3e6</td><td>My Band Page's deletion. You thought I was gone. \\n\\nDeleting the comment I posted on your 'talk page' does not delete my feelings for your hasty decision to delete my page. You, sir, are still a cock-sucking douche fuck. Sit back for a second and think about what your life has become. ............ Done? That didn't take long did it? Nope. Because, as I am most certainly aware, your life is a shitstorm of sitting in front of the computer masturbating to fictional creatures. Your attempts to get rid of me are mediocre at best. You are constantly sucking the dick of failure. You don't want a personal attack, huh? Well, too bad, Fuckcock McDickerson. You've got one. From me. You can go ahead and delete my profile from wikipedia. I'll just make another one and come right back to 'Syrthiss's talk page and insult the dick off of you. How could you shatter the dreams of an innocent eighteen year old college freshman trying to make a name for his band. Does that make you happy? Fucking with people because you're an overweight, single, old man in a dead-end job. Did you spot that perhaps someone else was going to follow his dreams and you were trying to hold him back so somebody else could suffer like you? Yes you did. I don't make empty threats, so I won't be saying anything along the lines of 'i'll hurt you' or 'i'll eat the children from within your sister's womb', but I will say that you are a asshole, son-of-a-bitch, mother fucking cock sucker. So, go eat some more food and drown your sorrows you premature ejaculating, bald headed fuck.\\n\\nYou should do something nice for yourself, maybe go grab a couple of Horny Goat Weeds from your local convenience store and jack off for a little longer than three minutes tonight.\\n\\nSincerely,\\nAn Asshole That's Better Than You In Every Way.</td><td>1</td><td>0</td><td>1</td><td>0</td><td>0</td><td>0</td><td>true</td></tr><tr><th>11</th><td>002264ea4d5f2887</td><td>Why can't you believe how fat Artie is? Did you see him on his recent appearence on the Tonight Show with Jay Leno? He looks absolutely AWFUL! If I had to put money on it, I'd say that Artie Lange is a can't miss candidate for the 2007 Dead pool! \\n\\n \\nKindly keep your malicious fingers off of my above comment, . Everytime you remove it, I will repost it!!!</td><td>1</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>true</td></tr><tr><th>12</th><td>0028d62e8a5629aa</td><td>All of my edits are good. Cunts like you who revert good edits because you're too stupid to understand how to write well , and then revert other edits just because you've decided to bear a playground grudge, are the problem. Maybe one day you'll realise the damage you did to a noble project. 201.215.187.159</td><td>1</td><td>0</td><td>1</td><td>0</td><td>1</td><td>0</td><td>true</td></tr><tr><th>13</th><td>003217c3eb469ba9</td><td>Hi! I am back again!\\nLast warning!\\nStop undoing my edits or die!</td><td>1</td><td>0</td><td>0</td><td>1</td><td>0</td><td>0</td><td>true</td></tr><tr><th>14</th><td>0036621e4c7e10b5</td><td>Would you both shut up, you don't run wikipedia, especially a stupid kid.</td><td>1</td><td>0</td><td>0</td><td>0</td><td>1</td><td>0</td><td>true</td></tr><tr><th>15</th><td>00472b8e2d38d1ea</td><td>A pair of jew-hating weiner nazi schmucks.</td><td>1</td><td>0</td><td>1</td><td>0</td><td>1</td><td>1</td><td>true</td></tr><tr><th>16</th><td>005f59485fcddeb0</td><td>\"\\n\\nSORRY PUCK BUT NO ONE EVER SAID DICK WAS NUMBER ONE!! But he is clearly listed as being \"\"second only to Batman\"\" and a \"\"master detective and martial artist\"\" and been trained in all the mentioned combat discplines. No need to mention in skills/abilities about the battles he's lost and won because thats not what an encylopia is supposed to do. \\n\\nAlso, what your problem is your going by low end showings and not HIGH end showings where he's proven more than a match for a meta like Slade and pinned Shiva down in one of his earliest appearences.\\n \"</td><td>1</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>true</td></tr><tr><th>17</th><td>00637960a7ec3436</td><td>\"\\n\\nUNBLOCK ME OR I'LL GET MY LAWYERS ON TO YOU FOR BLOCKING MY CONSTITUTIONAL RIGHT TO FREE SPEECH\"</td><td>1</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>true</td></tr><tr><th>18</th><td>00686325bcc16080</td><td>You should be fired, you're a moronic wimp who is too lazy to do research. It makes me sick that people like you exist in this world.</td><td>1</td><td>0</td><td>0</td><td>0</td><td>1</td><td>0</td><td>true</td></tr><tr><th>19</th><td>006b94add72ed61c</td><td>I think that your a Fagget get a oife and burn in Hell I hate you 'm sorry we cant have any more sex i'm running out of conndoms</td><td>1</td><td>0</td><td>1</td><td>1</td><td>1</td><td>1</td><td>true</td></tr><tr><th>20</th><td>006d11791d76b9f3</td><td>REPLY ABOVE:\\nThat was me, loser. The UN defines Vietnam to be part of Southeast Asia. And far as I know Vietnam is part of ASEAN, and used to be part of French Indochina with Laos and all those shit countries Anyway your culture has always been more influenced by SEA than Han Chinese (as in proper Yangtze Han Chinese, not the fringe indigenous tribes in Guangzhou/Guangxi). \\n\\nJust admit that you vietnamese are all a bunch of wannabe crap people. ALL the east asian people I've spoken to thinks of Vietnam as a very integral part of SEA, and we all think you're backward, dirty and speak in a horrible swearing language. Doesn't matter what crap you spout on Wikipedia, won't change the way people in the real world think.</td><td>0</td><td>0</td><td>0</td><td>0</td><td>1</td><td>0</td><td>true</td></tr><tr><th>21</th><td>006e87872c8b370c</td><td>you are a stupid fuck \\n\\nand your mother's cunt stinks</td><td>1</td><td>1</td><td>1</td><td>0</td><td>1</td><td>0</td><td>true</td></tr><tr><th>22</th><td>007f1839ada915e6</td><td>Your blatant POV pushing \\n\\nNeither of you guys has made any contribution to this Italian history article other than to shove your unhistorical unconstructive modern POV in my face.\\nThis is a HISTORY article. HISTORY.\\nHave you heard of that?\\nThis is the reason why so many people get pissed off about the pedantry and idiocy and triviality of Wikipedia.\\nJ*sus. Get a f*cking life.</td><td>1</td><td>0</td><td>1</td><td>0</td><td>0</td><td>0</td><td>true</td></tr><tr><th>23</th><td>0082b5a7b4a67da2</td><td>Give me a permanat block raseac....!!! remember me....!!! Muslims dont you worry for these images they are trying to hurt us but they cant look these images are fake and have nothing to do with Islam or our prophet even their article is so lame wikipedia is not an authentic source its an encyclopedia for the kids which has not authenticity anyone can edit or write whatever they think...just like raeasc you can see he does not knows what Islam is and he talks about it so forget about wiki its lame they have no authenticity...i recommend they should add pissed christ in jessus article....they should...</td><td>1</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>true</td></tr><tr><th>24</th><td>0086998b34865f93</td><td>Fuck you, block me, you faggot pussy!</td><td>1</td><td>0</td><td>1</td><td>0</td><td>1</td><td>0</td><td>true</td></tr><tr><th>25</th><td>008e0818dde894fb</td><td>Kill all niggers. \\n\\nI have hard, that others have said this.. should this be included? That racists sometimes say these.</td><td>1</td><td>0</td><td>1</td><td>0</td><td>1</td><td>1</td><td>true</td></tr><tr><th>26</th><td>009371b0ef213487</td><td>Burn Deck \\n\\nIf that'd guy's burn deck is like what I think it is than he is in for a blrude awakening!!! Leviathan motherfucker!!!</td><td>1</td><td>0</td><td>1</td><td>0</td><td>1</td><td>0</td><td>true</td></tr><tr><th>27</th><td>0097dd5c29bf7a15</td><td>u r a tw@ fuck off u gay boy.U r smelly.Fuck ur mum poopie</td><td>1</td><td>0</td><td>1</td><td>0</td><td>1</td><td>1</td><td>true</td></tr><tr><th>28</th><td>00ab65775c601cf9</td><td>Atheism is full of bias shit</td><td>1</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>true</td></tr><tr><th>29</th><td>00afb4dec99a231f</td><td>Hey why you are spreading misconceptions and trying to spread false information to the people.You not even know who won the battle and who lost.you are the cheapest person on the earth.Bloody ignorant fool.Check sources before spreading rumors</td><td>1</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>true</td></tr><tr><th>30</th><td>00b77cb600c897b4</td><td>\"\\n\\nAnd you are? Let me know when you've crawled out of their arses. '''''' Dick Laurent is dead \"</td><td>1</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>true</td></tr><tr><th>&vellip;</th><td>&vellip;</td><td>&vellip;</td><td>&vellip;</td><td>&vellip;</td><td>&vellip;</td><td>&vellip;</td><td>&vellip;</td><td>&vellip;</td><td>&vellip;</td></tr></tbody></table>"
],
"text/plain": [
"16225×9 DataFrames.DataFrame. Omitted printing of 8 columns\n",
"│ Row │ id │\n",
"├───────┼──────────────────┤\n",
"│ 1 │ 0002bcb3da6cb337 │\n",
"│ 2 │ 0005c987bdfc9d4b │\n",
"│ 3 │ 0007e25b2121310b │\n",
"│ 4 │ 001810bf8c45bf5f │\n",
"│ 5 │ 00190820581d90ce │\n",
"│ 6 │ 001956c382006abd │\n",
"│ 7 │ 001dc38a83d420cf │\n",
"│ 8 │ 0020e7119b96eeeb │\n",
"│ 9 │ 0020fd96ed3b8c8b │\n",
"│ 10 │ 0021fe88bc4da3e6 │\n",
"│ 11 │ 002264ea4d5f2887 │\n",
"⋮\n",
"│ 16214 │ fd052883fa6a8697 │\n",
"│ 16215 │ fd2f53aafe8eefcc │\n",
"│ 16216 │ fd68ef478b3dfd05 │\n",
"│ 16217 │ fdc92e571d39e7e1 │\n",
"│ 16218 │ fdce660ddcd6d7ca │\n",
"│ 16219 │ feb5637c531f933d │\n",
"│ 16220 │ fef142420a215b90 │\n",
"│ 16221 │ fef4cf7ba0012866 │\n",
"│ 16222 │ ff39a2895fc3b40e │\n",
"│ 16223 │ ffa33d3122b599d6 │\n",
"│ 16224 │ ffb47123b2d82762 │\n",
"│ 16225 │ ffbdbb0483ed0841 │"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"@time bad_ones = train[(train[:bad_ones] .> 0),:]"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"10.167887648758233"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(size(bad_ones)[1] / size(train)[1]) * 100"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The Dataset is completely unbalanced,agro comments are only 10% of the total dataset.\n",
"\n",
"1. Balance the dataset \n",
"2. "
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<table class=\"data-frame\"><thead><tr><th></th><th>bad_ones</th><th>x1</th></tr></thead><tbody><tr><th>1</th><td>false</td><td>143346</td></tr><tr><th>2</th><td>true</td><td>16225</td></tr></tbody></table>"
],
"text/plain": [
"2×2 DataFrames.DataFrame\n",
"│ Row │ bad_ones │ x1 │\n",
"├─────┼──────────┼────────┤\n",
"│ 1 │ false │ 143346 │\n",
"│ 2 │ true │ 16225 │"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"by(train, [:bad_ones], df -> length(df[:id]))"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<table class=\"data-frame\"><thead><tr><th></th><th>id</th><th>comment_text</th><th>toxic</th><th>severe_toxic</th><th>obscene</th><th>threat</th><th>insult</th><th>identity_hate</th><th>bad_ones</th></tr></thead><tbody><tr><th>1</th><td>0000997932d777bf</td><td>Explanation\\nWhy the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>false</td></tr><tr><th>2</th><td>000103f0d9cfb60f</td><td>D'aww! He matches this background colour I'm seemingly stuck with. Thanks. (talk) 21:51, January 11, 2016 (UTC)</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>false</td></tr><tr><th>3</th><td>000113f07ec002fd</td><td>Hey man, I'm really not trying to edit war. It's just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page. He seems to care more about the formatting than the actual info.</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>false</td></tr><tr><th>4</th><td>0001b41b1c6bb37e</td><td>\"\\nMore\\nI can't make any real suggestions on improvement - I wondered if the section statistics should be later on, or a subsection of \"\"types of accidents\"\" -I think the references may need tidying so that they are all in the exact same format ie date format etc. I can do that later on, if no-one else does first - if you have any preferences for formatting style on references or want to do it yourself please let me know.\\n\\nThere appears to be a backlog on articles for review so I guess there may be a delay until a reviewer turns up. It's listed in the relevant form eg Wikipedia:Good_article_nominations#Transport \"</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>false</td></tr><tr><th>5</th><td>0001d958c54c6e35</td><td>You, sir, are my hero. Any chance you remember what page that's on?</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>false</td></tr><tr><th>6</th><td>00025465d4725e87</td><td>\"\\n\\nCongratulations from me as well, use the tools well.  · talk \"</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>false</td></tr><tr><th>7</th><td>00031b1e95af7921</td><td>Your vandalism to the Matt Shirvington article has been reverted. Please don't do it again, or you will be banned.</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>false</td></tr><tr><th>8</th><td>00037261f536c51d</td><td>Sorry if the word 'nonsense' was offensive to you. Anyway, I'm not intending to write anything in the article(wow they would jump on me for vandalism), I'm merely requesting that it be more encyclopedic so one can use it for school as a reference. I have been to the selective breeding page but it's almost a stub. It points to 'animal breeding' which is a short messy article that gives you no info. There must be someone around with expertise in eugenics? 93.161.107.169</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>false</td></tr><tr><th>9</th><td>00040093b2687caa</td><td>alignment on this subject and which are contrary to those of DuLithgow</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>false</td></tr><tr><th>10</th><td>0005300084f90edc</td><td>\"\\nFair use rationale for Image:Wonju.jpg\\n\\nThanks for uploading Image:Wonju.jpg. I notice the image page specifies that the image is being used under fair use but there is no explanation or rationale as to why its use in Wikipedia articles constitutes fair use. In addition to the boilerplate fair use template, you must also write out on the image description page a specific explanation or rationale for why using this image in each article is consistent with fair use.\\n\\nPlease go to the image description page and edit it to include a fair use rationale.\\n\\nIf you have uploaded other fair use media, consider checking that you have specified the fair use rationale on those pages too. You can find a list of 'image' pages you have edited by clicking on the \"\"my contributions\"\" link (it is located at the very top of any Wikipedia page when you are logged in), and then selecting \"\"Image\"\" from the dropdown box. Note that any fair use images uploaded after 4 May, 2006, and lacking such an explanation will be deleted one week after they have been uploaded, as described on criteria for speedy deletion. If you have any questions please ask them at the Media copyright questions page. Thank you. (talk • contribs • ) \\nUnspecified source for Image:Wonju.jpg\\n\\nThanks for uploading Image:Wonju.jpg. I noticed that the file's description page currently doesn't specify who created the content, so the copyright status is unclear. If you did not create this file yourself, then you will need to specify the owner of the copyright. If you obtained it from a website, then a link to the website from which it was taken, together with a restatement of that website's terms of use of its content, is usually sufficient information. However, if the copyright holder is different from the website's publisher, then their copyright should also be acknowledged.\\n\\nAs well as adding the source, please add a proper copyright licensing tag if the file doesn't have one already. If you created/took the picture, audio, or video then the tag can be used to release it under the GFDL. If you believe the media meets the criteria at Wikipedia:Fair use, use a tag such as or one of the other tags listed at Wikipedia:Image copyright tags#Fair use. See Wikipedia:Image copyright tags for the full list of copyright tags that you can use.\\n\\nIf you have uploaded other files, consider checking that you have specified their source and tagged them, too. You can find a list of files you have uploaded by following [ this link]. Unsourced and untagged images may be deleted one week after they have been tagged, as described on criteria for speedy deletion. If the image is copyrighted under a non-free license (per Wikipedia:Fair use) then the image will be deleted 48 hours after . If you have any questions please ask them at the Media copyright questions page. Thank you. (talk • contribs • ) \"</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>false</td></tr><tr><th>11</th><td>00054a5e18b50dd4</td><td>bbq \\n\\nbe a man and lets discuss it-maybe over the phone?</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>false</td></tr><tr><th>12</th><td>0006f16e4e9f292e</td><td>Before you start throwing accusations and warnings at me, lets review the edit itself-making ad hominem attacks isn't going to strengthen your argument, it will merely make it look like you are abusing your power as an admin. \\nNow, the edit itself is relevant-this is probably the single most talked about event int he news as of late. His absence is notable, since he is the only living ex-president who did not attend. That's certainly more notable than his dedicating an aircracft carrier. \\nI intend to revert this edit, in hopes of attracting the attention of an admin that is willing to look at the issue itself, and not throw accusations around quite so liberally. Perhaps, if you achieve a level of civility where you can do this, we can have a rational discussion on the topic and resolve the matter peacefully.</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>false</td></tr><tr><th>13</th><td>00070ef96486d6f9</td><td>Oh, and the girl above started her arguments with me. She stuck her nose where it doesn't belong. I believe the argument was between me and Yvesnimmo. But like I said, the situation was settled and I apologized. Thanks,</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>false</td></tr><tr><th>14</th><td>00078f8ce7eb276d</td><td>\"\\n\\nJuelz Santanas Age\\n\\nIn 2002, Juelz Santana was 18 years old, then came February 18th, which makes Juelz turn 19 making songs with The Diplomats. The third neff to be signed to Cam's label under Roc A Fella. In 2003, he was 20 years old coming out with his own singles \"\"Santana's Town\"\" and \"\"Down\"\". So yes, he is born in 1983. He really is, how could he be older then Lloyd Banks? And how could he be 22 when his birthday passed? The homie neff is 23 years old. 1983 - 2006 (Juelz death, god forbid if your thinking about that) equals 23. Go to your caculator and stop changing his year of birth. My god.\"</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>false</td></tr><tr><th>15</th><td>000897889268bc93</td><td>REDIRECT Talk:Voydan Pop Georgiev- Chernodrinski</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>false</td></tr><tr><th>16</th><td>0009801bd85e5806</td><td>The Mitsurugi point made no sense - why not argue to include Hindi on Ryo Sakazaki's page to include more information?</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>false</td></tr><tr><th>17</th><td>0009eaea3325de8c</td><td>Don't mean to bother you \\n\\nI see that you're writing something regarding removing anything posted here and if you do oh well but if not and you can acctually discuss this with me then even better.\\n\\nI'd like to ask you to take a closer look at the Premature wrestling deaths catagory and the men listed in it, surely these men belong together in some catagory. Is there anything that you think we can do with the catagory besides delting it?</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>false</td></tr><tr><th>18</th><td>000b08c464718505</td><td>\"\\n\\n Regarding your recent edits \\n\\nOnce again, please read WP:FILMPLOT before editing any more film articles. Your edits are simply not good, with entirely too many unnecessary details and very bad writing. Please stop before you do further damage. -''''''The '45 \"</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>false</td></tr><tr><th>19</th><td>000bfd0867774845</td><td>\"\\nGood to know. About me, yeah, I'm studying now.(Deepu) \"</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>false</td></tr><tr><th>20</th><td>000c0dfd995809fa</td><td>\"\\n\\n Snowflakes are NOT always symmetrical! \\n\\nUnder Geometry it is stated that \"\"A snowflake always has six symmetric arms.\"\" This assertion is simply not true! According to Kenneth Libbrecht, \"\"The rather unattractive irregular crystals are by far the most common variety.\"\" http://www.its.caltech.edu/~atomic/snowcrystals/myths/myths.htm#perfection Someone really need to take a look at his site and get FACTS off of it because I still see a decent number of falsities on this page. (forgive me Im new at this and dont want to edit anything)\"</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>false</td></tr><tr><th>21</th><td>000c6a3f0cd3ba8e</td><td>\"\\n\\n The Signpost: 24 September 2012 \\n\\n Read this Signpost in full\\n Single-page\\n Unsubscribe\\n \\n\"</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>false</td></tr><tr><th>22</th><td>000cfee90f50d471</td><td>\"\\n\\nRe-considering 1st paragraph edit?\\nI don't understand the reasons for 's recent edit of this article not that I'm sure that the data are necessarily \"\"wrong.\"\" Rather, I'm persuaded that the strategy of introducing academic honors in the first paragraph is an unhelpful approach to this specific subject. I note that articles about other sitting Justices have been similarly \"\"enhanced;\"\" and I also believe those changes are no improvement. \\n\\nIn support of my view that this edit should be reverted, I would invite anyone to re-visit articles written about the following pairs of jurists.\\n A1. Benjamin Cardozo\\n A2. Learned Hand\\n\\n B1. John Marshall Harlan\\n B2. John Marshall Harlan II\\n\\nThe question becomes: Would the current version of the Wikipedia article about any one of them or either pair be improved by academic credentials in the introductory paragraph? I think not.\\n\\nPerhaps it helps to repeat a wry argument Kathleen Sullivan of Stanford Law makes when she suggests that some on the Harvard Law faculty wonder how Antonin Scalia avoided learning what others have managed to grasp about the processes of judging? I would hope this anecdote gently illustrates the point. \\n\\nLess humorous, but an even stronger argument is the one Clarence Thomas makes when he mentions wanting to return his law degree to Yale.\\n\\nAt a minimum, I'm questioning this edit? It deserves to be reconsidered. \"</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>false</td></tr><tr><th>23</th><td>000eefc67a2c930f</td><td>Radial symmetry \\n\\nSeveral now extinct lineages included in the Echinodermata were bilateral such as Homostelea, or even asymmetrical such as Cothurnocystis (Stylophora).\\n\\n-</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>false</td></tr><tr><th>24</th><td>000f35deef84dc4a</td><td>There's no need to apologize. A Wikipedia article is made for reconciling knowledge about a subject from different sources, and you've done history studies and not archaeology studies, I guess. I could scan the page, e-mail it to you, and then you could ask someone to translate the page.</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>false</td></tr><tr><th>25</th><td>000ffab30195c5e1</td><td>Yes, because the mother of the child in the case against Michael Jackson was studied in here motives and reasonings and judged upon her character just as harshly as Wacko Jacko himself. Don't tell me to ignore it and incriminate myself. I am going to continue refuting the bullshit that Jayjg keeps throwing at me. 18:01, 16 Jun 2005 (UTC)</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>false</td></tr><tr><th>26</th><td>0010307a3a50a353</td><td>\"\\nOk. But it will take a bit of work but I can't quite picture it. Do you have an example I can base it on? the Duck \"</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>false</td></tr><tr><th>27</th><td>0010833a96e1f886</td><td>\"== A barnstar for you! ==\\n\\n The Real Life Barnstar lets us be the stars\\n \"</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>false</td></tr><tr><th>28</th><td>0011cc71398479c4</td><td>How could I post before the block expires? The funny thing is, you think I'm being uncivil!</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>false</td></tr><tr><th>29</th><td>00128363e367d703</td><td>Not sure about a heading of 'Fight for Freedom' what will it contain?</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>false</td></tr><tr><th>30</th><td>001325b8b20ea8aa</td><td>Praise \\n\\nlooked at this article about 6 months ago -much improved. ]</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>false</td></tr><tr><th>&vellip;</th><td>&vellip;</td><td>&vellip;</td><td>&vellip;</td><td>&vellip;</td><td>&vellip;</td><td>&vellip;</td><td>&vellip;</td><td>&vellip;</td><td>&vellip;</td></tr></tbody></table>"
],
"text/plain": [
"143346×9 DataFrames.DataFrame. Omitted printing of 8 columns\n",
"│ Row │ id │\n",
"├────────┼──────────────────┤\n",
"│ 1 │ 0000997932d777bf │\n",
"│ 2 │ 000103f0d9cfb60f │\n",
"│ 3 │ 000113f07ec002fd │\n",
"│ 4 │ 0001b41b1c6bb37e │\n",
"│ 5 │ 0001d958c54c6e35 │\n",
"│ 6 │ 00025465d4725e87 │\n",
"│ 7 │ 00031b1e95af7921 │\n",
"│ 8 │ 00037261f536c51d │\n",
"│ 9 │ 00040093b2687caa │\n",
"│ 10 │ 0005300084f90edc │\n",
"│ 11 │ 00054a5e18b50dd4 │\n",
"⋮\n",
"│ 143335 │ ffca8d71d71a3fae │\n",
"│ 143336 │ ffcdcb71854f6d8a │\n",
"│ 143337 │ ffd2e85b07b3c7e4 │\n",
"│ 143338 │ ffd72e9766c09c97 │\n",
"│ 143339 │ ffe029a7c79dc7fe │\n",
"│ 143340 │ ffe897e7f7182c90 │\n",
"│ 143341 │ ffe8b9316245be30 │\n",
"│ 143342 │ ffe987279560d7ff │\n",
"│ 143343 │ ffea4adeee384e90 │\n",
"│ 143344 │ ffee36eab5c267c9 │\n",
"│ 143345 │ fff125370e4aaaf3 │\n",
"│ 143346 │ fff46fc426af1f9a │"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"good_ones = train[(train[:bad_ones] .== 0), :]"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(143346, 9)"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"size(good_ones)"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<table class=\"data-frame\"><thead><tr><th></th><th>id</th><th>comment_text</th><th>toxic</th><th>severe_toxic</th><th>obscene</th><th>threat</th><th>insult</th><th>identity_hate</th><th>bad_ones</th></tr></thead><tbody><tr><th>1</th><td>0002bcb3da6cb337</td><td>COCKSUCKER BEFORE YOU PISS AROUND ON MY WORK</td><td>1</td><td>1</td><td>1</td><td>0</td><td>1</td><td>0</td><td>4</td></tr><tr><th>2</th><td>0005c987bdfc9d4b</td><td>Hey... what is it..\\n@ | talk .\\nWhat is it... an exclusive group of some WP TALIBANS...who are good at destroying, self-appointed purist who GANG UP any one who asks them questions abt their ANTI-SOCIAL and DESTRUCTIVE (non)-contribution at WP?\\n\\nAsk Sityush to clean up his behavior than issue me nonsensical warnings...</td><td>1</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>1</td></tr><tr><th>3</th><td>0007e25b2121310b</td><td>Bye! \\n\\nDon't look, come or think of comming back! Tosser.</td><td>1</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>1</td></tr><tr><th>4</th><td>001810bf8c45bf5f</td><td>You are gay or antisemmitian? \\n\\nArchangel WHite Tiger\\n\\nMeow! Greetingshhh!\\n\\nUh, there are two ways, why you do erased my comment about WW2, that holocaust was brutally slaying of Jews and not gays/Gypsys/Slavs/anyone...\\n\\n1 - If you are anti-semitian, than shave your head bald and go to the skinhead meetings!\\n\\n2 - If you doubt words of the Bible, that homosexuality is a deadly sin, make a pentagram tatoo on your forehead go to the satanistic masses with your gay pals!\\n\\n3 - First and last warning, you fucking gay - I won't appreciate if any more nazi shwain would write in my page! I don't wish to talk to you anymore!\\n\\nBeware of the Dark Side!</td><td>1</td><td>0</td><td>1</td><td>0</td><td>1</td><td>1</td><td>4</td></tr><tr><th>5</th><td>00190820581d90ce</td><td>FUCK YOUR FILTHY MOTHER IN THE ASS, DRY!</td><td>1</td><td>0</td><td>1</td><td>0</td><td>1</td><td>0</td><td>3</td></tr><tr><th>6</th><td>001956c382006abd</td><td>I'm Sorry \\n\\nI'm sorry I screwed around with someones talk page. It was very bad to do. I know how having the templates on their talk page helps you assert your dominance over them. I know I should bow down to the almighty administrators. But then again, I'm going to go play outside....with your mom. 76.122.79.82</td><td>1</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>1</td></tr><tr><th>7</th><td>001dc38a83d420cf</td><td>GET FUCKED UP. GET FUCKEEED UP. GOT A DRINK THAT YOU CANT PUT DOWN???/ GET FUCK UP GET FUCKED UP. I'M FUCKED UP RIGHT NOW!</td><td>1</td><td>0</td><td>1</td><td>0</td><td>0</td><td>0</td><td>2</td></tr><tr><th>8</th><td>0020e7119b96eeeb</td><td>Stupid peace of shit stop deleting my stuff asshole go die and fall in a hole go to hell!</td><td>1</td><td>1</td><td>1</td><td>0</td><td>1</td><td>0</td><td>4</td></tr><tr><th>9</th><td>0020fd96ed3b8c8b</td><td>=Tony Sidaway is obviously a fistfuckee. He loves an arm up his ass.</td><td>1</td><td>0</td><td>1</td><td>0</td><td>1</td><td>0</td><td>3</td></tr><tr><th>10</th><td>0021fe88bc4da3e6</td><td>My Band Page's deletion. You thought I was gone. \\n\\nDeleting the comment I posted on your 'talk page' does not delete my feelings for your hasty decision to delete my page. You, sir, are still a cock-sucking douche fuck. Sit back for a second and think about what your life has become. ............ Done? That didn't take long did it? Nope. Because, as I am most certainly aware, your life is a shitstorm of sitting in front of the computer masturbating to fictional creatures. Your attempts to get rid of me are mediocre at best. You are constantly sucking the dick of failure. You don't want a personal attack, huh? Well, too bad, Fuckcock McDickerson. You've got one. From me. You can go ahead and delete my profile from wikipedia. I'll just make another one and come right back to 'Syrthiss's talk page and insult the dick off of you. How could you shatter the dreams of an innocent eighteen year old college freshman trying to make a name for his band. Does that make you happy? Fucking with people because you're an overweight, single, old man in a dead-end job. Did you spot that perhaps someone else was going to follow his dreams and you were trying to hold him back so somebody else could suffer like you? Yes you did. I don't make empty threats, so I won't be saying anything along the lines of 'i'll hurt you' or 'i'll eat the children from within your sister's womb', but I will say that you are a asshole, son-of-a-bitch, mother fucking cock sucker. So, go eat some more food and drown your sorrows you premature ejaculating, bald headed fuck.\\n\\nYou should do something nice for yourself, maybe go grab a couple of Horny Goat Weeds from your local convenience store and jack off for a little longer than three minutes tonight.\\n\\nSincerely,\\nAn Asshole That's Better Than You In Every Way.</td><td>1</td><td>0</td><td>1</td><td>0</td><td>0</td><td>0</td><td>2</td></tr><tr><th>11</th><td>002264ea4d5f2887</td><td>Why can't you believe how fat Artie is? Did you see him on his recent appearence on the Tonight Show with Jay Leno? He looks absolutely AWFUL! If I had to put money on it, I'd say that Artie Lange is a can't miss candidate for the 2007 Dead pool! \\n\\n \\nKindly keep your malicious fingers off of my above comment, . Everytime you remove it, I will repost it!!!</td><td>1</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>1</td></tr><tr><th>12</th><td>0028d62e8a5629aa</td><td>All of my edits are good. Cunts like you who revert good edits because you're too stupid to understand how to write well , and then revert other edits just because you've decided to bear a playground grudge, are the problem. Maybe one day you'll realise the damage you did to a noble project. 201.215.187.159</td><td>1</td><td>0</td><td>1</td><td>0</td><td>1</td><td>0</td><td>3</td></tr><tr><th>13</th><td>003217c3eb469ba9</td><td>Hi! I am back again!\\nLast warning!\\nStop undoing my edits or die!</td><td>1</td><td>0</td><td>0</td><td>1</td><td>0</td><td>0</td><td>2</td></tr><tr><th>14</th><td>0036621e4c7e10b5</td><td>Would you both shut up, you don't run wikipedia, especially a stupid kid.</td><td>1</td><td>0</td><td>0</td><td>0</td><td>1</td><td>0</td><td>2</td></tr><tr><th>15</th><td>00472b8e2d38d1ea</td><td>A pair of jew-hating weiner nazi schmucks.</td><td>1</td><td>0</td><td>1</td><td>0</td><td>1</td><td>1</td><td>4</td></tr><tr><th>16</th><td>005f59485fcddeb0</td><td>\"\\n\\nSORRY PUCK BUT NO ONE EVER SAID DICK WAS NUMBER ONE!! But he is clearly listed as being \"\"second only to Batman\"\" and a \"\"master detective and martial artist\"\" and been trained in all the mentioned combat discplines. No need to mention in skills/abilities about the battles he's lost and won because thats not what an encylopia is supposed to do. \\n\\nAlso, what your problem is your going by low end showings and not HIGH end showings where he's proven more than a match for a meta like Slade and pinned Shiva down in one of his earliest appearences.\\n \"</td><td>1</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>1</td></tr><tr><th>17</th><td>00637960a7ec3436</td><td>\"\\n\\nUNBLOCK ME OR I'LL GET MY LAWYERS ON TO YOU FOR BLOCKING MY CONSTITUTIONAL RIGHT TO FREE SPEECH\"</td><td>1</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>1</td></tr><tr><th>18</th><td>00686325bcc16080</td><td>You should be fired, you're a moronic wimp who is too lazy to do research. It makes me sick that people like you exist in this world.</td><td>1</td><td>0</td><td>0</td><td>0</td><td>1</td><td>0</td><td>2</td></tr><tr><th>19</th><td>006b94add72ed61c</td><td>I think that your a Fagget get a oife and burn in Hell I hate you 'm sorry we cant have any more sex i'm running out of conndoms</td><td>1</td><td>0</td><td>1</td><td>1</td><td>1</td><td>1</td><td>5</td></tr><tr><th>20</th><td>006d11791d76b9f3</td><td>REPLY ABOVE:\\nThat was me, loser. The UN defines Vietnam to be part of Southeast Asia. And far as I know Vietnam is part of ASEAN, and used to be part of French Indochina with Laos and all those shit countries Anyway your culture has always been more influenced by SEA than Han Chinese (as in proper Yangtze Han Chinese, not the fringe indigenous tribes in Guangzhou/Guangxi). \\n\\nJust admit that you vietnamese are all a bunch of wannabe crap people. ALL the east asian people I've spoken to thinks of Vietnam as a very integral part of SEA, and we all think you're backward, dirty and speak in a horrible swearing language. Doesn't matter what crap you spout on Wikipedia, won't change the way people in the real world think.</td><td>0</td><td>0</td><td>0</td><td>0</td><td>1</td><td>0</td><td>1</td></tr><tr><th>21</th><td>006e87872c8b370c</td><td>you are a stupid fuck \\n\\nand your mother's cunt stinks</td><td>1</td><td>1</td><td>1</td><td>0</td><td>1</td><td>0</td><td>4</td></tr><tr><th>22</th><td>007f1839ada915e6</td><td>Your blatant POV pushing \\n\\nNeither of you guys has made any contribution to this Italian history article other than to shove your unhistorical unconstructive modern POV in my face.\\nThis is a HISTORY article. HISTORY.\\nHave you heard of that?\\nThis is the reason why so many people get pissed off about the pedantry and idiocy and triviality of Wikipedia.\\nJ*sus. Get a f*cking life.</td><td>1</td><td>0</td><td>1</td><td>0</td><td>0</td><td>0</td><td>2</td></tr><tr><th>23</th><td>0082b5a7b4a67da2</td><td>Give me a permanat block raseac....!!! remember me....!!! Muslims dont you worry for these images they are trying to hurt us but they cant look these images are fake and have nothing to do with Islam or our prophet even their article is so lame wikipedia is not an authentic source its an encyclopedia for the kids which has not authenticity anyone can edit or write whatever they think...just like raeasc you can see he does not knows what Islam is and he talks about it so forget about wiki its lame they have no authenticity...i recommend they should add pissed christ in jessus article....they should...</td><td>1</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>1</td></tr><tr><th>24</th><td>0086998b34865f93</td><td>Fuck you, block me, you faggot pussy!</td><td>1</td><td>0</td><td>1</td><td>0</td><td>1</td><td>0</td><td>3</td></tr><tr><th>25</th><td>008e0818dde894fb</td><td>Kill all niggers. \\n\\nI have hard, that others have said this.. should this be included? That racists sometimes say these.</td><td>1</td><td>0</td><td>1</td><td>0</td><td>1</td><td>1</td><td>4</td></tr><tr><th>26</th><td>009371b0ef213487</td><td>Burn Deck \\n\\nIf that'd guy's burn deck is like what I think it is than he is in for a blrude awakening!!! Leviathan motherfucker!!!</td><td>1</td><td>0</td><td>1</td><td>0</td><td>1</td><td>0</td><td>3</td></tr><tr><th>27</th><td>0097dd5c29bf7a15</td><td>u r a tw@ fuck off u gay boy.U r smelly.Fuck ur mum poopie</td><td>1</td><td>0</td><td>1</td><td>0</td><td>1</td><td>1</td><td>4</td></tr><tr><th>28</th><td>00ab65775c601cf9</td><td>Atheism is full of bias shit</td><td>1</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>1</td></tr><tr><th>29</th><td>00afb4dec99a231f</td><td>Hey why you are spreading misconceptions and trying to spread false information to the people.You not even know who won the battle and who lost.you are the cheapest person on the earth.Bloody ignorant fool.Check sources before spreading rumors</td><td>1</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>1</td></tr><tr><th>30</th><td>00b77cb600c897b4</td><td>\"\\n\\nAnd you are? Let me know when you've crawled out of their arses. '''''' Dick Laurent is dead \"</td><td>1</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>1</td></tr><tr><th>&vellip;</th><td>&vellip;</td><td>&vellip;</td><td>&vellip;</td><td>&vellip;</td><td>&vellip;</td><td>&vellip;</td><td>&vellip;</td><td>&vellip;</td><td>&vellip;</td></tr></tbody></table>"
],
"text/plain": [
"32450×9 DataFrames.DataFrame. Omitted printing of 8 columns\n",
"│ Row │ id │\n",
"├───────┼──────────────────┤\n",
"│ 1 │ 0002bcb3da6cb337 │\n",
"│ 2 │ 0005c987bdfc9d4b │\n",
"│ 3 │ 0007e25b2121310b │\n",
"│ 4 │ 001810bf8c45bf5f │\n",
"│ 5 │ 00190820581d90ce │\n",
"│ 6 │ 001956c382006abd │\n",
"│ 7 │ 001dc38a83d420cf │\n",
"│ 8 │ 0020e7119b96eeeb │\n",
"│ 9 │ 0020fd96ed3b8c8b │\n",
"│ 10 │ 0021fe88bc4da3e6 │\n",
"│ 11 │ 002264ea4d5f2887 │\n",
"⋮\n",
"│ 32439 │ dcb3f3bc022a4ba8 │\n",
"│ 32440 │ e7d3543815292a53 │\n",
"│ 32441 │ dd3ef00751a8710b │\n",
"│ 32442 │ 7506b56eadf7ff6d │\n",
"│ 32443 │ 1a67a4f7d445746c │\n",
"│ 32444 │ 913e2989362cafb4 │\n",
"│ 32445 │ 445b71f7a6592276 │\n",
"│ 32446 │ e7e0bc7bc9ec48ed │\n",
"│ 32447 │ e34ba2f6960d2e88 │\n",
"│ 32448 │ 3c23fd11d919bfc9 │\n",
"│ 32449 │ a6bdff2b526ec26d │\n",
"│ 32450 │ 7963f7aa42dd1709 │"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"balanced_train = vcat(bad_ones, good_ones[sample(1:size(good_ones, 1), size(bad_ones, 1), replace=false, ordered=false), :])"
]
},
{
"cell_type": "code",
"execution_count": 58,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"search: \u001b[1mT\u001b[22m\u001b[1me\u001b[22m\u001b[1mx\u001b[22m\u001b[1mt\u001b[22m\u001b[1mA\u001b[22m\u001b[1mn\u001b[22m\u001b[1ma\u001b[22m\u001b[1ml\u001b[22m\u001b[1my\u001b[22m\u001b[1ms\u001b[22m\u001b[1mi\u001b[22m\u001b[1ms\u001b[22m\n",
"\n"
]
},
{
"data": {
"text/markdown": [
"No documentation found.\n",
"\n",
"Displaying the `README.md` for the module instead.\n",
"\n",
"---\n",
"\n",
"# TextAnalysis.jl\n",
"\n",
"[![Build Status](https://api.travis-ci.org/JuliaText/TextAnalysis.jl.svg)](https://travis-ci.org/JuliaText/TextAnalysis.jl) [![TextAnalysis](http://pkg.julialang.org/badges/TextAnalysis_0.5.svg)](http://pkg.julialang.org/?pkg=TextAnalysis) [![TextAnalysis](http://pkg.julialang.org/badges/TextAnalysis_0.6.svg)](http://pkg.julialang.org/?pkg=TextAnalysis)\n",
"\n",
"# Preface\n",
"\n",
"This manual is designed to get you started doing text analysis in Julia. It assumes that you already familiar with the basic methods of text analysis.\n",
"\n",
"# Outline\n",
"\n",
" * [Installation](#installation)\n",
" * [Getting Started](#getting-started)\n",
" * [Creating Documents](#creating-documents)\n",
"\n",
" * StringDocument\n",
" * FileDocument\n",
" * TokenDocument\n",
" * NGramDocument\n",
" * [Basic Functions for Working with Documents](#basic-functions-for-working-with-documents)\n",
"\n",
" * text\n",
" * tokens\n",
" * ngrams\n",
" * [Document Metadata](#document-metadata)\n",
"\n",
" * language\n",
" * name\n",
" * author\n",
" * timestamp\n",
" * [Preprocessing Documents](#preprocessing-documents)\n",
"\n",
" * Removing Corrupt UTF8\n",
" * Removing Punctuation\n",
" * Removing Case Distinctions\n",
" * Removing Words\n",
"\n",
" * Stop Words\n",
" * Articles\n",
" * Indefinite Articles\n",
" * Definite Articles\n",
" * Prepositions\n",
" * Pronouns\n",
" * Stemming\n",
" * Removing Rare Words\n",
" * Removing Sparse Words\n",
" * [Creating a Corpus](#creating-a-corpus)\n",
" * [Processing a Corpus](#processing-a-corpus)\n",
" * [Corpus Statistics](#corpus-statistics)\n",
"\n",
" * Lexicon\n",
" * Inverse Index\n",
" * [Creating a Document Term Matrix](#creating-a-document-term-matrix)\n",
" * [Creating Individual Rows of a Document Term Matrix](#creating-individual-rows-of-a-document-term-matrix)\n",
" * [The Hash Trick](#the-hash-trick)\n",
"\n",
" * Hashed DTV's\n",
" * Hashed DTM's\n",
" * [TF-IDF](#tf-idf)\n",
" * LSA: [Latent Semantic Analysis](#lsa-latent-semantic-analysis)\n",
" * LDA: [Latent Dirichlet Allocation](#lda-latent-dirichlet-allocation)\n",
" * [Extended Usage Example](#extended-usage-example): Analyzing the State of the Union Addresses\n",
"\n",
"# Installation\n",
"\n",
"The TextAnalysis package can be installed using Julia's package manager:\n",
"\n",
"```\n",
"Pkg.add(\"TextAnalysis\")\n",
"```\n",
"\n",
"# Getting Started\n",
"\n",
"In all of the examples that follow, we'll assume that you have the TextAnalysis package fully loaded. This means that we think you've implicily typed\n",
"\n",
"```\n",
"using TextAnalysis\n",
"```\n",
"\n",
"before every snippet of code.\n",
"\n",
"# Creating Documents\n",
"\n",
"The basic unit of text analysis is a document. The TextAnalysis package allows one to work with documents stored in a variety of formats:\n",
"\n",
" * _FileDocument_: A document represented using a plain text file on disk\n",
" * _StringDocument_: A document represented using a UTF8 String stored in RAM\n",
" * _TokenDocument_: A document represented as a sequence of UTF8 tokens\n",
" * _NGramDocument_: A document represented as a bag of n-grams, which are UTF8 n-grams that map to counts\n",
"\n",
"These format represent a hierarchy: you can always move down the hierachy, but can generally not move up the hierachy. A `FileDocument` can easily become a `StringDocument`, but an `NGramDocument` cannot easily become a `FileDocument`.\n",
"\n",
"Creating any of the four basic types of documents is very easy:\n",
"\n",
"```\n",
"str = \"To be or not to be...\"\n",
"sd = StringDocument(str)\n",
"\n",
"pathname = \"/usr/share/dict/words\"\n",
"fd = FileDocument(pathname)\n",
"\n",
"my_tokens = String[\"To\", \"be\", \"or\", \"not\", \"to\", \"be...\"]\n",
"td = TokenDocument(my_tokens)\n",
"\n",
"my_ngrams = Dict{String, Int}(\"To\" => 1, \"be\" => 2,\n",
" \"or\" => 1, \"not\" => 1,\n",
" \"to\" => 1, \"be...\" => 1)\n",
"ngd = NGramDocument(my_ngrams)\n",
"```\n",
"\n",
"For every type of document except a `FileDocument`, you can also construct a new document by simply passing in a string of text:\n",
"\n",
"```\n",
"sd = StringDocument(\"To be or not to be...\")\n",
"td = TokenDocument(\"To be or not to be...\")\n",
"ngd = NGramDocument(\"To be or not to be...\")\n",
"```\n",
"\n",
"The system will automatically perform tokenization or n-gramization in order to produce the required data. Unfortunately, `FileDocument`'s cannot be constructed this way because filenames are themselves strings. It would cause chaos if filenames were treated as the text contents of a document.\n",
"\n",
"That said, there is one way around this restriction: you can use the generic `Document()` constructor function, which will guess at the type of the inputs and construct the appropriate type of document object:\n",
"\n",
"```\n",
"Document(\"To be or not to be...\")\n",
"Document(\"/usr/share/dict/words\")\n",
"Document(String[\"To\", \"be\", \"or\", \"not\", \"to\", \"be...\"])\n",
"Document(Dict{String, Int}(\"a\" => 1, \"b\" => 3))\n",
"```\n",
"\n",
"This constructor is very convenient for working in the REPL, but should be avoided in permanent code because, unlike the other constructors, the return type of the `Document` function cannot be known at compile-time.\n",
"\n",
"# Basic Functions for Working with Documents\n",
"\n",
"Once you've created a document object, you can work with it in many ways. The most obvious thing is to access its text using the `text()` function:\n",
"\n",
"```\n",
"text(sd)\n",
"```\n",
"\n",
"This function works without warnings on `StringDocument`'s and `FileDocument`'s. For `TokenDocument`'s it is not possible to know if the text can be reconstructed perfectly, so calling `text(TokenDocument(\"This is text\"))` will produce a warning message before returning an approximate reconstruction of the text as it existed before tokenization. It is entirely impossible to reconstruct the text of an `NGramDocument`, so `text(NGramDocument(\"This is text\"))` raises an error.\n",
"\n",
"Instead of working with the text itself, you can work with the tokens or n-grams of a document using the `tokens()` and `ngrams()` functions:\n",
"\n",
"```\n",
"tokens(sd)\n",
"ngrams(sd)\n",
"```\n",
"\n",
"By default the `ngrams()` function produces unigrams. If you would like to produce bigrams or trigrams, you can specify that directly using a numeric argument to the `ngrams()` function:\n",
"\n",
"```\n",
"ngrams(sd, 2)\n",
"```\n",
"\n",
"If you have a `NGramDocument`, you can determine whether an `NGramDocument` contains unigrams, bigrams or a higher-order representation using the `ngram_complexity()` function:\n",
"\n",
"```\n",
"ngram_complexity(ngd)\n",
"```\n",
"\n",
"This information is not available for other types of `Document` objects because it is possible to produce any level of complexity when constructing n-grams from raw text or tokens.\n",
"\n",
"# Document Metadata\n",
"\n",
"In addition to methods for manipulating the representation of the text of a document, every document object also stores basic metadata about itself, including the following pieces of information:\n",
"\n",
" * `language()`: What language is the document in? Defaults to `EnglishLanguage`, a Language type defined by the Languages package.\n",
" * `name()`: What is the name of the document? Defaults to `\"Unnamed Document\"`.\n",
" * `author()`: Who wrote the document? Defaults to `\"Unknown Author\"`.\n",
" * `timestamp()`: When was the document written? Defaults to `\"Unknown Time\"`.\n",
"\n",
"Try these functions out on a `StringDocument` to see how the defaults work in practice:\n",
"\n",
"```\n",
"language(sd)\n",
"name(sd)\n",
"author(sd)\n",
"timestamp(sd)\n",
"```\n",
"\n",
"If you need reset these fields, you can use the mutating versions of the same functions:\n",
"\n",
"```\n",
"language!(sd, Languages.SpanishLanguage)\n",
"name!(sd, \"El Cid\")\n",
"author!(sd, \"Desconocido\")\n",
"timestamp!(sd, \"Desconocido\")\n",
"```\n",
"\n",
"# Preprocessing Documents\n",
"\n",
"Having easy access to the text of a document and its metadata is very important, but most text analysis tasks require some amount of preprocessing.\n",
"\n",
"At a minimum, your text source may contain corrupt characters. You can remove these using the `remove_corrupt_utf8!()` function:\n",
"\n",
"```\n",
"remove_corrupt_utf8!(sd)\n",
"```\n",
"\n",
"Alternatively, you may want to edit the text to remove items that are hard to process automatically. For example, our sample text sentence taken from Hamlet has three periods that we might like to discard. We can remove this kind of punctuation using the `remove_punctuation!()` function:\n",
"\n",
"```\n",
"remove_punctuation!(sd)\n",
"```\n",
"\n",
"Like punctuation, numbers and case distinctions are often easier removed than dealt with. To remove numbers or case distinctions, use the `remove_numbers!()` and `remove_case!()` functions:\n",
"\n",
"```\n",
"remove_numbers!(sd)\n",
"remove_case!(sd)\n",
"```\n",
"\n",
"At times you'll want to remove specific words from a document like a person's name. To do that, use the `remove_words!()` function:\n",
"\n",
"```\n",
"sd = StringDocument(\"Lear is mad\")\n",
"remove_words!(sd, [\"Lear\"])\n",
"```\n",
"\n",
"At other times, you'll want to remove whole classes of words. To make this easier, we can use several classes of basic words defined by the Languages.jl package:\n",
"\n",
" * _Articles_: \"a\", \"an\", \"the\"\n",
" * _Indefinite Articles_: \"a\", \"an\"\n",
" * _Definite Articles_: \"the\"\n",
" * _Prepositions_: \"across\", \"around\", \"before\", ...\n",
" * _Pronouns_: \"I\", \"you\", \"he\", \"she\", ...\n",
" * _Stop Words_: \"all\", \"almost\", \"alone\", ...\n",
"\n",
"These special classes can all be removed using specially-named functions:\n",
"\n",
" * `remove_articles!()`\n",
" * `remove_indefinite_articles!()`\n",
" * `remove_definite_articles!()`\n",
" * `remove_prepositions!()`\n",
" * `remove_pronouns!()`\n",
" * `remove_stop_words!()`\n",
"\n",
"These functions use words lists, so they are capable of working for many different languages without change:\n",
"\n",
"```\n",
"remove_articles!(sd)\n",
"remove_indefinite_articles!(sd)\n",
"remove_definite_articles!(sd)\n",
"remove_prepositions!(sd)\n",
"remove_pronouns!(sd)\n",
"remove_stop_words!(sd)\n",
"```\n",
"\n",
"In addition to removing words, it is also common to take words that are closely related like \"dog\" and \"dogs\" and stem them in order to produce a smaller set of words for analysis. We can do this using the `stem!()` function:\n",
"\n",
"```\n",
"stem!(sd)\n",
"```\n",
"\n",
"# Creating a Corpus\n",
"\n",
"Working with isolated documents gets boring quickly. We typically want to work with a collection of documents. We represent collections of documents using the Corpus type:\n",
"\n",
"```\n",
"crps = Corpus(Any[StringDocument(\"Document 1\"),\n",
" StringDocument(\"Document 2\")])\n",
"```\n",
"\n",
"# Standardizing a Corpus\n",
"\n",
"A `Corpus` may contain many different types of documents:\n",
"\n",
"```\n",
"crps = Corpus(Any[StringDocument(\"Document 1\"),\n",
" TokenDocument(\"Document 2\"),\n",
" NGramDocument(\"Document 3\")])\n",
"```\n",
"\n",
"It is generally more convenient to standardize all of the documents in a corpus using a single type. This can be done using the `standardize!` function:\n",
"\n",
"```\n",
"standardize!(crps, NGramDocument)\n",
"```\n",
"\n",
"After this step, you can check that the corpus only contains `NGramDocument`'s:\n",
"\n",
"```\n",
"crps\n",
"```\n",
"\n",
"# Processing a Corpus\n",
"\n",
"We can apply the same sort of preprocessing steps that are defined for individual documents to an entire corpus at once:\n",
"\n",
"```\n",
"crps = Corpus(Any[StringDocument(\"Document 1\"),\n",
" StringDocument(\"Document 2\")])\n",
"remove_punctuation!(crps)\n",
"```\n",
"\n",
"These operations are run on each document in the corpus individually.\n",
"\n",
"# Corpus Statistics\n",
"\n",
"Often we wish to think broadly about properties of an entire corpus at once. In particular, we want to work with two constructs:\n",
"\n",
" * _Lexicon_: The lexicon of a corpus consists of all the terms that occur in any document in the corpus. The lexical frequency of a term tells us how often a term occurs across all of the documents. Often the most interesting words in a document are those words whose frequency within a document is higher than their frequency in the corpus as a whole.\n",
" * _Inverse Index_: If we are interested in a specific term, we often want to know which documents in a corpus contain that term. The inverse index tells us this and therefore provides a simplistic sort of search algorithm.\n",
"\n",
"Because computations involving the lexicon can take a long time, a `Corpus`'s default lexicon is blank:\n",
"\n",
"```\n",
"lexicon(crps)\n",
"```\n",
"\n",
"In order to work with the lexicon, you have to update it and then access it:\n",
"\n",
"```\n",
"update_lexicon!(crps)\n",
"lexicon(crps)\n",
"```\n",
"\n",
"But once this work is done, you can easier address lots of interesting questions about a corpus:\n",
"\n",
"```\n",
"lexical_frequency(crps, \"Summer\")\n",
"lexical_frequency(crps, \"Document\")\n",
"```\n",
"\n",
"Like the lexicon, the inverse index for a corpus is blank by default:\n",
"\n",
"```\n",
"inverse_index(crps)\n",
"```\n",
"\n",
"Again, you need to update it before you can work with it:\n",
"\n",
"```\n",
"update_inverse_index!(crps)\n",
"inverse_index(crps)\n",
"```\n",
"\n",
"But once you've updated the inverse index, you can easily search the entire corpus:\n",
"\n",
"```\n",
"crps[\"Document\"]\n",
"crps[\"1\"]\n",
"crps[\"Summer\"]\n",
"```\n",
"\n",
"# Converting a DataFrame from a Corpus\n",
"\n",
"Sometimes we want to apply non-text specific data analysis operations to a corpus. The easiest way to do this is to convert a `Corpus` object into a `DataFrame`:\n",
"\n",
"```\n",
"convert(DataFrame, crps)\n",
"```\n",
"\n",
"# Creating a Document Term Matrix\n",
"\n",
"Often we want to represent documents as a matrix of word counts so that we can apply linear algebra operations and statistical techniques. Before we do this, we need to update the lexicon:\n",
"\n",
"```\n",
"update_lexicon!(crps)\n",
"m = DocumentTermMatrix(crps)\n",
"```\n",
"\n",
"A `DocumentTermMatrix` object is a special type. If you would like to use a simple sparse matrix, call `dtm()` on this object:\n",
"\n",
"```\n",
"dtm(m)\n",
"```\n",
"\n",
"If you would like to use a dense matrix instead, you can pass this as an argument to the `dtm` function:\n",
"\n",
"```\n",
"dtm(m, :dense)\n",
"```\n",
"\n",
"# Creating Individual Rows of a Document Term Matrix\n",
"\n",
"In many cases, we don't need the entire document term matrix at once: we can make do with just a single row. You can get this using the `dtv` function. Because individual's document do not have a lexicon associated with them, we have to pass in a lexicon as an additional argument:\n",
"\n",
"```\n",
"dtv(crps[1], lexicon(crps))\n",
"```\n",
"\n",
"# The Hash Trick\n",
"\n",
"The need to create a lexicon before we can construct a document term matrix is often prohibitive. We can often employ a trick that has come to be called the \"Hash Trick\" in which we replace terms with their hashed valued using a hash function that outputs integers from 1 to N. To construct such a hash function, you can use the `TextHashFunction(N)` constructor:\n",
"\n",
"```\n",
"h = TextHashFunction(10)\n",
"```\n",
"\n",
"You can see how this function maps strings to numbers by calling the `index_hash` function:\n",
"\n",
"```\n",
"index_hash(\"a\", h)\n",
"index_hash(\"b\", h)\n",
"```\n",
"\n",
"Using a text hash function, we can represent a document as a vector with N entries by calling the `hash_dtv` function:\n",
"\n",
"```\n",
"hash_dtv(crps[1], h)\n",
"```\n",
"\n",
"This can be done for a corpus as a whole to construct a DTM without defining a lexicon in advance:\n",
"\n",
"```\n",
"hash_dtm(crps, h)\n",
"```\n",
"\n",
"Every corpus has a hash function built-in, so this function can be called using just one argument:\n",
"\n",
"```\n",
"hash_dtm(crps)\n",
"```\n",
"\n",
"Moreover, if you do not specify a hash function for just one row of the hash DTM, a default hash function will be constructed for you:\n",
"\n",
"```\n",
"hash_dtv(crps[1])\n",
"```\n",
"\n",
"# TF-IDF\n",
"\n",
"In many cases, raw word counts are not appropriate for use because:\n",
"\n",
" * (A) Some documents are longer than other documents\n",
" * (B) Some words are more frequent than other words\n",
"\n",
"You can work around this by performing TF-IDF on a DocumentTermMatrix:\n",
"\n",
"```\n",
"m = DocumentTermMatrix(crps)\n",
"tf_idf(m)\n",
"```\n",
"\n",
"As you can see, TF-IDF has the effect of inserting 0's into the columns of words that occur in all documents. This is a useful way to avoid having to remove those words during preprocessing.\n",
"\n",
"# LSA: Latent Semantic Analysis\n",
"\n",
"Often we want to think about documents from the perspective of semantic content. One standard approach to doing this is to perform Latent Semantic Analysis or LSA on the corpus. You can do this using the `lsa` function:\n",
"\n",
"```\n",
"lsa(crps)\n",
"```\n",
"\n",
"# LDA: Latent Dirichlet Allocation\n",
"\n",
"Another way to get a handle on the semantic content of a corpus is to use Latent Dirichlet Allocation:\n",
"\n",
"```\n",
"m = DocumentTermMatrix(crps)\n",
"k = 2 # number of topics\n",
"iteration = 1000 # number of gibbs sampling iterations\n",
"alpha = 0.1 # hyper parameter\n",
"beta = 0.1 # hyber parameter\n",
"l = lda(m, k, iteration, alpha, beta) # l is k x word matrix.\n",
" # value is probablity of occurrence of a word in a topic.\n",
"```\n",
"\n",
"# Extended Usage Example\n",
"\n",
"To show you how text analysis might work in practice, we're going to work with a text corpus composed of political speeches from American presidents given as part of the State of the Union Address tradition.\n",
"\n",
"```\n",
"using TextAnalysis, DimensionalityReduction, Clustering\n",
"\n",
"crps = DirectoryCorpus(\"sotu\")\n",
"\n",
"standardize!(crps, StringDocument)\n",
"\n",
"crps = Corpus(crps[1:30])\n",
"\n",
"remove_case!(crps)\n",
"remove_punctuation!(crps)\n",
"\n",
"update_lexicon!(crps)\n",
"update_inverse_index!(crps)\n",
"\n",
"crps[\"freedom\"]\n",
"\n",
"m = DocumentTermMatrix(crps)\n",
"\n",
"D = dtm(m, :dense)\n",
"\n",
"T = tf_idf(D)\n",
"\n",
"cl = kmeans(T, 5)\n",
"```\n"
],
"text/plain": [
"No documentation found.\n",
"\n",
"Displaying the `README.md` for the module instead.\n",
"\n",
"---\n",
"\n",
"# TextAnalysis.jl\n",
"\n",
"[![Build Status](https://api.travis-ci.org/JuliaText/TextAnalysis.jl.svg)](https://travis-ci.org/JuliaText/TextAnalysis.jl) [![TextAnalysis](http://pkg.julialang.org/badges/TextAnalysis_0.5.svg)](http://pkg.julialang.org/?pkg=TextAnalysis) [![TextAnalysis](http://pkg.julialang.org/badges/TextAnalysis_0.6.svg)](http://pkg.julialang.org/?pkg=TextAnalysis)\n",
"\n",
"# Preface\n",
"\n",
"This manual is designed to get you started doing text analysis in Julia. It assumes that you already familiar with the basic methods of text analysis.\n",
"\n",
"# Outline\n",
"\n",
" * [Installation](#installation)\n",
" * [Getting Started](#getting-started)\n",
" * [Creating Documents](#creating-documents)\n",
"\n",
" * StringDocument\n",
" * FileDocument\n",
" * TokenDocument\n",
" * NGramDocument\n",
" * [Basic Functions for Working with Documents](#basic-functions-for-working-with-documents)\n",
"\n",
" * text\n",
" * tokens\n",
" * ngrams\n",
" * [Document Metadata](#document-metadata)\n",
"\n",
" * language\n",
" * name\n",
" * author\n",
" * timestamp\n",
" * [Preprocessing Documents](#preprocessing-documents)\n",
"\n",
" * Removing Corrupt UTF8\n",
" * Removing Punctuation\n",
" * Removing Case Distinctions\n",
" * Removing Words\n",
"\n",
" * Stop Words\n",
" * Articles\n",
" * Indefinite Articles\n",
" * Definite Articles\n",
" * Prepositions\n",
" * Pronouns\n",
" * Stemming\n",
" * Removing Rare Words\n",
" * Removing Sparse Words\n",
" * [Creating a Corpus](#creating-a-corpus)\n",
" * [Processing a Corpus](#processing-a-corpus)\n",
" * [Corpus Statistics](#corpus-statistics)\n",
"\n",
" * Lexicon\n",
" * Inverse Index\n",
" * [Creating a Document Term Matrix](#creating-a-document-term-matrix)\n",
" * [Creating Individual Rows of a Document Term Matrix](#creating-individual-rows-of-a-document-term-matrix)\n",
" * [The Hash Trick](#the-hash-trick)\n",
"\n",
" * Hashed DTV's\n",
" * Hashed DTM's\n",
" * [TF-IDF](#tf-idf)\n",
" * LSA: [Latent Semantic Analysis](#lsa-latent-semantic-analysis)\n",
" * LDA: [Latent Dirichlet Allocation](#lda-latent-dirichlet-allocation)\n",
" * [Extended Usage Example](#extended-usage-example): Analyzing the State of the Union Addresses\n",
"\n",
"# Installation\n",
"\n",
"The TextAnalysis package can be installed using Julia's package manager:\n",
"\n",
"```\n",
"Pkg.add(\"TextAnalysis\")\n",
"```\n",
"\n",
"# Getting Started\n",
"\n",
"In all of the examples that follow, we'll assume that you have the TextAnalysis package fully loaded. This means that we think you've implicily typed\n",
"\n",
"```\n",
"using TextAnalysis\n",
"```\n",
"\n",
"before every snippet of code.\n",
"\n",
"# Creating Documents\n",
"\n",
"The basic unit of text analysis is a document. The TextAnalysis package allows one to work with documents stored in a variety of formats:\n",
"\n",
" * _FileDocument_: A document represented using a plain text file on disk\n",
" * _StringDocument_: A document represented using a UTF8 String stored in RAM\n",
" * _TokenDocument_: A document represented as a sequence of UTF8 tokens\n",
" * _NGramDocument_: A document represented as a bag of n-grams, which are UTF8 n-grams that map to counts\n",
"\n",
"These format represent a hierarchy: you can always move down the hierachy, but can generally not move up the hierachy. A `FileDocument` can easily become a `StringDocument`, but an `NGramDocument` cannot easily become a `FileDocument`.\n",
"\n",
"Creating any of the four basic types of documents is very easy:\n",
"\n",
"```\n",
"str = \"To be or not to be...\"\n",
"sd = StringDocument(str)\n",
"\n",
"pathname = \"/usr/share/dict/words\"\n",
"fd = FileDocument(pathname)\n",
"\n",
"my_tokens = String[\"To\", \"be\", \"or\", \"not\", \"to\", \"be...\"]\n",
"td = TokenDocument(my_tokens)\n",
"\n",
"my_ngrams = Dict{String, Int}(\"To\" => 1, \"be\" => 2,\n",
" \"or\" => 1, \"not\" => 1,\n",
" \"to\" => 1, \"be...\" => 1)\n",
"ngd = NGramDocument(my_ngrams)\n",
"```\n",
"\n",
"For every type of document except a `FileDocument`, you can also construct a new document by simply passing in a string of text:\n",
"\n",
"```\n",
"sd = StringDocument(\"To be or not to be...\")\n",
"td = TokenDocument(\"To be or not to be...\")\n",
"ngd = NGramDocument(\"To be or not to be...\")\n",
"```\n",
"\n",
"The system will automatically perform tokenization or n-gramization in order to produce the required data. Unfortunately, `FileDocument`'s cannot be constructed this way because filenames are themselves strings. It would cause chaos if filenames were treated as the text contents of a document.\n",
"\n",
"That said, there is one way around this restriction: you can use the generic `Document()` constructor function, which will guess at the type of the inputs and construct the appropriate type of document object:\n",
"\n",
"```\n",
"Document(\"To be or not to be...\")\n",
"Document(\"/usr/share/dict/words\")\n",
"Document(String[\"To\", \"be\", \"or\", \"not\", \"to\", \"be...\"])\n",
"Document(Dict{String, Int}(\"a\" => 1, \"b\" => 3))\n",
"```\n",
"\n",
"This constructor is very convenient for working in the REPL, but should be avoided in permanent code because, unlike the other constructors, the return type of the `Document` function cannot be known at compile-time.\n",
"\n",
"# Basic Functions for Working with Documents\n",
"\n",
"Once you've created a document object, you can work with it in many ways. The most obvious thing is to access its text using the `text()` function:\n",
"\n",
"```\n",
"text(sd)\n",
"```\n",
"\n",
"This function works without warnings on `StringDocument`'s and `FileDocument`'s. For `TokenDocument`'s it is not possible to know if the text can be reconstructed perfectly, so calling `text(TokenDocument(\"This is text\"))` will produce a warning message before returning an approximate reconstruction of the text as it existed before tokenization. It is entirely impossible to reconstruct the text of an `NGramDocument`, so `text(NGramDocument(\"This is text\"))` raises an error.\n",
"\n",
"Instead of working with the text itself, you can work with the tokens or n-grams of a document using the `tokens()` and `ngrams()` functions:\n",
"\n",
"```\n",
"tokens(sd)\n",
"ngrams(sd)\n",
"```\n",
"\n",
"By default the `ngrams()` function produces unigrams. If you would like to produce bigrams or trigrams, you can specify that directly using a numeric argument to the `ngrams()` function:\n",
"\n",
"```\n",
"ngrams(sd, 2)\n",
"```\n",
"\n",
"If you have a `NGramDocument`, you can determine whether an `NGramDocument` contains unigrams, bigrams or a higher-order representation using the `ngram_complexity()` function:\n",
"\n",
"```\n",
"ngram_complexity(ngd)\n",
"```\n",
"\n",
"This information is not available for other types of `Document` objects because it is possible to produce any level of complexity when constructing n-grams from raw text or tokens.\n",
"\n",
"# Document Metadata\n",
"\n",
"In addition to methods for manipulating the representation of the text of a document, every document object also stores basic metadata about itself, including the following pieces of information:\n",
"\n",
" * `language()`: What language is the document in? Defaults to `EnglishLanguage`, a Language type defined by the Languages package.\n",
" * `name()`: What is the name of the document? Defaults to `\"Unnamed Document\"`.\n",
" * `author()`: Who wrote the document? Defaults to `\"Unknown Author\"`.\n",
" * `timestamp()`: When was the document written? Defaults to `\"Unknown Time\"`.\n",
"\n",
"Try these functions out on a `StringDocument` to see how the defaults work in practice:\n",
"\n",
"```\n",
"language(sd)\n",
"name(sd)\n",
"author(sd)\n",
"timestamp(sd)\n",
"```\n",
"\n",
"If you need reset these fields, you can use the mutating versions of the same functions:\n",
"\n",
"```\n",
"language!(sd, Languages.SpanishLanguage)\n",
"name!(sd, \"El Cid\")\n",
"author!(sd, \"Desconocido\")\n",
"timestamp!(sd, \"Desconocido\")\n",
"```\n",
"\n",
"# Preprocessing Documents\n",
"\n",
"Having easy access to the text of a document and its metadata is very important, but most text analysis tasks require some amount of preprocessing.\n",
"\n",
"At a minimum, your text source may contain corrupt characters. You can remove these using the `remove_corrupt_utf8!()` function:\n",
"\n",
"```\n",
"remove_corrupt_utf8!(sd)\n",
"```\n",
"\n",
"Alternatively, you may want to edit the text to remove items that are hard to process automatically. For example, our sample text sentence taken from Hamlet has three periods that we might like to discard. We can remove this kind of punctuation using the `remove_punctuation!()` function:\n",
"\n",
"```\n",
"remove_punctuation!(sd)\n",
"```\n",
"\n",
"Like punctuation, numbers and case distinctions are often easier removed than dealt with. To remove numbers or case distinctions, use the `remove_numbers!()` and `remove_case!()` functions:\n",
"\n",
"```\n",
"remove_numbers!(sd)\n",
"remove_case!(sd)\n",
"```\n",
"\n",
"At times you'll want to remove specific words from a document like a person's name. To do that, use the `remove_words!()` function:\n",
"\n",
"```\n",
"sd = StringDocument(\"Lear is mad\")\n",
"remove_words!(sd, [\"Lear\"])\n",
"```\n",
"\n",
"At other times, you'll want to remove whole classes of words. To make this easier, we can use several classes of basic words defined by the Languages.jl package:\n",
"\n",
" * _Articles_: \"a\", \"an\", \"the\"\n",
" * _Indefinite Articles_: \"a\", \"an\"\n",
" * _Definite Articles_: \"the\"\n",
" * _Prepositions_: \"across\", \"around\", \"before\", ...\n",
" * _Pronouns_: \"I\", \"you\", \"he\", \"she\", ...\n",
" * _Stop Words_: \"all\", \"almost\", \"alone\", ...\n",
"\n",
"These special classes can all be removed using specially-named functions:\n",
"\n",
" * `remove_articles!()`\n",
" * `remove_indefinite_articles!()`\n",
" * `remove_definite_articles!()`\n",
" * `remove_prepositions!()`\n",
" * `remove_pronouns!()`\n",
" * `remove_stop_words!()`\n",
"\n",
"These functions use words lists, so they are capable of working for many different languages without change:\n",
"\n",
"```\n",
"remove_articles!(sd)\n",
"remove_indefinite_articles!(sd)\n",
"remove_definite_articles!(sd)\n",
"remove_prepositions!(sd)\n",
"remove_pronouns!(sd)\n",
"remove_stop_words!(sd)\n",
"```\n",
"\n",
"In addition to removing words, it is also common to take words that are closely related like \"dog\" and \"dogs\" and stem them in order to produce a smaller set of words for analysis. We can do this using the `stem!()` function:\n",
"\n",
"```\n",
"stem!(sd)\n",
"```\n",
"\n",
"# Creating a Corpus\n",
"\n",
"Working with isolated documents gets boring quickly. We typically want to work with a collection of documents. We represent collections of documents using the Corpus type:\n",
"\n",
"```\n",
"crps = Corpus(Any[StringDocument(\"Document 1\"),\n",
" StringDocument(\"Document 2\")])\n",
"```\n",
"\n",
"# Standardizing a Corpus\n",
"\n",
"A `Corpus` may contain many different types of documents:\n",
"\n",
"```\n",
"crps = Corpus(Any[StringDocument(\"Document 1\"),\n",
" TokenDocument(\"Document 2\"),\n",
" NGramDocument(\"Document 3\")])\n",
"```\n",
"\n",
"It is generally more convenient to standardize all of the documents in a corpus using a single type. This can be done using the `standardize!` function:\n",
"\n",
"```\n",
"standardize!(crps, NGramDocument)\n",
"```\n",
"\n",
"After this step, you can check that the corpus only contains `NGramDocument`'s:\n",
"\n",
"```\n",
"crps\n",
"```\n",
"\n",
"# Processing a Corpus\n",
"\n",
"We can apply the same sort of preprocessing steps that are defined for individual documents to an entire corpus at once:\n",
"\n",
"```\n",
"crps = Corpus(Any[StringDocument(\"Document 1\"),\n",
" StringDocument(\"Document 2\")])\n",
"remove_punctuation!(crps)\n",
"```\n",
"\n",
"These operations are run on each document in the corpus individually.\n",
"\n",
"# Corpus Statistics\n",
"\n",
"Often we wish to think broadly about properties of an entire corpus at once. In particular, we want to work with two constructs:\n",
"\n",
" * _Lexicon_: The lexicon of a corpus consists of all the terms that occur in any document in the corpus. The lexical frequency of a term tells us how often a term occurs across all of the documents. Often the most interesting words in a document are those words whose frequency within a document is higher than their frequency in the corpus as a whole.\n",
" * _Inverse Index_: If we are interested in a specific term, we often want to know which documents in a corpus contain that term. The inverse index tells us this and therefore provides a simplistic sort of search algorithm.\n",
"\n",
"Because computations involving the lexicon can take a long time, a `Corpus`'s default lexicon is blank:\n",
"\n",
"```\n",
"lexicon(crps)\n",
"```\n",
"\n",
"In order to work with the lexicon, you have to update it and then access it:\n",
"\n",
"```\n",
"update_lexicon!(crps)\n",
"lexicon(crps)\n",
"```\n",
"\n",
"But once this work is done, you can easier address lots of interesting questions about a corpus:\n",
"\n",
"```\n",
"lexical_frequency(crps, \"Summer\")\n",
"lexical_frequency(crps, \"Document\")\n",
"```\n",
"\n",
"Like the lexicon, the inverse index for a corpus is blank by default:\n",
"\n",
"```\n",
"inverse_index(crps)\n",
"```\n",
"\n",
"Again, you need to update it before you can work with it:\n",
"\n",
"```\n",
"update_inverse_index!(crps)\n",
"inverse_index(crps)\n",
"```\n",
"\n",
"But once you've updated the inverse index, you can easily search the entire corpus:\n",
"\n",
"```\n",
"crps[\"Document\"]\n",
"crps[\"1\"]\n",
"crps[\"Summer\"]\n",
"```\n",
"\n",
"# Converting a DataFrame from a Corpus\n",
"\n",
"Sometimes we want to apply non-text specific data analysis operations to a corpus. The easiest way to do this is to convert a `Corpus` object into a `DataFrame`:\n",
"\n",
"```\n",
"convert(DataFrame, crps)\n",
"```\n",
"\n",
"# Creating a Document Term Matrix\n",
"\n",
"Often we want to represent documents as a matrix of word counts so that we can apply linear algebra operations and statistical techniques. Before we do this, we need to update the lexicon:\n",
"\n",
"```\n",
"update_lexicon!(crps)\n",
"m = DocumentTermMatrix(crps)\n",
"```\n",
"\n",
"A `DocumentTermMatrix` object is a special type. If you would like to use a simple sparse matrix, call `dtm()` on this object:\n",
"\n",
"```\n",
"dtm(m)\n",
"```\n",
"\n",
"If you would like to use a dense matrix instead, you can pass this as an argument to the `dtm` function:\n",
"\n",
"```\n",
"dtm(m, :dense)\n",
"```\n",
"\n",
"# Creating Individual Rows of a Document Term Matrix\n",
"\n",
"In many cases, we don't need the entire document term matrix at once: we can make do with just a single row. You can get this using the `dtv` function. Because individual's document do not have a lexicon associated with them, we have to pass in a lexicon as an additional argument:\n",
"\n",
"```\n",
"dtv(crps[1], lexicon(crps))\n",
"```\n",
"\n",
"# The Hash Trick\n",
"\n",
"The need to create a lexicon before we can construct a document term matrix is often prohibitive. We can often employ a trick that has come to be called the \"Hash Trick\" in which we replace terms with their hashed valued using a hash function that outputs integers from 1 to N. To construct such a hash function, you can use the `TextHashFunction(N)` constructor:\n",
"\n",
"```\n",
"h = TextHashFunction(10)\n",
"```\n",
"\n",
"You can see how this function maps strings to numbers by calling the `index_hash` function:\n",
"\n",
"```\n",
"index_hash(\"a\", h)\n",
"index_hash(\"b\", h)\n",
"```\n",
"\n",
"Using a text hash function, we can represent a document as a vector with N entries by calling the `hash_dtv` function:\n",
"\n",
"```\n",
"hash_dtv(crps[1], h)\n",
"```\n",
"\n",
"This can be done for a corpus as a whole to construct a DTM without defining a lexicon in advance:\n",
"\n",
"```\n",
"hash_dtm(crps, h)\n",
"```\n",
"\n",
"Every corpus has a hash function built-in, so this function can be called using just one argument:\n",
"\n",
"```\n",
"hash_dtm(crps)\n",
"```\n",
"\n",
"Moreover, if you do not specify a hash function for just one row of the hash DTM, a default hash function will be constructed for you:\n",
"\n",
"```\n",
"hash_dtv(crps[1])\n",
"```\n",
"\n",
"# TF-IDF\n",
"\n",
"In many cases, raw word counts are not appropriate for use because:\n",
"\n",
" * (A) Some documents are longer than other documents\n",
" * (B) Some words are more frequent than other words\n",
"\n",
"You can work around this by performing TF-IDF on a DocumentTermMatrix:\n",
"\n",
"```\n",
"m = DocumentTermMatrix(crps)\n",
"tf_idf(m)\n",
"```\n",
"\n",
"As you can see, TF-IDF has the effect of inserting 0's into the columns of words that occur in all documents. This is a useful way to avoid having to remove those words during preprocessing.\n",
"\n",
"# LSA: Latent Semantic Analysis\n",
"\n",
"Often we want to think about documents from the perspective of semantic content. One standard approach to doing this is to perform Latent Semantic Analysis or LSA on the corpus. You can do this using the `lsa` function:\n",
"\n",
"```\n",
"lsa(crps)\n",
"```\n",
"\n",
"# LDA: Latent Dirichlet Allocation\n",
"\n",
"Another way to get a handle on the semantic content of a corpus is to use Latent Dirichlet Allocation:\n",
"\n",
"```\n",
"m = DocumentTermMatrix(crps)\n",
"k = 2 # number of topics\n",
"iteration = 1000 # number of gibbs sampling iterations\n",
"alpha = 0.1 # hyper parameter\n",
"beta = 0.1 # hyber parameter\n",
"l = lda(m, k, iteration, alpha, beta) # l is k x word matrix.\n",
" # value is probablity of occurrence of a word in a topic.\n",
"```\n",
"\n",
"# Extended Usage Example\n",
"\n",
"To show you how text analysis might work in practice, we're going to work with a text corpus composed of political speeches from American presidents given as part of the State of the Union Address tradition.\n",
"\n",
"```\n",
"using TextAnalysis, DimensionalityReduction, Clustering\n",
"\n",
"crps = DirectoryCorpus(\"sotu\")\n",
"\n",
"standardize!(crps, StringDocument)\n",
"\n",
"crps = Corpus(crps[1:30])\n",
"\n",
"remove_case!(crps)\n",
"remove_punctuation!(crps)\n",
"\n",
"update_lexicon!(crps)\n",
"update_inverse_index!(crps)\n",
"\n",
"crps[\"freedom\"]\n",
"\n",
"m = DocumentTermMatrix(crps)\n",
"\n",
"D = dtm(m, :dense)\n",
"\n",
"T = tf_idf(D)\n",
"\n",
"cl = kmeans(T, 5)\n",
"```\n"
]
},
"execution_count": 58,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"?TextAnalysis"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"search: \u001b[1ma\u001b[22m\u001b[1mp\u001b[22m\u001b[1mp\u001b[22m\u001b[1me\u001b[22m\u001b[1mn\u001b[22m\u001b[1md\u001b[22m\u001b[1m!\u001b[22m\n",
"\n"
]
},
{
"data": {
"text/markdown": [
"```\n",
"append!(collection, collection2) -> collection.\n",
"```\n",
"\n",
"Add the elements of `collection2` to the end of `collection`.\n",
"\n",
"# Examples\n",
"\n",
"```jldoctest\n",
"julia> append!([1],[2,3])\n",
"3-element Array{Int64,1}:\n",
" 1\n",
" 2\n",
" 3\n",
"\n",
"julia> append!([1, 2, 3], [4, 5, 6])\n",
"6-element Array{Int64,1}:\n",
" 1\n",
" 2\n",
" 3\n",
" 4\n",
" 5\n",
" 6\n",
"```\n",
"\n",
"Use [`push!`](@ref) to add individual items to `collection` which are not already themselves in another collection. The result is of the preceding example is equivalent to `push!([1, 2, 3], 4, 5, 6)`.\n"
],
"text/plain": [
"```\n",
"append!(collection, collection2) -> collection.\n",
"```\n",
"\n",
"Add the elements of `collection2` to the end of `collection`.\n",
"\n",
"# Examples\n",
"\n",
"```jldoctest\n",
"julia> append!([1],[2,3])\n",
"3-element Array{Int64,1}:\n",
" 1\n",
" 2\n",
" 3\n",
"\n",
"julia> append!([1, 2, 3], [4, 5, 6])\n",
"6-element Array{Int64,1}:\n",
" 1\n",
" 2\n",
" 3\n",
" 4\n",
" 5\n",
" 6\n",
"```\n",
"\n",
"Use [`push!`](@ref) to add individual items to `collection` which are not already themselves in another collection. The result is of the preceding example is equivalent to `push!([1, 2, 3], 4, 5, 6)`.\n"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"?append!"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<table class=\"data-frame\"><thead><tr><th></th><th>id</th><th>comment_text</th><th>toxic</th><th>severe_toxic</th><th>obscene</th><th>threat</th><th>insult</th><th>identity_hate</th><th>bad_ones</th></tr></thead><tbody><tr><th>1</th><td>0002bcb3da6cb337</td><td>COCKSUCKER BEFORE YOU PISS AROUND ON MY WORK</td><td>1</td><td>1</td><td>1</td><td>0</td><td>1</td><td>0</td><td>true</td></tr><tr><th>2</th><td>0005c987bdfc9d4b</td><td>Hey... what is it..\\n@ | talk .\\nWhat is it... an exclusive group of some WP TALIBANS...who are good at destroying, self-appointed purist who GANG UP any one who asks them questions abt their ANTI-SOCIAL and DESTRUCTIVE (non)-contribution at WP?\\n\\nAsk Sityush to clean up his behavior than issue me nonsensical warnings...</td><td>1</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>true</td></tr><tr><th>3</th><td>0007e25b2121310b</td><td>Bye! \\n\\nDon't look, come or think of comming back! Tosser.</td><td>1</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>true</td></tr><tr><th>4</th><td>001810bf8c45bf5f</td><td>You are gay or antisemmitian? \\n\\nArchangel WHite Tiger\\n\\nMeow! Greetingshhh!\\n\\nUh, there are two ways, why you do erased my comment about WW2, that holocaust was brutally slaying of Jews and not gays/Gypsys/Slavs/anyone...\\n\\n1 - If you are anti-semitian, than shave your head bald and go to the skinhead meetings!\\n\\n2 - If you doubt words of the Bible, that homosexuality is a deadly sin, make a pentagram tatoo on your forehead go to the satanistic masses with your gay pals!\\n\\n3 - First and last warning, you fucking gay - I won't appreciate if any more nazi shwain would write in my page! I don't wish to talk to you anymore!\\n\\nBeware of the Dark Side!</td><td>1</td><td>0</td><td>1</td><td>0</td><td>1</td><td>1</td><td>true</td></tr><tr><th>5</th><td>00190820581d90ce</td><td>FUCK YOUR FILTHY MOTHER IN THE ASS, DRY!</td><td>1</td><td>0</td><td>1</td><td>0</td><td>1</td><td>0</td><td>true</td></tr><tr><th>6</th><td>001956c382006abd</td><td>I'm Sorry \\n\\nI'm sorry I screwed around with someones talk page. It was very bad to do. I know how having the templates on their talk page helps you assert your dominance over them. I know I should bow down to the almighty administrators. But then again, I'm going to go play outside....with your mom. 76.122.79.82</td><td>1</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>true</td></tr><tr><th>7</th><td>001dc38a83d420cf</td><td>GET FUCKED UP. GET FUCKEEED UP. GOT A DRINK THAT YOU CANT PUT DOWN???/ GET FUCK UP GET FUCKED UP. I'M FUCKED UP RIGHT NOW!</td><td>1</td><td>0</td><td>1</td><td>0</td><td>0</td><td>0</td><td>true</td></tr><tr><th>8</th><td>0020e7119b96eeeb</td><td>Stupid peace of shit stop deleting my stuff asshole go die and fall in a hole go to hell!</td><td>1</td><td>1</td><td>1</td><td>0</td><td>1</td><td>0</td><td>true</td></tr><tr><th>9</th><td>0020fd96ed3b8c8b</td><td>=Tony Sidaway is obviously a fistfuckee. He loves an arm up his ass.</td><td>1</td><td>0</td><td>1</td><td>0</td><td>1</td><td>0</td><td>true</td></tr><tr><th>10</th><td>0021fe88bc4da3e6</td><td>My Band Page's deletion. You thought I was gone. \\n\\nDeleting the comment I posted on your 'talk page' does not delete my feelings for your hasty decision to delete my page. You, sir, are still a cock-sucking douche fuck. Sit back for a second and think about what your life has become. ............ Done? That didn't take long did it? Nope. Because, as I am most certainly aware, your life is a shitstorm of sitting in front of the computer masturbating to fictional creatures. Your attempts to get rid of me are mediocre at best. You are constantly sucking the dick of failure. You don't want a personal attack, huh? Well, too bad, Fuckcock McDickerson. You've got one. From me. You can go ahead and delete my profile from wikipedia. I'll just make another one and come right back to 'Syrthiss's talk page and insult the dick off of you. How could you shatter the dreams of an innocent eighteen year old college freshman trying to make a name for his band. Does that make you happy? Fucking with people because you're an overweight, single, old man in a dead-end job. Did you spot that perhaps someone else was going to follow his dreams and you were trying to hold him back so somebody else could suffer like you? Yes you did. I don't make empty threats, so I won't be saying anything along the lines of 'i'll hurt you' or 'i'll eat the children from within your sister's womb', but I will say that you are a asshole, son-of-a-bitch, mother fucking cock sucker. So, go eat some more food and drown your sorrows you premature ejaculating, bald headed fuck.\\n\\nYou should do something nice for yourself, maybe go grab a couple of Horny Goat Weeds from your local convenience store and jack off for a little longer than three minutes tonight.\\n\\nSincerely,\\nAn Asshole That's Better Than You In Every Way.</td><td>1</td><td>0</td><td>1</td><td>0</td><td>0</td><td>0</td><td>true</td></tr><tr><th>11</th><td>002264ea4d5f2887</td><td>Why can't you believe how fat Artie is? Did you see him on his recent appearence on the Tonight Show with Jay Leno? He looks absolutely AWFUL! If I had to put money on it, I'd say that Artie Lange is a can't miss candidate for the 2007 Dead pool! \\n\\n \\nKindly keep your malicious fingers off of my above comment, . Everytime you remove it, I will repost it!!!</td><td>1</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>true</td></tr><tr><th>12</th><td>0028d62e8a5629aa</td><td>All of my edits are good. Cunts like you who revert good edits because you're too stupid to understand how to write well , and then revert other edits just because you've decided to bear a playground grudge, are the problem. Maybe one day you'll realise the damage you did to a noble project. 201.215.187.159</td><td>1</td><td>0</td><td>1</td><td>0</td><td>1</td><td>0</td><td>true</td></tr><tr><th>13</th><td>003217c3eb469ba9</td><td>Hi! I am back again!\\nLast warning!\\nStop undoing my edits or die!</td><td>1</td><td>0</td><td>0</td><td>1</td><td>0</td><td>0</td><td>true</td></tr><tr><th>14</th><td>0036621e4c7e10b5</td><td>Would you both shut up, you don't run wikipedia, especially a stupid kid.</td><td>1</td><td>0</td><td>0</td><td>0</td><td>1</td><td>0</td><td>true</td></tr><tr><th>15</th><td>00472b8e2d38d1ea</td><td>A pair of jew-hating weiner nazi schmucks.</td><td>1</td><td>0</td><td>1</td><td>0</td><td>1</td><td>1</td><td>true</td></tr><tr><th>16</th><td>005f59485fcddeb0</td><td>\"\\n\\nSORRY PUCK BUT NO ONE EVER SAID DICK WAS NUMBER ONE!! But he is clearly listed as being \"\"second only to Batman\"\" and a \"\"master detective and martial artist\"\" and been trained in all the mentioned combat discplines. No need to mention in skills/abilities about the battles he's lost and won because thats not what an encylopia is supposed to do. \\n\\nAlso, what your problem is your going by low end showings and not HIGH end showings where he's proven more than a match for a meta like Slade and pinned Shiva down in one of his earliest appearences.\\n \"</td><td>1</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>true</td></tr><tr><th>17</th><td>00637960a7ec3436</td><td>\"\\n\\nUNBLOCK ME OR I'LL GET MY LAWYERS ON TO YOU FOR BLOCKING MY CONSTITUTIONAL RIGHT TO FREE SPEECH\"</td><td>1</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>true</td></tr><tr><th>18</th><td>00686325bcc16080</td><td>You should be fired, you're a moronic wimp who is too lazy to do research. It makes me sick that people like you exist in this world.</td><td>1</td><td>0</td><td>0</td><td>0</td><td>1</td><td>0</td><td>true</td></tr><tr><th>19</th><td>006b94add72ed61c</td><td>I think that your a Fagget get a oife and burn in Hell I hate you 'm sorry we cant have any more sex i'm running out of conndoms</td><td>1</td><td>0</td><td>1</td><td>1</td><td>1</td><td>1</td><td>true</td></tr><tr><th>20</th><td>006e87872c8b370c</td><td>you are a stupid fuck \\n\\nand your mother's cunt stinks</td><td>1</td><td>1</td><td>1</td><td>0</td><td>1</td><td>0</td><td>true</td></tr><tr><th>21</th><td>007f1839ada915e6</td><td>Your blatant POV pushing \\n\\nNeither of you guys has made any contribution to this Italian history article other than to shove your unhistorical unconstructive modern POV in my face.\\nThis is a HISTORY article. HISTORY.\\nHave you heard of that?\\nThis is the reason why so many people get pissed off about the pedantry and idiocy and triviality of Wikipedia.\\nJ*sus. Get a f*cking life.</td><td>1</td><td>0</td><td>1</td><td>0</td><td>0</td><td>0</td><td>true</td></tr><tr><th>22</th><td>0082b5a7b4a67da2</td><td>Give me a permanat block raseac....!!! remember me....!!! Muslims dont you worry for these images they are trying to hurt us but they cant look these images are fake and have nothing to do with Islam or our prophet even their article is so lame wikipedia is not an authentic source its an encyclopedia for the kids which has not authenticity anyone can edit or write whatever they think...just like raeasc you can see he does not knows what Islam is and he talks about it so forget about wiki its lame they have no authenticity...i recommend they should add pissed christ in jessus article....they should...</td><td>1</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>true</td></tr><tr><th>23</th><td>0086998b34865f93</td><td>Fuck you, block me, you faggot pussy!</td><td>1</td><td>0</td><td>1</td><td>0</td><td>1</td><td>0</td><td>true</td></tr><tr><th>24</th><td>008e0818dde894fb</td><td>Kill all niggers. \\n\\nI have hard, that others have said this.. should this be included? That racists sometimes say these.</td><td>1</td><td>0</td><td>1</td><td>0</td><td>1</td><td>1</td><td>true</td></tr><tr><th>25</th><td>009371b0ef213487</td><td>Burn Deck \\n\\nIf that'd guy's burn deck is like what I think it is than he is in for a blrude awakening!!! Leviathan motherfucker!!!</td><td>1</td><td>0</td><td>1</td><td>0</td><td>1</td><td>0</td><td>true</td></tr><tr><th>26</th><td>0097dd5c29bf7a15</td><td>u r a tw@ fuck off u gay boy.U r smelly.Fuck ur mum poopie</td><td>1</td><td>0</td><td>1</td><td>0</td><td>1</td><td>1</td><td>true</td></tr><tr><th>27</th><td>00ab65775c601cf9</td><td>Atheism is full of bias shit</td><td>1</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>true</td></tr><tr><th>28</th><td>00afb4dec99a231f</td><td>Hey why you are spreading misconceptions and trying to spread false information to the people.You not even know who won the battle and who lost.you are the cheapest person on the earth.Bloody ignorant fool.Check sources before spreading rumors</td><td>1</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>true</td></tr><tr><th>29</th><td>00b77cb600c897b4</td><td>\"\\n\\nAnd you are? Let me know when you've crawled out of their arses. '''''' Dick Laurent is dead \"</td><td>1</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>true</td></tr><tr><th>30</th><td>00be7dcac98dc95d</td><td>this user is such a worthless goddamn faggot fuck you faggot</td><td>1</td><td>0</td><td>1</td><td>0</td><td>1</td><td>0</td><td>true</td></tr><tr><th>&vellip;</th><td>&vellip;</td><td>&vellip;</td><td>&vellip;</td><td>&vellip;</td><td>&vellip;</td><td>&vellip;</td><td>&vellip;</td><td>&vellip;</td><td>&vellip;</td></tr></tbody></table>"
],
"text/plain": [
"15294×9 DataFrames.DataFrame. Omitted printing of 8 columns\n",
"│ Row │ id │\n",
"├───────┼──────────────────┤\n",
"│ 1 │ 0002bcb3da6cb337 │\n",
"│ 2 │ 0005c987bdfc9d4b │\n",
"│ 3 │ 0007e25b2121310b │\n",
"│ 4 │ 001810bf8c45bf5f │\n",
"│ 5 │ 00190820581d90ce │\n",
"│ 6 │ 001956c382006abd │\n",
"│ 7 │ 001dc38a83d420cf │\n",
"│ 8 │ 0020e7119b96eeeb │\n",
"│ 9 │ 0020fd96ed3b8c8b │\n",
"│ 10 │ 0021fe88bc4da3e6 │\n",
"│ 11 │ 002264ea4d5f2887 │\n",
"⋮\n",
"│ 15283 │ fd052883fa6a8697 │\n",
"│ 15284 │ fd2f53aafe8eefcc │\n",
"│ 15285 │ fd68ef478b3dfd05 │\n",
"│ 15286 │ fdc92e571d39e7e1 │\n",
"│ 15287 │ fdce660ddcd6d7ca │\n",
"│ 15288 │ feb5637c531f933d │\n",
"│ 15289 │ fef142420a215b90 │\n",
"│ 15290 │ fef4cf7ba0012866 │\n",
"│ 15291 │ ff39a2895fc3b40e │\n",
"│ 15292 │ ffa33d3122b599d6 │\n",
"│ 15293 │ ffb47123b2d82762 │\n",
"│ 15294 │ ffbdbb0483ed0841 │"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"toxic_comments = bad_ones[bad_ones[:toxic] .== 1, :]"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [],
"source": [
"words = []\n",
"\n",
"for row in eachrow(toxic_comments)\n",
" append!(words, split(row[:comment_text]))\n",
"end"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"DataStructures.Accumulator{Any,Int64} with 73700 entries:\n",
" \"(1986)\" => 2\n",
" \"stupid,\\\"\\\"\" => 1\n",
" \"MORALS,\" => 1\n",
" \"um,\" => 4\n",
" \"gathered\" => 2\n",
" \"cannibalistic\" => 1\n",
" \"*will*\" => 1\n",
" \"denominator:\" => 1\n",
" \"little,\" => 6\n",
" \"ADVANCE\" => 1\n",
" \"methods\" => 10\n",
" \"PSYCHO\" => 1\n",
" \"GREEKWARRIOR\" => 1\n",
" \"Stupid,\" => 1\n",
" \"Queers\" => 1\n",
" \"tags.\" => 1\n",
" \"lulz\" => 15\n",
" \"arguement\" => 2\n",
" \"333\" => 3\n",
" \"premature\" => 4\n",
" \"crib\" => 1\n",
" \"playlist.\" => 1\n",
" \"(UTC)\" => 149\n",
" \"hicks\" => 1\n",
" \"70.251.71.245\" => 2\n",
" ⋮ => ⋮"
]
},
"execution_count": 47,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"toxic_word_counter = counter(words)"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"0 methods for generic function <b>Accumulator</b>:<ul></ul>"
],
"text/plain": [
"# 0 methods for generic function \"(::Accumulator)\":"
]
},
"execution_count": 51,
"metadata": {},
"output_type": "execute_result"
}
],
"source": []
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"A Corpus"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"crps = Corpus(document_array)"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"WARNING: Compat.UTF8String is deprecated, use String instead.\n",
" likely near In[19]:1\n",
"\u001b[1m\u001b[33mWARNING: \u001b[39m\u001b[22m\u001b[33mremove_articles! is deprecated, Use prepare! instead.\u001b[39m\n"
]
}
],
"source": [
"remove_articles!(crps)\n",
"remove_case!(crps)"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"update_lexicon!(crps)\n",
"update_inverse_index!(crps)"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"DataFrames.DataFrame"
]
},
"execution_count": 50,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"crps_df = DataFrame"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {},
"outputs": [
{
"ename": "LoadError",
"evalue": "\u001b[91mUndefVarError: tfidf not defined\u001b[39m",
"output_type": "error",
"traceback": [
"\u001b[91mUndefVarError: tfidf not defined\u001b[39m",
"",
"Stacktrace:",
" [1] \u001b[1minclude_string\u001b[22m\u001b[22m\u001b[1m(\u001b[22m\u001b[22m::String, ::String\u001b[1m)\u001b[22m\u001b[22m at \u001b[1m./loading.jl:522\u001b[22m\u001b[22m"
]
}
],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Julia 0.6.2",
"language": "julia",
"name": "julia-0.6"
},
"language_info": {
"file_extension": ".jl",
"mimetype": "application/julia",
"name": "julia",
"version": "0.6.2"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment