Skip to content

Instantly share code, notes, and snippets.

@amqdn
Last active April 1, 2021 20:01
Show Gist options
  • Save amqdn/8a4a44f4e1d34dc63280d5da216ba55f to your computer and use it in GitHub Desktop.
Save amqdn/8a4a44f4e1d34dc63280d5da216ba55f to your computer and use it in GitHub Desktop.
A basic showcase of Python/Julia inter-op in bag-of-words classification
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Basic NLP\n",
"\n",
"Attempting basic NLP classification with a real dataset and Julia/Python inter-op. "
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"using PyCall, CSV, DataFrames"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"PyCall allows us to import Python libraries directly into the Julia environment, CSV.jl helps us open CSVs, and we can pass a CSV.File directly to the DataFrame constructor from DataFrames.jl.\n",
"\n",
"Let's first load our dataset:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"# Source: https://www.kaggle.com/clmentbisaillon/fake-and-real-news-dataset\n",
"news_fake = DataFrame(CSV.File(\"fake.csv\"))\n",
"news_true = DataFrame(CSV.File(\"true.csv\"));"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<table class=\"data-frame\"><thead><tr><th></th><th>title</th><th>text</th><th>subject</th><th>date</th></tr><tr><th></th><th>String</th><th>String</th><th>String</th><th>String</th></tr></thead><tbody><p>2 rows × 4 columns</p><tr><th>1</th><td> Donald Trump Sends Out Embarrassing New Year’s Eve Message; This is Disturbing</td><td>Donald Trump just couldn t wish all Americans a Happy New Year and leave it at that. Instead, he had to give a shout out to his enemies, haters and the very dishonest fake news media. The former reality show star had just one job to do and he couldn t do it. As our Country rapidly grows stronger and smarter, I want to wish all of my friends, supporters, enemies, haters, and even the very dishonest Fake News Media, a Happy and Healthy New Year, President Angry Pants tweeted. 2018 will be a great year for America! As our Country rapidly grows stronger and smarter, I want to wish all of my friends, supporters, enemies, haters, and even the very dishonest Fake News Media, a Happy and Healthy New Year. 2018 will be a great year for America! Donald J. Trump (@realDonaldTrump) December 31, 2017Trump s tweet went down about as welll as you d expect.What kind of president sends a New Year s greeting like this despicable, petty, infantile gibberish? Only Trump! His lack of decency won t even allow him to rise above the gutter long enough to wish the American citizens a happy new year! Bishop Talbert Swan (@TalbertSwan) December 31, 2017no one likes you Calvin (@calvinstowell) December 31, 2017Your impeachment would make 2018 a great year for America, but I ll also accept regaining control of Congress. Miranda Yaver (@mirandayaver) December 31, 2017Do you hear yourself talk? When you have to include that many people that hate you you have to wonder? Why do the they all hate me? Alan Sandoval (@AlanSandoval13) December 31, 2017Who uses the word Haters in a New Years wish?? Marlene (@marlene399) December 31, 2017You can t just say happy new year? Koren pollitt (@Korencarpenter) December 31, 2017Here s Trump s New Year s Eve tweet from 2016.Happy New Year to all, including to my many enemies and those who have fought me and lost so badly they just don t know what to do. Love! Donald J. Trump (@realDonaldTrump) December 31, 2016This is nothing new for Trump. He s been doing this for years.Trump has directed messages to his enemies and haters for New Year s, Easter, Thanksgiving, and the anniversary of 9/11. pic.twitter.com/4FPAe2KypA Daniel Dale (@ddale8) December 31, 2017Trump s holiday tweets are clearly not presidential.How long did he work at Hallmark before becoming President? Steven Goodine (@SGoodine) December 31, 2017He s always been like this . . . the only difference is that in the last few years, his filter has been breaking down. Roy Schulze (@thbthttt) December 31, 2017Who, apart from a teenager uses the term haters? Wendy (@WendyWhistles) December 31, 2017he s a fucking 5 year old Who Knows (@rainyday80) December 31, 2017So, to all the people who voted for this a hole thinking he would change once he got into power, you were wrong! 70-year-old men don t change and now he s a year older.Photo by Andrew Burton/Getty Images.</td><td>News</td><td>December 31, 2017</td></tr><tr><th>2</th><td> Drunk Bragging Trump Staffer Started Russian Collusion Investigation</td><td>House Intelligence Committee Chairman Devin Nunes is going to have a bad day. He s been under the assumption, like many of us, that the Christopher Steele-dossier was what prompted the Russia investigation so he s been lashing out at the Department of Justice and the FBI in order to protect Trump. As it happens, the dossier is not what started the investigation, according to documents obtained by the New York Times.Former Trump campaign adviser George Papadopoulos was drunk in a wine bar when he revealed knowledge of Russian opposition research on Hillary Clinton.On top of that, Papadopoulos wasn t just a covfefe boy for Trump, as his administration has alleged. He had a much larger role, but none so damning as being a drunken fool in a wine bar. Coffee boys don t help to arrange a New York meeting between Trump and President Abdel Fattah el-Sisi of Egypt two months before the election. It was known before that the former aide set up meetings with world leaders for Trump, but team Trump ran with him being merely a coffee boy.In May 2016, Papadopoulos revealed to Australian diplomat Alexander Downer that Russian officials were shopping around possible dirt on then-Democratic presidential nominee Hillary Clinton. Exactly how much Mr. Papadopoulos said that night at the Kensington Wine Rooms with the Australian, Alexander Downer, is unclear, the report states. But two months later, when leaked Democratic emails began appearing online, Australian officials passed the information about Mr. Papadopoulos to their American counterparts, according to four current and former American and foreign officials with direct knowledge of the Australians role. Papadopoulos pleaded guilty to lying to the F.B.I. and is now a cooperating witness with Special Counsel Robert Mueller s team.This isn t a presidency. It s a badly scripted reality TV show.Photo by Win McNamee/Getty Images.</td><td>News</td><td>December 31, 2017</td></tr></tbody></table>"
],
"text/latex": [
"\\begin{tabular}{r|cccc}\n",
"\t& title & text & subject & date\\\\\n",
"\t\\hline\n",
"\t& String & String & String & String\\\\\n",
"\t\\hline\n",
"\t1 & Donald Trump Sends Out Embarrassing New Year’s Eve Message; This is Disturbing & Donald Trump just couldn t wish all Americans a Happy New Year and leave it at that. Instead, he had to give a shout out to his enemies, haters and the very dishonest fake news media. The former reality show star had just one job to do and he couldn t do it. As our Country rapidly grows stronger and smarter, I want to wish all of my friends, supporters, enemies, haters, and even the very dishonest Fake News Media, a Happy and Healthy New Year, President Angry Pants tweeted. 2018 will be a great year for America! As our Country rapidly grows stronger and smarter, I want to wish all of my friends, supporters, enemies, haters, and even the very dishonest Fake News Media, a Happy and Healthy New Year. 2018 will be a great year for America! Donald J. Trump (@realDonaldTrump) December 31, 2017Trump s tweet went down about as welll as you d expect.What kind of president sends a New Year s greeting like this despicable, petty, infantile gibberish? Only Trump! His lack of decency won t even allow him to rise above the gutter long enough to wish the American citizens a happy new year! Bishop Talbert Swan (@TalbertSwan) December 31, 2017no one likes you Calvin (@calvinstowell) December 31, 2017Your impeachment would make 2018 a great year for America, but I ll also accept regaining control of Congress. Miranda Yaver (@mirandayaver) December 31, 2017Do you hear yourself talk? When you have to include that many people that hate you you have to wonder? Why do the they all hate me? Alan Sandoval (@AlanSandoval13) December 31, 2017Who uses the word Haters in a New Years wish?? Marlene (@marlene399) December 31, 2017You can t just say happy new year? Koren pollitt (@Korencarpenter) December 31, 2017Here s Trump s New Year s Eve tweet from 2016.Happy New Year to all, including to my many enemies and those who have fought me and lost so badly they just don t know what to do. Love! Donald J. Trump (@realDonaldTrump) December 31, 2016This is nothing new for Trump. He s been doing this for years.Trump has directed messages to his enemies and haters for New Year s, Easter, Thanksgiving, and the anniversary of 9/11. pic.twitter.com/4FPAe2KypA Daniel Dale (@ddale8) December 31, 2017Trump s holiday tweets are clearly not presidential.How long did he work at Hallmark before becoming President? Steven Goodine (@SGoodine) December 31, 2017He s always been like this . . . the only difference is that in the last few years, his filter has been breaking down. Roy Schulze (@thbthttt) December 31, 2017Who, apart from a teenager uses the term haters? Wendy (@WendyWhistles) December 31, 2017he s a fucking 5 year old Who Knows (@rainyday80) December 31, 2017So, to all the people who voted for this a hole thinking he would change once he got into power, you were wrong! 70-year-old men don t change and now he s a year older.Photo by Andrew Burton/Getty Images. & News & December 31, 2017 \\\\\n",
"\t2 & Drunk Bragging Trump Staffer Started Russian Collusion Investigation & House Intelligence Committee Chairman Devin Nunes is going to have a bad day. He s been under the assumption, like many of us, that the Christopher Steele-dossier was what prompted the Russia investigation so he s been lashing out at the Department of Justice and the FBI in order to protect Trump. As it happens, the dossier is not what started the investigation, according to documents obtained by the New York Times.Former Trump campaign adviser George Papadopoulos was drunk in a wine bar when he revealed knowledge of Russian opposition research on Hillary Clinton.On top of that, Papadopoulos wasn t just a covfefe boy for Trump, as his administration has alleged. He had a much larger role, but none so damning as being a drunken fool in a wine bar. Coffee boys don t help to arrange a New York meeting between Trump and President Abdel Fattah el-Sisi of Egypt two months before the election. It was known before that the former aide set up meetings with world leaders for Trump, but team Trump ran with him being merely a coffee boy.In May 2016, Papadopoulos revealed to Australian diplomat Alexander Downer that Russian officials were shopping around possible dirt on then-Democratic presidential nominee Hillary Clinton. Exactly how much Mr. Papadopoulos said that night at the Kensington Wine Rooms with the Australian, Alexander Downer, is unclear, the report states. But two months later, when leaked Democratic emails began appearing online, Australian officials passed the information about Mr. Papadopoulos to their American counterparts, according to four current and former American and foreign officials with direct knowledge of the Australians role. Papadopoulos pleaded guilty to lying to the F.B.I. and is now a cooperating witness with Special Counsel Robert Mueller s team.This isn t a presidency. It s a badly scripted reality TV show.Photo by Win McNamee/Getty Images. & News & December 31, 2017 \\\\\n",
"\\end{tabular}\n"
],
"text/plain": [
"\u001b[1m2×4 DataFrame\u001b[0m\n",
"\u001b[1m Row \u001b[0m│\u001b[1m title \u001b[0m\u001b[1m text \u001b[0m\u001b[1m subject \u001b[0m\u001b[1m date \u001b[0m\n",
"\u001b[1m \u001b[0m│\u001b[90m String \u001b[0m\u001b[90m String \u001b[0m\u001b[90m String \u001b[0m\u001b[90m String \u001b[0m\n",
"─────┼──────────────────────────────────────────────────────────────────────────────────────────────────\n",
" 1 │ Donald Trump Sends Out Embarras… Donald Trump just couldn t wish … News December 31, 2017\n",
" 2 │ Drunk Bragging Trump Staffer St… House Intelligence Committee Cha… News December 31, 2017"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ENV[\"COLUMNS\"] = 10000; # Allows us to display all columns for this dataset in the notebook\n",
"first(news_fake, 2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As we can see, the data has 4 columns: `title`, `text`, `subject` and `date`. Since I'm only building a simple classifier here, let's drop the latter two columns. The following code will do that, it will combine the text from both the `title` and `text` columns into one, and it will also set the target labels."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"combine_text_and_set_target (generic function with 1 method)"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"function combine_text_and_set_target(df::DataFrame, label::Bool)\n",
" return DataFrame(\n",
" Dict(\n",
" \"text\" => [\"<title> $(row.title) <text> $(row.text)\" for row in eachrow(df)],\n",
" \"fake\" => [label for _ in 1:size(df, 1)]\n",
" )\n",
" )\n",
"end"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"news_fake = combine_text_and_set_target(news_fake[:, [:title, :text]], true)\n",
"news_true = combine_text_and_set_target(news_true[:, [:title, :text]], false);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And then we'll combine them into one DataFrame, deleting the originals:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<table class=\"data-frame\"><thead><tr><th></th><th>fake</th><th>text</th></tr><tr><th></th><th>Bool</th><th>String</th></tr></thead><tbody><p>2 rows × 2 columns</p><tr><th>1</th><td>1</td><td>&lt;title&gt; Donald Trump Sends Out Embarrassing New Year’s Eve Message; This is Disturbing &lt;text&gt; Donald Trump just couldn t wish all Americans a Happy New Year and leave it at that. Instead, he had to give a shout out to his enemies, haters and the very dishonest fake news media. The former reality show star had just one job to do and he couldn t do it. As our Country rapidly grows stronger and smarter, I want to wish all of my friends, supporters, enemies, haters, and even the very dishonest Fake News Media, a Happy and Healthy New Year, President Angry Pants tweeted. 2018 will be a great year for America! As our Country rapidly grows stronger and smarter, I want to wish all of my friends, supporters, enemies, haters, and even the very dishonest Fake News Media, a Happy and Healthy New Year. 2018 will be a great year for America! Donald J. Trump (@realDonaldTrump) December 31, 2017Trump s tweet went down about as welll as you d expect.What kind of president sends a New Year s greeting like this despicable, petty, infantile gibberish? Only Trump! His lack of decency won t even allow him to rise above the gutter long enough to wish the American citizens a happy new year! Bishop Talbert Swan (@TalbertSwan) December 31, 2017no one likes you Calvin (@calvinstowell) December 31, 2017Your impeachment would make 2018 a great year for America, but I ll also accept regaining control of Congress. Miranda Yaver (@mirandayaver) December 31, 2017Do you hear yourself talk? When you have to include that many people that hate you you have to wonder? Why do the they all hate me? Alan Sandoval (@AlanSandoval13) December 31, 2017Who uses the word Haters in a New Years wish?? Marlene (@marlene399) December 31, 2017You can t just say happy new year? Koren pollitt (@Korencarpenter) December 31, 2017Here s Trump s New Year s Eve tweet from 2016.Happy New Year to all, including to my many enemies and those who have fought me and lost so badly they just don t know what to do. Love! Donald J. Trump (@realDonaldTrump) December 31, 2016This is nothing new for Trump. He s been doing this for years.Trump has directed messages to his enemies and haters for New Year s, Easter, Thanksgiving, and the anniversary of 9/11. pic.twitter.com/4FPAe2KypA Daniel Dale (@ddale8) December 31, 2017Trump s holiday tweets are clearly not presidential.How long did he work at Hallmark before becoming President? Steven Goodine (@SGoodine) December 31, 2017He s always been like this . . . the only difference is that in the last few years, his filter has been breaking down. Roy Schulze (@thbthttt) December 31, 2017Who, apart from a teenager uses the term haters? Wendy (@WendyWhistles) December 31, 2017he s a fucking 5 year old Who Knows (@rainyday80) December 31, 2017So, to all the people who voted for this a hole thinking he would change once he got into power, you were wrong! 70-year-old men don t change and now he s a year older.Photo by Andrew Burton/Getty Images.</td></tr><tr><th>2</th><td>1</td><td>&lt;title&gt; Drunk Bragging Trump Staffer Started Russian Collusion Investigation &lt;text&gt; House Intelligence Committee Chairman Devin Nunes is going to have a bad day. He s been under the assumption, like many of us, that the Christopher Steele-dossier was what prompted the Russia investigation so he s been lashing out at the Department of Justice and the FBI in order to protect Trump. As it happens, the dossier is not what started the investigation, according to documents obtained by the New York Times.Former Trump campaign adviser George Papadopoulos was drunk in a wine bar when he revealed knowledge of Russian opposition research on Hillary Clinton.On top of that, Papadopoulos wasn t just a covfefe boy for Trump, as his administration has alleged. He had a much larger role, but none so damning as being a drunken fool in a wine bar. Coffee boys don t help to arrange a New York meeting between Trump and President Abdel Fattah el-Sisi of Egypt two months before the election. It was known before that the former aide set up meetings with world leaders for Trump, but team Trump ran with him being merely a coffee boy.In May 2016, Papadopoulos revealed to Australian diplomat Alexander Downer that Russian officials were shopping around possible dirt on then-Democratic presidential nominee Hillary Clinton. Exactly how much Mr. Papadopoulos said that night at the Kensington Wine Rooms with the Australian, Alexander Downer, is unclear, the report states. But two months later, when leaked Democratic emails began appearing online, Australian officials passed the information about Mr. Papadopoulos to their American counterparts, according to four current and former American and foreign officials with direct knowledge of the Australians role. Papadopoulos pleaded guilty to lying to the F.B.I. and is now a cooperating witness with Special Counsel Robert Mueller s team.This isn t a presidency. It s a badly scripted reality TV show.Photo by Win McNamee/Getty Images.</td></tr></tbody></table>"
],
"text/latex": [
"\\begin{tabular}{r|cc}\n",
"\t& fake & text\\\\\n",
"\t\\hline\n",
"\t& Bool & String\\\\\n",
"\t\\hline\n",
"\t1 & 1 & <title> Donald Trump Sends Out Embarrassing New Year’s Eve Message; This is Disturbing <text> Donald Trump just couldn t wish all Americans a Happy New Year and leave it at that. Instead, he had to give a shout out to his enemies, haters and the very dishonest fake news media. The former reality show star had just one job to do and he couldn t do it. As our Country rapidly grows stronger and smarter, I want to wish all of my friends, supporters, enemies, haters, and even the very dishonest Fake News Media, a Happy and Healthy New Year, President Angry Pants tweeted. 2018 will be a great year for America! As our Country rapidly grows stronger and smarter, I want to wish all of my friends, supporters, enemies, haters, and even the very dishonest Fake News Media, a Happy and Healthy New Year. 2018 will be a great year for America! Donald J. Trump (@realDonaldTrump) December 31, 2017Trump s tweet went down about as welll as you d expect.What kind of president sends a New Year s greeting like this despicable, petty, infantile gibberish? Only Trump! His lack of decency won t even allow him to rise above the gutter long enough to wish the American citizens a happy new year! Bishop Talbert Swan (@TalbertSwan) December 31, 2017no one likes you Calvin (@calvinstowell) December 31, 2017Your impeachment would make 2018 a great year for America, but I ll also accept regaining control of Congress. Miranda Yaver (@mirandayaver) December 31, 2017Do you hear yourself talk? When you have to include that many people that hate you you have to wonder? Why do the they all hate me? Alan Sandoval (@AlanSandoval13) December 31, 2017Who uses the word Haters in a New Years wish?? Marlene (@marlene399) December 31, 2017You can t just say happy new year? Koren pollitt (@Korencarpenter) December 31, 2017Here s Trump s New Year s Eve tweet from 2016.Happy New Year to all, including to my many enemies and those who have fought me and lost so badly they just don t know what to do. Love! Donald J. Trump (@realDonaldTrump) December 31, 2016This is nothing new for Trump. He s been doing this for years.Trump has directed messages to his enemies and haters for New Year s, Easter, Thanksgiving, and the anniversary of 9/11. pic.twitter.com/4FPAe2KypA Daniel Dale (@ddale8) December 31, 2017Trump s holiday tweets are clearly not presidential.How long did he work at Hallmark before becoming President? Steven Goodine (@SGoodine) December 31, 2017He s always been like this . . . the only difference is that in the last few years, his filter has been breaking down. Roy Schulze (@thbthttt) December 31, 2017Who, apart from a teenager uses the term haters? Wendy (@WendyWhistles) December 31, 2017he s a fucking 5 year old Who Knows (@rainyday80) December 31, 2017So, to all the people who voted for this a hole thinking he would change once he got into power, you were wrong! 70-year-old men don t change and now he s a year older.Photo by Andrew Burton/Getty Images. \\\\\n",
"\t2 & 1 & <title> Drunk Bragging Trump Staffer Started Russian Collusion Investigation <text> House Intelligence Committee Chairman Devin Nunes is going to have a bad day. He s been under the assumption, like many of us, that the Christopher Steele-dossier was what prompted the Russia investigation so he s been lashing out at the Department of Justice and the FBI in order to protect Trump. As it happens, the dossier is not what started the investigation, according to documents obtained by the New York Times.Former Trump campaign adviser George Papadopoulos was drunk in a wine bar when he revealed knowledge of Russian opposition research on Hillary Clinton.On top of that, Papadopoulos wasn t just a covfefe boy for Trump, as his administration has alleged. He had a much larger role, but none so damning as being a drunken fool in a wine bar. Coffee boys don t help to arrange a New York meeting between Trump and President Abdel Fattah el-Sisi of Egypt two months before the election. It was known before that the former aide set up meetings with world leaders for Trump, but team Trump ran with him being merely a coffee boy.In May 2016, Papadopoulos revealed to Australian diplomat Alexander Downer that Russian officials were shopping around possible dirt on then-Democratic presidential nominee Hillary Clinton. Exactly how much Mr. Papadopoulos said that night at the Kensington Wine Rooms with the Australian, Alexander Downer, is unclear, the report states. But two months later, when leaked Democratic emails began appearing online, Australian officials passed the information about Mr. Papadopoulos to their American counterparts, according to four current and former American and foreign officials with direct knowledge of the Australians role. Papadopoulos pleaded guilty to lying to the F.B.I. and is now a cooperating witness with Special Counsel Robert Mueller s team.This isn t a presidency. It s a badly scripted reality TV show.Photo by Win McNamee/Getty Images. \\\\\n",
"\\end{tabular}\n"
],
"text/plain": [
"\u001b[1m2×2 DataFrame\u001b[0m\n",
"\u001b[1m Row \u001b[0m│\u001b[1m fake \u001b[0m\u001b[1m text \u001b[0m\n",
"\u001b[1m \u001b[0m│\u001b[90m Bool \u001b[0m\u001b[90m String \u001b[0m\n",
"─────┼─────────────────────────────────────────\n",
" 1 │ true <title> Donald Trump Sends Out …\n",
" 2 │ true <title> Drunk Bragging Trump St…"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data = vcat(news_fake, news_true)\n",
"news_fake = news_true = nothing\n",
"first(data, 2)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<table class=\"data-frame\"><thead><tr><th></th><th>fake</th><th>text</th></tr><tr><th></th><th>Bool</th><th>String</th></tr></thead><tbody><p>2 rows × 2 columns</p><tr><th>1</th><td>0</td><td>&lt;title&gt; Vatican upbeat on possibility of Pope Francis visiting Russia &lt;text&gt; MOSCOW (Reuters) - Vatican Secretary of State Cardinal Pietro Parolin said on Tuesday that there was positive momentum behind the idea of Pope Francis visiting Russia, but suggested there was more work to be done if it were to happen. Parolin, speaking at a joint news conference in Moscow alongside Russian Foreign Minister Sergei Lavrov, did not give any date for such a possible visit. The Eastern and Western branches of Christianity split apart in 1054. The pope, leader of the world s 1.2 billion Catholics, is seeking to improve ties, and last year in Cuba held what was the first ever meeting between a Roman Catholic pope and a Russian Orthodox patriarch. Parolin said he had also used his talks in the Russian capital to also raise certain difficulties faced by the Catholic Church in Russia. He said that Moscow and the Vatican disagreed about the plight of Christians in certain parts of the world. He did not elaborate. Parolin, who is due later on Tuesday to meet Patriarch Kirill, the head of the Russian Orthodox Church, said he also believed Russia could play an important role when it came to helping solve a crisis in Venezuela because of its close relations with Caracas. </td></tr><tr><th>2</th><td>0</td><td>&lt;title&gt; Indonesia to buy $1.14 billion worth of Russian jets &lt;text&gt; JAKARTA (Reuters) - Indonesia will buy 11 Sukhoi fighter jets worth $1.14 billion from Russia in exchange for cash and Indonesian commodities, two cabinet ministers said on Tuesday. The Southeast Asian country has pledged to ship up to $570 million worth of commodities in addition to cash to pay for the Suhkoi SU-35 fighter jets, which are expected to be delivered in stages starting in two years. Indonesian Trade Minister Enggartiasto Lukita said in a joint statement with Defence Minister Ryamizard Ryacudu that details of the type and volume of commodities were still being negotiated . Previously he had said the exports could include palm oil, tea, and coffee. The deal is expected to be finalised soon between Indonesian state trading company PT Perusahaan Perdangangan Indonesia and Russian state conglomerate Rostec. Russia is currently facing a new round of U.S.-imposed trade sanctions. Meanwhile, Southeast Asia s largest economy is trying to promote its palm oil products amid threats of a cut in consumption by European Union countries. Indonesia is also trying to modernize its ageing air force after a string of military aviation accidents. Indonesia, which had a $411 million trade surplus with Russia in 2016, wants to expand bilateral cooperation in tourism, education, energy, technology and aviation among others. </td></tr></tbody></table>"
],
"text/latex": [
"\\begin{tabular}{r|cc}\n",
"\t& fake & text\\\\\n",
"\t\\hline\n",
"\t& Bool & String\\\\\n",
"\t\\hline\n",
"\t1 & 0 & <title> Vatican upbeat on possibility of Pope Francis visiting Russia <text> MOSCOW (Reuters) - Vatican Secretary of State Cardinal Pietro Parolin said on Tuesday that there was positive momentum behind the idea of Pope Francis visiting Russia, but suggested there was more work to be done if it were to happen. Parolin, speaking at a joint news conference in Moscow alongside Russian Foreign Minister Sergei Lavrov, did not give any date for such a possible visit. The Eastern and Western branches of Christianity split apart in 1054. The pope, leader of the world s 1.2 billion Catholics, is seeking to improve ties, and last year in Cuba held what was the first ever meeting between a Roman Catholic pope and a Russian Orthodox patriarch. Parolin said he had also used his talks in the Russian capital to also raise certain difficulties faced by the Catholic Church in Russia. He said that Moscow and the Vatican disagreed about the plight of Christians in certain parts of the world. He did not elaborate. Parolin, who is due later on Tuesday to meet Patriarch Kirill, the head of the Russian Orthodox Church, said he also believed Russia could play an important role when it came to helping solve a crisis in Venezuela because of its close relations with Caracas. \\\\\n",
"\t2 & 0 & <title> Indonesia to buy \\$1.14 billion worth of Russian jets <text> JAKARTA (Reuters) - Indonesia will buy 11 Sukhoi fighter jets worth \\$1.14 billion from Russia in exchange for cash and Indonesian commodities, two cabinet ministers said on Tuesday. The Southeast Asian country has pledged to ship up to \\$570 million worth of commodities in addition to cash to pay for the Suhkoi SU-35 fighter jets, which are expected to be delivered in stages starting in two years. Indonesian Trade Minister Enggartiasto Lukita said in a joint statement with Defence Minister Ryamizard Ryacudu that details of the type and volume of commodities were still being negotiated . Previously he had said the exports could include palm oil, tea, and coffee. The deal is expected to be finalised soon between Indonesian state trading company PT Perusahaan Perdangangan Indonesia and Russian state conglomerate Rostec. Russia is currently facing a new round of U.S.-imposed trade sanctions. Meanwhile, Southeast Asia s largest economy is trying to promote its palm oil products amid threats of a cut in consumption by European Union countries. Indonesia is also trying to modernize its ageing air force after a string of military aviation accidents. Indonesia, which had a \\$411 million trade surplus with Russia in 2016, wants to expand bilateral cooperation in tourism, education, energy, technology and aviation among others. \\\\\n",
"\\end{tabular}\n"
],
"text/plain": [
"\u001b[1m2×2 DataFrame\u001b[0m\n",
"\u001b[1m Row \u001b[0m│\u001b[1m fake \u001b[0m\u001b[1m text \u001b[0m\n",
"\u001b[1m \u001b[0m│\u001b[90m Bool \u001b[0m\u001b[90m String \u001b[0m\n",
"─────┼──────────────────────────────────────────\n",
" 1 │ false <title> Vatican upbeat on possib…\n",
" 2 │ false <title> Indonesia to buy $1.14 b…"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"last(data, 2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"From here, we'll use Python's scikit-learn to help us with the NLP tasks. There are solutions available in the Julia ecosystem, but we'll leave that for another notebook.\n",
"\n",
"The approach we'll take below is to simply create a bag-of-words representation of the data and then train a Linear SVM classifier to predict whether a given title/text combination is \"fake\" or not, according to the dataset annotator(s). Don't worry if that last sentence confused you.\n",
"\n",
"What is a bag of words? It is a way of representing a string of text as numbers.\n",
"1. Think of take-a-number systems at the grocery store or DMV. First, we go through all of the words once, and when we see a word, we take a number, write the word on the back of the ticket, and then put the ticket in a box. This is called our _vocabulary_.\n",
"2. Next, we count how many tickets we ended up with (say, 100 unique tickets). This is the _size_ of our vocabulary.\n",
"3. Then, we go through every distinct piece of text in our dataset. For every piece of text, we start by lining up a bunch of small tins in a row, as many tins as there are tickets. And for every word in that piece of text, we pluck its ticket from the box and drop it in the tin with the same number as the ticket. We will end up with many empty tins (represented by 0s), and some tins with tickets (represented by 1s), for every piece of text. Well, it seems like a waste to have all these empty tins lined up with only some of them containing tickets (it is!), so let's take all of our word-tickets out of the tins and put them in a bag, noting only the ticket number. Since they're still only numbers, our machine will understand them, but the number of the word doesn't tell us anything about where it appears in the text, only that it does appear, so we're left with a \"bag\" of words for every piece of text."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"sktext = pyimport(\"sklearn.feature_extraction.text\")\n",
"vectorizer = sktext.CountVectorizer(binary=true) # binary means 1s and 0s\n",
"vectorizer.fit(data.text); # create the vocabulary"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"MLJ.jl has some simple functions for dealing with data, so we'll use those as well."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"using MLJ"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`x` is input data, `y` is target labels"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"y, x = unpack(data, ==(:fake), colname -> true);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We'll split the data into training and holdout sets (70% and 30%, respectively). \"Stratify\" means keep the same balance between classes as found in the original data."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(Bool[0, 0, 1, 0, 1, 1, 0, 1, 0, 1 … 0, 1, 1, 1, 1, 0, 1, 1, 1, 1], Bool[0, 1, 0, 0, 1, 1, 0, 1, 0, 1 … 1, 0, 1, 1, 0, 1, 1, 1, 1, 1])"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"idxt, idxv = partition(eachindex(y), 0.7, shuffle=true, stratify=y)\n",
"xt, xv = x[idxt], x[idxv]\n",
"yt, yv = y[idxt], y[idxv]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here is, conceptually, where we go from tickets to tins to \"bags of words\":"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"xt, xv = vectorizer.transform(xt), vectorizer.transform(xv);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"That's all we need in order to classify -- bags and labels! We'll use a simple classifier from scikit-learn. It's called a Support Vector Machine. If you're not familiar, don't worry too much about the details. Just know that it works in the case where we have numbers as input and are looking to classify something (\"fake\" or not). We'll train it on the selected 70% of data."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"PyObject SVC(kernel='linear')"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"svm = pyimport(\"sklearn.svm\")\n",
"clf_svm = svm.SVC(kernel=\"linear\") # \"clf\" means classifier\n",
"clf_svm.fit(xt, yt)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Then, we'll make predictions on our holdout set:"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"ŷv = clf_svm.predict(xv);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And let's see how we did. In order to use the performance measures that come with MLJ, we'll have to coerce the arrays into something MLJ will like:"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Accuracy: 0.9982923750835251\n",
"Precision: 0.9990049751243781\n",
"Recall: 0.9977285633162976\n"
]
},
{
"data": {
"text/plain": [
" ┌───────────────────────────┐\n",
" │ Ground Truth │\n",
"┌─────────────┼─────────────┬─────────────┤\n",
"│ Predicted │ false │ true │\n",
"├─────────────┼─────────────┼─────────────┤\n",
"│ false │ 6418 │ 16 │\n",
"├─────────────┼─────────────┼─────────────┤\n",
"│ true │ 7 │ 7028 │\n",
"└─────────────┴─────────────┴─────────────┘\n"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ŷv, yv = coerce(ŷv, OrderedFactor), coerce(yv, OrderedFactor)\n",
"println(\"Accuracy: $(accuracy(ŷv, yv))\")\n",
"println(\"Precision: $(ppv(ŷv, yv))\")\n",
"println(\"Recall: $(tpr(ŷv, yv))\")\n",
"confusion_matrix(ŷv, yv)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Wow, that is much too good! Something is fishy. We'll have to figure out what happened! Spoiler: https://www.kaggle.com/mosewintner/5-data-leaks-100-acc-1-word-99-6-acc"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Julia 1.5.4",
"language": "julia",
"name": "julia-1.5"
},
"language_info": {
"file_extension": ".jl",
"mimetype": "application/julia",
"name": "julia",
"version": "1.6.0"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment