Skip to content

Instantly share code, notes, and snippets.

@tuliocasagrande
Last active September 22, 2015 17:30
Show Gist options
  • Save tuliocasagrande/059a9c7435a891aa6ac5 to your computer and use it in GitHub Desktop.
Save tuliocasagrande/059a9c7435a891aa6ac5 to your computer and use it in GitHub Desktop.

YouTube Spam Collection v. 1

The YouTube Spam Collection v. 1 is a public set of YouTube labeled comments that have been collected for spam research. It has five datasets composed by 1,956 real and non-encoded messages that were tagged as legitimate (ham) or spam.

Composition

This corpus has been collected using the YouTube Data API v3.

The samples were extracted from the comments section of 5 videos that were among the 10 most viewed on YouTube during the collection period. The table below lists the 5 datasets collected, the YouTube video ID, the number of samples in each class and the total number of samples per dataset.

Dataset YouTube ID # Spam # Ham Total Link
Psy 9bZkp7q19f0 175 175 350 Link 1
KatyPerry CevxZvSJLk8 175 175 350 Link 2
LMFAO KQ6zr6kCPj8 236 202 438 Link 3
Eminem uelHwf8o7_U 245 203 448 Link 4
Shakira pRpeEdMmmQ0 174 196 370 Link 5

Note: the comments are chronologically sorted.

Usage

The collection is composed by one csv file per dataset, where each line has the following attributes:

COMMENT_ID,AUTHOR,DATE,CONTENT,CLASS

We offer one example bellow:

z12oglnpoq3gjh4om04cfdlbgp2uepyytpw0k,Francisco Nora,2013-11-28T19:52:35,please like :D https://premium.easypromosapp.com/voteme/19924/616375350,1

We would appreciate:

  1. If you find this collection useful, make a reference to the paper below and the web page: http://dcomp.sor.ufscar.br/talmeida/youtubespamcollecion/.
  2. Send us a message either to talmeida < AT > ufscar.br or tuliocasagrande < AT > acm.org in case you make use of the corpus.

Publication and More Information

We offer a comprehensive study of this corpus in the following paper. This work presents a number of statistics, studies and baseline results for several machine learning methods.

Alberto, T.C., Lochter J.V., Almeida, T.A. Filtragem Automática de Spam nos Comentários do YouTube. Anais do XII Encontro Nacional de Inteligência Artificial e Computacional (ENIAC'15), Natal, RN, Brazil, 2015. (pending approval)

About

The YouTube Spam Collection has been created by Tiago A. Almeida, Tulio C. Alberto and Johannes V. Lochter.

© Tiago A. Almeida, Tulio C. Alberto and Johannes V. Lochter, 2015.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment