Skip to content

Instantly share code, notes, and snippets.

@dcslin
Last active December 11, 2018 06:07
Show Gist options
  • Save dcslin/1eec1565255384c47cbae809a5c10a66 to your computer and use it in GitHub Desktop.
Save dcslin/1eec1565255384c47cbae809a5c10a66 to your computer and use it in GitHub Desktop.
news article clustering

cluster a list of news/article and group them if regardings same piece of news.

online resources

  • google news articles

    • no timing feature in the dataset: link
    • clustering stories
    • The source ranking involves many things. Is there original content? The timeliness. Coverage of recent developments? The relevancy to the cluster at hand. In some cases, is there local relevancy? Is there content from a local source with local content? link
  • is a topic modeling problem link

  • modeling link

  • news algorithm architecture link

  • linkslink

  • linkslink

  • on techmemelink

  • incremental clusteringlink

  • factorlink

  • link..link

  • link2

  • duplicate newslink

  • headline clusteringlink

TF IDF link

sklearn tfidf example

  • sklearn tfidf vectorizer link
  • Clustering text documents using k-meanslink

Hierarchical Clustering

  • mitigate non convex dataset by combining hierachical clustering and k means link
  • tf-idf clustering multiple approaches link
  • cosine dis vs euclidean dis link
  • retrieve cluster from hierarchical clustering model link
  • determining number of cluster: CH index link
  • determine number of cluster tutorial link

sparse matrix scipy

intro link

gensim python lib link

datasets - desire near duplicate dataset

  • kaggle search link
  • one week of global link
  • all the news of american link

TOREAD:

  • topic modeling + deep learning?
  • hierarchical clustering unsupervised?

TODO:

  • missing time feature from dataset
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment