Skip to content

Instantly share code, notes, and snippets.

View maziyarpanahi's full-sized avatar
😎
Building a private medical ChatGPT!

Maziyar Panahi maziyarpanahi

😎
Building a private medical ChatGPT!
View GitHub Profile
@maziyarpanahi
maziyarpanahi / top-500-enwiki.txt
Created October 22, 2017 14:26
Top 500 phrases in English Wikipedia
Phrases were extracted by Stanford CoreNLP/Spark 2.2 (6minutes) from English Wikipeida (+5 million pages)
+---------------------------+-----+ [441/9895]
|value |count|
+---------------------------+-----+
|square miles |59821|
|unique feature |46463|
|id form |46101|
|administrative district |45963|
|first time |41423|
@maziyarpanahi
maziyarpanahi / pubmed-cancer-LDA-results.txt
Last active October 22, 2017 13:49
Results of LDA over PubMed dataset "Cancer" sub-corpora
Stanford CoreNLP (Sentence splitter and POS Tagging - extract noun phrases), StopWordsRemover, TF-IDF, word2vec and OnlineLDAOptimizer
==========
Query: "cancer"
Sample: 500K abstracts
Dataset: PubMed
==========
val numTopics: Int = 50
val maxIterations: Int = 100
val vocabSize: Int = 10000
@maziyarpanahi
maziyarpanahi / enwiki-gas-emissions-LDA-results.txt
Last active July 3, 2017 17:02
The results of Spark LDA ran over English Wikipedia pages (different queries). The topics are sorted by coherence of each topic (Word2Vec).
Stanford CoreNLP (Sentence splitter and POS Tagging - extract noun phrases), StopWordsRemover, TF-IDF, word2vec and OnlineLDAOptimizer
Query: Global Warming (5000 pages)
==========Parameteres==========
val numTopics: Int = 50
val maxIterations: Int = 100
val vocabSize: Int = 10000
val minDF: Int = 1
val minTF: Int = 1
val maxItems: Int = 15
@maziyarpanahi
maziyarpanahi / enwiki-global-warming-LDA-results.txt
Last active October 22, 2017 13:48
The results of Spark LDA ran over English Wikipedia pages (different queries). The topics are sorted by coherence of each topic (Word2Vec).
====================
Stanford CoreNLP (Sentence splitter and POS Tagging - NN and NNS), StopWordsRemover, TF-IDF, word2vec and OnlineLDAOptimizer
Query: Global Warming (5000 pages)
==========Parameteres==========
val numTopics: Int = 50
val maxIterations: Int = 100
val vocabSize: Int = 10000
val minDF: Int = 10
val minTF: Int = 1