keamanansiber/Google Summer of Code 2022 Final Report.md

## Google Summer of Code 2022 Final Report.md

      
    Raw
  

              Google Summer of Code 2022 Final Report.md
            
          
    Google Summer of Code 2022 Final Report

Name: Hatma Suryotrisongko • Project: OWASP Maryam

Proposal Topic

Deep Learning for OWASP Maryam's NLP operations

Work was Done

Designed and implemented a topic modeling module (PR #269)  (Personal repo link).
Milestones Achieved


Learning and implementing all existing Maryam's modules. (Jupyter Notebooks).


The first prototype.


Designed and implemented the first prototype for the core/util/iris/topicmodeling.py: K-Means algorithm and Wordcloud. (notebook),
output = tm.perform_kmeans()


The second prototype.


Designed and implemented topic modeling using Mr. Kaushik's dataset, with NMF and LDA algorithm (notebook),
tm = topicmodeling(jsonfile)
tm.perform_nmf1()

tm.perform_nmf2()
)
tm.perform_lda()


The third prototype.


Implemented BERT algorithm using BERTopic and SentenceTransformer. (notebook), users will be able to choose their preferred word embedding model, from repositories, as for a particular dataset user may need to select the best word embedding model which has the most superior performance.

tm.run_topic_modeling("paraphrase-distilroberta-base-v1")


The fourth prototype.


Adding CVF file support and stopwords removal. (notebook),
tm = topicmodeling(inputfile="testdataset.csv", filetype="csv", verbose=True)


The initial creation of the PR for topicmodeling module (PR).


Maryam/maryam/core/util/iris/topic.py


Maryam/maryam/modules/iris/topicmodeling.py


Fix bug and add new feature (commit).


Renaming the plotly.py to solve the dependencies problem. In addition, I also added the --output and --api functionality. Tested by using =
topicmodeling -i mixed.json -t json -m all-distilroberta-v1  --output
report json testreport iris/topicmodeling
topicmodeling -i mixed.json -t json -m all-distilroberta-v1  --api


Added a new feature:


Search Topics Using a Keyword (Top2Vec). (notebook),
topic_words, word_scores, topic_scores, topic_nums = model.search_topics(keywords=["music"], num_topics=2) for topic in topic_nums: model.generate_topic_wordcloud(topic)


To Continue My Work


Top2Vec has a limitation: that it might not work well for a small dataset. Therefore, when using a small dataset (such as Mr. Kaushik's mixed.json) sometimes it will not work. See the demonstration on jupyter notebook.


The same notebook as the previous notebook, but when we re-run the notebook sometimes it will fail.


Top2Vec algorithm has a good performance, but still has this problem when using a small dataset. To continue my work, we need to find an alternative algorithm or improving the current Top2Vec implementation.