Skip to content

Instantly share code, notes, and snippets.

@keamanansiber
Last active September 8, 2022 08:30
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save keamanansiber/fb87f236ebd231bb2b8293f427dea85a to your computer and use it in GitHub Desktop.
Save keamanansiber/fb87f236ebd231bb2b8293f427dea85a to your computer and use it in GitHub Desktop.

Google Summer of Code 2022 Final Report

Name: Hatma Suryotrisongko • Project: OWASP Maryam

Proposal Topic

Deep Learning for OWASP Maryam's NLP operations

Work was Done

Designed and implemented a topic modeling module (PR #269) (Personal repo link).

Milestones Achieved

  • Learning and implementing all existing Maryam's modules. (Jupyter Notebooks).

  • The first prototype.

    • Designed and implemented the first prototype for the core/util/iris/topicmodeling.py: K-Means algorithm and Wordcloud. (notebook),

      output = tm.perform_kmeans()

      Screenshot1

  • The second prototype.

    • Designed and implemented topic modeling using Mr. Kaushik's dataset, with NMF and LDA algorithm (notebook),

      tm = topicmodeling(jsonfile)

      tm.perform_nmf1()

      Screenshot2

      tm.perform_nmf2()

      Screenshot3)

      tm.perform_lda()

      Screenshot4

  • The third prototype.

    • Implemented BERT algorithm using BERTopic and SentenceTransformer. (notebook), users will be able to choose their preferred word embedding model, from repositories, as for a particular dataset user may need to select the best word embedding model which has the most superior performance.

      Screenshot5

      tm.run_topic_modeling("paraphrase-distilroberta-base-v1")

      Screenshot6

  • The fourth prototype.

    • Adding CVF file support and stopwords removal. (notebook),

      tm = topicmodeling(inputfile="testdataset.csv", filetype="csv", verbose=True)

      Screenshot7

  • The initial creation of the PR for topicmodeling module (PR).

  • Fix bug and add new feature (commit).

    • Renaming the plotly.py to solve the dependencies problem. In addition, I also added the --output and --api functionality. Tested by using =

      topicmodeling -i mixed.json -t json -m all-distilroberta-v1 --output

      report json testreport iris/topicmodeling

      topicmodeling -i mixed.json -t json -m all-distilroberta-v1 --api

  • Added a new feature:

    • Search Topics Using a Keyword (Top2Vec). (notebook),

      topic_words, word_scores, topic_scores, topic_nums = model.search_topics(keywords=["music"], num_topics=2) for topic in topic_nums: model.generate_topic_wordcloud(topic)

      Screenshot9

      Screenshot8

To Continue My Work

  • Top2Vec has a limitation: that it might not work well for a small dataset. Therefore, when using a small dataset (such as Mr. Kaushik's mixed.json) sometimes it will not work. See the demonstration on jupyter notebook.

    • The same notebook as the previous notebook, but when we re-run the notebook sometimes it will fail.

    • Top2Vec algorithm has a good performance, but still has this problem when using a small dataset. To continue my work, we need to find an alternative algorithm or improving the current Top2Vec implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment