Skip to content

Instantly share code, notes, and snippets.

@analyticascent
Last active January 24, 2020 10:11
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save analyticascent/556ddbc3c74043676545b6ea43fa907d to your computer and use it in GitHub Desktop.
Save analyticascent/556ddbc3c74043676545b6ea43fa907d to your computer and use it in GitHub Desktop.
text_classification_demo.ipynb
Display the source blob
Display the rendered blob
Raw
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@analyticascent
Copy link
Author

Today I uploaded what comes very close to being the final version of the notebook.

There may be enough spelling/grammatical errors to warrant another revision, but one thing I'm tempted to do is include links to the official documentation for each of the libraries used so people can learn more about what they do and the parameters that can be changed. I'm also trying to think of a more clear and concise way to describe to readers what document-term matrices are.

At this point I can't really think of any other major ways to improve it without making it too wordy for newcomers or not detailed enough for the same group.

@2112bytes - Those three libraries are more or less the three main libraries used in most machine learning projects (although what specific tools are needed from sklearn will depend on what you're trying to do).

When it comes to random_state values, it's been hard for me to find a value that results in accuracy below 91% or above 93%. For example, setting it to 15 results in 91.06% and 100 results in 92.98%. Part of why the results appear to be so consistent is that the two users truly do have very different ways of talking about the same thing (such as how often they link to things), along with the fact that the amount of available training samples is quite high.

By default, train_test_split will use 75% of the sample data to train the model and 25% to test it. That's one of the things made clear in the online documentation that I probably need to add in the next revision.

I won't post a new version until I'm sure it'll be the last. Thus far two coders (yourself included) and one non-coder has given me feedback, so I plan on seeking more input (especially from non-coders) until I can't find any more room for improvement. Really appreciate the feedback I've gotten from you and others thus far!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment