Navigation Menu

Skip to content

Instantly share code, notes, and snippets.

@analyticascent
Last active January 24, 2020 10:11
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save analyticascent/556ddbc3c74043676545b6ea43fa907d to your computer and use it in GitHub Desktop.
Save analyticascent/556ddbc3c74043676545b6ea43fa907d to your computer and use it in GitHub Desktop.
text_classification_demo.ipynb
Display the source blob
Display the rendered blob
Raw
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@analyticascent
Copy link
Author

analyticascent commented Dec 22, 2018

Some context for anyone that happens to stumble across this:

I've put this together as a way to give a few people I know (most of whom have never coded before) an idea of how text classification works. Reading through the markdown documentation and the code cells in the above post should take most people 30 minutes or so to do, but I'm not yet sure how long it would take a non-coder to understand why the code does what it does.

Any questions from people that happen to stumble upon this page are more than welcome. I plan on including this as part of a Github repo meant to serve as a Python crash course: https://github.com/analyticascent/python-inflection

The end goal of that repo is for someone with little or no coding experience to be able to pick up enough Python to be dangerous in ten days or less.

@2112bytes
Copy link

I ran through this and it was very instructional. I came away with a few things:

This was super helpful; to have a workable, explained code is meaningful. The concepts are explained well. So on understanding that code...

  • Pandas, numpy, sklearn are all new to me, and the intros here helps steer me on what to learn next. Googling sklearn lead me to scikit-learn.org which has some introductory information others beginners might gain from (I haven't gone through it yet).

  • There is more for me to learn about random_state. Changing that int seems to create very reliable accuracy results. I was thinking that a more random sample of traint/test selection would generate very slightly different metric scores on each run; in fact, they are consistent on every run for a given state value.

*The "brains" of it appear to be in the vectorizing and algorithm, and the choices made in those would affect the outcome in more complex scenarios.

@analyticascent
Copy link
Author

Today I uploaded what comes very close to being the final version of the notebook.

There may be enough spelling/grammatical errors to warrant another revision, but one thing I'm tempted to do is include links to the official documentation for each of the libraries used so people can learn more about what they do and the parameters that can be changed. I'm also trying to think of a more clear and concise way to describe to readers what document-term matrices are.

At this point I can't really think of any other major ways to improve it without making it too wordy for newcomers or not detailed enough for the same group.

@2112bytes - Those three libraries are more or less the three main libraries used in most machine learning projects (although what specific tools are needed from sklearn will depend on what you're trying to do).

When it comes to random_state values, it's been hard for me to find a value that results in accuracy below 91% or above 93%. For example, setting it to 15 results in 91.06% and 100 results in 92.98%. Part of why the results appear to be so consistent is that the two users truly do have very different ways of talking about the same thing (such as how often they link to things), along with the fact that the amount of available training samples is quite high.

By default, train_test_split will use 75% of the sample data to train the model and 25% to test it. That's one of the things made clear in the online documentation that I probably need to add in the next revision.

I won't post a new version until I'm sure it'll be the last. Thus far two coders (yourself included) and one non-coder has given me feedback, so I plan on seeking more input (especially from non-coders) until I can't find any more room for improvement. Really appreciate the feedback I've gotten from you and others thus far!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment