-
-
Save analyticascent/556ddbc3c74043676545b6ea43fa907d to your computer and use it in GitHub Desktop.
Today I uploaded what comes very close to being the final version of the notebook.
There may be enough spelling/grammatical errors to warrant another revision, but one thing I'm tempted to do is include links to the official documentation for each of the libraries used so people can learn more about what they do and the parameters that can be changed. I'm also trying to think of a more clear and concise way to describe to readers what document-term matrices are.
At this point I can't really think of any other major ways to improve it without making it too wordy for newcomers or not detailed enough for the same group.
@2112bytes - Those three libraries are more or less the three main libraries used in most machine learning projects (although what specific tools are needed from sklearn
will depend on what you're trying to do).
When it comes to random_state
values, it's been hard for me to find a value that results in accuracy below 91% or above 93%. For example, setting it to 15
results in 91.06%
and 100
results in 92.98%
. Part of why the results appear to be so consistent is that the two users truly do have very different ways of talking about the same thing (such as how often they link to things), along with the fact that the amount of available training samples is quite high.
By default, train_test_split
will use 75% of the sample data to train the model and 25% to test it. That's one of the things made clear in the online documentation that I probably need to add in the next revision.
I won't post a new version until I'm sure it'll be the last. Thus far two coders (yourself included) and one non-coder has given me feedback, so I plan on seeking more input (especially from non-coders) until I can't find any more room for improvement. Really appreciate the feedback I've gotten from you and others thus far!
I ran through this and it was very instructional. I came away with a few things:
This was super helpful; to have a workable, explained code is meaningful. The concepts are explained well. So on understanding that code...
Pandas, numpy, sklearn are all new to me, and the intros here helps steer me on what to learn next. Googling sklearn lead me to scikit-learn.org which has some introductory information others beginners might gain from (I haven't gone through it yet).
There is more for me to learn about random_state. Changing that int seems to create very reliable accuracy results. I was thinking that a more random sample of traint/test selection would generate very slightly different metric scores on each run; in fact, they are consistent on every run for a given state value.
*The "brains" of it appear to be in the vectorizing and algorithm, and the choices made in those would affect the outcome in more complex scenarios.