- ML/DM question--> ML/DM Solutions
- Real world problem --> ML/DM question
#Workflow
- ask question
- collect data
- formulation
- Real world problem --> math/data problem
File-->DataTable (Report)
File-->scatterplot
File --> Distributions
File--> Classification tree--> tree graph
- In sample error
- out of sample error
- Model error
To avoid overfitting
File--> Data Sampler -->(can connect to a DataTable to view. but Datatable doesn't output data, so need to connect test learner to Sata Sampler) File--> Data Sampler -->(remember to pass both train data and test data through this link) Test Learner
Data+ Learner-->Test Learner-->Confusion matrix, ROC analysis
you can evaluate model by Data
Paint Data
Find Correct Model for Data, Find Correct Data for Model
File-->Data Sampler (Data Sample)--> Model-->prediction (Remaining Data)-------->prediction
File-->PCA(optional)-->Distance-->Hierarchical Clustering, Distance Map
State Table
|year|user|n
The meaning of pivot table is
why not quantize your data, low resolution looks good!
probability based(mainstream but less intuitive) and distance based
Laplace smoothing spreads out your distribution, the downside is that it makes your sparse matrix dense.
keyword feature is crucial! keywords in common pull articles together, keywords push the article with them from those without them.