ChihChengLiang/note.md

## note.md

      
    Raw
  

              note.md
            
          
    Agenda


ML/DM question--> ML/DM Solutions
Real world problem --> ML/DM question

#Workflow

ask question
collect data
formulation
Real world problem --> math/data problem

Orange

File-->DataTable (Report)
File-->scatterplot
File --> Distributions
File--> Classification tree--> tree graph
Types of Errors


In sample error
out of sample error
Model error

To avoid overfitting
File--> Data Sampler -->(can connect to a DataTable to view. but Datatable doesn't output data, so need to connect test learner to Sata Sampler)
File--> Data Sampler -->(remember to pass both train data and test data through this link) Test Learner
Data+ Learner-->Test Learner-->Confusion matrix, ROC analysis
you can evaluate model by Data

Paint Data

Find Correct Model for Data, Find Correct Data for Model

How can I plot the predicted label?

File-->Data Sampler (Data Sample)--> Model-->prediction
(Remaining Data)-------->prediction
PCA

Clustering

File-->PCA(optional)-->Distance-->Hierarchical Clustering, Distance Map
Ipython notebook

State Table
|year|user|n
The meaning of pivot table is $P(\text{year}, \text{user})$
Stock Example

why not quantize your data, low resolution looks good!
NLP

probability based(mainstream but less intuitive) and distance based
Laplace smoothing spreads out your distribution, the downside is that it makes your sparse matrix dense.
keyword feature is crucial! keywords in common pull articles together, keywords push the article with them from those without them.