Skip to content

Instantly share code, notes, and snippets.

@ChihChengLiang
Last active August 29, 2015 14:01
Show Gist options
  • Save ChihChengLiang/bf22834fa21ec693f554 to your computer and use it in GitHub Desktop.
Save ChihChengLiang/bf22834fa21ec693f554 to your computer and use it in GitHub Desktop.

Agenda

  1. ML/DM question--> ML/DM Solutions
  2. Real world problem --> ML/DM question

#Workflow

  1. ask question
  2. collect data
  3. formulation
  4. Real world problem --> math/data problem

Orange

File-->DataTable (Report)

File-->scatterplot

File --> Distributions

File--> Classification tree--> tree graph

Types of Errors

  • In sample error
  • out of sample error
  • Model error

To avoid overfitting

File--> Data Sampler -->(can connect to a DataTable to view. but Datatable doesn't output data, so need to connect test learner to Sata Sampler) File--> Data Sampler -->(remember to pass both train data and test data through this link) Test Learner

Data+ Learner-->Test Learner-->Confusion matrix, ROC analysis

you can evaluate model by Data
Paint Data

Find Correct Model for Data, Find Correct Data for Model

How can I plot the predicted label?

File-->Data Sampler (Data Sample)--> Model-->prediction (Remaining Data)-------->prediction

PCA

Clustering

File-->PCA(optional)-->Distance-->Hierarchical Clustering, Distance Map

Ipython notebook

State Table

|year|user|n

The meaning of pivot table is $P(\text{year}, \text{user})$

Stock Example

why not quantize your data, low resolution looks good!

NLP

probability based(mainstream but less intuitive) and distance based

Laplace smoothing spreads out your distribution, the downside is that it makes your sparse matrix dense.

keyword feature is crucial! keywords in common pull articles together, keywords push the article with them from those without them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment