Skip to content

Instantly share code, notes, and snippets.

View vijayanandrp's full-sized avatar
👑

Vijay Anand Pandian vijayanandrp

👑
View GitHub Profile
@vijayanandrp
vijayanandrp / tutorial_with_solutions.md
Created December 15, 2017 07:42
Pycon 2016 tutorial by Kevin Markham. -

Tutorial: Machine Learning with Text in scikit-learn

Agenda

  1. Model building in scikit-learn (refresher)
  2. Representing text as numerical data
  3. Reading a text-based dataset into pandas
  4. Vectorizing our dataset
  5. Building and evaluating a model

Data wrangling

Data wrangling, sometimes referred to as data munging, is the process of transforming and mapping data from one "raw" data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics. A data wrangler is a person who performs these transformation operations. Wiki

Wrangler is an interactive tool for data cleaning and transformation. Spend less time formatting and more time analyzing your data. stanford

Example - 1

Data wrangling

Data wrangling, sometimes referred to as data munging, is the process of transforming and mapping data from one "raw" data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics. A data wrangler is a person who performs these transformation operations. Wiki

Wrangler is an interactive tool for data cleaning and transformation. Spend less time formatting and more time analyzing your data. stanford

Example - 1

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

This manual mode where you can test this predicition model with runtime names.

def model_evaluation(classifier):
    print('<<<  Testing Module   >>> ')
    print('Enter "q" or "quit" to end testing module')
    while 1:
        test_name = input('\n Enter name to classify: ')
        if test_name.lower() == 'q' or test_name.lower() == 'quit':
            print('End')
            exit(1)
def train_and_test(train_percent=0.80):
    feature_set = prepare_data_set()
    validate_data_set(feature_set)
    random.shuffle(feature_set)
    total = len(feature_set)
    cut_point = int(total * train_percent)
    # splitting Dataset into train and test
    train_set = feature_set[:cut_point]
 test_set = feature_set[cut_point:]

Feature/attributes/input/predictors extraction from given name string.

def extract_feature(name: str):
    name = name.upper()
    feature = dict()
    
    # additional feature extraction
    # feature["first_1"] = name[0]
    # for letter in 'abcdefghijklmnopqrstuvwxyz'.upper():

You can download the dataset at here

!/usr/bin/env python3.5
# -*- coding: utf-8 -*-

import os
import random
from zipfile import ZipFile
from nltk import NaiveBayesClassifier, MaxentClassifier, DecisionTreeClassifier, classify
@vijayanandrp
vijayanandrp / read_email.py
Last active November 2, 2022 12:56
Function for reading email (*.eml only) files using python - https://informationcorners.com/read-send-emails-python/
# -*- coding: utf-8 -*-
import re
import email
import smtplib
import mimetypes
from email.mime.multipart import MIMEMultipart
from email import encoders
from email.mime.audio import MIMEAudio
from email.mime.base import MIMEBase