jkuruzovich/rpi-2024-machinelearning.md

## rpi-2024-machinelearning.md

      
    Raw
  

              rpi-2024-machinelearning.md
            
          
    Prompt Driven Process for Python-based Data Science using CRISP-DM Framework

RPI - 2024
Framework:
https://app.myeducator.com/reader/web/1421a/2/qk5s5/
Resources:
https://colab.research.google.com
Data/Problems:
https://www.kaggle.com
Examples:
https://introml.analyticsdojo.com/intro.html
Todays Notebook:
https://colab.research.google.com/notebooks/intro.ipynb
Step 0: Get the Data

import pandas as pd
# Let's input them into a Pandas DataFrame
data = pd.read_csv('https://raw.githubusercontent.com/rpi-techfundamentals/spring2019-materials/master/input/train.csv')
data_test  = pd.read_csv('https://raw.githubusercontent.com/rpi-techfundamentals/spring2019-materials/master/input/test.csv')

Step 1: Understand the Data


Define the Project Objectives:

Identify the business or research objectives.
Understand the context and define what questions you aim to answer or what predictions you want to make.


Acquire the Data:

Collect the dataset from relevant sources.
Understand the structure and content of the data.


Explore the Data:

Perform initial data exploration using descriptive statistics.
Identify the types of variables (numerical, categorical, datetime, etc.).
Handle missing values, if any.


Document Findings:

Summarize your initial findings.
Note any interesting patterns, anomalies, or data quality issues.


import pandas as pd

# Load the data (Already done before)
# data = pd.read_csv('your_dataset.csv')

# Initial exploration
print(data.info())
print(data.describe())
print(data.head())
Step 2: Data Preparation - Turn All Variables to Numbers


Handle Categorical Variables:

Convert categorical variables to numerical values using techniques like one-hot encoding or label encoding.


Normalize/Standardize Numerical Variables:

Scale numerical variables to have a standard normal distribution (mean = 0, std = 1) or normalize them to a specific range (0 to 1).


Create a Feature Matrix:

Combine all variables into a single feature matrix that can be used for modeling.


from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

# Define the transformations
numeric_features = ['numeric_column1', 'numeric_column2']
categorical_features = ['categorical_column1', 'categorical_column2']

# Preprocessing pipeline for numeric features
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

# Preprocessing pipeline for categorical features
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Combine transformations
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# Apply transformations
data_prepared = preprocessor.fit_transform(data)
Step 3: Visualize General Relationships


Correlation Matrix:

Visualize correlations between numerical variables.


Mean Differences:

Compare means of numerical variables across different categories.


Pivot Tables:

Create pivot tables to explore the relationships between categorical variables and numerical outcomes.


import seaborn as sns
import matplotlib.pyplot as plt

# Correlation matrix
correlation_matrix = pd.DataFrame(data_prepared).corr()
sns.heatmap(correlation_matrix, annot=True)
plt.show()

# Mean differences
data.groupby('categorical_column1')['numeric_column1'].mean().plot(kind='bar')
plt.show()

# Pivot table
pivot_table = pd.pivot_table(data, values='numeric_column1', index='categorical_column1', columns='categorical_column2')
print(pivot_table)
Step 4: Define the Prediction Task


Identify the Target Variable:

Determine which variable you want to predict (the target variable).


Classify the Problem Type:

Is it a classification problem (predicting categories) or a regression problem (predicting numerical values)?


# Define target and features
target = 'target_column'
features = data.drop(columns=[target]).columns

# Classification or Regression problem
if data[target].dtype == 'object' or data[target].nunique() < 10:
    problem_type = 'classification'
else:
    problem_type = 'regression'

print(f"Problem type: {problem_type}")
Step 5: Train-Test Split and Modeling


Split the Data:

Split the dataset into training and testing sets.


Select and Train the Model:

Choose appropriate algorithms for your problem type.
Train the model on the training data.


from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestRegressor

# Split the data
X_train, X_test, y_train, y_test = train_test_split(data_prepared, data[target], test_size=0.2, random_state=42)

# Model selection and training
if problem_type == 'classification':
    model = LogisticRegression()
elif problem_type == 'regression':
    model = RandomForestRegressor()

model.fit(X_train, y_train)
Step 6: Evaluate the Model


Make Predictions:

Use the trained model to make predictions on the test set.


Evaluate Performance:

Use appropriate metrics to evaluate the model performance (accuracy, precision, recall for classification; RMSE, MAE for regression).


from sklearn.metrics import accuracy_score, precision_score, recall_score, mean_squared_error, mean_absolute_error

# Predictions
y_pred = model.predict(X_test)

# Evaluation
if problem_type == 'classification':
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred, average='weighted')
    recall = recall_score(y_test, y_pred, average='weighted')
    print(f"Accuracy: {accuracy}\nPrecision: {precision}\nRecall: {recall}")

elif problem_type == 'regression':
    rmse = mean_squared_error(y_test, y_pred, squared=False)
    mae = mean_absolute_error(y_test, y_pred)
    print(f"RMSE: {rmse}\nMAE: {mae}")
Summary

This multi-step process guides you through understanding your data, preparing it for analysis, exploring relationships, defining your predictive task, modeling, and evaluating your results. Adhering to the CRISP-DM framework ensures a structured approach to data science projects.
Next Steps

Part 1: Go to Kaggle.com and find and interesting data science project. Look for notebooks that have solved the prediction problem.
Part 2: Create a copy of the notebook and make a meaningful change.  Consider a new visualization, summary table, or predictive model.  Analyzed how your resullts differ from others.
https://www.kaggle.com/code/apapiu/regularized-linear-models