Skip to content

Instantly share code, notes, and snippets.

@jkuruzovich
Last active June 12, 2024 15:37
Show Gist options
  • Save jkuruzovich/889fd9f0762a2af7c11eed09ead30330 to your computer and use it in GitHub Desktop.
Save jkuruzovich/889fd9f0762a2af7c11eed09ead30330 to your computer and use it in GitHub Desktop.

Prompt Driven Process for Python-based Data Science using CRISP-DM Framework

RPI - 2024

Framework: https://app.myeducator.com/reader/web/1421a/2/qk5s5/

Resources: https://colab.research.google.com

Data/Problems: https://www.kaggle.com

Examples: https://introml.analyticsdojo.com/intro.html

Todays Notebook: https://colab.research.google.com/notebooks/intro.ipynb

Step 0: Get the Data

import pandas as pd
# Let's input them into a Pandas DataFrame
data = pd.read_csv('https://raw.githubusercontent.com/rpi-techfundamentals/spring2019-materials/master/input/train.csv')
data_test  = pd.read_csv('https://raw.githubusercontent.com/rpi-techfundamentals/spring2019-materials/master/input/test.csv')

Step 1: Understand the Data

  1. Define the Project Objectives:

    • Identify the business or research objectives.
    • Understand the context and define what questions you aim to answer or what predictions you want to make.
  2. Acquire the Data:

    • Collect the dataset from relevant sources.
    • Understand the structure and content of the data.
  3. Explore the Data:

    • Perform initial data exploration using descriptive statistics.
    • Identify the types of variables (numerical, categorical, datetime, etc.).
    • Handle missing values, if any.
  4. Document Findings:

    • Summarize your initial findings.
    • Note any interesting patterns, anomalies, or data quality issues.
import pandas as pd

# Load the data (Already done before)
# data = pd.read_csv('your_dataset.csv')

# Initial exploration
print(data.info())
print(data.describe())
print(data.head())

Step 2: Data Preparation - Turn All Variables to Numbers

  1. Handle Categorical Variables:

    • Convert categorical variables to numerical values using techniques like one-hot encoding or label encoding.
  2. Normalize/Standardize Numerical Variables:

    • Scale numerical variables to have a standard normal distribution (mean = 0, std = 1) or normalize them to a specific range (0 to 1).
  3. Create a Feature Matrix:

    • Combine all variables into a single feature matrix that can be used for modeling.
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

# Define the transformations
numeric_features = ['numeric_column1', 'numeric_column2']
categorical_features = ['categorical_column1', 'categorical_column2']

# Preprocessing pipeline for numeric features
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

# Preprocessing pipeline for categorical features
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Combine transformations
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# Apply transformations
data_prepared = preprocessor.fit_transform(data)

Step 3: Visualize General Relationships

  1. Correlation Matrix:

    • Visualize correlations between numerical variables.
  2. Mean Differences:

    • Compare means of numerical variables across different categories.
  3. Pivot Tables:

    • Create pivot tables to explore the relationships between categorical variables and numerical outcomes.
import seaborn as sns
import matplotlib.pyplot as plt

# Correlation matrix
correlation_matrix = pd.DataFrame(data_prepared).corr()
sns.heatmap(correlation_matrix, annot=True)
plt.show()

# Mean differences
data.groupby('categorical_column1')['numeric_column1'].mean().plot(kind='bar')
plt.show()

# Pivot table
pivot_table = pd.pivot_table(data, values='numeric_column1', index='categorical_column1', columns='categorical_column2')
print(pivot_table)

Step 4: Define the Prediction Task

  1. Identify the Target Variable:

    • Determine which variable you want to predict (the target variable).
  2. Classify the Problem Type:

    • Is it a classification problem (predicting categories) or a regression problem (predicting numerical values)?
# Define target and features
target = 'target_column'
features = data.drop(columns=[target]).columns

# Classification or Regression problem
if data[target].dtype == 'object' or data[target].nunique() < 10:
    problem_type = 'classification'
else:
    problem_type = 'regression'

print(f"Problem type: {problem_type}")

Step 5: Train-Test Split and Modeling

  1. Split the Data:

    • Split the dataset into training and testing sets.
  2. Select and Train the Model:

    • Choose appropriate algorithms for your problem type.
    • Train the model on the training data.
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestRegressor

# Split the data
X_train, X_test, y_train, y_test = train_test_split(data_prepared, data[target], test_size=0.2, random_state=42)

# Model selection and training
if problem_type == 'classification':
    model = LogisticRegression()
elif problem_type == 'regression':
    model = RandomForestRegressor()

model.fit(X_train, y_train)

Step 6: Evaluate the Model

  1. Make Predictions:

    • Use the trained model to make predictions on the test set.
  2. Evaluate Performance:

    • Use appropriate metrics to evaluate the model performance (accuracy, precision, recall for classification; RMSE, MAE for regression).
from sklearn.metrics import accuracy_score, precision_score, recall_score, mean_squared_error, mean_absolute_error

# Predictions
y_pred = model.predict(X_test)

# Evaluation
if problem_type == 'classification':
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred, average='weighted')
    recall = recall_score(y_test, y_pred, average='weighted')
    print(f"Accuracy: {accuracy}\nPrecision: {precision}\nRecall: {recall}")

elif problem_type == 'regression':
    rmse = mean_squared_error(y_test, y_pred, squared=False)
    mae = mean_absolute_error(y_test, y_pred)
    print(f"RMSE: {rmse}\nMAE: {mae}")

Summary

This multi-step process guides you through understanding your data, preparing it for analysis, exploring relationships, defining your predictive task, modeling, and evaluating your results. Adhering to the CRISP-DM framework ensures a structured approach to data science projects.

Next Steps

Part 1: Go to Kaggle.com and find and interesting data science project. Look for notebooks that have solved the prediction problem.

Part 2: Create a copy of the notebook and make a meaningful change. Consider a new visualization, summary table, or predictive model. Analyzed how your resullts differ from others.

https://www.kaggle.com/code/apapiu/regularized-linear-models

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment