RPI - 2024
Framework: https://app.myeducator.com/reader/web/1421a/2/qk5s5/
Resources: https://colab.research.google.com
Data/Problems: https://www.kaggle.com
Examples: https://introml.analyticsdojo.com/intro.html
Todays Notebook: https://colab.research.google.com/notebooks/intro.ipynb
import pandas as pd
# Let's input them into a Pandas DataFrame
data = pd.read_csv('https://raw.githubusercontent.com/rpi-techfundamentals/spring2019-materials/master/input/train.csv')
data_test = pd.read_csv('https://raw.githubusercontent.com/rpi-techfundamentals/spring2019-materials/master/input/test.csv')
-
Define the Project Objectives:
- Identify the business or research objectives.
- Understand the context and define what questions you aim to answer or what predictions you want to make.
-
Acquire the Data:
- Collect the dataset from relevant sources.
- Understand the structure and content of the data.
-
Explore the Data:
- Perform initial data exploration using descriptive statistics.
- Identify the types of variables (numerical, categorical, datetime, etc.).
- Handle missing values, if any.
-
Document Findings:
- Summarize your initial findings.
- Note any interesting patterns, anomalies, or data quality issues.
import pandas as pd
# Load the data (Already done before)
# data = pd.read_csv('your_dataset.csv')
# Initial exploration
print(data.info())
print(data.describe())
print(data.head())
-
Handle Categorical Variables:
- Convert categorical variables to numerical values using techniques like one-hot encoding or label encoding.
-
Normalize/Standardize Numerical Variables:
- Scale numerical variables to have a standard normal distribution (mean = 0, std = 1) or normalize them to a specific range (0 to 1).
-
Create a Feature Matrix:
- Combine all variables into a single feature matrix that can be used for modeling.
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
# Define the transformations
numeric_features = ['numeric_column1', 'numeric_column2']
categorical_features = ['categorical_column1', 'categorical_column2']
# Preprocessing pipeline for numeric features
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='mean')),
('scaler', StandardScaler())
])
# Preprocessing pipeline for categorical features
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent')),
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
# Combine transformations
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)
])
# Apply transformations
data_prepared = preprocessor.fit_transform(data)
-
Correlation Matrix:
- Visualize correlations between numerical variables.
-
Mean Differences:
- Compare means of numerical variables across different categories.
-
Pivot Tables:
- Create pivot tables to explore the relationships between categorical variables and numerical outcomes.
import seaborn as sns
import matplotlib.pyplot as plt
# Correlation matrix
correlation_matrix = pd.DataFrame(data_prepared).corr()
sns.heatmap(correlation_matrix, annot=True)
plt.show()
# Mean differences
data.groupby('categorical_column1')['numeric_column1'].mean().plot(kind='bar')
plt.show()
# Pivot table
pivot_table = pd.pivot_table(data, values='numeric_column1', index='categorical_column1', columns='categorical_column2')
print(pivot_table)
-
Identify the Target Variable:
- Determine which variable you want to predict (the target variable).
-
Classify the Problem Type:
- Is it a classification problem (predicting categories) or a regression problem (predicting numerical values)?
# Define target and features
target = 'target_column'
features = data.drop(columns=[target]).columns
# Classification or Regression problem
if data[target].dtype == 'object' or data[target].nunique() < 10:
problem_type = 'classification'
else:
problem_type = 'regression'
print(f"Problem type: {problem_type}")
-
Split the Data:
- Split the dataset into training and testing sets.
-
Select and Train the Model:
- Choose appropriate algorithms for your problem type.
- Train the model on the training data.
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestRegressor
# Split the data
X_train, X_test, y_train, y_test = train_test_split(data_prepared, data[target], test_size=0.2, random_state=42)
# Model selection and training
if problem_type == 'classification':
model = LogisticRegression()
elif problem_type == 'regression':
model = RandomForestRegressor()
model.fit(X_train, y_train)
-
Make Predictions:
- Use the trained model to make predictions on the test set.
-
Evaluate Performance:
- Use appropriate metrics to evaluate the model performance (accuracy, precision, recall for classification; RMSE, MAE for regression).
from sklearn.metrics import accuracy_score, precision_score, recall_score, mean_squared_error, mean_absolute_error
# Predictions
y_pred = model.predict(X_test)
# Evaluation
if problem_type == 'classification':
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
print(f"Accuracy: {accuracy}\nPrecision: {precision}\nRecall: {recall}")
elif problem_type == 'regression':
rmse = mean_squared_error(y_test, y_pred, squared=False)
mae = mean_absolute_error(y_test, y_pred)
print(f"RMSE: {rmse}\nMAE: {mae}")
This multi-step process guides you through understanding your data, preparing it for analysis, exploring relationships, defining your predictive task, modeling, and evaluating your results. Adhering to the CRISP-DM framework ensures a structured approach to data science projects.
Part 1: Go to Kaggle.com and find and interesting data science project. Look for notebooks that have solved the prediction problem.
Part 2: Create a copy of the notebook and make a meaningful change. Consider a new visualization, summary table, or predictive model. Analyzed how your resullts differ from others.
https://www.kaggle.com/code/apapiu/regularized-linear-models