kashaziz/logistic_regression_customer_prediction.py

## logistic_regression_customer_prediction.py
"""
This Python script demonstrates the usage of logistic regression to predict whether customers will make the next purchase on an e-commerce site.
The code performs the following steps:

1. Load and Preprocess Data:
    - Loads an e-commerce dataset containing customer features such as 'time_on_site', 'total_spent', 'is_returning_customer', and 'will_make_next_purchase'.
    - Splits the data into training and testing sets.

2. Model Training:
    - Creates a logistic regression model using scikit-learn.
    - Trains the model on the training set, where 'will_make_next_purchase' is the target variable.

3. Model Evaluation:
    - Predicts the target variable on the testing set and calculates accuracy.
    - Displays the confusion matrix to provide a detailed view of model performance, including true positives, true negatives, false positives, and false negatives.

4. Making Predictions on New Data:
    - Demonstrates how to use the trained model to make predictions on new data.
    - Creates a new DataFrame ('new_data') with hypothetical customer features, including 'time_on_site', 'total_spent', 'is_returning_customer'.
    - Outputs predictions for whether these new customers will make the next purchase.

Note: The script assumes that 'will_make_next_purchase' is a binary target variable (0 or 1) indicating whether a customer makes the next purchase.
Additionally, 'customer_id' has been added as a feature for prediction, considering it might contribute to purchase behavior.
"""

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix

# Load the dataset
ecommerce_data = pd.read_csv('data/shopping_data.csv')

# Assume 'will_make_next_purchase' is the target variable, and others are features
X = ecommerce_data[['customer_id', 'time_on_site', 'total_spent', 'is_returning_customer']]
y = ecommerce_data['will_make_next_purchase']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a Logistic Regression model
model = LogisticRegression()

# Train the model
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)

# Display the confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)

# Display results
print(f'Training Accuracy: {model.score(X_train, y_train):.2f}')
print(f'Test Accuracy: {accuracy:.2f}')
print('Confusion Matrix:')
print(conf_matrix)

# Now, let's make predictions on new data
# Assuming 'new_data' is a DataFrame with columns 'customer_id', 'time_on_site', 'total_spent', 'is_returning_customer'
# You should replace this with your actual new data
new_data = pd.DataFrame({
    'customer_id': [9, 10, 11],
    'time_on_site': [12, 8, 15],
    'total_spent': [60, 30, 75],
    'is_returning_customer': [1, 0, 1]
})

# Make predictions on the new data
new_data_predictions = model.predict(new_data)

# Display predictions for the new data
new_data_with_predictions = new_data.copy()
new_data_with_predictions['will_make_next_purchase'] = new_data_predictions
print('Predictions for New Data:')
print(new_data_with_predictions[['customer_id', 'will_make_next_purchase']])

## shopping_data.csv

          
            customer_id
            time_on_site
            total_spent
            is_returning_customer
            will_make_next_purchase

            
              1
              10
              50
              1
              1

            
              2
              15
              75
              0
              1

            
              3
              8
              30
              1
              0

            
              4
              20
              100
              1
              1

            
              5
              5
              20
              0
              0

            
              6
              12
              60
              1
              1

            
              7
              18
              90
              1
              1

            
              8
              7
              35
              0
              0
	"""
	This Python script demonstrates the usage of logistic regression to predict whether customers will make the next purchase on an e-commerce site.
	The code performs the following steps:

	1. Load and Preprocess Data:
	- Loads an e-commerce dataset containing customer features such as 'time_on_site', 'total_spent', 'is_returning_customer', and 'will_make_next_purchase'.
	- Splits the data into training and testing sets.

	2. Model Training:
	- Creates a logistic regression model using scikit-learn.
	- Trains the model on the training set, where 'will_make_next_purchase' is the target variable.

	3. Model Evaluation:
	- Predicts the target variable on the testing set and calculates accuracy.
	- Displays the confusion matrix to provide a detailed view of model performance, including true positives, true negatives, false positives, and false negatives.

	4. Making Predictions on New Data:
	- Demonstrates how to use the trained model to make predictions on new data.
	- Creates a new DataFrame ('new_data') with hypothetical customer features, including 'time_on_site', 'total_spent', 'is_returning_customer'.
	- Outputs predictions for whether these new customers will make the next purchase.

	Note: The script assumes that 'will_make_next_purchase' is a binary target variable (0 or 1) indicating whether a customer makes the next purchase.
	Additionally, 'customer_id' has been added as a feature for prediction, considering it might contribute to purchase behavior.
	"""

	import pandas as pd
	from sklearn.model_selection import train_test_split
	from sklearn.linear_model import LogisticRegression
	from sklearn.metrics import accuracy_score, confusion_matrix

	# Load the dataset
	ecommerce_data = pd.read_csv('data/shopping_data.csv')

	# Assume 'will_make_next_purchase' is the target variable, and others are features
	X = ecommerce_data[['customer_id', 'time_on_site', 'total_spent', 'is_returning_customer']]
	y = ecommerce_data['will_make_next_purchase']

	# Split the data into training and testing sets
	X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

	# Create a Logistic Regression model
	model = LogisticRegression()

	# Train the model
	model.fit(X_train, y_train)

	# Make predictions on the test set
	y_pred = model.predict(X_test)

	# Calculate accuracy
	accuracy = accuracy_score(y_test, y_pred)

	# Display the confusion matrix
	conf_matrix = confusion_matrix(y_test, y_pred)

	# Display results
	print(f'Training Accuracy: {model.score(X_train, y_train):.2f}')
	print(f'Test Accuracy: {accuracy:.2f}')
	print('Confusion Matrix:')
	print(conf_matrix)

	# Now, let's make predictions on new data
	# Assuming 'new_data' is a DataFrame with columns 'customer_id', 'time_on_site', 'total_spent', 'is_returning_customer'
	# You should replace this with your actual new data
	new_data = pd.DataFrame({
	'customer_id': [9, 10, 11],
	'time_on_site': [12, 8, 15],
	'total_spent': [60, 30, 75],
	'is_returning_customer': [1, 0, 1]
	})

	# Make predictions on the new data
	new_data_predictions = model.predict(new_data)

	# Display predictions for the new data
	new_data_with_predictions = new_data.copy()
	new_data_with_predictions['will_make_next_purchase'] = new_data_predictions
	print('Predictions for New Data:')
	print(new_data_with_predictions[['customer_id', 'will_make_next_purchase']])
customer_id	time_on_site	total_spent	is_returning_customer	will_make_next_purchase
1	10	50	1	1
2	15	75	0	1
3	8	30	1	0
4	20	100	1	1
5	5	20	0	0
6	12	60	1	1
7	18	90	1	1
8	7	35	0	0