Skip to content

Instantly share code, notes, and snippets.

@gingerwizard
Created April 10, 2024 13:10
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save gingerwizard/979e8e10fca6e0d186bf3eb848eb2628 to your computer and use it in GitHub Desktop.
Save gingerwizard/979e8e10fca6e0d186bf3eb848eb2628 to your computer and use it in GitHub Desktop.

Download data

wget https://datasets-documentation.s3.eu-west-3.amazonaws.com/nyc-taxi/nyc-taxi-vectors.csv.gz
gzip -d nyc-taxi-vectors.csv.gz

Install Dependencies

pip install scikit-learn
pip install pandas

Run the following

import pandas as pd
from sklearn.cluster import KMeans
from ast import literal_eval
import time

start_time = time.time()
# Load the CSV file into a DataFrame
df = pd.read_csv('nyc-taxi-vectors.csv')

# Convert the string representation of vectors to actual lists
df['vector'] = df['vector'].apply(literal_eval)

# Convert lists to a list of lists for fitting the model
vectors = list(df['vector'])

# Perform KMeans clustering
kmeans = KMeans(n_clusters=5, random_state=42, n_init=1)
df['cluster'] = kmeans.fit_predict(vectors)

execution_time = (time.time() - start_time)
print('Execution time in seconds: ' + str(execution_time))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment