This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| subjective_cols = [col for col in sd.columns if col.startswith('subjective')] | |
| print(subjective_cols) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| core_samples = np.zeros_like(labels, dtype = bool) #switches array of binary into array of booleans | |
| core_samples[dbscn.core_sample_indices_] = True | |
| print(core_samples) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| #I'm going to make this simple by writing a function to do this for different numbers of K | |
| def cluster_batch(k, data=x_df): | |
| k_mean = MiniBatchKMeans(n_clusters = k) | |
| fitted = k_mean.fit(data) | |
| labels = fitted.labels_ | |
| print(labels) | |
| print("Labels: " + str(labels)) | |
| print("Centroids: " + str(fitted.cluster_centers_)) | |
| print("Silhouette Score: " + str(silhouette_score(data, labels,sample_size=int(data.shape[0]*.1)))) | |
| print("Silhouette Score: " + str(silhouette_score(data, labels,sample_size=int(data.shape[0]*.2)))) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| adults_new['native_born'] = [1 if i=='United-States' else 0 for i in adults['native-country']] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| feature_importances = pd.DataFrame(model.feature_importances_, | |
| index = X_train_simpler.columns, | |
| columns=['importance']).sort_values('importance', | |
| ascending=False) | |
| feature_importances | |
| feature_importances['importance'] = feature_importances['importance'] * 100 | |
| import seaborn as sns | |
| plt.figure(figsize=(30,15)) # this creates a figure 8 inch wide, 4 inch high |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| from sklearn.feature_extraction.text import HashingVectorizer | |
| hvec = HashingVectorizer(stop_words='english') | |
| hvec.fit(data_train['data']) | |
| hvecdata = hvec.transform(data_train['data']) | |
| X_train = pd.DataFrame(hvecdata.todense()) | |
| print(X_train.shape) | |
| X_test = pd.DataFrame(hvec.transform(data_test['data']).todense()) | |
| print(X_test.shape) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| from sklearn.feature_extraction.text import TfidfVectorizer | |
| tvec = TfidfVectorizer(stop_words='english') | |
| tvec.fit(data_train['data']) | |
| tvecdata = tvec.transform(data_train['data']) | |
| X_train = pd.DataFrame(tvec.fit_transform(data_train['data']).todense(),columns=tvec.get_feature_names()) | |
| print(X_train.shape) | |
| X_test = pd.DataFrame(tvec.transform(data_test['data']).todense(),columns=tvec.get_feature_names()) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| k = 3 | |
| kmeans = cluster.KMeans(n_clusters=k) | |
| kmeans.fit(X_scaled) | |
| labels = kmeans.labels_ | |
| centroids = kmeans.cluster_centers_ | |
| inertia = kmeans.inertia_ | |
| print('Centroids:', centroids) | |
| print('') |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| from sklearn.linear_model import LogisticRegression | |
| logit = LogisticRegression() | |
| model = logit.fit(X_train, y_train) | |
| #predictions = model.predict(X_test) | |
| print("Score:", model.score(X_test, y_test)) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| import numpy as np | |
| import pandas as pd | |
| import matplotlib.pyplot as plt | |
| import seaborn as sns | |
| import scipy.stats as stats | |
| plt.style.use('fivethirtyeight') | |
| # plt.style.use('ggplot') | |
| %matplotlib inline | |
| %config InlineBackend.figure_format = 'retina' |