Skip to content

Instantly share code, notes, and snippets.

View sharathgrao's full-sized avatar

Sharath G sharathgrao

  • Bangalore
View GitHub Profile
@sharathgrao
sharathgrao / rdd_to_dataframe_spark.py
Created March 12, 2024 17:36
RDD to a DataFrame in Python using Spark
from pyspark.sql import SparkSession
# Create a SparkSession
spark = SparkSession.builder.appName("RDD to DataFrame").getOrCreate()
# Create an example RDD
data = [("Alice", 25), ("Bob", 30), ("Charlie", 28)]
rdd = spark.sparkContext.parallelize(data)
# Define column names
@sharathgrao
sharathgrao / numpy_seed.py
Created February 22, 2024 05:53
numpy seed
import numpy as np
# Set the seed to 50
np.random.seed(50)
# Generate two arrays of random numbers
array1 = np.random.rand(10)
array2 = np.random.rand(10)
print("Array 1:", array1)
@sharathgrao
sharathgrao / pandas-profiling-intro.py
Last active November 19, 2020 03:17
pandas-profiling-gist
import pandas as pd
import pandas_profiling as pp
## read the csv data into pandas dataframe
data = pd.read_csv("query-hive-10382804.csv")
## run pandas profiling on data
profile = pp.ProfileReport(data)
## output html file with profiling report of the data
from sklearn.linear_model import LinearRegression
from sklearn.metrics import median_absolute_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_boston
boston = load_boston()
X_train, X_test, y_train, y_test = train_test_split(boston.data, boston.target, random_state=1)
regr = LinearRegression()
regr.fit(X_train, y_train)