Skip to content

Instantly share code, notes, and snippets.

View auhuman's full-sized avatar

Arunkumar Mathiyazhagan auhuman

View GitHub Profile
@auhuman
auhuman / snippet.sql
Created April 30, 2026 10:05
Spark Connect vs RDD: Understanding Modern Spark Architecture — snippet 3
SELECT url, COUNT(*) AS page_count
FROM logs
GROUP BY url
ORDER BY page_count DESC
LIMIT 10
@auhuman
auhuman / snippet.py
Created April 30, 2026 10:05
Spark Connect vs RDD: Understanding Modern Spark Architecture — snippet 2
# Spark Connect — thin client connecting to a remote cluster
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, split, regexp_extract
spark = SparkSession.builder.remote("sc://cluster:15002").getOrCreate()
logs_df = spark.read.text("s3://bucket/logs/*.log")
# Parse log fields from raw text
parsed = logs_df.select(
@auhuman
auhuman / snippet.py
Created April 30, 2026 10:05
Spark Connect vs RDD: Understanding Modern Spark Architecture — snippet 1
# Traditional Spark — requires full Spark runtime locally
sc = SparkContext()
logs_rdd = sc.textFile("s3://bucket/logs/*.log")
parsed = logs_rdd.map(parse_log) # parse_log returns a dict
errors = parsed.filter(lambda x: x['status'] == '404').count()
top_pages = (
@auhuman
auhuman / snippet.sql
Created April 30, 2026 09:57
Spark Connect vs RDD: Understanding Modern Spark Architecture — snippet 4
SELECT url, COUNT(*) AS page_count
FROM logs
GROUP BY url
ORDER BY page_count DESC
LIMIT 10
@auhuman
auhuman / snippet.py
Created April 30, 2026 09:57
Spark Connect vs RDD: Understanding Modern Spark Architecture — snippet 3
# Spark Connect — thin client connecting to a remote cluster
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, split, regexp_extract
spark = SparkSession.builder.remote("sc://cluster:15002").getOrCreate()
logs_df = spark.read.text("s3://bucket/logs/*.log")
# Parse log fields from raw text
parsed = logs_df.select(
@auhuman
auhuman / snippet.py
Created April 30, 2026 09:57
Spark Connect vs RDD: Understanding Modern Spark Architecture — snippet 2
# Traditional Spark — requires full Spark runtime locally
sc = SparkContext()
logs_rdd = sc.textFile("s3://bucket/logs/*.log")
parsed = logs_rdd.map(parse_log) # parse_log returns a dict
errors = parsed.filter(lambda x: x['status'] == '404').count()
top_pages = (
@auhuman
auhuman / snippet.txt
Created April 30, 2026 09:57
Spark Connect vs RDD: Understanding Modern Spark Architecture — snippet 1
TRADITIONAL SPARK SPARK CONNECT
================= =============
┌──────────────────┐ ┌──────────────────┐
│ Your Laptop │ │ Your Laptop │
│ (Full Runtime) │ │ (Thin Client) │
│ │ │ │
│ ┌────────────┐ │ │ ┌────────────┐ │
│ │ Full Spark │ │ │ │ Client │ │
│ │ Runtime │ │ vs. │ │ Library │ │