Skip to content

Instantly share code, notes, and snippets.

@devender-yadav
devender-yadav / test.py
Created January 19, 2020 07:17
Pyspark RDD Checkpointing
df = spark.range(1, 7, 2)
df.show()
rdd = df.rdd
rdd = rdd.cache()
print("Storage Level - {}".format(rdd.getStorageLevel()))
print("Is Checkpointed - {}".format(rdd.isCheckpointed()))
print("Checkpoint File - {}".format(rdd.getCheckpointFile()))
@devender-yadav
devender-yadav / spark-svd.scala
Created November 14, 2019 15:53 — forked from vrilleup/spark-svd.scala
Spark/mllib SVD example
import org.apache.spark.mllib.linalg.distributed.RowMatrix
import org.apache.spark.mllib.linalg._
import org.apache.spark.{SparkConf, SparkContext}
// To use the latest sparse SVD implementation, please build your spark-assembly after this
// change: https://github.com/apache/spark/pull/1378
// Input tsv with 3 fields: rowIndex(Long), columnIndex(Long), weight(Double), indices start with 0
// Assume the number of rows is larger than the number of columns, and the number of columns is
// smaller than Int.MaxValue
@devender-yadav
devender-yadav / coverage.md
Last active August 8, 2019 10:18
code coverage in python

Project Structure

-myproj
  -proj
  -xx.py
  -dir1
    -xx.py
  -dir2
    -xx.py

-tests

@devender-yadav
devender-yadav / code_quality.md
Created July 21, 2019 19:35
My python notes

Static code analysis for python application

Pylint

How to install

pip install pylint

How to run

@devender-yadav
devender-yadav / deb.md
Created July 21, 2019 19:31
My linux notes

Use local debian mirror - https://www.debian.org/mirror/sponsors.html

RUN echo \
   'deb http://mirror.cse.iitk.ac.in/debian/ stretch main\n \
    deb http://security.debian.org/debian-security stretch/updates main\n \
    deb http://mirror.cse.iitk.ac.in/debian/ stretch-updates main\n' \
    > /etc/apt/sources.list
@devender-yadav
devender-yadav / cx_Oracle_install.md
Last active November 15, 2019 10:52
Install cx_Oracle 5.3 python package on Mac
@devender-yadav
devender-yadav / athena_sagamaker.py
Created May 31, 2019 09:58
Running AWS Athena SQL queries
import sys
!{sys.executable} -m pip install PyAthena
from pyathena import connect
import pandas as pd
conn = connect(s3_staging_dir='s3://dev-sample-data123/temp',
region_name='ap-south-1')
df = pd.read_sql("SELECT * FROM rank_table limit 10;", conn)
@devender-yadav
devender-yadav / rss_feeds_scraper.py
Created May 23, 2019 15:57
Fetch RSS feeds from TOI
import feedparser
NewsFeed = feedparser.parse("https://timesofindia.indiatimes.com/rssfeedstopstories.cms")
print ('Number of RSS posts :', len(NewsFeed.entries))
entry = NewsFeed.entries[1]
print('Post Title :',entry.title)
@devender-yadav
devender-yadav / mongo.md
Created May 22, 2019 17:28
Merge mongo data distributed on 2 locations

Let us assume, initally mongo is using /data/db1 dbpath and then /data/db2

Start mongo /data/db1

sudo mongod --dbpath=/data/db1

Export data from mongo /data/db1

mongoexport --db test_db --collection collection1 --out tb_collection1_db1.json