Manish Dixit dixitm20

## readme.md

      
              2 files
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                dixitm20
                / readme.md
            
            
              Last active
              December 11, 2023 10:13
            
              
                Script for easy identification, restoration & validation of deleted objects in Amazon S3 buckets.
              
          
    restore-s3-deletes

Overview

A bash script to list or delete delete-markers in a versioning-enabled Amazon S3 bucket. Deleting delete-markers restores the deleted object.
This script is useful for identifying, restoring & validating accidental deletes on Amazon S3.
Requirements


## ZeppelinSetup.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                dixitm20
                / ZeppelinSetup.md
            
            
              Created
              September 21, 2022 13:02
            
              
                Zeppelin Setup For Use With Local Spark MetaStore
              
          
    Zeppelin setup For Use With Local Spark MetaStore

Download and Setup Zeppelin

Step 1) Define below env variables and aliases in the ~/.bash_profile OR ~/.bashrc:
export SPARK_CONF_DIR='/Users/dixitm/Workspace/conf/spark-conf-dir'

# DATA_PLATFORM_ROOT: Local root dir where spark catalog & metastore is setup
export DATA_PLATFORM_ROOT="/Users/dixitm/Workspace/data/local-data-platform"


## MultiplePythonVersionsWithPyenv.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                dixitm20
                / MultiplePythonVersionsWithPyenv.md
            
            
              Created
              June 19, 2022 19:10
            
              
                Managing Multiple Python Versions With Pyenv
              
          
    Managing Multiple Python Versions With Pyenv


References:

https://realpython.com/intro-to-pyenv/#working-with-multiple-environments
https://www.liquidweb.com/kb/how-to-install-pyenv-on-ubuntu-18-04/
https://github.com/pyenv/pyenv

Use below commands to install pyenv
$ sudo apt-get install -y make build-essential libssl-dev zlib1g-dev \

  
## OpensearchLocalDockerForSparkApps.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                dixitm20
                / OpensearchLocalDockerForSparkApps.md
            
            
              Created
              June 17, 2022 19:30
            
              
                Opensearch local docker for Spark Apps | Setup Instructions
              
          
    Opensearch local docker for Spark Apps | Setup Instructions

Ref Url


https://hub.docker.com/r/opensearchproject/opensearch-dashboards
https://amulyasharma.medium.com/opensearch-up-and-running-in-10-mins-49e05689087e
https://opensearch.org/docs/latest/opensearch/install/docker/
https://opensearch.org/docs/latest/opensearch/install/docker-security/

Docker Compose

Use the below steps to run the elastic search container

  
## Spark-Data-Engineer-Assignment.md

      
              1 file
            
          
              0 forks
            
          
              2 comments
            
          
              0 stars
            
          
                dixitm20
                / Spark-Data-Engineer-Assignment.md
            
            
              Last active
              July 18, 2024 10:30
            
              
                Assignment For Data Engineer
              
          
    Spark: Scala / PySpark Exercise

Create a spark application written in Scala or PySpark that reads in the provided signals dataset, processes
the data, and stores the entire output as specified below.
For each entity_id in the signals dataset, find the item_id with the oldest and newest month_id.In some cases it may be the same item. If there are 2 different items with the same month_id then take the item with the lower item_id. Finally sum the count of signals for each entity and output as the total_signals. The correct output should contain 1 row per unique entity_id.
Requirements:


Create a Scala SBT project Or Pyspark Project (If you know scala then please use the same as we give higher preference to that).
Use the Spark Scala/Pyspark API and Dataframes/Datasets


Please do not use Spark SQL with a sql string!


## lambda_function.py
import json
import boto3 as boto3
import os

dynamodb = boto3.resource('dynamodb')


def truncate_table(table_name):
    table = dynamodb.Table(table_name)
    scan_kwargs = {
	import json
	import boto3 as boto3
	import os

	dynamodb = boto3.resource('dynamodb')


	def truncate_table(table_name):
	table = dynamodb.Table(table_name)
	scan_kwargs = {