Skip to content

Instantly share code, notes, and snippets.

Avatar
💭
Awesome

Anjaiah Methuku anjijava16

💭
Awesome
View GitHub Profile
View parquet_compression.md

Parquet compression options​ Parquet is designed for large-scale data with several types of data compression formats supported. Depending on your data format, you might want a different compression.

LZ4: Compression codec loosely based on the LZ4 compression algorithm, but with an additional undocumented framing scheme. The framing is part of the original Hadoop compression library and was historically copied first in parquet-mr, then emulated with mixed results by parquet-cpp. LZO: Compression codec based on or interoperable with the LZO compression library. GZIP: Compression codec based on the GZIP format (not the closely-related "zlib" or "deflate" formats) defined by RFC 1952. Snappy: Default compression for parquet files. ZSTD: Compression codec with the highest compression ratio based on the Zstandard format defined by RFC 8478.

View Top
1. Hive Joins
2. Functions SQL,Window Functions should write one example in notepad
3. Top 3 Records, or Top n Records
4. Best File formats hive : Ans Should be Parquet Why
5. map side vs reduce side join
6. Spark Connectors, Spark with Hive connectors
6. Reduce By vs group by (Good ANs : group by having more shuffle but in reduce by less shuffle
7. cache vs perist
8. repartition vs colasec
9. RDD vs Dataframe
View sql_interv.sql
1. https://www.youtube.com/watch?v=AK7_m-aThfw
1. https://www.youtube.com/watch?v=Oo2FoYgRBvE&list=PLaIYQ9kvDKjroSixUUJJEdlvh-jr8X3ER
2. https://www.youtube.com/watch?v=-WEpWH1NHGU
3. https://www.mygreatlearning.com/blog/sql-interview-questions/
# HIve
1. https://www.youtube.com/watch?v=8tFyr02GYzc&list=PLIeFQEzpZM9gDh5UWCPeI11M-Ykm4scq1
View AWS_ETL.py
Incoming/Sources/Inputs: S3 Files (CSV,Parquet) and RDBMS (Oracle,MySQL ) Data
Languages: Python,Shell Script
Cluster : AWS EMR (It is Like Hortnworks)
Linux : Linux server (It is EC2)
Processing :
1. Hive SQL
View .zprofile
# Setting PATH for Python 3.8
# The original version is saved in .zprofile.pysave
PATH="/Library/Frameworks/Python.framework/Versions/3.8/bin:${PATH}"
# Hadoop
export HADOOP_HOME=/Users/welcome/Desktop/hadoop/hadoop-3.2.1/
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
View python_environment_s.sh
# Virtual Environmnet
1. There are other great third-party tools for creating virtual environments, such as
1. virtualenv
2. conda
3. Pipenv
4. Poetry
# Step1
View mac_shortcuts.txt
Jump to beginning of a line – Command+Left Arrow
Jump to end of a line – Command+Right Arrow
Jump to beginning of current word – Option+Right Arrow
Jump to end of current word – Option+Right Arrow
To backspace on a Mac, press the fn and Delete keys, as shown below
command +Space ==> QUick search
Control + LeftArrow ==> Left Open one
Control+ RightArrow ==> Right Opne one
command + c==> COPY
command + p ==> Past
View zprofile.sh
vi .zprofile
# Setting PATH for Python 3.8
# The original version is saved in .zprofile.pysave
PATH="/Library/Frameworks/Python.framework/Versions/3.8/bin:${PATH}"
# Hadoop
export HADOOP_HOME=/Users/welcome/Desktop/hadoop/hadoop-3.2.1/
export HADOOP_INSTALL=$HADOOP_HOME
View hive-site.xml
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?><!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0