Skip to content

Instantly share code, notes, and snippets.

Avatar
💭
Awesome

Anjaiah Methuku anjijava16

💭
Awesome
View GitHub Profile
View GoogleCloudSQL_Imp.txt
https://app.pluralsight.com/library/courses/preparing-google-cloud-professional-data-engineer-exam-1/recommended-courses ---> ML
https://app.pluralsight.com/profile/author/vitthal-srinivasan
https://app.pluralsight.com/profile/author/james-wilson
https://app.pluralsight.com/profile/author/janani-ravi
View hive_serde_isse.sql
Table :
====================
CREATE EXTERNAL TABLE tweets ( createddate string,
geolocation string,
tweetmessage string,
user_name struct<geoenabled:boolean, id:int, name:string, screenname:string, userlocation:string>
)ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' LOCATION 'gs://iwinner-data/json_data';
Query :
View DWH_ML.sh
Database pioneer and Turing Award winner Jim Gray gave a famous adage: When you have lots of data, bring [machine learning] computations to the data, rather than data to the computations.
According to him, there is nothing closer to the data than the database; so the computations have to be done inside the database.
Now all major cloud and database vendors are:
🔸 offering SQL data pipelines in the data warehouse
🔸 expanding in-database ML computations offerings
ML and analytics in the data warehouse are cheaper and more efficient.
View gcloud_config
C:\Users\anjai>gcloud config get-value project
iwinner-data
Updates are available for some Cloud SDK components. To install them,
please run:
$ gcloud components update
View Spark_Strcutured_Streaming_Write_MySQL.scala
package com.mts.matrix.spark.stream
import com.mts.matrix.spark.utils.SparkUtils
import org.apache.spark.sql.{DataFrame, SaveMode}
import org.apache.spark.sql.functions.{col,lit, from_json}
import org.apache.spark.sql.streaming.{StreamingQuery, Trigger}
import org.apache.spark.sql.types.{IntegerType, StringType, StructType}
import org.apache.spark.sql.streaming.Trigger
View read_jdbc_parquet_write_mongo.scala
def getSparkSessionMongoDbConfig(parms: Map[String, String]): SparkSession = {
val spark = SparkSession
.builder
.appName(parms("JOB_NAME"))
.master("local[*]")
.config("spark.mongodb.input.uri", "mongodb://127.0.0.1/retaildb.orders?authSource=admin")
.config("spark.mongodb.output.uri", "mongodb://127.0.0.1/retaildb.orders?authSource=admin")
.getOrCreate()
val isS3Enable = parms("S3_OPERATION_ENABLE").toBoolean;
View udf_udaf_udtf.sql
####################################################################################
UDF VS UDAF VS UDTF
1.UDF : UDFs works on a single row in a table and produces a single row as output. Its one to one relationship between input and output of a function. e.g Hive built in TRIM() function.
Extends UDF
we have to overload a method called evaluate() inside our class.
2.UDAF : User defined aggregate functions works on more than one row and gives single row as output. e.g Hive built in MAX() or COUNT() functions.
Extends UDAF.
We need to overwrite five methods called init(), iterate(), terminatePartial(), merge() and terminate()
View mongo_db_windows_setup.txt
MongoDB :
localhost
Port:27017
username: admin
password: admin
Port : 27017
Databasename: meetup
collectionName(Table_Name): meetup_rsvp_tbl
View Spark_write_Nosql.py
Write to Cassandra using foreachBatch() in Scala
import org.apache.spark.sql._
import org.apache.spark.sql.cassandra._
import com.datastax.spark.connector.cql.CassandraConnectorConf
import com.datastax.spark.connector.rdd.ReadConf
import com.datastax.spark.connector._
val host = "<ip address>"