Skip to content

Instantly share code, notes, and snippets.

@orellabac
Created July 15, 2024 03:54
Show Gist options
  • Save orellabac/e25c135a17db25f9902d66f440b28b42 to your computer and use it in GitHub Desktop.
Save orellabac/e25c135a17db25f9902d66f440b28b42 to your computer and use it in GitHub Desktop.
Example configuration of custom prompts for the SMA Assistant
prompts:
- name: "DataFrameReader.load"
prompt: |
Explain this tag:
------ TAG
@@firstline
and how it affects the following line of code:
------
**CODE**
@@secondline
------
Consider these notes:
------
**NOTES**
The DataFrameReader api in Spark follows a pattern of
df.read.option(…).option(…).format(format_name).load(path,options)
and the SnowparkAPI uses a pattern of
df.read.option(...).option(...).[csv|parquet|json|orc|...](path,options)
when migrating these statements you will need to adjust the order of the calls and replace the load method with the appropriate format method. Also some options names might
be different.
For example:
* in Spark the option name for skipping a header is "header",false
* while in Snowpark it is "SKIP_HEADER",1
* in Spark you use inferSchema and in snowpark INFER_SCHEMA true and PARSE_HEADER true
* in Spark you use nullValue and in snowpark you use NULL_IF
* in spark you use SEP and in snowpark you use FIELD_DELIMITER
* PARSE_HEADER and SKIP_HEADER options are not used together
* the default for FIELD_DELIMITER and SEP is ',' is not specified in the original code you dont need to pass it
* In snowpark you can specify a FILE_FORMAT but when you translate from .load to .csv|.json|.parquet and others
you dont need to specify it
------
And based on that information provided some rewrite suggestions.
matcher: ".*pyspark.sql.readwriter.DataFrameReader.load.*"
- name: "RDD.map"
prompt: |
Explain this tag:
------ TAG
@@firstline
and how it affects the following line of code:
------
**CODE**
@@secondline
------
Consider these notes:
------
**NOTES**
Spark has a concept of RDD which is not present in
In Spark you can use the map function to transform an RDD,
In snowpark you need to use dataframe operations.
The following examples show how these operations can be translated from Spark to Snowpark.
`rdd=df.rdd.map(lambda x: (x[0]+","+x[1],x[2],x[3]*2))`
can be rewritten as:
`df.select(df[0]+","+df[1],df[2],df[3]*2)`
operations like:
`rdd2=rdd.reduceByKey(lambda a,b: a+b)`
can be rewritten as:
`df.groupBy(df[0]).agg(sum(df[1]))`
operations like:
`rdd3=rdd2.filter(lambda x: x[1]>10)`
can be rewritten as:
`df.filter(df[1]>10)`
operations like:
`rdd.flatMap(lambda x: range(1, x))`
Return a new RDD by first applying a function to all elements of this
RDD, and then flattening the results.
A similar approach in snowpark will
`df=df.select(sequence(lit(1), df[0]).alias("flatmap"))`
`df=df.select(explode("flatmap"))`
------
And based on that information provided some rewrite suggestions.
matcher: ".*pyspark.rdd.RDD..*map.*"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment