Created
July 23, 2020 21:18
-
-
Save sharmanirek/151115c88ff8801ecf03c6063e6d8a12 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{"cells":[{"metadata":{},"cell_type":"markdown","source":"# Feature Engineering in Splice Machine\n#### Let's start our Spark Session"},{"metadata":{"trusted":true},"cell_type":"code","source":"# Setup\nfrom pyspark.sql import SparkSession\nfrom splicemachine.spark import PySpliceContext\n\nspark = SparkSession.builder.getOrCreate()\nsplice = PySpliceContext(spark)","execution_count":1,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## Ingesting Data\n<blockquote><p class='quotation'><span style='font-size:15px'> Using the table created in <a href='./7.3 Data Exploration.ipynb'>7.3 Data Exploration</a>, we will create features first with <code>SQL</code> and subsequently ingest into <code>PySpark</code> for further analysis. <footer>Splice Machine</footer>\n</blockquote>"},{"metadata":{},"cell_type":"markdown","source":"#### Here we'll calculate a couple simple features in SQL \nWe're computing a couple quick scaled differenced features for our use"},{"metadata":{"trusted":true},"cell_type":"code","source":"%%sql\nSELECT \n time_offset,\n expected_weekly_trans_cnt,\n expected_weekly_trans_amnt,\n expected_daily_trans_cnt,\n expected_daily_trans_amnt,\n weekly_trans_cnt,\n weekly_trans_amnt,\n daily_trans_cnt,\n daily_trans_amnt,\n rolling_avg_weekly_trans_cnt,\n rolling_avg_weekly_trans_amnt,\n rolling_avg_daily_trans_cnt,\n rolling_avg_daily_trans_amnt,\n MACD_trans_amnt,\n MACD_trans_cnt,\n RSI_trans_amnt,\n RSI_trans_cnt,\n Aroon_trans_amnt,\n Aroon_trans_cnt,\n ADX_trans_amnt,\n ADX_trans_cnt,\n current_balance,\n rolling_avg_balance,\n MACD_balance,\n Aroon_balance,\n RSI_balance,\n ADX_balance,\n credit_score,\n credit_limit,\n amount,\n (weekly_trans_cnt - expected_weekly_trans_cnt)/expected_weekly_trans_cnt AS weekly_trans_cnt_DIFF,\n (weekly_trans_amnt - expected_weekly_trans_amnt)/expected_weekly_trans_amnt AS weekly_trans_amnt_DIFF,\n (daily_trans_cnt - expected_daily_trans_cnt)/expected_daily_trans_cnt AS daily_trans_cnt_DIFF,\n (daily_trans_amnt - expected_daily_trans_amnt)/expected_daily_trans_amnt AS daily_trans_amnt_DIFF\nFROM CC_FRAUD_DATA\n{LIMIT 10}","execution_count":11,"outputs":[{"data":{"application/vnd.jupyter.widget-view+json":{"model_id":"fa1da058-182a-47bf-a3df-b9baee0409cd","version_major":2,"version_minor":0},"method":"display_data"},"metadata":{},"output_type":"display_data"}]},{"metadata":{},"cell_type":"markdown","source":"## Ingesting the data with these new features into splice machine"},{"metadata":{"trusted":true},"cell_type":"code","source":"df = splice.df(\"\"\"\nSELECT \n time_offset,\n expected_weekly_trans_cnt,\n expected_weekly_trans_amnt,\n expected_daily_trans_cnt,\n expected_daily_trans_amnt,\n weekly_trans_cnt,\n weekly_trans_amnt,\n daily_trans_cnt,\n daily_trans_amnt,\n rolling_avg_weekly_trans_cnt,\n rolling_avg_weekly_trans_amnt,\n rolling_avg_daily_trans_cnt,\n rolling_avg_daily_trans_amnt,\n MACD_trans_amnt,\n MACD_trans_cnt,\n RSI_trans_amnt,\n RSI_trans_cnt,\n Aroon_trans_amnt,\n Aroon_trans_cnt,\n ADX_trans_amnt,\n ADX_trans_cnt,\n current_balance,\n rolling_avg_balance,\n MACD_balance,\n Aroon_balance,\n RSI_balance,\n ADX_balance,\n credit_score,\n credit_limit,\n amount,\n (weekly_trans_cnt - expected_weekly_trans_cnt) AS weekly_trans_cnt_DIFF,\n (weekly_trans_amnt - expected_weekly_trans_amnt) AS weekly_trans_amnt_DIFF,\n (daily_trans_cnt - expected_daily_trans_cnt) AS daily_trans_cnt_DIFF,\n (daily_trans_amnt - expected_daily_trans_amnt) AS daily_trans_amnt_DIFF,\n CLASS_RESULT\nFROM CC_FRAUD_DATA\"\"\")","execution_count":38,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## Now engineering features with PySpark\nHere, we calculate the z-score for the AMOUNT feature based on class result"},{"metadata":{"trusted":true},"cell_type":"code","source":"from beakerx.object import beakerx\nbeakerx.pandas_display_table()","execution_count":39,"outputs":[]},{"metadata":{"trusted":true,"scrolled":true},"cell_type":"code","source":"from pyspark.sql import functions as F\n\ndf = df.join(F.broadcast(df.groupBy(\"CLASS_RESULT\").agg(\n F.stddev_pop(\"AMOUNT\").alias(\"AMOUNT_sd\"), \n F.avg(\"AMOUNT\").alias(\"AMOUNT_avg\"))),\n \"CLASS_RESULT\")\\\n .withColumn(\"AMOUNT_Z_SCORE\", (F.col(\"AMOUNT\") - F.col(\"AMOUNT_avg\")) / F.col(\"AMOUNT_sd\"))\n\ndf.toPandas()","execution_count":40,"outputs":[{"output_type":"display_data","data":{"application/vnd.jupyter.widget-view+json":{"version_major":2,"version_minor":0,"model_id":"c2969bc5a1fc42739a370d3980d2dba4"}},"metadata":{}}]},{"metadata":{},"cell_type":"markdown","source":"# Fantastic!\n<blockquote> \nThis basically shows how our platform can be used to execute basic and advanced feature engineering tasks in both SQL and PySpark! <br>\n Next Up: <a href='./7.5 Model Creation.ipynb'>Using MLManager to create basic machine learning models.</a>\n<footer>Splice Machine</footer>\n</blockquote>"}],"metadata":{"kernelspec":{"name":"python3","display_name":"Python 3","language":"python"},"language_info":{"name":"python","version":"3.7.6","mimetype":"text/x-python","codemirror_mode":{"name":"ipython","version":3},"pygments_lexer":"ipython3","nbconvert_exporter":"python","file_extension":".py"},"toc":{"nav_menu":{},"number_sections":false,"sideBar":true,"skip_h1_title":false,"base_numbering":1,"title_cell":"Table of Contents","title_sidebar":"Contents","toc_cell":false,"toc_position":{"height":"calc(100% - 180px)","width":"268px","left":"10px","top":"150px"},"toc_section_display":true,"toc_window_display":true}},"nbformat":4,"nbformat_minor":4} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment