Andresmup/ARIMA_PLUS_XREG_TIME_SERIES_FORECASTING_BIG_QUERY.md

## ARIMA_PLUS_XREG_TIME_SERIES_FORECASTING_BIG_QUERY.md

      
    Raw
  

              ARIMA_PLUS_XREG_TIME_SERIES_FORECASTING_BIG_QUERY.md
            
          
    BIG QUERY TIME SERIES FORECASTING W/ ARIMA-PLUS-XREG


✔️ INTRODUCCIÓN

🔥 Bienvenido a mi Gist donde muestro como modelar un TIME SERIES FORECASTING  con ARIMA PLUS XREG en BIG QUERY
✔️ ABSTRACT

El objetivo principal es poder demostrar como utilizando únicamente sql en big query se puede realizar un modelo, en este caso implementando
un forecasting para una serie de tiempo. La ejecución de este código se hizo en Big Query Studio de Google Cloud.
✔️ ORIGEN DATOS

Los datos utilizados para esta demostración fueron obtenidos de https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page,
donde podemos encontrar archivos .parquet para cada mes; que contienen un registro de los viajes realizados por los taxis en New York.
En este caso usaremos los registros para los taxis verdes del 2022. Junto con el registro donde se detallan las ubicaciones.
Una vez descargados simplemente hay que subirlos a un bucket en Cloud Storage.
✔️ QUERIES STEP BY STEP

➡️ IMPORTACIÓN TABLAS

Los primero será crear una tabla externa a partir de las particiones que contienen los registros de los viajes realizados por los taxis
verdes en 2022. Los cuales subimos a un bucket en Cloud Storage.
-- Creation of external table from parquet green taxi dataset
CREATE OR REPLACE EXTERNAL TABLE `dataengineeringcourse-0001.ny_green_taxi.green_tripdata_2022` #Location of green taxi dataset
OPTIONS (
  format = 'PARQUET', #Because the data is partitioned in parquet batchs
  uris = ['gs://green_tripdata_2022/*.parquet'] #Select all 2022 values (1 for every month)
);
Realizamos un select para ver que la tabla green_tripdata_2022 esté correcta.
-- Checking external green_tripdata_2022 table
SELECT * FROM dataengineeringcourse-0001.ny_green_taxi.green_tripdata_2022 limit 5;


VendorID
lpep_pickup_datetime
lpep_dropoff_datetime
store_and_fwd_flag
RatecodeID
PULocationID
DOLocationID
passenger_count
trip_distance
fare_amount
extra
mta_tax
tip_amount
tolls_amount
ehail_fee
improvement_surcharge
total_amount
payment_type
trip_type
congestion_surcharge


2
2022-01-01 00:14:21.000000 UTC
2022-01-01 00:15:33.000000 UTC
N
1.0
42
42
1.0
0.44
3.5
0.5
0.5
0.0
0.0

0.3
4.8
2.0
1.0
0.0


1
2022-01-01 00:20:55.000000 UTC
2022-01-01 00:29:38.000000 UTC
N
1.0
116
41
1.0
2.1
9.5
0.5
0.5
0.0
0.0

0.3
10.8
2.0
1.0
0.0


1
2022-01-01 00:57:02.000000 UTC
2022-01-01 01:13:14.000000 UTC
N
1.0
41
140
1.0
3.7
14.5
3.25
0.5
4.6
0.0

0.3
23.15
1.0
1.0
2.75


2
2022-01-01 00:07:42.000000 UTC
2022-01-01 00:15:57.000000 UTC
N
1.0
181
181
1.0
1.69
8.0
0.5
0.5
0.0
0.0

0.3
9.3
2.0
1.0
0.0


2
2022-01-01 00:07:50.000000 UTC
2022-01-01 00:28:52.000000 UTC
N
1.0
33
170
1.0
6.26
22.0
0.5
0.5
5.21
0.0

0.3
31.26
1.0
1.0
2.75


Repetimos este proceso para la tabla que contiene la información de las ubicaciones.
-- Creation of external table from taxi_zone_lookup
CREATE OR REPLACE EXTERNAL TABLE `dataengineeringcourse-0001.ny_green_taxi.taxi_zone_lookup` #Location of taxi zone data
OPTIONS (
  format = 'CSV', #This is a csv file
  uris = ['gs://green_tripdata_2022/taxi_zone_lookup.csv'] #Table with the id Borough, Zone and Service Zone
);
Realizamos un select para ver que la tabla taxi_zone_lookup esté correcta.


LocationID
Borough
Zone
service_zone


1
EWR
Newark Airport
EWR


2
Queens
Jamaica Bay
Boro Zone


3
Bronx
Allerton/Pelham Gardens
Boro Zone


4
Manhattan
Alphabet City
Yellow Zone


5
Staten Island
Arden Heights
Boro Zone


➡️ CREACIÓN TABLA CON DATOS A TRABAJAR

Una vez están las dos tablas externas, procedemos a preparar los datos para el modelo, de acuerdo con las necesidades del caso.
Detallando las columnas que se van a utilizar, definiendo el formato de las columnas necesarias y filtrando aquellas filas que cumplan con ciertos criterios.
-- Create ml table for bronx with sum and round of passenger_count, fare_amount and trip_distance for trips on 2022
CREATE OR REPLACE TABLE `dataengineeringcourse-0001.ny_green_taxi.bronx_green_tripdata_2022_ml` AS (
  SELECT
    DATE(TIMESTAMP(g.lpep_pickup_datetime)) AS pickup_date, #Convert TIMESTAMP into DATE
    ROUND(SUM(g.passenger_count), 2) AS total_passengers, #Sum and round passengers count per trip values
    ROUND(SUM(g.fare_amount), 2) AS total_amount, #Sum and round fare amount 
    ROUND(SUM(g.trip_distance), 2) AS total_distance #Sum and round trip distance
  FROM
    `dataengineeringcourse-0001.ny_green_taxi.green_tripdata_2022` g
  JOIN
    `dataengineeringcourse-0001.ny_green_taxi.taxi_zone_lookup` tzl
  ON
    CAST(g.PULocationID AS STRING) = CAST(tzl.LocationID AS STRING)
  WHERE
    g.passenger_count > 0 AND #Filter trips with passenger count higher than 0
    g.fare_amount > 0 AND #Filter trips with fare amount higher than 0
    g.trip_distance > 0 AND  #Filter trips with distance higher than 0
    EXTRACT(YEAR FROM TIMESTAMP(g.lpep_pickup_datetime)) = 2022 AND #Only values for 2022
    tzl.Borough = 'Bronx' #Filter values for our borough targer which is Bronx
  GROUP BY
    pickup_date #Group by date
  ORDER BY
    pickup_date #Order by date January first
);
Realizamos un select para ver que la tabla bronx_green_tripdata_2022_ml esté correcta.
-- Checking external table
SELECT * FROM ny_green_taxi.bronx_green_tripdata_2022_ml limit 5;


pickup_date
total_passengers
total_amount
total_distance


2022-01-01
59.0
731.7
169.7


2022-01-02
52.0
524.5
111.04


2022-01-03
55.0
547.0
135.83


2022-01-04
45.0
490.5
115.68


2022-01-05
74.0
733.0
183.44


Con una visualización gráfica podemos comprender mejor los datos.

➡️ CREACIÓN Y ENTRENAMIENTO DEL MODELO ARIMA_PLUS_XREG

Procedemos a crear y entrenar un modelo arima plus xreg, el cual guardaremos.
Dicho modelo se entrenará con los datos de enero a noviembre de 2022, ya que el objetivo es poder predecir la cantidad de pasajeros
que usaran el servicio de taxis green de new york en diciembre para el distrito del Bronx.
#ARIMA_PLUS_XREG
CREATE OR REPLACE MODEL ny_green_taxi.bronx_trip_model #Name of model
 OPTIONS (
   MODEL_TYPE = 'ARIMA_PLUS_XREG', #Select type model
   time_series_timestamp_col = 'pickup_date', #Indicate date column
   time_series_data_col = 'total_passengers') AS #Target column
SELECT  pickup_date, total_passengers, total_distance, total_amount 
FROM ny_green_taxi.bronx_green_tripdata_2022_ml
WHERE  pickup_date BETWEEN DATE('2022-01-01') AND DATE('2022-11-30') #We are gone train with data from january to november

➡️ REVISIÓN DEL MODELO

Procedemos a chequear con unos select cuales son las medidas de tendencia central, las métricas y los coeficientes  del modelo.
-- Check features
SELECT * FROM ML.FEATURE_INFO(MODEL ny_green_taxi.bronx_trip_model)


input
min
max
mean
median
stddev
category_count
null_count


total_passengers
14.0
103.0
47.619760479041915
46.0
14.17699248283547
null
0


pickup_date
null
null
null
null
null
334
0


total_distance
28.6
307.13
137.1334431137725
132.36
50.129443810579161
null
0


total_amount
180.5
1327.5
616.10883233533013
585.0
212.53652820434178
null
0


-- Check metrics
SELECT * FROM ML.ARIMA_EVALUATE(MODEL ny_green_taxi.bronx_trip_model) LIMIT 5


non_seasonal_p
non_seasonal_d
non_seasonal_q
has_drift
log_likelihood
AIC
variance
seasonal_periods
has_holiday_effect
has_spikes_and_dips
has_step_changes
error_message


0
1
5
true
-1076.59
2167.18
36.17
[WEEKLY]
false
true
false


0
1
5
false
-1080.169
2172.33
37.212
[WEEKLY]
false
true
false


3
1
2
true
-1091.47
2196.95
40.53
[WEEKLY]
false
true
false


0
1
4
true
-1093.20
2198.41
40.94
[WEEKLY]
false
true
false


3
1
1
true
-1093.50
2199
41.04
[WEEKLY]
false
true
false


-- Check coheficients
SELECT * FROM ML.ARIMA_COEFFICIENTS(MODEL ny_green_taxi.bronx_trip_model)


ar_coefficients
ma_coefficients
intercept_or_drift
processed_input
weight


[]
[]
null
total_distance
0.0012


[]
[]

total_amount
0.048


[]
[]
null
INTERCEPT
191.51


[]
[-0.86,-0.25,0.75,-0.03,-0.6]
-0.04
null
null


➡️ REALIZAR PREDICCIONES PARA DICIEMBRE

Es momento de realizar las predicciones, para ello se selecciona el modelo guardado y entrenado, definiendo un intervalo
de confianza del 80% para las predicciones. Definiendo un periodo de 31 días como la cantidad a predecir
(cantidad días de diciembre). Y el punto de inicio el día siguiente al 30 de octubre.
-- Make predictions for december 
SELECT *
FROM ML.FORECAST(
MODEL ny_green_taxi.bronx_trip_model,#Define model
STRUCT(31 AS horizon, 0.8 AS confidence_level), #Set prediction of 31 days (december), 80% of confidence interval
(
  SELECT
  pickup_date ,total_amount, total_distance #Set columns to use into the prediction
  FROM
    ny_green_taxi.bronx_green_tripdata_2022_ml #Dataset
  WHERE
    pickup_date > DATE('2022-11-30') #Start prediction from last day of november
))
Esta query arrojara una tabla como resultado en la cual podemos observar la cantidad de pasajeros que se predicen usaran el
servicio cada día para el Bronx, el error estándar y cuáles son los valores mínimos y máximos para un intervalo de confianza del 80%.


forecast_timestamp
forecast_value
standard_error
confidence_level
prediction_interval_lower_bound
prediction_interval_upper_bound


2022-12-01
35
6.04
0.8
27
42


2022-12-02
43
6.1
0.8
35
50


2022-12-03
35
6.13
0.8
27
43


2022-12-04
32
7.22
0.8
22
41


2022-12-05
34
8.08
0.8
24
45


➡️ EVALUACIÓN MODELO

Ahora se procede a obtener las métricas más comunes (MAE, MSE, RMSE, MAPE y SMAPE) para conocer como de precisa fue la predicción
y poder compararla a futuro a medida que implementen mejoras en el modelo.
-- Evaluate the model predictions
SELECT *
FROM ML.EVALUATE(
MODEL ny_green_taxi.bronx_trip_model, #Define model
(
  SELECT
    pickup_date,total_passengers, total_distance, total_amount #Set columns to evaluate the prediction
  FROM
    ny_green_taxi.bronx_green_tripdata_2022_ml #Source for the true values
  WHERE
    pickup_date > DATE('2022-11-30') #Start evaluation from las day of november
),
STRUCT(
  TRUE AS perform_aggregation,  31 AS horizon)) #Set evaluation to prediction 31 days (december)


mean_absolute_error
mean_squared_error
root_mean_squared_error
mean_absolute_percentage_error
symmetric_mean_absolute_percentage_error


4.34
26.86
5.18
13.83
13.68


✔️ CONCLUSIÓN

Este trabajo buscaba mostrar con un ejemplo sencillo como realizar un modelo en big query utilizando únicamente sql, usando datos
alojados en un bucket de google cloud storage y ejecutando las queries en big query studio.
Recapitulando las queries que se mostraron:

Crear tablas externas a partir de datos en buckets de google cloud storage
Crear y procesar los datos que necesitemos.
Visualizar los datos y graficarlos.
Crear el modelo que mejor se adapte a nuestro y entrenarlo.
Realizar predicciones.
Evaluar el desempeño del modelo.


♻️ La estructura de este código se puede adaptar a otros casos de uso.
🗂️ Estas queries son la base para implementar otros modelos y/o mejorar este mismo.
📜 Puedes hacer un fork y mejorar el trabajo.

💬 Gracias por la visita. Si tienes alguna duda o sugerencia, no dudes en contactar o dejar tu comentario aqui abajo. 💬
👨‍💻 Andrés Muñoz Pampillón
VendorID	lpep_pickup_datetime	lpep_dropoff_datetime	store_and_fwd_flag	RatecodeID	PULocationID	DOLocationID	passenger_count	trip_distance	fare_amount	extra	mta_tax	tip_amount	improvement_surcharge	total_amount	payment_type	trip_type	congestion_surcharge
2	2022-01-01 00:14:21.000000 UTC	2022-01-01 00:15:33.000000 UTC	N	1.0	42	42	1.0	0.44	3.5	0.5	0.5	0.0	0.3	4.8	2.0	1.0	0.0
1	2022-01-01 00:20:55.000000 UTC	2022-01-01 00:29:38.000000 UTC	N	1.0	116	41	1.0	2.1	9.5	0.5	0.5	0.0	0.3	10.8	2.0	1.0	0.0
1	2022-01-01 00:57:02.000000 UTC	2022-01-01 01:13:14.000000 UTC	N	1.0	41	140	1.0	3.7	14.5	3.25	0.5	4.6	0.3	23.15	1.0	1.0	2.75
2	2022-01-01 00:07:42.000000 UTC	2022-01-01 00:15:57.000000 UTC	N	1.0	181	181	1.0	1.69	8.0	0.5	0.5	0.0	0.3	9.3	2.0	1.0	0.0
2	2022-01-01 00:07:50.000000 UTC	2022-01-01 00:28:52.000000 UTC	N	1.0	33	170	1.0	6.26	22.0	0.5	0.5	5.21	0.3	31.26	1.0	1.0	2.75
LocationID	Borough	Zone	service_zone
1	EWR	Newark Airport	EWR
2	Queens	Jamaica Bay	Boro Zone
3	Bronx	Allerton/Pelham Gardens	Boro Zone
4	Manhattan	Alphabet City	Yellow Zone
5	Staten Island	Arden Heights	Boro Zone
pickup_date	total_passengers	total_amount	total_distance
2022-01-01	59.0	731.7	169.7
2022-01-02	52.0	524.5	111.04
2022-01-03	55.0	547.0	135.83
2022-01-04	45.0	490.5	115.68
2022-01-05	74.0	733.0	183.44
input	min	max	mean	median	stddev	category_count
total_passengers	14.0	103.0	47.619760479041915	46.0	14.17699248283547	null
pickup_date	null	null	null	null	null	334
total_distance	28.6	307.13	137.1334431137725	132.36	50.129443810579161	null
total_amount	180.5	1327.5	616.10883233533013	585.0	212.53652820434178	null
non_seasonal_p	non_seasonal_d	non_seasonal_q	has_drift	log_likelihood	AIC	variance	seasonal_periods	has_holiday_effect	has_spikes_and_dips	has_step_changes
0	1	5	true	-1076.59	2167.18	36.17	[WEEKLY]	false	true	false
0	1	5	false	-1080.169	2172.33	37.212	[WEEKLY]	false	true	false
3	1	2	true	-1091.47	2196.95	40.53	[WEEKLY]	false	true	false
0	1	4	true	-1093.20	2198.41	40.94	[WEEKLY]	false	true	false
3	1	1	true	-1093.50	2199	41.04	[WEEKLY]	false	true	false
ar_coefficients	ma_coefficients	intercept_or_drift	processed_input	weight
[]	[]	null	total_distance	0.0012
[]	[]		total_amount	0.048
[]	[]	null	INTERCEPT	191.51
[]	[-0.86,-0.25,0.75,-0.03,-0.6]	-0.04	null	null
forecast_timestamp	forecast_value	standard_error	confidence_level	prediction_interval_lower_bound	prediction_interval_upper_bound
2022-12-01	35	6.04	0.8	27	42
2022-12-02	43	6.1	0.8	35	50
2022-12-03	35	6.13	0.8	27	43
2022-12-04	32	7.22	0.8	22	41
2022-12-05	34	8.08	0.8	24	45