Skip to content

Instantly share code, notes, and snippets.

@andrewgross
Created January 29, 2018 23:17
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save andrewgross/300286593b0bd2c2cc0ace5db819095e to your computer and use it in GitHub Desktop.
Save andrewgross/300286593b0bd2c2cc0ace5db819095e to your computer and use it in GitHub Desktop.
Bad Floating Point
Hey,
This is going to be a bit of an esoteric ticket. I noticed some strange behavior recently when comparing Spectrum and Redshift results on the same dataset.
Redshift Data: fare.txt
Parquet Data: fare.parquet
The parquet data was generated from fare.txt with PySpark using convert.py on Spark 2.2.0
Redshift Table Schema:
CREATE TABLE public.fare (
"id" VARCHAR,
"fare" DOUBLE PRECISION
)
Spectrum Table Schema:
CREATE EXTERNAL TABLE spectrum.fare (
"id" VARCHAR,
"fare" DOUBLE PRECISION
)
STORED AS PARQUET
LOCATION 's3://redshift-scratch/fare_parquet/';
NOTE:
I queried the data in several ways and got some very strange results. The following query gets the count of distinct ids and the sum of the fare field. Spectrum and Redshift get slightly different results, but I chalk this up to floating point issues:
select
count(distinct o.id) as redshift_count,
count(distinct s_o.id) as spectrum_count,
sum(o.fare) as redshift_sum,
sum(s_o.fare) as spectrum_count
from
public.fare as o,
spectrum.fare as s_o;
https://captured-pics.s3.amazonaws.com/yYPqvSHHhfn74XGRvvuVj7zFYIVCal3xgU.png
--------------------------------------------------------------------------------------------------
However, when you only do a count on one of the fields, the sums come back exactly equal:
select
count(distinct o.id) as redshift_count,
-- count(distinct s_o.id) as spectrum_count,
sum(o.fare) as redshift_sum,
sum(s_o.fare) as spectrum_count
from
public.fare as o,
spectrum.fare as s_o;
https://captured-pics.s3.amazonaws.com/q2dlKsU1gCu0kmCy4p9pBURJwzVDJFoWiO.png
--------------------------------------------------------------------------------------------------
It doesn't matter which column you use for count, they both have the same results.
select
-- count(distinct o.id) as redshift_count,
count(distinct s_o.id) as spectrum_count,
sum(o.fare) as redshift_sum,
sum(s_o.fare) as spectrum_count
from
public.fare as o,
spectrum.fare as s_o;
https://captured-pics.s3.amazonaws.com/XYbAMGIrqbnhwi9SGbId2B66il3dNiVIJZ.png
--------------------------------------------------------------------------------------------------
However, if you change the count from distinct, the results no longer match!
select
count(o.*) as redshift_count,
count(s_o.*) as spectrum_count,
sum(o.fare) as redshift_sum,
sum(s_o.fare) as spectrum_count
from
public.fare as o,
spectrum.fare as s_o;
https://captured-pics.s3.amazonaws.com/Blw1pUfMdBm8j2ZLxFZRod4aXO6rCwLOlr.png
--------------------------------------------------------------------------------------------------
And the results will no longer match if you use just one column:
select
count(o.*) as redshift_count,
-- count(s_o.*) as spectrum_count,
sum(o.fare) as redshift_sum,
sum(s_o.fare) as spectrum_count
from
public.fare as o,
spectrum.fare as s_o;
https://captured-pics.s3.amazonaws.com/byAIEQmXA0PntSKXATHZuyeQ8GJZ5PIW7y.png
--------------------------------------------------------------------------------------------------
select
-- count(o.*) as redshift_count,
count(s_o.*) as spectrum_count,
sum(o.fare) as redshift_sum,
sum(s_o.fare) as spectrum_count
from
public.fare as o,
spectrum.fare as s_o;
https://captured-pics.s3.amazonaws.com/hempPkPhC0frjcFJXQiKE8cWAIORnOArpN.png
--------------------------------------------------------------------------------------------------
There is a lot of weirdness going on here. I would like to be able to rely on floating point calculations being at least consistent and not dependent on the number of columns? At the moment our workaround is to ensure we use appropriate significant figures and truncate results, but this is a very confusing bug to be sure. I even had occasionally instances of the same query returning slightly different results (out at the 10th+ decimal place) on subsequent runs, but I have not been able to reproduce this reliably yet.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment