andrewgross/gist:300286593b0bd2c2cc0ace5db819095e

## gistfile1.txt
Hey,

This is going to be a bit of an esoteric ticket.  I noticed some strange behavior recently when comparing Spectrum and Redshift results on the same dataset.

Redshift Data: fare.txt
Parquet Data: fare.parquet

The parquet data was generated from fare.txt with PySpark using convert.py on Spark 2.2.0

Redshift Table Schema:

CREATE TABLE public.fare (
        "id" VARCHAR,
        "fare" DOUBLE PRECISION
)

Spectrum Table Schema:

CREATE EXTERNAL TABLE spectrum.fare (
        "id" VARCHAR,
        "fare" DOUBLE PRECISION
)
STORED AS PARQUET
LOCATION 's3://redshift-scratch/fare_parquet/';

NOTE:

I queried the data in several ways and got some very strange results.  The following query gets the count of distinct ids and the sum of the fare field.  Spectrum and Redshift get slightly different results, but I chalk this up to floating point issues:

select
    count(distinct o.id) as redshift_count,
    count(distinct s_o.id) as spectrum_count,
    sum(o.fare) as redshift_sum,
    sum(s_o.fare) as spectrum_count
from
    public.fare as o,
    spectrum.fare as s_o;


https://captured-pics.s3.amazonaws.com/yYPqvSHHhfn74XGRvvuVj7zFYIVCal3xgU.png

--------------------------------------------------------------------------------------------------

However, when you only do a count on one of the fields, the sums come back exactly equal:

select
    count(distinct o.id) as redshift_count,
    -- count(distinct s_o.id) as spectrum_count,
    sum(o.fare) as redshift_sum,
    sum(s_o.fare) as spectrum_count
from
    public.fare as o,
    spectrum.fare as s_o;


https://captured-pics.s3.amazonaws.com/q2dlKsU1gCu0kmCy4p9pBURJwzVDJFoWiO.png

--------------------------------------------------------------------------------------------------

It doesn't matter which column you use for count, they both have the same results.

select
    -- count(distinct o.id) as redshift_count,
    count(distinct s_o.id) as spectrum_count,
    sum(o.fare) as redshift_sum,
    sum(s_o.fare) as spectrum_count
from
    public.fare as o,
    spectrum.fare as s_o;


https://captured-pics.s3.amazonaws.com/XYbAMGIrqbnhwi9SGbId2B66il3dNiVIJZ.png


--------------------------------------------------------------------------------------------------


However, if you change the count from distinct, the results no longer match!

select
    count(o.*) as redshift_count,
    count(s_o.*) as spectrum_count,
    sum(o.fare) as redshift_sum,
    sum(s_o.fare) as spectrum_count
from
    public.fare as o,
    spectrum.fare as s_o;


https://captured-pics.s3.amazonaws.com/Blw1pUfMdBm8j2ZLxFZRod4aXO6rCwLOlr.png

--------------------------------------------------------------------------------------------------

And the results will no longer match if you use just one column:

select
    count(o.*) as redshift_count,
    -- count(s_o.*) as spectrum_count,
    sum(o.fare) as redshift_sum,
    sum(s_o.fare) as spectrum_count
from
    public.fare as o,
    spectrum.fare as s_o;


https://captured-pics.s3.amazonaws.com/byAIEQmXA0PntSKXATHZuyeQ8GJZ5PIW7y.png

--------------------------------------------------------------------------------------------------

select
    -- count(o.*) as redshift_count,
    count(s_o.*) as spectrum_count,
    sum(o.fare) as redshift_sum,
    sum(s_o.fare) as spectrum_count
from
    public.fare as o,
    spectrum.fare as s_o;


https://captured-pics.s3.amazonaws.com/hempPkPhC0frjcFJXQiKE8cWAIORnOArpN.png


--------------------------------------------------------------------------------------------------


There is a lot of weirdness going on here.  I would like to be able to rely on floating point calculations being at least consistent and not dependent on the number of columns?  At the moment our workaround is to ensure we use appropriate significant figures and truncate results, but this is a very confusing bug to be sure.  I even had occasionally instances of the same query returning slightly different results (out at the 10th+ decimal place) on subsequent runs, but I have not been able to reproduce this reliably yet.
	Hey,

	This is going to be a bit of an esoteric ticket. I noticed some strange behavior recently when comparing Spectrum and Redshift results on the same dataset.

	Redshift Data: fare.txt
	Parquet Data: fare.parquet

	The parquet data was generated from fare.txt with PySpark using convert.py on Spark 2.2.0

	Redshift Table Schema:

	CREATE TABLE public.fare (
	"id" VARCHAR,
	"fare" DOUBLE PRECISION
	)

	Spectrum Table Schema:

	CREATE EXTERNAL TABLE spectrum.fare (
	"id" VARCHAR,
	"fare" DOUBLE PRECISION
	)
	STORED AS PARQUET
	LOCATION 's3://redshift-scratch/fare_parquet/';

	NOTE:

	I queried the data in several ways and got some very strange results. The following query gets the count of distinct ids and the sum of the fare field. Spectrum and Redshift get slightly different results, but I chalk this up to floating point issues:

	select
	count(distinct o.id) as redshift_count,
	count(distinct s_o.id) as spectrum_count,
	sum(o.fare) as redshift_sum,
	sum(s_o.fare) as spectrum_count
	from
	public.fare as o,
	spectrum.fare as s_o;


	https://captured-pics.s3.amazonaws.com/yYPqvSHHhfn74XGRvvuVj7zFYIVCal3xgU.png

	--------------------------------------------------------------------------------------------------

	However, when you only do a count on one of the fields, the sums come back exactly equal:

	select
	count(distinct o.id) as redshift_count,
	-- count(distinct s_o.id) as spectrum_count,
	sum(o.fare) as redshift_sum,
	sum(s_o.fare) as spectrum_count
	from
	public.fare as o,
	spectrum.fare as s_o;


	https://captured-pics.s3.amazonaws.com/q2dlKsU1gCu0kmCy4p9pBURJwzVDJFoWiO.png

	--------------------------------------------------------------------------------------------------

	It doesn't matter which column you use for count, they both have the same results.

	select
	-- count(distinct o.id) as redshift_count,
	count(distinct s_o.id) as spectrum_count,
	sum(o.fare) as redshift_sum,
	sum(s_o.fare) as spectrum_count
	from
	public.fare as o,
	spectrum.fare as s_o;


	https://captured-pics.s3.amazonaws.com/XYbAMGIrqbnhwi9SGbId2B66il3dNiVIJZ.png


	--------------------------------------------------------------------------------------------------


	However, if you change the count from distinct, the results no longer match!

	select
	count(o.*) as redshift_count,
	count(s_o.*) as spectrum_count,
	sum(o.fare) as redshift_sum,
	sum(s_o.fare) as spectrum_count
	from
	public.fare as o,
	spectrum.fare as s_o;


	https://captured-pics.s3.amazonaws.com/Blw1pUfMdBm8j2ZLxFZRod4aXO6rCwLOlr.png

	--------------------------------------------------------------------------------------------------

	And the results will no longer match if you use just one column:

	select
	count(o.*) as redshift_count,
	-- count(s_o.*) as spectrum_count,
	sum(o.fare) as redshift_sum,
	sum(s_o.fare) as spectrum_count
	from
	public.fare as o,
	spectrum.fare as s_o;


	https://captured-pics.s3.amazonaws.com/byAIEQmXA0PntSKXATHZuyeQ8GJZ5PIW7y.png

	--------------------------------------------------------------------------------------------------

	select
	-- count(o.*) as redshift_count,
	count(s_o.*) as spectrum_count,
	sum(o.fare) as redshift_sum,
	sum(s_o.fare) as spectrum_count
	from
	public.fare as o,
	spectrum.fare as s_o;


	https://captured-pics.s3.amazonaws.com/hempPkPhC0frjcFJXQiKE8cWAIORnOArpN.png


	--------------------------------------------------------------------------------------------------


	There is a lot of weirdness going on here. I would like to be able to rely on floating point calculations being at least consistent and not dependent on the number of columns? At the moment our workaround is to ensure we use appropriate significant figures and truncate results, but this is a very confusing bug to be sure. I even had occasionally instances of the same query returning slightly different results (out at the 10th+ decimal place) on subsequent runs, but I have not been able to reproduce this reliably yet.