Skip to content

Instantly share code, notes, and snippets.

@cloud-fan
Last active September 20, 2016 02:55
Show Gist options
  • Save cloud-fan/14ada3f2b3225b5db52ccaa12aacfbd4 to your computer and use it in GitHub Desktop.
Save cloud-fan/14ada3f2b3225b5db52ccaa12aacfbd4 to your computer and use it in GitHub Desktop.
insertion in spark sql

normal table: src(a int, b int)

the result is produced by select a, b from src

SQL:

INSERT INTO TABLE src SELECT 1,2;

1.6:

  • hive table: success, the result is 1, 2
  • datasource table: success, the result is 1, 2

2.0:

  • hive table: success, the result is 1, 2
  • datasource table: success, the result is 1, 2
-- try to match the column names of target table
INSERT INTO TABLE src SELECT 1 AS b, 2 AS a;

1.6:

  • hive table: success, but the name doesn't matter, the result is 1, 2
  • datasource table: success, but the name doesn't matter, the result is 1, 2

2.0:

  • hive table: success, but the name doesn't matter, the result is 1, 2
  • datasource table: success, but the name doesn't matter, the result is 1, 2

insertInto:

Seq(1 -> 2).toDF().write.insertInto("src")

1.6:

  • hive table: success, the result is 1, 2
  • datasource table: success, the result is 1, 2

2.0:

  • hive table: success, the result is 1, 2
  • datasource table: success, the result is 1, 2
// try to match the column names of target table
Seq(1 -> 2).toDF("b", "a").write.insertInto("src")

1.6:

  • hive table: success, but name is ignored, same as sql version
  • datasource table: success, but name is ignored, same as sql version

2.0:

  • hive table: success, but name is ignored, same as sql version
  • datasource table: success, but name is ignored, same as sql version

saveAsTable:

saveAsTable with append mode doesn't support hive table in 2.0, so the following tests only run on datasource table.

Seq(1 -> 2).toDF().write.mode("append").saveAsTable("src")

1.6: success, the result is 1, 2
2.0: fail, AnalysisException: cannot resolve a given input columns: [_1, _2]

// try to match the column names of target table
Seq(1 -> 2).toDF("b", "a").write.mode("append").saveAsTable("src")

1.6: success, the result is 1, 2
2.0: success, the result is 2, 1

partitioned table: src(a int) partitioned by (b int, c int)

the result is produced by select a, b, c from src

SQL:

INSERT INTO TABLE src PARTITION(b=2, c=3) SELECT 1;

1.6:

  • hive table: success, the result is 1, 2, 3
  • datasource table: fail, RuntimeException: requires that the query in the SELECT clause of the INSERT INTO/OVERWRITE statement generates the same number of columns as its schema.

2.0:

  • hive table: success, the result is 1, 2, 3
  • datasource table: fail, RuntimeException: requires that the query in the SELECT clause of the INSERT INTO/OVERWRITE statement generates the same number of columns as its schema.
INSERT INTO TABLE src PARTITION(c=2, b=3) SELECT 1;

1.6:

  • hive table: success, the result is 1, 3, 2
  • datasource table: fail, RuntimeException: requires that the query in the SELECT clause of the INSERT INTO/OVERWRITE statement generates the same number of columns as its schema.

2.0:

  • hive table: success, the result is 1, 3, 2
  • datasource table: fail, RuntimeException: requires that the query in the SELECT clause of the INSERT INTO/OVERWRITE statement generates the same number of columns as its schema.
INSERT INTO TABLE src PARTITION(b=2, c) SELECT 1, 3;

1.6:

  • hive table: success, the result is 1, 2, 3
  • datasource table: fail, RuntimeException: requires that the query in the SELECT clause of the INSERT INTO/OVERWRITE statement generates the same number of columns as its schema.

2.0:

  • hive table: success, the result is 1, 2, 3
  • datasource table: fail, RuntimeException: requires that the query in the SELECT clause of the INSERT INTO/OVERWRITE statement generates the same number of columns as its schema.
INSERT INTO TABLE src PARTITION(b=2, c=3) SELECT 4, 5, 6;

1.6:

  • hive table: success, the result is 4, 2, 3 (confusing behaviour)
  • datasource table: success, the result is 4, 5, 6 (confusing behaviour)

2.0:

  • hive table: fail, AnalysisException: unresolved operator 'InsertIntoTable (weird error message)
  • datasource table: success, the result is 4, 5, 6 (confusing behaviour)
INSERT INTO TABLE src PARTITION(b, c) SELECT 1, 2, 3;

1.6:

  • hive table: success, the result is 1, 2, 3
  • datasource table: success, the result is 1, 2, 3

2.0:

  • hive table: success, the result is 1, 2, 3
  • datasource table: success, the result is 1, 2, 3
INSERT INTO TABLE src PARTITION(c) SELECT 1, 2, 3;

1.6:

  • hive table: fail, java.util.NoSuchElementException: key not found: c
  • datasource table: fail, AnalysisException: Specified partition columns (b) do not match the partition columns of the table. Please use (b, c) as the partition columns

2.0:

  • hive table: fail, AnalysisException: Requested partitioning does not match the src table
  • datasource table: success, the result is 1, 2, 3
INSERT INTO TABLE src PARTITION(b, c) SELECT 1, 2 AS c, 3 AS b;

1.6:

  • hive table: success, the result is 1, 2, 3
  • datasource table: success, the result is 1, 2, 3

2.0:

  • hive table: success, the result is 1, 2, 3
  • datasource table: success, the result is 1, 2, 3
INSERT INTO TABLE src PARTITION(c, b) SELECT 1, 2, 3;

1.6:

  • hive table: success, the result is 1, 2, 3
  • datasource table: success, the result is 1, 2, 3

2.0:

  • hive table: success, the result is 1, 2, 3
  • datasource table: success, the result is 1, 2, 3
INSERT INTO TABLE src SELECT 1, 2, 3;

1.6:

  • hive table: success, but actually no data is inserted
  • datasource table: fail, AnalysisException: Specified partition columns () do not match the partition columns of the table. Please use (b, c) as the partition columns

2.0:

  • hive table: fail, AnalysisException: Cannot find partition column b
  • datasource table: success, the result is 1, 2, 3
INSERT INTO TABLE src SELECT 1, 2 AS c, 3 AS b;

1.6:

  • hive table: success, but actually no data is inserted
  • datasource table: fail, AnalysisException: Specified partition columns () do not match the partition columns of the table. Please use (b, c) as the partition columns

2.0:

  • hive table: success, the result is 1, 3, 2
  • datasource table: success, the result is 1, 2, 3

insertInto:

Seq((1, 2, 3)).toDF("x", "y", "z").write.partitionBy("y", "z").insertInto("src")

1.6:

  • hive table: fail, java.util.NoSuchElementException: key not found: b
  • datasource table: fail, AnalysisException: Specified partition columns (y, z) do not match the partition columns of the table. Please use (b, c) as the partition columns

2.0:

  • hive table: fail, SparkException: Requested partitioning does not match the src table
  • datasource table: success, the result is 1, 2, 3
Seq((1, 2, 3)).toDF("x", "y", "z").write.partitionBy("y", "x").insertInto("src")

1.6:

  • hive table: fail, java.util.NoSuchElementException: key not found: b
  • datasource table: fail, AnalysisException: Specified partition columns (y, x) do not match the partition columns of the table. Please use (b, c) as the partition columns

2.0:

  • hive table: fail, SparkException: Requested partitioning does not match the src table
  • datasource table: success, the result is 3, 1, 2
Seq((1, 2, 3)).toDF("x", "y", "z").write.insertInto("src")

1.6:

  • hive table: success, but actually no data is inserted
  • datasource table: fail, AnalysisException: Specified partition columns () do not match the partition columns of the table. Please use (b, c) as the partition columns

2.0:

  • hive table: fail, AnalysisException: Cannot find partition column b
  • datasource table: success, the result is 1, 2, 3
Seq((1, 2, 3)).toDF("a", "b", "c").write.partitionBy("b", "c").insertInto("src")

1.6:

  • hive table: success, the result is 1, 2, 3
  • datasource table: success, the result is 1, 2, 3

2.0:

  • hive table: success, the result is 1, 2, 3
  • datasource table: success, the result is 1, 2, 3
Seq((1, 2, 3)).toDF("a", "b", "c").write.partitionBy("c", "b").insertInto("src")

1.6:

  • hive table: success, the result is 1, 2, 3
  • datasource table: success, the result is 1, 2, 3

2.0:

  • hive table: success, the result is 1, 2, 3
  • datasource table: success, the result is 1, 2, 3
Seq((1, 2, 3)).toDF("a", "b", "c").write.partitionBy("b").insertInto("src")

1.6:

  • hive table: fail, java.util.NoSuchElementException: key not found: c
  • datasource table: fail, AnalysisException: Specified partition columns (b) do not match the partition columns of the table. Please use (b, c) as the partition columns

2.0:

  • hive table: fail, AnalysisException: Requested partitioning does not match the src table
  • datasource table: success, the result is 1, 3, 2
Seq((1, 2, 3)).toDF("a", "b", "c").write.insertInto("src")

1.6:

  • hive table: success, but actually no data is inserted
  • datasource table: fail, AnalysisException: Specified partition columns () do not match the partition columns of the table. Please use (b, c) as the partition columns

2.0:

  • hive table: success, the result is 1, 2, 3
  • datasource table: success, the result is 1, 2, 3
Seq((1, 2, 3)).toDF("c", "b", "a").write.insertInto("src")

1.6:

  • hive table: success, but actually no data is inserted
  • datasource table: fail, AnalysisException: Specified partition columns () do not match the partition columns of the table. Please use (b, c) as the partition columns

2.0:

  • hive table: success, the result is 3, 2, 1
  • datasource table: success, the result is 1, 2, 3

saveAsTable

saveAsTable with append mode doesn't support hive table in 2.0, so the following tests only run on datasource table.

Seq((1, 2, 3)).toDF("x", "y", "z").write.partitionBy("y", "z").mode("append").saveAsTable("src")

1.6: fail, java.util.NoSuchElementException: key not found: b
2.0: fail, AnalysisException: cannot resolve a given input columns: [x, y, z]

val df = Seq((1, 2, 3)).toDF("x", "y", "z")
df.write.partitionBy("z", "y").mode("append").saveAsTable("src")

1.6: fail, java.util.NoSuchElementException: key not found: b
2.0: fail, AnalysisException: cannot resolve a given input columns: [x, y, z]

val df = Seq((1, 2, 3)).toDF("x", "y", "z")
df.write.mode("append").saveAsTable("src")

1.6: success, but actually no data is inserted
2.0: fail, AnalysisException: cannot resolve a given input columns: [x, y, z]

val df = Seq((1, 2, 3)).toDF("a", "b", "c")
df.write.partitionBy("b", "c").mode("append").saveAsTable("src")

1.6: success, result is 1, 2, 3
2.0: success, result is 1, 2, 3

val df = Seq((1, 2, 3)).toDF("a", "b", "c")
df.write.partitionBy("c", "b").mode("append").saveAsTable("src")

1.6: success, result is 1, 2, 3
2.0: success, result is 1, 3, 2 (but the dir is c=3/b=2, which is wrong)

val df = Seq((1, 2, 3)).toDF("a", "b", "c")
df.write.partitionBy("b").mode("append").saveAsTable("src")

1.6: fail, java.util.NoSuchElementException: key not found: c
2.0: the write success, but the dir is b=2, so the read will fail

val df = Seq((1, 2, 3)).toDF("a", "b", "c")
df.write.mode("append").saveAsTable("src")

1.6: success, but actually no data is inserted
2.0: success, but actually no data is inserted (after checking the data folder, there do exist a data file, but has no partition dir )

val df = Seq((1, 2, 3)).toDF("c", "b", "a")
df.write.mode("append").saveAsTable("src")

1.6: success, but actually no data is inserted
2.0: success, but actually no data is inserted (after checking the data folder, there do exist a data file, but has no partition dir )

@ToniYang
Copy link

so,what is datasource table what is hive table? when I use saveAsTable("test"),test is a datasource table or a hive table?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment