Skip to content

Instantly share code, notes, and snippets.

@ebuildy
Last active January 23, 2020 02:21
Show Gist options
  • Star 5 You must be signed in to star a gist
  • Fork 4 You must be signed in to fork a gist
  • Save ebuildy/3de0e2855498e5358e4eed1a4f72ea48 to your computer and use it in GitHub Desktop.
Save ebuildy/3de0e2855498e5358e4eed1a4f72ea48 to your computer and use it in GitHub Desktop.
Flatten Spark data frame fields structure, via SQL in Java
class Toto
{
public void Main()
{
final DataFrame source = GetDataFrame();
final String querySelectSQL = flattenSchema(source.schema(), null);
source.registerTempTable("source");
final DataFrame flattenData = sqlContext.sql("SELECT " + querySelectSQL + " FROM source")
}
/**
* Generate SQL to select columns as flat.
*/
public String flattenSchema(StructType schema, String prefix)
{
final StringBuilder selectSQLQuery = new StringBuilder();
for (StructField field : schema.fields())
{
final String fieldName = field.name();
if (fieldName.startsWith("@"))
{
continue;
}
String colName = prefix == null ? fieldName : (prefix + "." + fieldName);
String colNameTarget = colName.replace(".", "_");
if (field.dataType().getClass().equals(StructType.class))
{
selectSQLQuery.append(flattenSchema((StructType) field.dataType(), colName));
}
else
{
selectSQLQuery.append(colName);
selectSQLQuery.append(" as ");
selectSQLQuery.append(colNameTarget);
}
selectSQLQuery.append(",");
}
if (selectSQLQuery.length() > 0)
{
selectSQLQuery.deleteCharAt(selectSQLQuery.length() - 1);
}
return selectSQLQuery.toString();
}
}
@dharadhruve
Copy link

If I am having array field, like bellow than above code doesn't expand fields, so what should i need to change??
this is my schema:
root
|-- personalInfo: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- studentId: integer (nullable = true)
| | |-- studentName: string (nullable = true)
| | |-- studentAddress: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- streetName: string (nullable = true)
| | | | |-- city: string (nullable = true)
| | |-- studentContactNo: integer (nullable = true)
| | |-- studentBranch: string (nullable = true)
|-- gradeInfo: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- studentId: integer (nullable = true)
| | |-- studentSemester: string (nullable = true)
| | |-- studentCPI: double (nullable = true)

output of above code is :
personalInfo as personalInfo,gradeInfo as gradeInfo
+--------------------+---------------+
| personalInfo| gradeInfo|
+--------------------+---------------+
|[[1, dhara, [[TP-...|[[1, 3rd, 8.0]]|
|[[1, dhara, [[TP-...|[[1, 3rd, 8.7]]|
|[[2, Ankita, [[TP...|[[2, 3rd, 7.0]]|
|[[2, Ankita, [[TP...|[[2, 3rd, 7.7]]|
|[[3, Shreya, [[TP...|[[3, 3rd, 6.0]]|
|[[3, Shreya, [[TP...|[[3, 3rd, 6.7]]|
+--------------------+---------------+

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment