Skip to content

Commit

Permalink
[SPARK-16847][SQL] Prevent to potentially read corrupt statstics on b…
Browse files Browse the repository at this point in the history
…inary in Parquet vectorized reader

## What changes were proposed in this pull request?

This problem was found in [PARQUET-251](https://issues.apache.org/jira/browse/PARQUET-251) and we disabled filter pushdown on binary columns in Spark before. We enabled this after upgrading Parquet but it seems there is potential incompatibility for Parquet files written in lower Spark versions.

Currently, this does not happen in normal Parquet reader. However, In Spark, we implemented a vectorized reader, separately with Parquet's standard API. For normal Parquet reader this is being handled but not in the vectorized reader.

It is okay to just pass `FileMetaData`. This is being handled in parquet-mr (See apache/parquet-java@e3b9502). This will prevent loading corrupt statistics in each page in Parquet.

This PR replaces the deprecated usage of constructor.

## How was this patch tested?

N/A

Author: hyukjinkwon <[email protected]>

Closes apache#14450 from HyukjinKwon/SPARK-16847.
  • Loading branch information
HyukjinKwon authored and srowen committed Aug 6, 2016
1 parent e679bc3 commit 55d6dad
Showing 1 changed file with 4 additions and 2 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -140,7 +140,8 @@ public void initialize(InputSplit inputSplit, TaskAttemptContext taskAttemptCont
String sparkRequestedSchemaString =
configuration.get(ParquetReadSupport$.MODULE$.SPARK_ROW_REQUESTED_SCHEMA());
this.sparkSchema = StructType$.MODULE$.fromString(sparkRequestedSchemaString);
this.reader = new ParquetFileReader(configuration, file, blocks, requestedSchema.getColumns());
this.reader = new ParquetFileReader(
configuration, footer.getFileMetaData(), file, blocks, requestedSchema.getColumns());
for (BlockMetaData block : blocks) {
this.totalRowCount += block.getRowCount();
}
Expand Down Expand Up @@ -204,7 +205,8 @@ protected void initialize(String path, List<String> columns) throws IOException
}
}
this.sparkSchema = new ParquetSchemaConverter(config).convert(requestedSchema);
this.reader = new ParquetFileReader(config, file, blocks, requestedSchema.getColumns());
this.reader = new ParquetFileReader(
config, footer.getFileMetaData(), file, blocks, requestedSchema.getColumns());
for (BlockMetaData block : blocks) {
this.totalRowCount += block.getRowCount();
}
Expand Down

0 comments on commit 55d6dad

Please sign in to comment.