diff --git a/PROTOCOL.md b/PROTOCOL.md index e6bbe20c5e3..64435fcc11a 100644 --- a/PROTOCOL.md +++ b/PROTOCOL.md @@ -1708,7 +1708,7 @@ numRecords | The number of records in this data file. tightBounds | Whether per-column statistics are currently **tight** or **wide** (see below). For any logical file where `deletionVector` is not `null`, the `numRecords` statistic *must* be present and accurate. That is, it must equal the number of records in the data file, not the valid records in the logical file. -In the presence of [Deletion Vectors](#Deletion-Vectors) the statistics may be somewhat outdated, i.e. not reflecting deleted rows yet. The flag `stats.tightBounds` indicates whether we have **tight bounds** (i.e. the min/maxValue exists[^1] in the valid state of the file) or **wide bounds** (i.e. the minValue is <= all valid values in the file, and the maxValue >= all valid values in the file). These upper/lower bounds are sufficient information for data skipping. +In the presence of [Deletion Vectors](#Deletion-Vectors) the statistics may be somewhat outdated, i.e. not reflecting deleted rows yet. The flag `stats.tightBounds` indicates whether we have **tight bounds** (i.e. the min/maxValue exists[^1] in the valid state of the file) or **wide bounds** (i.e. the minValue is <= all valid values in the file, and the maxValue >= all valid values in the file). These upper/lower bounds are sufficient information for data skipping. Note, `stats.tightBounds` is evaluated to `true` when it is not present in the statistics. Per-column statistics record information for each column in the file and they are encoded, mirroring the schema of the actual data. For example, given the following data schema: