Skip to content

Encoding Definitions


Plain

No compression is applied to the data. Depending on the data type, values are stored in the following formats:

  • Boolean: Least-significant bit ordering within bytes
  • Integer: Little-Endian byte order
  • Long: Little-Endian byte order
  • Float: IEEE754 floating-point numbers in Little-Endian byte order
  • Double: IEEE754 floating-point numbers in Little-Endian byte order
  • String: Length and data streams

Boolean-RLE

This encoding compresses boolean columns using least-significant bit numbering (bit-endianness). Refer to the ORC specification for implementation details.

Byte-RLE

This encoding compresses byte streams, such as the GeometryType stream in Geometry columns. Refer to the ORC specification for implementation details.

Dictionary Encoding

Dictionary encoding compactly represents repeated values and can be applied to String and Geometry columns. Distinct values are stored in a dictionary stream, while a separate data stream stores indices into the dictionary. We allow the following encodings of this concept:

String Dictionary Encoding

The dictionary stream contains distinct UTF-8 encoded string values. The data stream contains dictionary indices encoded as UInt32. A dictionary-encoded nullable string column consists of the following streams in order: Present, Length, Dictionary, Data

Note

TODO: Is "can" (=> MAY) here the right terminoligy? This implies that streams might be encoded by one of the options below, but also other options.

If yes, what is the alternative options? If no, clarify this misunderstanding.

All streams can be further compressed using lightweight encodings:

  • Present Stream: Boolean-RLE
  • Length and Data Stream: See Integer encodings
  • Dictionary: See FSST Dictionary

FSST Dictionary Encoding

Dictionary encoding requires fully repeating strings to reduce size. However, geospatial attributes often contain strings with common prefixes that are not identical (e.g., localized country names). FSST replaces frequently occurring substrings while supporting efficient scans and random lookups. It further compresses UTF-8 encoded dictionary values in MLT. An FSST dictionary-encoded nullable string column consists of the following streams in order: Present, SymbolLength, SymbolTable, String Length, Dictionary (Compressed Corpus) For implementation details, refer to this paper.

Available implementations:

Shared Dictionary Encoding

Shared dictionary encoding allows multiple columns to share a common dictionary. Nested fields can also use a shared dictionary (e.g., for localized values like name:* columns in an OSM dataset). If used for nested fields, all fields sharing the dictionary must be grouped together in the file and prefixed with the dictionary.

A shared dictionary-encoded nullable string column consists of the following streams in order: Length, Dictionary, Present1, Data1, Present2, Data2

A shared FSST dictionary-encoded nullable string column consists of the following streams in order: SymbolLength, SymbolTable, String Length, Dictionary (Compressed Corpus), Present1, Data1, Present2, Data2

Vertex Dictionary Encoding

Uses an additional VertexOffsets stream to store indices for vertex coordinates in the VertexBuffer stream. Vertices in the VertexBuffer are sorted using a Hilbert curve and delta-encoded with null suppression.

Morton Vertex Dictionary Encoding

VertexBuffer coordinates are transformed into a 1D integer using Morton code. The data is sorted by Morton code and further compressed using integer compression techniques.

Integer Encodings

Most data in MLT is stored as integer arrays, making efficient integer compression crucial.

Logical-Level Techniques

Integers are encoded using delta encoding, run-length encoding (RLE), or a combination of both (delta-RLE).

Delta Encoding

Computes differences between consecutive elements (e.g., x₂ - x₁, x₃ - x₂, ...). Used with physical-level techniques to reduce the number of bits required for storing delta values.

Run-Length Encoding (RLE)

Refer to Wikipedia for a basic explanation. Runs and values are stored in separate buffers. For unsigned integers, ZigZag encoding is applied to the values buffer. Both buffers are further compressed using null suppression.

Delta-RLE

Applies delta encoding followed by RLE. Efficient for ascending sequences like id fields.

Physical-Level Techniques

Null suppression techniques are used to compress integer arrays by reducing the number of bits required to store each integer.

ZigZag Encoding

Used for encoding unsigned integers in null suppression techniques. ZigZag encoding uses the least significant bit to represent the sign.

Varint Encoding

A byte-aligned null suppression technique that compresses integers using a minimal number of bytes. For implementation details, refer to Protobuf.

SIMD-FastPFOR

A bit-aligned null suppression technique that compresses integers using a minimal number of bits. Uses a patched approach to store exceptions (outliers) separately, keeping the overall bit width small.

Refer to: - https://arxiv.org/pdf/1209.2137.pdf - https://ayende.com/blog/199524-C/integer-compression-the-fastpfor-code

Available implementations:

Float Encodings

Adaptive Lossless Floating-Point Compression (ALP)

A lossless floating-point compression technique used in MLT. For a detailed explanation, refer to this paper.

Available implementations: