【Paper Reading】Lance: Efficient Random Access in Columnar Storage through Adaptive Structural Encodings
Mar 14, 2026 - #Research
One-Line Takeaway
What this paper is really arguing is not "Lance is faster than Parquet." The real claim is: the key to random access performance isn't just the compression algorithm — it's how structural information is laid out on disk. Arrow looks too much like an in-memory format. Parquet is too optimized for sequential scans. Lance treats "structural encoding" as a first-class design dimension, with the explicit goal of handling both point lookups and full scans on NVMe simultaneously.
The Core Problem
Traditional columnar formats are designed around full scans by default. But AI and retrieval workloads need both: bulk data sweeps and high-volume, fine-grained random access.
On NVMe, the bottleneck is no longer just bandwidth. It's:
- How many I/Os does it take to retrieve a single value?
- How much irrelevant data does each I/O pull in?
- Can the CPU decode fast enough to keep the disk queue saturated?
- How much metadata needs to stay resident in memory for fast lookup?
The paper unifies all of these under a single concept:
structural encoding — that is, how nesting, nullability,
variable-length values, and page boundaries are actually encoded.
Lance, Arrow, and Parquet: Where They Stand
What They Share
- All three are fundamentally columnar formats that shred nested data into leaf columns and apply compression on top.
- All three make tradeoffs between compression ratio, scan throughput, and random access latency.
- All three must preserve structural metadata when handling nested types — they just do so differently.
The Critical Differences
| Format | Design Center | Structural Organization | Random Access Cost | Main Weakness |
|---|---|---|---|---|
| Arrow | Stay close to in-memory representation, minimize copies | Multiple independent buffers: validity, offsets, values | A single value may span multiple buffers requiring staged fetches | Nested/variable-length types generate too many IOPS on disk |
| Parquet | Compression ratio and full scans | rep/def + sparse values + pages | Usually one page per row lookup | Page/read amplification, large page index cache, painful parameter tuning |
| Lance | Treats structural encoding as an independent design dimension | Adaptively selects full-zip or miniblock based on data width | Fixed-width aims for 1 I/O; variable-length aims for 2 | Higher implementation complexity; some small types carry extra decode cost |
Arrow, Parquet, Lance: Getting to the Point
1. Why Arrow's Strengths Don't Transfer to Disk
Arrow's power is in "organizing data like memory." That gives you:
- Near-zero page-level read amplification for simple, uncompressed types
- No need for a page index or dedicated search cache
- Very lightweight CPU decode
But the problems are equally direct:
- A nested or variable-length value is typically spread across multiple buffers
- These buffers have sequential dependencies — you can't jump straight to a target position
- A point lookup therefore isn't "1 random read"; it's "multiple, staged, hard-to-merge random reads"
The issue with Arrow isn't weak compression. It's that its structure is too scattered from a disk perspective. Multiple buffers are fine in memory; on NVMe or S3, multiple buffers become extra IOPS.
2. Why Parquet's Random Access Isn't That Bad — But Still Not Good Enough
Parquet is more "disk-aware" than Arrow:
- rep/def and values are co-located within a page
- Which page contains a given row can be found quickly via the page offset index
- For many types, a point lookup reads exactly one page
This explains one of the paper's most important observations: with the right configuration, Parquet's random access isn't bad — it can actually be quite strong.
But that strength comes with conditions:
- Small pages → better random access (less read amplification)
- Large pages → easier to benefit from coalesced I/O for scans, but point lookups pull in too much garbage
- Wide fields with small pages → page count explodes, and the page index eats huge amounts of memory
- Row group size also significantly affects scan throughput, and different types have different preferences
Parquet's core tradeoff is: You buy random access capability with page-level chunking, but you also import read amplification, a large search cache, and deeply painful row-group/page-size tuning complexity.
3. Where Lance's Innovation Actually Lives
Lance doesn't pick a side between Arrow and Parquet. It disassembles both and recombines them:
- For wide values: use
full-zip encoding - For narrow values: use
miniblock encoding
This is the core innovation: not one fixed structural encoding for the entire file, but adaptive selection based on data width.
Abstractly:
full-zipapproaches "bundle all structural information and data for a single value access as tightly as possible"miniblockis closer to "small-page Parquet" — it accepts a small amount of read amplification to get lighter CPU overhead during scans
Lance therefore doesn't have one fixed tradeoff. It splits the tradeoff into two strategies applied locally to different data types.
What These Innovations Actually Change
What Gets Better
Random access on nested/variable-length data is more stable. Arrow-style nesting keeps adding I/Os with each level; Lance tries to flatten that growth.
Random access on wide fields is more friendly. Parquet needs a large page index in memory when pages are kept small for wide fields; Lance's full-zip doesn't depend on a page-level search cache for wide fields.
Full scan throughput is competitive with Parquet — often better. Not because Lance has magical compression, but because it's better at keeping the disk queue full continuously.
One configuration works across more types. No need to repeatedly tune row group size and page size the way you do with Parquet.
What Gets Worse
Point lookups on simple narrow types may not beat Arrow-style access. Miniblock allows opaque/dense compression, but the cost is that retrieving a single value requires decoding the entire block first.
Some full-zip layouts waste space. Fixed-width types need to store a slot for null values to support direct positional access — they can't fully sparse-encode nulls the way Parquet does.
Implementation complexity is significantly higher. You now have two structural encoding schemes, different index structures, and different chunking rules all cooperating.
Some scenarios may become CPU-bound. The paper notes that the LZ4 path still has headroom for compression/decompression optimization, and small accesses carry some syscall and alignment overhead.
Why These Tradeoffs Exist
- Arrow is fast because it decodes lightly; slow because it generates many I/Os
- Parquet is fast because it generates fewer I/Os; slow because of page amplification + page index + CPU decode + tuning overhead
- Lance separates "the I/O problem for wide data" from "the CPU problem for narrow data" through adaptive structural encoding
In other words, Lance doesn't eliminate the tradeoff. It changes the tradeoff from "one compromise point for the entire file" to "local optimums per data type."
Unpacking the Five Mechanisms Behind the Conclusion
The paper's conclusion can be decomposed into five concrete mechanisms.
1. Smaller Search Cache
The reason isn't that Lance has no index at all. It's that indexes appear only where necessary, at a finer granularity.
- For wide values in
full-zip, random access relies on a repetition index — not Parquet's approach of keeping a per-page offset index resident in memory for every page - For narrow values in
miniblock, only compact chunk metadata is stored; each chunk covers dozens of values, so the metadata amortizes well - Auxiliary structures like dictionaries and symbol tables only appear when the relevant encoding is active
Lance's advantage isn't "zero metadata." It's avoiding the explosion of page indexes that Parquet suffers when combining wide fields with small page sizes.
2. No Need to Configure Row Group Sizes
Because Lance's random access capability comes from:
- Per-value positional access via full-zip
- Small-block positional access via miniblock
- Local navigation via repetition index / chunk metadata
...rather than from "getting the row group and page sizes right" as in Parquet. Parallel scan also doesn't depend on the row group concept, so there's no need to repeatedly compromise between point lookup and full scan on a single row group size parameter.
3. Struct Packing
Lance separates "structural encoding" from "column compression":
- First compress each struct field independently in columnar fashion
- Then pack multiple fields together as a single logical column
This preserves the benefits of columnar compression while reducing the number of I/Os needed for a random access to an entire struct. The cost is losing fine-grained projection — scanning a single field forces reading the whole struct.
4. Bridging Lists Across Pages
Lance's miniblock does not require a chunk to start at a top-level record boundary — rows can straddle chunk boundaries. This means lists don't need to be hard-cut at page boundaries the way many traditional page organizations require; they can span across chunks.
This is possible because the repetition index records completion counts both inside and outside chunks, making it possible to recover "where this list started, where it continues, and where it ends."
5. Nested Repetition Indices
Parquet's rep/def primarily solves "how to reconstruct nested structure." Lance goes further and encodes "how to perform nested random navigation" into the index as well.
The repetition index records not just top-level row boundaries, but navigation across nesting depths:
- How many lists have been completed at the top level
- How many lists have been completed at the next level down
- How many elements have accumulated at the bottom level
So Lance doesn't just support locating row x — in
principle it supports navigating deeper structures like
array[x][y][z]. This is the substance of what "nested
repetition indices" means.
Compression in Lance: What's Done vs. What's Left
Structural Layer
Control word (in full-zip, before each value): packs repetition and definition level bits into 1–4 bytes. This is structural compression, not value compression — its purpose is to minimize the on-disk footprint of nesting boundaries and null state while keeping each value directly addressable.
Offsets rewritten as lengths (in full-zip variable-length value area): Arrow-style offsets are converted to per-value lengths, then bit-packed. This avoids chasing two offsets to locate a value.
Repetition index: primarily a navigation index, but internally bit-packed. Maps rows to byte offsets and chunk positions to nesting boundaries.
Value Layer
FSST: mainly for string value buffers; keeps single values independently decodable. Examples from the paper: prompts, reviews, names (sometimes combined with dictionary).
Dictionary encoding: effective for low-cardinality strings. The paper uses Dictionary + FSST for names. Adds a table lookup step on the random access path.
Bitpacking: for integer and date value buffers, or for length fields. Example: dates in the paper's benchmarks.
LZ4: for value buffers of large objects; per-value LZ4 framing preserves single-value independent decompression. Examples: code, images, websites.
Opaque chunk compression (inside miniblock chunks): since miniblock doesn't fully interleave all buffers into per-value layout, it can allow opaque compression. Better scan vectorization, more compression options — but point lookups must decode the whole block first.
Metadata/Cache Layer
Miniblock metadata: buffer counts, per-buffer sizes, chunk dimensions. Determines whether the search cache is actually small.
Symbol tables / dictionaries: auxiliary structures for FSST and dictionary encodings. The paper treats these as the kind of data that may need to be cached for random access.
Already Done vs. Still Has Room
Already implemented: full-zip, miniblock, control word bitpacking, length bitpacking, repetition index, FSST, dictionary, bitpacking, LZ4, struct packing.
Explicitly noted but not yet complete: - Run-length encoding on full-zip: design is discussed in the paper; noted as not implemented in Lance 2.1 - Search cache compression: the paper believes further compression would be straightforward - More aggressive transparent compression combos: particularly replacing some opaque miniblock encodings with transparent ones to reduce point lookup decode cost - Further LZ4 path optimization: some scenarios are already approaching CPU-bound - Better chunk sizing and disk alignment: current chunk size and alignment rules are still heuristic
Three Things Worth Remembering
- The most important contribution of this paper is the analytical framework of "structural encoding." It cleanly separates problems that used to all get lumped together under "file format design."
- The difference between Arrow, Parquet, and Lance isn't about who compresses more aggressively. It's about where each puts its priorities: I/O count, read amplification, CPU decode cost, and metadata memory.
- Lance's genuine novelty isn't any single encoding technique. It's "adaptively switching structural encoding based on data width." This frees it from committing the entire file to a single tradeoff point.
Attachment
⬆ #AI Infrastructure, #Columnar Storage, #File Format, #NVMe, #Random Access