【Paper Reading】Lance: Efficient Random Access in Columnar Storage through Adaptive Structural Encodings

Mar 14, 2026 - #Research

One-Line Takeaway

What this paper is really arguing is not "Lance is faster than Parquet." The real claim is: the key to random access performance isn't just the compression algorithm — it's how structural information is laid out on disk. Arrow looks too much like an in-memory format. Parquet is too optimized for sequential scans. Lance treats "structural encoding" as a first-class design dimension, with the explicit goal of handling both point lookups and full scans on NVMe simultaneously.

The Core Problem

Traditional columnar formats are designed around full scans by default. But AI and retrieval workloads need both: bulk data sweeps and high-volume, fine-grained random access.

On NVMe, the bottleneck is no longer just bandwidth. It's:

How many I/Os does it take to retrieve a single value?
How much irrelevant data does each I/O pull in?
Can the CPU decode fast enough to keep the disk queue saturated?
How much metadata needs to stay resident in memory for fast lookup?

The paper unifies all of these under a single concept: structural encoding — that is, how nesting, nullability, variable-length values, and page boundaries are actually encoded.

Lance, Arrow, and Parquet: Where They Stand

All three are fundamentally columnar formats that shred nested data into leaf columns and apply compression on top.
All three make tradeoffs between compression ratio, scan throughput, and random access latency.
All three must preserve structural metadata when handling nested types — they just do so differently.

The Critical Differences

Format	Design Center	Structural Organization	Random Access Cost	Main Weakness
Arrow	Stay close to in-memory representation, minimize copies	Multiple independent buffers: validity, offsets, values	A single value may span multiple buffers requiring staged fetches	Nested/variable-length types generate too many IOPS on disk
Parquet	Compression ratio and full scans	rep/def + sparse values + pages	Usually one page per row lookup	Page/read amplification, large page index cache, painful parameter tuning
Lance	Treats structural encoding as an independent design dimension	Adaptively selects full-zip or miniblock based on data width	Fixed-width aims for 1 I/O; variable-length aims for 2	Higher implementation complexity; some small types carry extra decode cost

Arrow, Parquet, Lance: Getting to the Point

1. Why Arrow's Strengths Don't Transfer to Disk

Arrow's power is in "organizing data like memory." That gives you:

Near-zero page-level read amplification for simple, uncompressed types
No need for a page index or dedicated search cache
Very lightweight CPU decode

But the problems are equally direct:

A nested or variable-length value is typically spread across multiple buffers
These buffers have sequential dependencies — you can't jump straight to a target position
A point lookup therefore isn't "1 random read"; it's "multiple, staged, hard-to-merge random reads"

The issue with Arrow isn't weak compression. It's that its structure is too scattered from a disk perspective. Multiple buffers are fine in memory; on NVMe or S3, multiple buffers become extra IOPS.

2. Why Parquet's Random Access Isn't That Bad — But Still Not Good Enough

Parquet is more "disk-aware" than Arrow:

rep/def and values are co-located within a page
Which page contains a given row can be found quickly via the page offset index
For many types, a point lookup reads exactly one page

This explains one of the paper's most important observations: with the right configuration, Parquet's random access isn't bad — it can actually be quite strong.

But that strength comes with conditions:

Small pages → better random access (less read amplification)
Large pages → easier to benefit from coalesced I/O for scans, but point lookups pull in too much garbage
Wide fields with small pages → page count explodes, and the page index eats huge amounts of memory
Row group size also significantly affects scan throughput, and different types have different preferences

Parquet's core tradeoff is: You buy random access capability with page-level chunking, but you also import read amplification, a large search cache, and deeply painful row-group/page-size tuning complexity.

3. Where Lance's Innovation Actually Lives

Lance doesn't pick a side between Arrow and Parquet. It disassembles both and recombines them:

For wide values: use full-zip encoding
For narrow values: use miniblock encoding

This is the core innovation: not one fixed structural encoding for the entire file, but adaptive selection based on data width.

Abstractly:

full-zip approaches "bundle all structural information and data for a single value access as tightly as possible"
miniblock is closer to "small-page Parquet" — it accepts a small amount of read amplification to get lighter CPU overhead during scans

Lance therefore doesn't have one fixed tradeoff. It splits the tradeoff into two strategies applied locally to different data types.

What These Innovations Actually Change

What Gets Better

Random access on nested/variable-length data is more stable. Arrow-style nesting keeps adding I/Os with each level; Lance tries to flatten that growth.
Random access on wide fields is more friendly. Parquet needs a large page index in memory when pages are kept small for wide fields; Lance's full-zip doesn't depend on a page-level search cache for wide fields.
Full scan throughput is competitive with Parquet — often better. Not because Lance has magical compression, but because it's better at keeping the disk queue full continuously.
One configuration works across more types. No need to repeatedly tune row group size and page size the way you do with Parquet.

What Gets Worse

Point lookups on simple narrow types may not beat Arrow-style access. Miniblock allows opaque/dense compression, but the cost is that retrieving a single value requires decoding the entire block first.
Some full-zip layouts waste space. Fixed-width types need to store a slot for null values to support direct positional access — they can't fully sparse-encode nulls the way Parquet does.
Implementation complexity is significantly higher. You now have two structural encoding schemes, different index structures, and different chunking rules all cooperating.
Some scenarios may become CPU-bound. The paper notes that the LZ4 path still has headroom for compression/decompression optimization, and small accesses carry some syscall and alignment overhead.

Why These Tradeoffs Exist

Arrow is fast because it decodes lightly; slow because it generates many I/Os
Parquet is fast because it generates fewer I/Os; slow because of page amplification + page index + CPU decode + tuning overhead
Lance separates "the I/O problem for wide data" from "the CPU problem for narrow data" through adaptive structural encoding

In other words, Lance doesn't eliminate the tradeoff. It changes the tradeoff from "one compromise point for the entire file" to "local optimums per data type."

Unpacking the Five Mechanisms Behind the Conclusion

The paper's conclusion can be decomposed into five concrete mechanisms.

1. Smaller Search Cache

The reason isn't that Lance has no index at all. It's that indexes appear only where necessary, at a finer granularity.

For wide values in full-zip, random access relies on a repetition index — not Parquet's approach of keeping a per-page offset index resident in memory for every page
For narrow values in miniblock, only compact chunk metadata is stored; each chunk covers dozens of values, so the metadata amortizes well
Auxiliary structures like dictionaries and symbol tables only appear when the relevant encoding is active

Lance's advantage isn't "zero metadata." It's avoiding the explosion of page indexes that Parquet suffers when combining wide fields with small page sizes.

2. No Need to Configure Row Group Sizes

Because Lance's random access capability comes from:

Per-value positional access via full-zip
Small-block positional access via miniblock
Local navigation via repetition index / chunk metadata

...rather than from "getting the row group and page sizes right" as in Parquet. Parallel scan also doesn't depend on the row group concept, so there's no need to repeatedly compromise between point lookup and full scan on a single row group size parameter.

3. Struct Packing

Lance separates "structural encoding" from "column compression":

First compress each struct field independently in columnar fashion
Then pack multiple fields together as a single logical column

This preserves the benefits of columnar compression while reducing the number of I/Os needed for a random access to an entire struct. The cost is losing fine-grained projection — scanning a single field forces reading the whole struct.

4. Bridging Lists Across Pages

Lance's miniblock does not require a chunk to start at a top-level record boundary — rows can straddle chunk boundaries. This means lists don't need to be hard-cut at page boundaries the way many traditional page organizations require; they can span across chunks.

This is possible because the repetition index records completion counts both inside and outside chunks, making it possible to recover "where this list started, where it continues, and where it ends."

5. Nested Repetition Indices

Parquet's rep/def primarily solves "how to reconstruct nested structure." Lance goes further and encodes "how to perform nested random navigation" into the index as well.

The repetition index records not just top-level row boundaries, but navigation across nesting depths:

How many lists have been completed at the top level
How many lists have been completed at the next level down
How many elements have accumulated at the bottom level

So Lance doesn't just support locating row x — in principle it supports navigating deeper structures like array[x][y][z]. This is the substance of what "nested repetition indices" means.

Compression in Lance: What's Done vs. What's Left

Structural Layer

Control word (in full-zip, before each value): packs repetition and definition level bits into 1–4 bytes. This is structural compression, not value compression — its purpose is to minimize the on-disk footprint of nesting boundaries and null state while keeping each value directly addressable.

Offsets rewritten as lengths (in full-zip variable-length value area): Arrow-style offsets are converted to per-value lengths, then bit-packed. This avoids chasing two offsets to locate a value.

Repetition index: primarily a navigation index, but internally bit-packed. Maps rows to byte offsets and chunk positions to nesting boundaries.

Value Layer

FSST: mainly for string value buffers; keeps single values independently decodable. Examples from the paper: prompts, reviews, names (sometimes combined with dictionary).

Dictionary encoding: effective for low-cardinality strings. The paper uses Dictionary + FSST for names. Adds a table lookup step on the random access path.

Bitpacking: for integer and date value buffers, or for length fields. Example: dates in the paper's benchmarks.

LZ4: for value buffers of large objects; per-value LZ4 framing preserves single-value independent decompression. Examples: code, images, websites.

Opaque chunk compression (inside miniblock chunks): since miniblock doesn't fully interleave all buffers into per-value layout, it can allow opaque compression. Better scan vectorization, more compression options — but point lookups must decode the whole block first.

Metadata/Cache Layer

Miniblock metadata: buffer counts, per-buffer sizes, chunk dimensions. Determines whether the search cache is actually small.

Symbol tables / dictionaries: auxiliary structures for FSST and dictionary encodings. The paper treats these as the kind of data that may need to be cached for random access.

Already Done vs. Still Has Room

Already implemented: full-zip, miniblock, control word bitpacking, length bitpacking, repetition index, FSST, dictionary, bitpacking, LZ4, struct packing.

Explicitly noted but not yet complete: - Run-length encoding on full-zip: design is discussed in the paper; noted as not implemented in Lance 2.1 - Search cache compression: the paper believes further compression would be straightforward - More aggressive transparent compression combos: particularly replacing some opaque miniblock encodings with transparent ones to reduce point lookup decode cost - Further LZ4 path optimization: some scenarios are already approaching CPU-bound - Better chunk sizing and disk alignment: current chunk size and alignment rules are still heuristic

Three Things Worth Remembering

The most important contribution of this paper is the analytical framework of "structural encoding." It cleanly separates problems that used to all get lumped together under "file format design."
The difference between Arrow, Parquet, and Lance isn't about who compresses more aggressively. It's about where each puts its priorities: I/O count, read amplification, CPU decode cost, and metadata memory.
Lance's genuine novelty isn't any single encoding technique. It's "adaptively switching structural encoding based on data width." This frees it from committing the entire file to a single tradeoff point.

Attachment

A drawing of full-zip and mini-bench encoding methods described in this paper.

⬆ #AI Infrastructure, #Columnar Storage, #File Format, #NVMe, #Random Access