Skip to content
Concept

How a verifiable query actually works

A 10-minute primer on the cryptographic mechanics of a zero-knowledge SQL query — commitment, circuit construction, proving, and verification.

Published 2026-05-18· 6 min read

The phrase "the database proves the query" sounds like magic. It is, in fact, a precise pipeline of well-understood cryptography. This primer walks through the mechanism end-to-end, in the level of detail a CTO needs to make architectural decisions — without the proof-system formalism a paper would require.

The setting

Three parties:

  • The owner — holds the underlying data, executes the query, and produces the proof.
  • The verifier — wants to learn the answer to a query and an assurance the answer is correct, without ever seeing the data.
  • The bulletin board — a write-once-read-many public log of dataset commitments. This can be a private append-only log, an internal git repository signed by the firm's HSM, or a public chain — the cryptography is indifferent.

Step 1 — Commit to the dataset

Before any query is ever asked, the owner commits to the dataset. The commitment is a short cryptographic fingerprint — typically a few hundred bytes — produced from the entire database using a polynomial commitment scheme (IPA in transparent-setup systems, KZG in trusted-setup systems).

The commitment has two properties:

  1. It binds the owner to one specific database state. The owner cannot later substitute different rows and still have proofs verify.
  2. It hides the rows. The verifier learns nothing about content from the commitment alone.

When data changes, the owner publishes a new commitment. The verifier can always check which commitment a given proof refers to.

The commitment fingerprints the dataset without revealing it.

Step 2 — Compile the query into a circuit

When a query arrives, the engine compiles it into an arithmetic circuit — a system of polynomial constraints over a finite field. Every SQL operator becomes a sub-circuit:

  • WHERE x BETWEEN a AND b becomes a range-check gate (efficient via lookup tables in modern PLONK-ish systems).
  • GROUP BY becomes a sort + boundary-marker sub-circuit. Rows are sorted by the group key; the boundary between groups is marked; aggregates are accumulated within each segment.
  • JOIN becomes a permutation argument: the engine asserts that for every row claimed in the output, there exist matching rows in each input table — and that the output is a permutation of the matched pairs (with dummy tuples padding the cardinality).
  • SUM, COUNT, AVG, MIN, MAX become accumulator sub-circuits with carry-out semantics.

Each gate has been hand-crafted by the proof-system designer and benchmarked. The circuit compiler stitches them together according to the query plan — exactly the way a traditional query optimizer stitches together physical operators.

Step 3 — Generate proving and verification keys

The first time a particular query circuit is seen, a setup phase runs over it. For PLONK-ish systems with transparent setup (like Halo2/IPA), this is fast and produces no toxic-waste. For KZG-based systems, it consumes the existing trusted-setup public parameters (commonly the Ethereum-grade ones from the KZG ceremony).

The output is two artifacts:

  • A proving key, kept by the owner.
  • A verification key, distributed to verifiers.

Both are reusable for every future evaluation of the same query.

Step 4 — Run the query and produce the proof

The owner now does two things in parallel:

  1. Executes the query against the witness (the actual rows). This produces the answer.
  2. Witnesses every intermediate value of the circuit and runs the proof-generation algorithm.

Proof generation is the expensive step. For a non-trivial TPC-H query (Q1, Q5, Q9) on a 1 M-row table, expect seconds to minutes of CPU on commodity hardware today. GPU and FPGA provers (Cysic, Ulvetanna, Ingonyama; also Polygon's Plonky3 on AVX-512) have brought this down by 1–2 orders of magnitude in the last 18 months.

The output is a proof artifact of a few tens of kilobytes — independent of the size of the underlying database.

Step 5 — Deliver

The owner returns to the verifier:

  • The claimed answer (AVG(salary) = 142,837).
  • The proof (proof.zkp, 38 KB).
  • A reference to the active commitment (commit_t2 = 0x91b3...).

Step 6 — Verify

The verifier, holding only the verification key and the proof, runs the verification algorithm. This is sub-second on a laptop, milliseconds on a server for every modern PLONK-ish system. The check confirms three things at once:

  1. The proof references the claimed commitment.
  2. The circuit was the exact query the verifier expected.
  3. There exists some witness consistent with the commitment such that executing the circuit produces the claimed answer.

Crucially, the verifier learns nothing about which witness — that is, nothing about which rows.

If any of these conditions fail, the verifier rejects the proof. The owner cannot produce a fraudulent proof except by breaking the underlying cryptographic assumptions (discrete logarithm or related problems) — for which we have no efficient attack and which the world's cryptographers spend significant effort to either break or strengthen.

What this gives you

A non-interactive proof has properties that are easy to underrate until you architect a system around them:

  • Transferable. The verifier can hand the proof to its counterparties. The auditor can pass it to the regulator. The proof stands alone.
  • Re-verifiable indefinitely. Today's proof verifies the same tomorrow, and ten years from now.
  • Publicly checkable. Anyone with the verification key can audit — including journalists, watchdog groups, and the firm's own board.
  • Cheap to verify, no matter how large the data. A proof over a 100-million-row dataset verifies in the same milliseconds as one over a thousand rows. This is the succinctness in zk-SNARK.

These four properties are why the compliance, audit, and inter-organizational data exchange use cases compose so naturally — and why no other privacy technology achieves them simultaneously.

What it does not give you

  • It does not hide the answer. If the answer itself is sensitive, layer differential privacy or k-anonymity on top.
  • It does not solve mutability cheaply. Each update changes the commitment; incremental and append-only schemes are mature, full mutability is still research-grade.
  • It does not provide confidentiality during compute. The owner sees the rows while computing — by definition. If that is the threat model, you need FHE or a TEE in addition.

Further reading