The phrase "the database proves the query" sounds like magic. It is, in fact, a precise pipeline of well-understood cryptography. This primer walks through the mechanism end-to-end, in the level of detail a CTO needs to make architectural decisions — without the proof-system formalism a paper would require.
The setting
Three parties:
- The owner — holds the underlying data, executes the query, and produces the proof.
- The verifier — wants to learn the answer to a query and an assurance the answer is correct, without ever seeing the data.
- The bulletin board — a write-once-read-many public log of dataset commitments. This can be a private append-only log, an internal git repository signed by the firm's HSM, or a public chain — the cryptography is indifferent.
Step 1 — Commit to the dataset
Before any query is ever asked, the owner commits to the dataset. The commitment is a short cryptographic fingerprint — typically a few hundred bytes — produced from the entire database using a polynomial commitment scheme (IPA in transparent-setup systems, KZG in trusted-setup systems).
The commitment has two properties:
- It binds the owner to one specific database state. The owner cannot later substitute different rows and still have proofs verify.
- It hides the rows. The verifier learns nothing about content from the commitment alone.
When data changes, the owner publishes a new commitment. The verifier can always check which commitment a given proof refers to.
[Private rows] [Public bulletin board] ─────────────── ──────────────────────── row #1 ... commit_t0 = 0x4a1f... row #2 ... ──── hash ───▶ commit_t1 = 0xe09c... row #3 ... (polynomial commit_t2 = 0x91b3... ◀ active ... commitment) ...
Step 2 — Compile the query into a circuit
When a query arrives, the engine compiles it into an arithmetic circuit — a system of polynomial constraints over a finite field. Every SQL operator becomes a sub-circuit:
WHERE x BETWEEN a AND bbecomes a range-check gate (efficient via lookup tables in modern PLONK-ish systems).GROUP BYbecomes a sort + boundary-marker sub-circuit. Rows are sorted by the group key; the boundary between groups is marked; aggregates are accumulated within each segment.JOINbecomes a permutation argument: the engine asserts that for every row claimed in the output, there exist matching rows in each input table — and that the output is a permutation of the matched pairs (with dummy tuples padding the cardinality).SUM,COUNT,AVG,MIN,MAXbecome accumulator sub-circuits with carry-out semantics.
Each gate has been hand-crafted by the proof-system designer and benchmarked. The circuit compiler stitches them together according to the query plan — exactly the way a traditional query optimizer stitches together physical operators.
Step 3 — Generate proving and verification keys
The first time a particular query circuit is seen, a setup phase runs over it. For PLONK-ish systems with transparent setup (like Halo2/IPA), this is fast and produces no toxic-waste. For KZG-based systems, it consumes the existing trusted-setup public parameters (commonly the Ethereum-grade ones from the KZG ceremony).
The output is two artifacts:
- A proving key, kept by the owner.
- A verification key, distributed to verifiers.
Both are reusable for every future evaluation of the same query.
Step 4 — Run the query and produce the proof
The owner now does two things in parallel:
- Executes the query against the witness (the actual rows). This produces the answer.
- Witnesses every intermediate value of the circuit and runs the proof-generation algorithm.
Proof generation is the expensive step. For a non-trivial TPC-H query (Q1, Q5, Q9) on a 1 M-row table, expect seconds to minutes of CPU on commodity hardware today. GPU and FPGA provers (Cysic, Ulvetanna, Ingonyama; also Polygon's Plonky3 on AVX-512) have brought this down by 1–2 orders of magnitude in the last 18 months.
The output is a proof artifact of a few tens of kilobytes — independent of the size of the underlying database.
Step 5 — Deliver
The owner returns to the verifier:
- The claimed answer (
AVG(salary) = 142,837). - The proof (
proof.zkp, 38 KB). - A reference to the active commitment (
commit_t2 = 0x91b3...).
Step 6 — Verify
The verifier, holding only the verification key and the proof, runs the verification algorithm. This is sub-second on a laptop, milliseconds on a server for every modern PLONK-ish system. The check confirms three things at once:
- The proof references the claimed commitment.
- The circuit was the exact query the verifier expected.
- There exists some witness consistent with the commitment such that executing the circuit produces the claimed answer.
Crucially, the verifier learns nothing about which witness — that is, nothing about which rows.
If any of these conditions fail, the verifier rejects the proof. The owner cannot produce a fraudulent proof except by breaking the underlying cryptographic assumptions (discrete logarithm or related problems) — for which we have no efficient attack and which the world's cryptographers spend significant effort to either break or strengthen.
What this gives you
A non-interactive proof has properties that are easy to underrate until you architect a system around them:
- Transferable. The verifier can hand the proof to its counterparties. The auditor can pass it to the regulator. The proof stands alone.
- Re-verifiable indefinitely. Today's proof verifies the same tomorrow, and ten years from now.
- Publicly checkable. Anyone with the verification key can audit — including journalists, watchdog groups, and the firm's own board.
- Cheap to verify, no matter how large the data. A proof over a 100-million-row dataset verifies in the same milliseconds as one over a thousand rows. This is the succinctness in zk-SNARK.
These four properties are why the compliance, audit, and inter-organizational data exchange use cases compose so naturally — and why no other privacy technology achieves them simultaneously.
What it does not give you
- It does not hide the answer. If the answer itself is sensitive, layer differential privacy or k-anonymity on top.
- It does not solve mutability cheaply. Each update changes the commitment; incremental and append-only schemes are mature, full mutability is still research-grade.
- It does not provide confidentiality during compute. The owner sees the rows while computing — by definition. If that is the threat model, you need FHE or a TEE in addition.
Further reading
- zkDB vs FHE vs TEE — decision tree.
- The trust topology of a zkDB — who trusts whom, and why this changes.
- Research bibliography — the foundational papers.