Node Module - Data Server (Airport + FlightSQL)¶

The Node module exposes DuckDB Airport on flight_server (Arrow Flight with PATH-style access) and optionally Apache Arrow FlightSQL on flightsql_server (Grafana InfluxDB / FlightSQL). Both share the same DuckDB engine and DuckLake-backed data.

Architecture¶

┌─────────────────────────────────────────────────────────────────┐
│                        Homer Node                                │
│                                                                  │
│  DuckDB Airport (gRPC :50051)  ──┐                              │
│  Arrow FlightSQL (gRPC :50055) ──┼──▶  DuckDB + DuckLake        │
│  HTTP /query (port+1)          ──┘                              │
│                                                                  │
└───────────────────────────────┬────────────────────────────────┘
                                ▼
                       ┌────────────────┐
                       │ Parquet volumes │
                       └────────────────┘

Key Features¶

DuckDB Airport (flight_server) — Arrow Flight for DuckDB ATTACH 'arrow_flight://…' and the coordinator’s HTTP /query path
Apache FlightSQL (flightsql_server, optional) — same SQL stack as HTTP queries, for Grafana FlightSQL / ADBC
Multi-volume support - read from multiple storage backends (hot/cold)
Automatic UNION ALL - transparent data merging from all volumes
Partition pruning - DuckDB automatically skips partitions without matching data
HTTP API - REST endpoints for queries and management

Configuration¶

Single Volume (Simple Case)¶

{
  "node": {
    "enable": true,
    "flight_server": {
      "host": "0.0.0.0",
      "port": 50051,
      "auth_token": "your-secret-token",
      "max_message_size": 16777216
    },
    "flightsql_server": {
      "enable": false,
      "host": "0.0.0.0",
      "port": 50055,
      "auth_token": "",
      "catalog_refresh_interval_sec": 30
    },
    "ducklake": {
      "lake_name": "homer_lake",
      "volumes": [
        {
          "name": "default",
          "type": "local",
          "catalog_type": "sqlite",
          "catalog_path": "/data/homer/homer_catalog.sqlite",
          "path": "/data/homer/parquet"
        }
      ]
    }
  }
}

Multiple Volumes (Tiered Storage)¶

{
  "node": {
    "enable": true,
    "flight_server": {
      "host": "0.0.0.0",
      "port": 50051,
      "auth_token": "your-secret-token"
    },
    "ducklake": {
      "lake_name": "homer_lake",
      "volumes": [
        {
          "name": "hot",
          "type": "local",
          "catalog_type": "sqlite",
          "catalog_path": "/data/homer/homer_catalog_hot.sqlite",
          "path": "/data/homer/parquet"
        },
        {
          "name": "cold",
          "type": "s3",
          "catalog_type": "sqlite",
          "catalog_path": "/data/homer/homer_catalog_cold.sqlite",
          "path": "s3://homer-bucket/cold/",
          "s3_region": "us-east-1",
          "s3_access_key_id": "your-key",
          "s3_secret_access_key": "your-secret",
          "s3_endpoint": "https://s3.amazonaws.com",
          "s3_use_ssl": true
        }
      ]
    }
  }
}

Configuration Parameters¶

flight_server¶

Parameter	Type	Default	Description
`host`	string	"0.0.0.0"	Listen address
`port`	int	50051	gRPC server port
`auth_token`	string	""	Bearer token for authentication
`max_message_size`	int	16777216	Maximum message size (16MB)

flightsql_server (optional, Grafana FlightSQL)¶

Parameter	Type	Default	Description
`enable`	bool	false	Listen for Apache Arrow FlightSQL (separate from Airport)
`host`	string	"0.0.0.0"	Listen address
`port`	int	50055	gRPC FlightSQL port
`auth_token`	string	""	If set, require `Authorization: Bearer <token>`
`catalog_refresh_interval_sec`	int	30	Periodic `DETACH`/`ATTACH` refresh when not using a shared writer DB

See FLIGHTSQL.md for Grafana setup and coordinator proxy (coordinator.flightsql_server, default port 32010).

ducklake¶

Parameter	Type	Description
`lake_name`	string	Base DuckLake name (used as prefix)
`volumes`	array	Array of storage volumes

volume¶

Parameter	Type	Default	Description
`name`	string	-	Volume label. Use `"default"` for the primary single-volume catalog (see below). Other labels (`hot`, `cold`, …) form the DuckDB catalog name together with `lake_name`.
`type`	string	"local"	Storage type: "local" or "s3"
`catalog_type`	string	"sqlite"	DuckLake catalog — sqlite
`catalog_path`	string	-	Path to catalog file
`path`	string	-	Data path (local or s3://)
`s3_region`	string	-	AWS region (for S3)
`s3_access_key_id`	string	-	AWS Access Key ID
`s3_secret_access_key`	string	-	AWS Secret Access Key
`s3_endpoint`	string	-	Custom S3 endpoint (MinIO, R2)
`s3_use_ssl`	bool	true	Use SSL for S3 connections
`override_data_path`	bool	false	If true, DuckLake `ATTACH` uses `OVERRIDE_DATA_PATH TRUE` so `path` may differ from the `DATA_PATH` already stored in the catalog (e.g. bucket rename). Prefer keeping `path` identical to the writer volume that created the catalog.

Volume `name` and the DuckDB catalog¶

When Node attaches DuckLake, the volume’s name determines the DuckDB catalog identifier used in SQL (e.g. homer_lake.main.hep_proto_1_call):

If name is "default", the catalog is attached as node.ducklake.lake_name only (e.g. homer_lake). This matches the default single-catalog writer (storage.ducklake.lake_name).
For any other name, the catalog is {lake_name}_{name} — the same pattern as in the Multiple Volumes example (homer_lake_hot, homer_lake_cold).

API rows may include a storage_volume column (your volume label). Qualified FROM clauses must use the catalog name, not the label alone; the coordinator typically emits lake_name.main…, and Node may rewrite it per volume.

For tiered storage, keep node.ducklake.volumes in sync with storage.ducklake.storage_policy.volumes (same name values and matching catalog_path / path per volume). See STORAGE_POLICIES.md.

How Multi-Volume Works¶

UNION ALL Mechanism¶

When multiple volumes are configured, Node automatically rewrites incoming queries:

Incoming query:

SELECT * FROM homer_lake.main.sip_transactions 
WHERE timestamp BETWEEN '2025-01-01 10:00:00' AND '2025-01-01 12:00:00'

Rewritten query:

SELECT * FROM (
  SELECT * FROM homer_lake_hot.main.sip_transactions
  UNION ALL
  SELECT * FROM homer_lake_cold.main.sip_transactions
) 
WHERE timestamp BETWEEN '2025-01-01 10:00:00' AND '2025-01-01 12:00:00'

Partition Pruning¶

DuckDB automatically optimizes UNION ALL queries through:

Predicate pushdown - WHERE conditions are applied to each subquery
Partition pruning - partitions not containing data in the requested timerange are skipped
File statistics - DuckLake stores min/max statistics for each file

This means if the data for the requested period exists only in the hot volume, the cold volume won't be scanned at all.

HTTP API¶

GET /api/v4/node/stats¶

Returns node statistics.

{
  "running": true,
  "address": "0.0.0.0:50051",
  "volumes_count": 2,
  "volumes": ["hot", "cold"]
}

POST /api/v4/node/query¶

Executes an SQL query.

Request:

{
  "sql": "SELECT * FROM homer_lake.main.sip_transactions LIMIT 10"
}

Response:

{
  "columns": ["timestamp", "call_id", "method", ...],
  "data": [...]
}

POST /api/v4/node/vacuum¶

Runs vacuum on all volumes.

Request:

{
  "expire_older_than": "7 days"
}

FlightSQL Clients¶

Python (pyarrow)¶

from pyarrow.flight import FlightClient, FlightCallOptions

client = FlightClient("grpc://localhost:50051")
options = FlightCallOptions(headers=[("authorization", "Bearer your-token")])

info = client.get_flight_info(
    FlightDescriptor.for_command(b"SELECT * FROM homer_lake.main.sip_transactions LIMIT 10"),
    options
)

reader = client.do_get(info.endpoints[0].ticket, options)
table = reader.read_all()
print(table.to_pandas())

DuckDB¶

INSTALL arrow;
LOAD arrow;

-- Connect to FlightSQL server
ATTACH 'arrow_flight://localhost:50051?auth_token=your-token' AS homer;

-- Query data
SELECT * FROM homer.homer_lake.main.sip_transactions LIMIT 10;

Ingest, Storage and Node Integration¶

Ingest, Storage and Node modules work together:

┌────────────────────────────────────────────────────────────┐
│                       Homer Server                          │
│                                                             │
│  ┌─────────────┐   ┌─────────────┐   ┌─────────────┐       │
│  │   Ingest    │──▶│   Storage   │──▶│    Node     │       │
│  │ (HEP recv)  │   │ (DuckLake)  │   │ (FlightSQL) │       │
│  └─────────────┘   └─────────────┘   └─────────────┘       │
│   UDP/TCP/HTTP       Parquet write      gRPC:50051         │
│                            │                  │            │
│                            ▼                  ▼            │
│  ┌─────────────────────────────────────────────────┐       │
│  │              DuckLake Catalog                    │       │
│  │       (shared catalog, same volumes)             │       │
│  └─────────────────────────────────────────────────┘       │
│                            │                               │
│              ┌─────────────┴─────────────┐                 │
│              ▼                           ▼                 │
│       ┌─────────────┐             ┌─────────────┐          │
│       │ Hot Volume  │             │ Cold Volume │          │
│       │ (local SSD) │             │ (S3/R2)     │          │
│       └─────────────┘             └─────────────┘          │
└────────────────────────────────────────────────────────────┘

Important: Storage and Node must use the same volumes with identical catalog_path (and data path) for each volume. For a single logical DuckLake that the writer owns, the node volume must point at the same catalog and Parquet tree as storage.ducklake.

Writer + Node (all-in-one process)¶

When Ingest, Storage, and Node are enabled together, Node uses the writer’s DuckDB connection for queries so in-memory buffer tables and freshly flushed Parquet stay visible without a second attach.

The SQL engine then only sees catalogs the writer actually attached — for the default (non–storage-policy) writer that is normally storage.ducklake.lake_name (e.g. homer_lake).
Set node.ducklake.lake_name to the same value as storage.ducklake.lake_name, and align catalog_path / path on the single volume with storage.
Using "name": "default" for that single volume keeps Node’s own DuckDB attach (used for catalog refresh when not sharing the writer DB) consistent with the writer catalog name.

Current code also rewrites coordinator SQL to use node.ducklake.lake_name when executing on the shared writer connection with one volume (duckLakeCatalogForQuery in src/node/node.go), which avoids Binder errors if name is not default. Path and lake_name alignment with storage remains mandatory.

With multi-volume node.ducklake.volumes and a writer storage_policy, the Node’s HTTP /query path (coordinator → local node) merges results from the writer DuckDB (homer_lake + memory buffer) and the writer’s TieredStorageManager DuckDB (homer_lake_hot / homer_lake_cold UNION), so cold-tier data is visible in search. Apache Arrow FlightSQL still uses the writer connection only until extended similarly.

Troubleshooting¶

Symptom Likely cause

Binder Error: Catalog "homer_lake_…" does not exist on search / API queries Rewritten SQL referenced a suffixed catalog (from volumes[].name) that is not attached on the connection running the query — e.g. all-in-one with shared writer DB while the node volume used a non-default label. Fix: same lake_name and paths as storage, mirror tiered volume names with STORAGE_POLICIES.md, use "name": "default" for the single primary volume, or use a build that includes shared-DB single-volume catalog normalization (see duckLakeCatalogForQuery in src/node/node.go).

DATA_PATH … does not match existing data path in the catalog on Node attach The S3/local path for a volume must match the path recorded when the catalog was first created (usually by the writer). Fix: align path with storage.ducklake.storage_policy.volumes for that volume, or set override_data_path": true on that volume (DuckLake OVERRIDE_DATA_PATH) if you intentionally moved data or renamed the bucket prefix.

Example Configurations¶

See examples in the examples/ directory:

homer-node.json - Node only (read-only)
homer-writer.json - Ingest + Storage + Node (combined)
homer-writer-rustfs.json - Ingest + Storage + Node with S3 (RustFS/MinIO)
homer.json - All-in-one (Ingest + Storage + Node + Coordinator)