Node Module - Data Server (Airport + FlightSQL)¶
The Node module exposes DuckDB Airport on flight_server (Arrow Flight with PATH-style access) and optionally Apache Arrow FlightSQL on flightsql_server (Grafana InfluxDB / FlightSQL). Both share the same DuckDB engine and DuckLake-backed data.
Architecture¶
┌─────────────────────────────────────────────────────────────────┐
│ Homer Node │
│ │
│ DuckDB Airport (gRPC :50051) ──┐ │
│ Arrow FlightSQL (gRPC :50055) ──┼──▶ DuckDB + DuckLake │
│ HTTP /query (port+1) ──┘ │
│ │
└───────────────────────────────┬────────────────────────────────┘
▼
┌────────────────┐
│ Parquet volumes │
└────────────────┘
Key Features¶
- DuckDB Airport (
flight_server) — Arrow Flight for DuckDBATTACH 'arrow_flight://…'and the coordinator’s HTTP/querypath - Apache FlightSQL (
flightsql_server, optional) — same SQL stack as HTTP queries, for Grafana FlightSQL / ADBC - Multi-volume support - read from multiple storage backends (hot/cold)
- Automatic UNION ALL - transparent data merging from all volumes
- Partition pruning - DuckDB automatically skips partitions without matching data
- HTTP API - REST endpoints for queries and management
Configuration¶
Single Volume (Simple Case)¶
{
"node": {
"enable": true,
"flight_server": {
"host": "0.0.0.0",
"port": 50051,
"auth_token": "your-secret-token",
"max_message_size": 16777216
},
"flightsql_server": {
"enable": false,
"host": "0.0.0.0",
"port": 50055,
"auth_token": "",
"catalog_refresh_interval_sec": 30
},
"ducklake": {
"lake_name": "homer_lake",
"volumes": [
{
"name": "default",
"type": "local",
"catalog_type": "sqlite",
"catalog_path": "/data/homer/homer_catalog.sqlite",
"path": "/data/homer/parquet"
}
]
}
}
}
Multiple Volumes (Tiered Storage)¶
{
"node": {
"enable": true,
"flight_server": {
"host": "0.0.0.0",
"port": 50051,
"auth_token": "your-secret-token"
},
"ducklake": {
"lake_name": "homer_lake",
"volumes": [
{
"name": "hot",
"type": "local",
"catalog_type": "sqlite",
"catalog_path": "/data/homer/homer_catalog_hot.sqlite",
"path": "/data/homer/parquet"
},
{
"name": "cold",
"type": "s3",
"catalog_type": "sqlite",
"catalog_path": "/data/homer/homer_catalog_cold.sqlite",
"path": "s3://homer-bucket/cold/",
"s3_region": "us-east-1",
"s3_access_key_id": "your-key",
"s3_secret_access_key": "your-secret",
"s3_endpoint": "https://s3.amazonaws.com",
"s3_use_ssl": true
}
]
}
}
}
Configuration Parameters¶
flight_server¶
| Parameter | Type | Default | Description |
|---|---|---|---|
host |
string | "0.0.0.0" | Listen address |
port |
int | 50051 | gRPC server port |
auth_token |
string | "" | Bearer token for authentication |
max_message_size |
int | 16777216 | Maximum message size (16MB) |
flightsql_server (optional, Grafana FlightSQL)¶
| Parameter | Type | Default | Description |
|---|---|---|---|
enable |
bool | false | Listen for Apache Arrow FlightSQL (separate from Airport) |
host |
string | "0.0.0.0" | Listen address |
port |
int | 50055 | gRPC FlightSQL port |
auth_token |
string | "" | If set, require Authorization: Bearer <token> |
catalog_refresh_interval_sec |
int | 30 | Periodic DETACH/ATTACH refresh when not using a shared writer DB |
See FLIGHTSQL.md for Grafana setup and coordinator proxy (coordinator.flightsql_server, default port 32010).
ducklake¶
| Parameter | Type | Description |
|---|---|---|
lake_name |
string | Base DuckLake name (used as prefix) |
volumes |
array | Array of storage volumes |
volume¶
| Parameter | Type | Default | Description |
|---|---|---|---|
name |
string | - | Volume label. Use "default" for the primary single-volume catalog (see below). Other labels (hot, cold, …) form the DuckDB catalog name together with lake_name. |
type |
string | "local" | Storage type: "local" or "s3" |
catalog_type |
string | "sqlite" | DuckLake catalog — sqlite |
catalog_path |
string | - | Path to catalog file |
path |
string | - | Data path (local or s3://) |
s3_region |
string | - | AWS region (for S3) |
s3_access_key_id |
string | - | AWS Access Key ID |
s3_secret_access_key |
string | - | AWS Secret Access Key |
s3_endpoint |
string | - | Custom S3 endpoint (MinIO, R2) |
s3_use_ssl |
bool | true | Use SSL for S3 connections |
override_data_path |
bool | false | If true, DuckLake ATTACH uses OVERRIDE_DATA_PATH TRUE so path may differ from the DATA_PATH already stored in the catalog (e.g. bucket rename). Prefer keeping path identical to the writer volume that created the catalog. |
Volume name and the DuckDB catalog¶
When Node attaches DuckLake, the volume’s name determines the DuckDB catalog identifier used in SQL (e.g. homer_lake.main.hep_proto_1_call):
- If
nameis"default", the catalog is attached asnode.ducklake.lake_nameonly (e.g.homer_lake). This matches the default single-catalog writer (storage.ducklake.lake_name). - For any other
name, the catalog is{lake_name}_{name}— the same pattern as in the Multiple Volumes example (homer_lake_hot,homer_lake_cold).
API rows may include a storage_volume column (your volume label). Qualified FROM clauses must use the catalog name, not the label alone; the coordinator typically emits lake_name.main…, and Node may rewrite it per volume.
For tiered storage, keep node.ducklake.volumes in sync with storage.ducklake.storage_policy.volumes (same name values and matching catalog_path / path per volume). See STORAGE_POLICIES.md.
How Multi-Volume Works¶
UNION ALL Mechanism¶
When multiple volumes are configured, Node automatically rewrites incoming queries:
Incoming query:
SELECT * FROM homer_lake.main.sip_transactions
WHERE timestamp BETWEEN '2025-01-01 10:00:00' AND '2025-01-01 12:00:00'
Rewritten query:
SELECT * FROM (
SELECT * FROM homer_lake_hot.main.sip_transactions
UNION ALL
SELECT * FROM homer_lake_cold.main.sip_transactions
)
WHERE timestamp BETWEEN '2025-01-01 10:00:00' AND '2025-01-01 12:00:00'
Partition Pruning¶
DuckDB automatically optimizes UNION ALL queries through:
- Predicate pushdown - WHERE conditions are applied to each subquery
- Partition pruning - partitions not containing data in the requested timerange are skipped
- File statistics - DuckLake stores min/max statistics for each file
This means if the data for the requested period exists only in the hot volume, the cold volume won't be scanned at all.
HTTP API¶
GET /api/v4/node/stats¶
Returns node statistics.
{
"running": true,
"address": "0.0.0.0:50051",
"volumes_count": 2,
"volumes": ["hot", "cold"]
}
POST /api/v4/node/query¶
Executes an SQL query.
Request:
{
"sql": "SELECT * FROM homer_lake.main.sip_transactions LIMIT 10"
}
Response:
{
"columns": ["timestamp", "call_id", "method", ...],
"data": [...]
}
POST /api/v4/node/vacuum¶
Runs vacuum on all volumes.
Request:
{
"expire_older_than": "7 days"
}
FlightSQL Clients¶
Python (pyarrow)¶
from pyarrow.flight import FlightClient, FlightCallOptions
client = FlightClient("grpc://localhost:50051")
options = FlightCallOptions(headers=[("authorization", "Bearer your-token")])
info = client.get_flight_info(
FlightDescriptor.for_command(b"SELECT * FROM homer_lake.main.sip_transactions LIMIT 10"),
options
)
reader = client.do_get(info.endpoints[0].ticket, options)
table = reader.read_all()
print(table.to_pandas())
DuckDB¶
INSTALL arrow;
LOAD arrow;
-- Connect to FlightSQL server
ATTACH 'arrow_flight://localhost:50051?auth_token=your-token' AS homer;
-- Query data
SELECT * FROM homer.homer_lake.main.sip_transactions LIMIT 10;
Ingest, Storage and Node Integration¶
Ingest, Storage and Node modules work together:
┌────────────────────────────────────────────────────────────┐
│ Homer Server │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Ingest │──▶│ Storage │──▶│ Node │ │
│ │ (HEP recv) │ │ (DuckLake) │ │ (FlightSQL) │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ UDP/TCP/HTTP Parquet write gRPC:50051 │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────────────────────────────────────────┐ │
│ │ DuckLake Catalog │ │
│ │ (shared catalog, same volumes) │ │
│ └─────────────────────────────────────────────────┘ │
│ │ │
│ ┌─────────────┴─────────────┐ │
│ ▼ ▼ │
│ ┌─────────────┐ ┌─────────────┐ │
│ │ Hot Volume │ │ Cold Volume │ │
│ │ (local SSD) │ │ (S3/R2) │ │
│ └─────────────┘ └─────────────┘ │
└────────────────────────────────────────────────────────────┘
Important: Storage and Node must use the same volumes with identical catalog_path (and data path) for each volume. For a single logical DuckLake that the writer owns, the node volume must point at the same catalog and Parquet tree as storage.ducklake.
Writer + Node (all-in-one process)¶
When Ingest, Storage, and Node are enabled together, Node uses the writer’s DuckDB connection for queries so in-memory buffer tables and freshly flushed Parquet stay visible without a second attach.
- The SQL engine then only sees catalogs the writer actually attached — for the default (non–storage-policy) writer that is normally
storage.ducklake.lake_name(e.g.homer_lake). - Set
node.ducklake.lake_nameto the same value asstorage.ducklake.lake_name, and aligncatalog_path/pathon the single volume with storage. - Using
"name": "default"for that single volume keeps Node’s own DuckDB attach (used for catalog refresh when not sharing the writer DB) consistent with the writer catalog name.
Current code also rewrites coordinator SQL to use node.ducklake.lake_name when executing on the shared writer connection with one volume (duckLakeCatalogForQuery in src/node/node.go), which avoids Binder errors if name is not default. Path and lake_name alignment with storage remains mandatory.
With multi-volume node.ducklake.volumes and a writer storage_policy, the Node’s HTTP /query path (coordinator → local node) merges results from the writer DuckDB (homer_lake + memory buffer) and the writer’s TieredStorageManager DuckDB (homer_lake_hot / homer_lake_cold UNION), so cold-tier data is visible in search. Apache Arrow FlightSQL still uses the writer connection only until extended similarly.
Troubleshooting¶
| Symptom | Likely cause |
|---|---|
Binder Error: Catalog "homer_lake_…" does not exist on search / API queries |
Rewritten SQL referenced a suffixed catalog (from volumes[].name) that is not attached on the connection running the query — e.g. all-in-one with shared writer DB while the node volume used a non-default label. Fix: same lake_name and paths as storage, mirror tiered volume names with STORAGE_POLICIES.md, use "name": "default" for the single primary volume, or use a build that includes shared-DB single-volume catalog normalization (see duckLakeCatalogForQuery in src/node/node.go). |
DATA_PATH … does not match existing data path in the catalog on Node attach |
The S3/local path for a volume must match the path recorded when the catalog was first created (usually by the writer). Fix: align path with storage.ducklake.storage_policy.volumes for that volume, or set override_data_path": true on that volume (DuckLake OVERRIDE_DATA_PATH) if you intentionally moved data or renamed the bucket prefix. |
Example Configurations¶
See examples in the examples/ directory:
homer-node.json- Node only (read-only)homer-writer.json- Ingest + Storage + Node (combined)homer-writer-rustfs.json- Ingest + Storage + Node with S3 (RustFS/MinIO)homer.json- All-in-one (Ingest + Storage + Node + Coordinator)