Storage Policies - Tiered Storage for Homer Server¶
Storage policies allow you to configure tiered storage, automatically moving old data from fast local storage (hot) to cheaper object storage like S3 or Cloudflare R2 (cold).
Overview¶
┌─────────────────────────────────────────────────────────┐
│ Homer Storage │
│ │ │
│ ▼ │
│ ┌─────────────────────┐ │
│ │ Hot Volume │ ◄── New data │
│ │ (Local SSD) │ written here │
│ │ /data/homer/ │ │
│ │ max_age: 7 days │ │
│ └─────────┬───────────┘ │
│ │ │
│ │ TieringService │
│ │ (automatic, daily) │
│ ▼ │
│ ┌─────────────────────┐ │
│ │ Cold Volume │ ◄── Old data │
│ │ (S3/R2 bucket) │ moved here │
│ │ s3://bucket/cold/ │ │
│ │ max_age: unlimited │ │
│ └─────────────────────┘ │
└─────────────────────────────────────────────────────────┘
Configuration¶
Add the storage_policy section to your storage.ducklake configuration:
{
"storage": {
"enable": true,
"ducklake": {
"storage_policy": {
"enable": true,
"ttl_move_interval_sec": 3600,
"move_factor": 0.8,
"concurrent_moves": 2,
"move_on_startup": false,
"volumes": [
{
"name": "hot",
"type": "local",
"path": "/data/homer/parquet",
"priority": 0,
"max_data_age_days": 7,
"max_size_gb": 100
},
{
"name": "cold",
"type": "s3",
"path": "s3://your-bucket/homer/cold/",
"priority": 1,
"max_data_age_days": 0,
"s3_region": "us-east-1",
"s3_access_key_id": "YOUR_ACCESS_KEY",
"s3_secret_access_key": "YOUR_SECRET_KEY",
"s3_endpoint": "",
"s3_use_ssl": true
}
]
}
}
}
}
Configuration Options¶
Storage Policy Settings¶
| Option | Type | Default | Description |
|---|---|---|---|
enable |
bool | false | Enable tiered storage |
ttl_move_interval_sec |
int | 3600 | How often to check for data to move (seconds) |
move_factor |
float | 0.8 | Move data when volume fill ratio exceeds this value (0.0-1.0) |
concurrent_moves |
int | 2 | Maximum concurrent partition moves |
move_on_startup |
bool | false | Run tiering check on server startup |
move_factor Explained¶
The move_factor parameter works similar to ClickHouse storage policies. It controls when data starts moving from a volume based on disk usage:
- Value range: 0.0 to 1.0 (percentage as decimal)
- Default: 0.8 (80%)
- Behavior: When volume usage exceeds
move_factor * max_size_gb, oldest partitions are moved to the next volume
Example scenarios:
| move_factor | max_size_gb | Trigger Point |
|---|---|---|
| 0.8 | 100 GB | Move starts when volume has 80 GB of data |
| 0.9 | 500 GB | Move starts when volume has 450 GB of data |
| 0.5 | 200 GB | Move starts when volume has 100 GB of data |
| 1.0 | any | Only TTL-based moves (age), no size-based moves |
Note: If max_size_gb is 0 (unlimited), only TTL-based moves (max_data_age_days) will trigger data movement.
Volume Settings¶
| Option | Type | Default | Description |
|---|---|---|---|
name |
string | required | Volume name (e.g., "hot", "cold") |
type |
string | "local" | Storage type: "local" or "s3" |
path |
string | required | Local path or S3 URL |
priority |
int | 0 | Lower = higher priority. Writes go to lowest priority |
max_data_age_days |
int | 0 | Tiering moves rows in partitions whose DuckLake date is on or before calendar(today) − N days (inclusive). Example: N=1 on May 12 includes partition date=2026-05-11. 0 disables TTL-based moves. |
max_size_gb |
int | 0 | Max volume size in GB (0 = no limit) |
S3-specific Settings (for type: "s3")¶
| Option | Type | Default | Description |
|---|---|---|---|
s3_region |
string | "" | AWS region |
s3_access_key_id |
string | "" | Access key |
s3_secret_access_key |
string | "" | Secret key |
s3_endpoint |
string | "" | Custom endpoint for S3-compatible services (R2, MinIO, RustFS) |
s3_use_ssl |
bool | true | Use HTTPS for S3 connections |
Examples¶
Local + S3 (AWS)¶
{
"volumes": [
{
"name": "hot",
"type": "local",
"path": "/data/homer/parquet",
"priority": 0,
"max_data_age_days": 7
},
{
"name": "cold",
"type": "s3",
"path": "s3://homer-archive/data/",
"priority": 1,
"s3_region": "us-east-1",
"s3_access_key_id": "AKIAIOSFODNN7EXAMPLE",
"s3_secret_access_key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
}
]
}
Local + Cloudflare R2¶
{
"volumes": [
{
"name": "hot",
"type": "local",
"path": "/data/homer/parquet",
"priority": 0,
"max_data_age_days": 30
},
{
"name": "cold",
"type": "s3",
"path": "s3://homer-bucket/cold/",
"priority": 1,
"s3_region": "auto",
"s3_access_key_id": "YOUR_R2_ACCESS_KEY",
"s3_secret_access_key": "YOUR_R2_SECRET_KEY",
"s3_endpoint": "https://ACCOUNT_ID.r2.cloudflarestorage.com"
}
]
}
Local + MinIO¶
{
"volumes": [
{
"name": "hot",
"type": "local",
"path": "/data/homer/parquet",
"priority": 0,
"max_data_age_days": 7
},
{
"name": "cold",
"type": "s3",
"path": "s3://homer/archive/",
"priority": 1,
"s3_region": "us-east-1",
"s3_access_key_id": "minioadmin",
"s3_secret_access_key": "minioadmin",
"s3_endpoint": "http://minio:9000",
"s3_use_ssl": false
}
]
}
Local + RustFS¶
RustFS is a high-performance S3-compatible object storage written in Rust.
{
"volumes": [
{
"name": "hot",
"type": "local",
"path": "/data/homer/parquet",
"priority": 0,
"max_data_age_days": 7
},
{
"name": "cold",
"type": "s3",
"path": "s3://homer-cold/data/",
"priority": 1,
"s3_region": "us-east-1",
"s3_access_key_id": "rustfsadmin",
"s3_secret_access_key": "rustfsadmin",
"s3_endpoint": "http://rustfs:9000",
"s3_use_ssl": false
}
]
}
Three-tier Storage¶
{
"volumes": [
{
"name": "hot",
"type": "local",
"path": "/data/homer/ssd",
"priority": 0,
"max_data_age_days": 3
},
{
"name": "warm",
"type": "local",
"path": "/data/homer/hdd",
"priority": 1,
"max_data_age_days": 30
},
{
"name": "cold",
"type": "s3",
"path": "s3://homer-archive/data/",
"priority": 2,
"s3_region": "us-east-1",
"s3_access_key_id": "...",
"s3_secret_access_key": "..."
}
]
}
How It Works¶
Data Flow¶
- Write: All new data is written to the primary (hot) volume (lowest priority number)
- Tiering: The TieringService periodically checks for old partitions
- Copy: Data older than
max_data_age_daysis copied to cold storage (new parquet files created) - Delete: After successful copy, data is deleted from hot storage
- Cleanup: Empty partition directories are automatically removed
- Query: Queries automatically search across all volumes using UNION ALL
Partition Movement Process¶
Data is partitioned by date (date column). The tiering service copies entire date partitions to cold storage:
-- Step 1: Copy data to cold storage (creates new parquet files in S3)
INSERT INTO cold_lake.main.hep_proto_1_call
SELECT * FROM hot_lake.main.hep_proto_1_call
WHERE date = '2026-01-15';
-- Step 2: Delete from hot storage (marks records as deleted in DuckLake catalog)
DELETE FROM hot_lake.main.hep_proto_1_call
WHERE date = '2026-01-15';
-- Step 3: Cleanup empty partition directories (automatic)
-- /data/homer/parquet/main/hep_proto_1_call/date=2026-01-15/ removed if empty
Important notes:
- This is a copy + delete operation, not physical file movement
- New parquet files are created in cold storage (S3/R2)
- Original parquet files in hot storage are marked for deletion (GC removes them later)
- If copy succeeds but delete fails, data exists in both places (safe, no data loss)
- Tables in cold storage are created with PARTITION BY (date) for efficient queries
Querying Across Volumes¶
When storage policy is enabled, queries automatically span all volumes:
-- Executed internally as:
(SELECT * FROM hot_lake.main.hep_proto_1_call WHERE ...)
UNION ALL
(SELECT * FROM cold_lake.main.hep_proto_1_call WHERE ...)
ORDER BY timestamp DESC
LIMIT 1000
Monitoring¶
Monitor tiered storage via logs:
level=INFO msg="TieringService: Starting tiering cycle"
level=INFO msg="TieringService: Found old partitions" table=hep_proto_1_call count=3 dates=[2026-01-10 2026-01-11 2026-01-12]
level=INFO msg="TieredStorageManager: Partition moved" table=hep_proto_1_call date=2026-01-10 rows=150000
level=INFO msg="TieringService: Tiering cycle completed" duration=45.2s partitions_moved=3
Migration from Non-Tiered Setup¶
If you have existing data without tiered storage and want to enable it, the system automatically handles migration:
Automatic Migration¶
When tiered storage is enabled, the system checks for an existing legacy catalog:
| Scenario | Hot Catalog | Cold Catalog |
|---|---|---|
| New installation | homer_catalog_hot.sqlite |
homer_catalog_cold.sqlite |
| Migration from legacy | homer_catalog.sqlite (existing) |
homer_catalog_cold.sqlite |
What happens:
1. If homer_catalog.sqlite exists, it's used as the hot volume catalog
2. A new homer_catalog_cold.sqlite is created for cold storage
3. Existing Parquet files in /data/homer/parquet/ continue to work
4. Old data will gradually move to cold storage based on max_data_age_days
Log output during migration:
level=INFO msg="TieredStorageManager: Using legacy catalog for hot volume (migration mode)" path=/data/homer/homer_catalog.sqlite
No Manual Steps Required¶
Simply enable storage_policy in your config and restart. The system handles the rest.
Best Practices¶
- Start with longer retention on hot storage: Begin with 30 days and reduce as needed — configure TTL via
retention_days, not mapping schema alone. - Use compaction before tiering: Ensure compaction runs before tiering to minimize small files in cold storage
- Monitor S3 costs: Object storage egress can be expensive for frequently queried data
- Test restore procedures: Periodically verify you can query data from cold storage
- Use lifecycle policies: Configure S3 lifecycle rules for further cost optimization (e.g., Glacier after 1 year)
Limitations¶
- Currently supports moving by date partition only (not by size)
- No automatic data recall from cold to hot
- S3 query performance may be slower than local storage
- Each volume requires a separate DuckLake catalog file