Storage Policies - Tiered Storage for Homer Server¶

Storage policies allow you to configure tiered storage, automatically moving old data from fast local storage (hot) to cheaper object storage like S3 or Cloudflare R2 (cold).

Overview¶

┌─────────────────────────────────────────────────────────┐
│                    Homer Storage                         │
│                         │                                │
│                         ▼                                │
│              ┌─────────────────────┐                    │
│              │    Hot Volume       │ ◄── New data       │
│              │   (Local SSD)       │     written here   │
│              │  /data/homer/       │                    │
│              │  max_age: 7 days    │                    │
│              └─────────┬───────────┘                    │
│                        │                                │
│                        │ TieringService                 │
│                        │ (automatic, daily)             │
│                        ▼                                │
│              ┌─────────────────────┐                    │
│              │    Cold Volume      │ ◄── Old data       │
│              │   (S3/R2 bucket)    │     moved here     │
│              │  s3://bucket/cold/  │                    │
│              │  max_age: unlimited │                    │
│              └─────────────────────┘                    │
└─────────────────────────────────────────────────────────┘

Configuration¶

Add the storage_policy section to your storage.ducklake configuration:

{
  "storage": {
    "enable": true,
    "ducklake": {
      "storage_policy": {
        "enable": true,
        "ttl_move_interval_sec": 3600,
        "move_factor": 0.8,
        "concurrent_moves": 2,
        "move_on_startup": false,
        "volumes": [
          {
            "name": "hot",
            "type": "local",
            "path": "/data/homer/parquet",
            "priority": 0,
            "max_data_age_days": 7,
            "max_size_gb": 100
          },
          {
            "name": "cold",
            "type": "s3",
            "path": "s3://your-bucket/homer/cold/",
            "priority": 1,
            "max_data_age_days": 0,
            "s3_region": "us-east-1",
            "s3_access_key_id": "YOUR_ACCESS_KEY",
            "s3_secret_access_key": "YOUR_SECRET_KEY",
            "s3_endpoint": "",
            "s3_use_ssl": true
          }
        ]
      }
    }
  }
}

Configuration Options¶

Storage Policy Settings¶

Option	Type	Default	Description
`enable`	bool	false	Enable tiered storage
`ttl_move_interval_sec`	int	3600	How often to check for data to move (seconds)
`move_factor`	float	0.8	Move data when volume fill ratio exceeds this value (0.0-1.0)
`concurrent_moves`	int	2	Maximum concurrent partition moves
`move_on_startup`	bool	false	Run tiering check on server startup

move_factor Explained¶

The move_factor parameter works similar to ClickHouse storage policies. It controls when data starts moving from a volume based on disk usage:

Value range: 0.0 to 1.0 (percentage as decimal)
Default: 0.8 (80%)
Behavior: When volume usage exceeds move_factor * max_size_gb, oldest partitions are moved to the next volume

Example scenarios:

move_factor	max_size_gb	Trigger Point
0.8	100 GB	Move starts when volume has 80 GB of data
0.9	500 GB	Move starts when volume has 450 GB of data
0.5	200 GB	Move starts when volume has 100 GB of data
1.0	any	Only TTL-based moves (age), no size-based moves

Note: If max_size_gb is 0 (unlimited), only TTL-based moves (max_data_age_days) will trigger data movement.

Volume Settings¶

Option	Type	Default	Description
`name`	string	required	Volume name (e.g., "hot", "cold")
`type`	string	"local"	Storage type: "local" or "s3"
`path`	string	required	Local path or S3 URL
`priority`	int	0	Lower = higher priority. Writes go to lowest priority
`max_data_age_days`	int	0	Tiering moves rows in partitions whose DuckLake `date` is on or before `calendar(today) − N days` (inclusive). Example: `N=1` on May 12 includes partition `date=2026-05-11`. `0` disables TTL-based moves.
`max_size_gb`	int	0	Max volume size in GB (0 = no limit)

S3-specific Settings (for `type: "s3"`)¶

Option	Type	Default	Description
`s3_region`	string	""	AWS region
`s3_access_key_id`	string	""	Access key
`s3_secret_access_key`	string	""	Secret key
`s3_endpoint`	string	""	Custom endpoint for S3-compatible services (R2, MinIO, RustFS)
`s3_use_ssl`	bool	true	Use HTTPS for S3 connections

Examples¶

Local + S3 (AWS)¶

{
  "volumes": [
    {
      "name": "hot",
      "type": "local",
      "path": "/data/homer/parquet",
      "priority": 0,
      "max_data_age_days": 7
    },
    {
      "name": "cold",
      "type": "s3",
      "path": "s3://homer-archive/data/",
      "priority": 1,
      "s3_region": "us-east-1",
      "s3_access_key_id": "AKIAIOSFODNN7EXAMPLE",
      "s3_secret_access_key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
    }
  ]
}

Local + Cloudflare R2¶

{
  "volumes": [
    {
      "name": "hot",
      "type": "local",
      "path": "/data/homer/parquet",
      "priority": 0,
      "max_data_age_days": 30
    },
    {
      "name": "cold",
      "type": "s3",
      "path": "s3://homer-bucket/cold/",
      "priority": 1,
      "s3_region": "auto",
      "s3_access_key_id": "YOUR_R2_ACCESS_KEY",
      "s3_secret_access_key": "YOUR_R2_SECRET_KEY",
      "s3_endpoint": "https://ACCOUNT_ID.r2.cloudflarestorage.com"
    }
  ]
}

Local + MinIO¶

{
  "volumes": [
    {
      "name": "hot",
      "type": "local",
      "path": "/data/homer/parquet",
      "priority": 0,
      "max_data_age_days": 7
    },
    {
      "name": "cold",
      "type": "s3",
      "path": "s3://homer/archive/",
      "priority": 1,
      "s3_region": "us-east-1",
      "s3_access_key_id": "minioadmin",
      "s3_secret_access_key": "minioadmin",
      "s3_endpoint": "http://minio:9000",
      "s3_use_ssl": false
    }
  ]
}

Local + RustFS¶

RustFS is a high-performance S3-compatible object storage written in Rust.

{
  "volumes": [
    {
      "name": "hot",
      "type": "local",
      "path": "/data/homer/parquet",
      "priority": 0,
      "max_data_age_days": 7
    },
    {
      "name": "cold",
      "type": "s3",
      "path": "s3://homer-cold/data/",
      "priority": 1,
      "s3_region": "us-east-1",
      "s3_access_key_id": "rustfsadmin",
      "s3_secret_access_key": "rustfsadmin",
      "s3_endpoint": "http://rustfs:9000",
      "s3_use_ssl": false
    }
  ]
}

Three-tier Storage¶

{
  "volumes": [
    {
      "name": "hot",
      "type": "local",
      "path": "/data/homer/ssd",
      "priority": 0,
      "max_data_age_days": 3
    },
    {
      "name": "warm",
      "type": "local",
      "path": "/data/homer/hdd",
      "priority": 1,
      "max_data_age_days": 30
    },
    {
      "name": "cold",
      "type": "s3",
      "path": "s3://homer-archive/data/",
      "priority": 2,
      "s3_region": "us-east-1",
      "s3_access_key_id": "...",
      "s3_secret_access_key": "..."
    }
  ]
}

How It Works¶

Data Flow¶

Write: All new data is written to the primary (hot) volume (lowest priority number)
Tiering: The TieringService periodically checks for old partitions
Copy: Data older than max_data_age_days is copied to cold storage (new parquet files created)
Delete: After successful copy, data is deleted from hot storage
Cleanup: Empty partition directories are automatically removed
Query: Queries automatically search across all volumes using UNION ALL

Partition Movement Process¶

Data is partitioned by date (date column). The tiering service copies entire date partitions to cold storage:

-- Step 1: Copy data to cold storage (creates new parquet files in S3)
INSERT INTO cold_lake.main.hep_proto_1_call 
SELECT * FROM hot_lake.main.hep_proto_1_call 
WHERE date = '2026-01-15';

-- Step 2: Delete from hot storage (marks records as deleted in DuckLake catalog)
DELETE FROM hot_lake.main.hep_proto_1_call 
WHERE date = '2026-01-15';

-- Step 3: Cleanup empty partition directories (automatic)
-- /data/homer/parquet/main/hep_proto_1_call/date=2026-01-15/ removed if empty

Important notes: - This is a copy + delete operation, not physical file movement - New parquet files are created in cold storage (S3/R2) - Original parquet files in hot storage are marked for deletion (GC removes them later) - If copy succeeds but delete fails, data exists in both places (safe, no data loss) - Tables in cold storage are created with PARTITION BY (date) for efficient queries

Querying Across Volumes¶

When storage policy is enabled, queries automatically span all volumes:

-- Executed internally as:
(SELECT * FROM hot_lake.main.hep_proto_1_call WHERE ...)
UNION ALL
(SELECT * FROM cold_lake.main.hep_proto_1_call WHERE ...)
ORDER BY timestamp DESC
LIMIT 1000

Monitoring¶

Monitor tiered storage via logs:

level=INFO msg="TieringService: Starting tiering cycle"
level=INFO msg="TieringService: Found old partitions" table=hep_proto_1_call count=3 dates=[2026-01-10 2026-01-11 2026-01-12]
level=INFO msg="TieredStorageManager: Partition moved" table=hep_proto_1_call date=2026-01-10 rows=150000
level=INFO msg="TieringService: Tiering cycle completed" duration=45.2s partitions_moved=3

Migration from Non-Tiered Setup¶

If you have existing data without tiered storage and want to enable it, the system automatically handles migration:

Automatic Migration¶

When tiered storage is enabled, the system checks for an existing legacy catalog:

Scenario	Hot Catalog	Cold Catalog
New installation	`homer_catalog_hot.sqlite`	`homer_catalog_cold.sqlite`
Migration from legacy	`homer_catalog.sqlite` (existing)	`homer_catalog_cold.sqlite`

What happens: 1. If homer_catalog.sqlite exists, it's used as the hot volume catalog 2. A new homer_catalog_cold.sqlite is created for cold storage 3. Existing Parquet files in /data/homer/parquet/ continue to work 4. Old data will gradually move to cold storage based on max_data_age_days

Log output during migration:

level=INFO msg="TieredStorageManager: Using legacy catalog for hot volume (migration mode)" path=/data/homer/homer_catalog.sqlite

No Manual Steps Required¶

Simply enable storage_policy in your config and restart. The system handles the rest.

Best Practices¶

Start with longer retention on hot storage: Begin with 30 days and reduce as needed — configure TTL via retention_days, not mapping schema alone.
Use compaction before tiering: Ensure compaction runs before tiering to minimize small files in cold storage
Monitor S3 costs: Object storage egress can be expensive for frequently queried data
Test restore procedures: Periodically verify you can query data from cold storage
Use lifecycle policies: Configure S3 lifecycle rules for further cost optimization (e.g., Glacier after 1 year)

Limitations¶

Currently supports moving by date partition only (not by size)
No automatic data recall from cold to hot
S3 query performance may be slower than local storage
Each volume requires a separate DuckLake catalog file