HEP UDP/TCP ingest — Mpps benchmark and bottleneck analysis¶
Test bench: Intel Core Ultra 9 185H (22 logical cores), Linux 6.8, loopback; after
sudo ./scripts/tune-udp-sysctl.sh:net.core.rmem_max = 33554432(32 MiB). Benchmarks live insrc/server/mpps_benchmark_test.go. See Reproduce for exact commands.
The numbers below answer two questions:
- What end-to-end Mpps can a single Homer ingest process sustain for HEP3 UDP and TCP?
- Which line of code is the actual bottleneck?
The short version (after pool + metrics batching fixes, 2026-05-19):
- TCP peak: ~1.46 Mpps (≈8.6 Gbps) at 16 workers / 16 connections.
- UDP peak (sysctl tuned): ~0.83 Mpps at 16w/16s; best drop rate on 1 sender: ~0.53 Mpps, ~18% drop (w2_s1). Multi-sender loopback still drops ~80%+ (senders faster than decode, not rmem-limited).
- Theoretical decode ceiling: ≈ 3.9 Mpps (
BenchmarkHEPDecodeOnly, 259 ns/op after pooling; was 502 ns / 8 allocs before). - Before fixes (broken per-server
packetPool+ per-packet Prometheus): TCP ~0.41 Mpps, UDP ~0.35 Mpps on the same host.
Remaining limits:
- Kernel UDP
net.core.rmem_max(208 KiB default) — still causes drops under burst load. - Decoder / DuckLake writer once receive path is no longer allocating per packet.
Headline numbers (after fixes)¶
UDP HEP (pool fix + tune-udp-sysctl.sh, loopback)¶
| workers | senders | hep Mpps | drop % | bandwidth |
|---|---|---|---|---|
| 2 | 1 | 0.529 | 17.8% | 3121 Mbps |
| 4 | 1 | 0.469 | 26.8% | 2771 Mbps |
| 4 | 4 | 0.679 | 73.7% | 4008 Mbps |
| 8 | 8 | 0.809 | 82.5% | 4774 Mbps |
| 16 | 16 | 0.828 | 85.8% | 4886 Mbps |
With default rmem_max=212992 (before sysctl), the same matrix showed
~86–94% drop and ~0.12–0.35 Mpps — raising rmem fixes single-sender
paths first.
TCP HEP (single process, loopback connections)¶
| workers | senders | hep Mpps | bandwidth |
|---|---|---|---|
| 2 | 1 | 0.393 | 2320 Mbps |
| 4 | 1 | 0.406 | 2399 Mbps |
| 4 | 4 | 1.256 | 7414 Mbps |
| 8 | 8 | 1.326 | 7831 Mbps |
| 16 | 16 | 1.458 | 8607 Mbps |
TCP has zero packet loss (kernel back-pressures the sender). Above 8 workers / 8 connections, throughput drops because event loop and worker scheduling contention starts winning.
Pure decode ceiling (no networking, no channels)¶
| benchmark | ns/op | throughput |
|---|---|---|
BenchmarkHEPDecodeSerial |
1675 | ≈ 597 k pps / core |
BenchmarkHEPDecodeOnly |
502 | ≈ 1.99 Mpps total (22 cores) |
8 allocations per HEP3 packet, mostly from the SIP zero-copy parser. This is the upper bound the rest of the pipeline competes against.
Where the CPU goes¶
CPU profile from BenchmarkUDPMpps/w8_s8 (5-second window, 84.8 s of
combined CPU time):
| component | cum % | notes |
|---|---|---|
runMppsUDP.func1 (test senders) |
55.7 % | loopback net.conn.Write + syscalls |
gnet event loops (receive side) |
30.3 % | poll + accept + readUDP |
(*udpServer).OnTraffic |
25.2 % | of which: |
↳ packetPool.Get → mallocgc |
18.85 s | fresh 64 KiB allocation per packet |
| ↳ Prometheus metric updates (×3) | 1.18 s | hot-path counter / histogram updates |
↳ inputCh <- pkt |
0.79 s | channel send |
↳ time.Now() |
0.15 s | |
(*HEPInput).worker |
7.6 % | |
↳ decoder.DecodeHEP |
5.08 s | dominated by SIP parse |
Two things stand out: workers are not the bottleneck (they spend
~16 % of their on-CPU time waiting in select), and the receive
path spends 88 % of its time on packetPool.Get falling through to
runtime.mallocgc.
Pool mismatch (fixed 2026-05-19)¶
Status: fixed. All gnet receivers now use HEPInput.getPacketBuf()
/ putPacketBuf() (shared h.buffer pool). Per-server packetPool
was removed from UDP/TCP/TLS.
Historical note — before the fix, src/server/udp.go,
src/server/tcp.go, src/server/tls.go each created a per-server
packetPool and Get() from it on every OnTraffic event:
```118:122:src/server/udp.go // Copy packet data into pooled buffer (Next buffer is reused by gnet) buf := us.packetPool.Get().([]byte) buf = buf[:len(packet)] copy(buf, packet)
The decoder worker, however, returns the buffer to a **different**
pool — `h.buffer`, which is created in `NewHEPInput`:
```326:367:src/server/server.go
if cap(msg.data) >= maxPktLen {
h.buffer.Put(msg.data[:maxPktLen])
}
h.buffer is never read by anyone on the hot path, so every UDP/TCP
packet effectively allocates a fresh 64 KiB buffer
(maxPktLen = 65535) and the GC reclaims it later. At ~300 k pps
that's ~18 GiB/s of byte-buffer allocations going through mallocgc,
which matches the profile exactly.
Applied fix: single shared buffer pool (getPacketBuf /
putPacketBuf on HEPInput). Measured speed-up on the same hardware:
≈ 2.6× UDP and 3.2× TCP Mpps (loopback, before sysctl tuning).
Receive-path Prometheus updates are batched via ingestReceiveMetrics
(ingest_metrics.go, flush every 128 packets, labels resolved once).
P2: SIP / HEP decode pooling (2026-05-19)¶
Changes: HEP + SipMsg sync.Pool, ReleaseHEP() after worker/write,
ipv4BytesToString (no net.IP on hot path).
| metric | before | after |
|---|---|---|
BenchmarkHEPDecodeOnly |
502 ns, 8 allocs, 1872 B | 259 ns, 6 allocs, 722 B |
| theoretical decode Mpps | ~2.0 | ~3.9 |
Mandatory pairing: every DecodeHEP / Decoder.Decode on the hot path must
call decoder.ReleaseHEP(hep) when done (ingest worker, writer, benchmarks).
P1: DuckLake WriteHEP profile (2026-05-19)¶
Benchmarks in src/storage/ducklake/writehep_bench_test.go (warmup excludes
first-table CREATE / Exec).
| benchmark | ns/op | allocs | ~pps |
|---|---|---|---|
BenchmarkWriteHEP_SIP (decode once) |
~7.1 µs | 31 | ~140k |
BenchmarkDecodeAndWriteHEP_SIP |
~8.3 µs | 37 | ~120k |
CPU (steady state): decode+SIP parse ~45–50%; DuckLake append/batch
(TableWriter.Write / Appender) ~35–40%; remainder mutex/scheduling.
Allocations per WriteHEP: buildExtraJSON → string(b) copy (~required
for row lifetime), Payload/CID string copies from HEP parse, fastUUID,
pooled []interface{} row. flushBatch dominates alloc profile only when
the batch fills (every 10k rows by default).
Implication: on a writer node with DuckLake enabled, ~120k SIP pps per
core end-to-end is a realistic loopback ceiling before batch flush / catalog
IO; multi-core scales with worker_count and shard_count.
go test -vet=off -tags='!vet' -run='^$' -bench=BenchmarkDecodeAndWriteHEP_SIP \
-benchmem -benchtime=3s ./storage/ducklake/
P3: deferred data_extra + HTTP ingest (2026-05-19)¶
buildExtraJSONCell: SIPdata_extrastored as pooled*[]byteon the WriteHEP path;string()conversion moved toflushBatchviacellToDriverValue(one alloc per row at flush, not per enqueue).- HTTP/HTTPS: batched receive metrics (
ingestReceiveMetrics), shared buffer pool (copyPacketToPool), commonhandleIngestPOST.
Secondary bottlenecks¶
After pool + metrics + decode pooling + deferred JSON, the next ceilings are:
- Kernel UDP socket buffer. With
net.core.rmem_max = 212992, gnet'sSO_RCVBUFrequest is silently capped. Drops happen as soon as a single event-loop pause exceeds ~1 ms. Recommended sysctl for high-rate captures:
sudo ./scripts/tune-udp-sysctl.sh
# persistent:
sudo cp examples/sysctl/99-homer-udp-buffers.conf /etc/sysctl.d/
sudo sysctl --system
Homer config already defaults socket_recv_buffer to 8 MiB; after
sysctl, gnet will get the full value instead of being capped at
rmem_max.
- Prometheus hot-path metrics.
RecordHEPPacketReceived,RecordHEPPacketSize, andRecordBytesReceivedtogether cost ~1.2 s of the 5 s window — i.e. ~18 % of OnTraffic time even before the pool fix. Two cheap wins: - Batch the size histogram (
HEPPacketSize.Observe) into a per-worker accumulator, flushed at the same cadence asserverWorkerMetrics. -
Skip
WithLabelValues("udp")resolution per packet by caching the curried child once perudpServer. -
buildExtraJSONcopy and batch flush on the DuckLake path (see P1 above). Tunebatch_size/flush_interval.
Reproduce¶
# Pure decode ceiling
cd src
go test -vet=off -tags='!vet' -run='^$' \
-bench='BenchmarkHEPDecode(Serial|Only)' -benchtime=3s \
-timeout=120s ./server/
# UDP Mpps matrix
go test -vet=off -tags='!vet' -run='^$' \
-bench=BenchmarkUDPMpps -benchtime=1x -timeout=180s ./server/
# TCP Mpps matrix
go test -vet=off -tags='!vet' -run='^$' \
-bench=BenchmarkTCPMpps -benchtime=1x -timeout=180s ./server/
# CPU profile for the worst-loaded UDP case
go test -vet=off -tags='!vet' -run='^$' \
-bench=BenchmarkUDPMpps/w8_s8 -benchtime=1x \
-cpuprofile=cpu_udp.prof ./server/
go tool pprof -top -cum cpu_udp.prof | head -40
go tool pprof -list 'OnTraffic' cpu_udp.prof
The benchmark uses a real HEP3-encapsulated SIP INVITE (738 B
total, including SDP) so the decoder and SIP parser execute the same
paths they would in production. Each sub-bench warms up for 1 s, then
samples HEPCount over a 5 s window for steady-state Mpps.