Gossipper profiler analysis (pprof)¶
Date: 2026-03-16
Profiles: CPU, Heap (alloc_space)
Workload: UAC 500 calls, ~44 cps, UAS at 127.0.0.1:35060
Update (2026-03-16): chan *sip.Message + use-after-free fix¶
Transport channel refactoring¶
Completed migration from chan sip.Message to chan *sip.Message with pool usage:
mailboxRegistry.mailboxes:map[string]chan *sip.Messagedispatch/dispatchMessages/dispatchMessagePtr— operate on*sip.MessageserverUDPSession.inbox,udpServerInbound.msg—chan *sip.MessagereceiveinexecuteCall— returns(*sip.Message, error)runClientSharedTCP/runClientSharedTLS— "inbox-first" receive
Use-after-free fix (waitForMatch)¶
Bug: with optional recv (<recv response="180" optional="true"/>) on mismatch:
1. stash(msg, fromPending=false) added msg to pending
2. Then sip.PutMessage(msg) returned it to the pool
3. pending held a "dead" pointer → next recv read garbage
Fix (engine.go, waitForMatch):
if cmd.Optional {
// Free only if stash did NOT take ownership
if stash == nil || fromPending {
sip.PutMessage(msg)
}
return nil, errOptionalRecvMismatch
}
This fixed three failing tests: TestEngineRunsTCPScenario, TestEngineRunsTLSScenario, TestEngineOptionalRecvShortCircuitsOnUnexpected.
Benchmarks (Intel Core Ultra 9 185H, 22 threads)¶
File: internal/sip/message_test.go
Sequential run¶
| Benchmark | ns/op | B/op | allocs/op |
|---|---|---|---|
OldDispatchPath (Parse + Copy) |
1328 | 1696 | 21 |
NewUDPDispatchPath (ParseInto + pool) |
1045 | 1585 | 14 |
NewDispatchPath (Parse + CopyInto, TCP) |
1355 | 1697 | 21 |
CopyInto standalone |
301 | 112 | 7 |
UDP path: −21% latency, −7 allocs/op, −111 B/op (intermediate Copy() removed)
Parallel run (GOMAXPROCS=22)¶
| Benchmark | ns/op | B/op | allocs/op |
|---|---|---|---|
OldDispatchParallel |
580 | 1696 | 21 |
NewUDPDispatchParallel |
530 | 1585 | 14 |
−8.6% latency under load (bottleneck is string operations in parser, not struct allocation)
Update (2026-03-16): sip.Parse unification + RenderMessage fast path + pprof¶
sip.Parse — unification with byte scanner¶
Change: Parse now uses the same parseBytes helper as ParseInto. Shared code, shared performance.
Result: Parse sped up from 1004 ns → 864 ns (−14%), 1584 B → 1176 B (−26%).
RenderMessage / RenderMessageStrict — fast path for bodiless messages¶
Problem: RenderMessage always ran two renders — first to compute BodyLength, second final — even when the template lacked [len] (typical INVITE/ACK/BYE without body).
Fix: added hasLenToken(raw string) bool (O(n) byte scan), which skips the second render when [len] is absent.
Benchmarks:
| Scenario | ns/op before | ns/op after | B/op before | B/op after |
|---|---|---|---|---|
RenderMessageStrict without [len] (fast path) |
~4162 | 1761 (−58%) | ~3398 | 1298 (−62%) |
RenderMessageStrict with [len] (unchanged) |
4162 | 4162 | 3398 | 3398 |
Most SIP messages (INVITE without SDP, ACK, BYE, registrations) have no [len] → −58% latency, −62% B/op for typical traffic.
New pprof profile (2026-03-16 — after all optimizations)¶
Benchmark: BenchmarkEngineUACUDPThroughput — full UAC cycle (INVITE→200→ACK→pause→BYE→200) over UDP, 1 call per iteration, mock UAS in same process.
Result: ~51 ms/op (including network round-trips), 34 KB/op, 338 allocs/op.
CPU (top flat, after optimizations)¶
| Function | flat% | cum% | Description |
|---|---|---|---|
syscall.Syscall6 |
17.95% | — | UDP sockets (I/O bound — expected) |
runtime.stealWork |
5.13% | 15.38% | Goroutine scheduler |
net.UDPConn.readFrom |
2.56% | — | UDP read |
| other | <3% | — | Various runtime costs |
CPU conclusion: application is fully I/O-bound. SIP parsing and rendering do not appear in CPU top. All optimizations met their goals.
Heap (alloc_space, top — after optimizations)¶
| Function | flat | cum | Change vs before |
|---|---|---|---|
runtime.mallocgc |
12.82% | — | — (baseline) |
sip.parseBytes |
11.18% | — | ↓ was 19.7% (sip.Parse) = −43% |
transport.NewSharedUDP |
9.63% | — | test overhead (new conn each iteration) |
executeCall |
6.39% | 24.02% | — |
pprof.StartCPUProfile |
5.50% | — | profiling overhead |
wrapSIPSend |
4.79% | — | SIP payload → trace/HEP |
MakeNoZero |
1.60% | — | ↓ was 17.8% = −91% |
template.* |
not in top | — | ↓ was 28-31% = fully eliminated |
regexp.* |
not in top | — | ↓ was 6.3% = fully eliminated |
Summary of key metrics¶
| Metric | Before | After | Δ |
|---|---|---|---|
sip.Parse alloc |
19.7% | 11.18% | −43% |
MakeNoZero (strings.Builder) |
17.8% | 1.60% | −91% |
regexp in render |
6.3% | 0% | −100% |
template.RenderStrict (cum) |
31.6% | ~0% | −100% |
| CPU hot-path | executeCall/render | syscall I/O | Application I/O-bound |
Conclusions¶
- TCP dispatch (
dispatchMessages): marginal gain —CopyIntois equivalent toCopyin allocs since the packet template is already parsed by SharedTCP. Benefit is semantic clarity (explicit object lifetime). - UDP dispatch (
dispatch): −33% allocs (21→14) — removedMessage.Copy(), parsing goes directly into pool object. - Pool (
sync.Pool) in isolation: no visible gain in micro-benchmark sinceParseIntostill doesmsg.Headers = make(...). Real benefit — reduced GC pressure at high CPS. - Next step: removing
msg.Headers = make(...)fromParseInto(restoring delete-loop) can save 1 more alloc; safe only with guaranteed no concurrent reads — requires verification.
Update (2026-03-16): ParseInto byte-scanner + trace race fix¶
ParseInto: ReplaceAll+Split replaced with direct byte scanner¶
Change (internal/sip/message.go):
- strings.ReplaceAll(string(raw), "\r\n", "\n") + strings.Split(...) replaced with direct byte scanner over raw (no intermediate strings).
- msg.Headers = make(...) replaced with delete-loop to reuse existing map.
Benchmark results (Intel Core Ultra 9 185H):
| Benchmark | ns/op before | ns/op after | B/op before | B/op after | allocs before | allocs after |
|---|---|---|---|---|---|---|
ParseIntoPool |
1051 | 777 (−26%) | 1585 | 776 (−51%) | 14 | 17* |
NewUDPDispatchPath |
1013 | 761 (−25%) | 1585 | 776 (−51%) | 14 | 17* |
NewUDPDispatchParallel |
530 | 289 (−45%) | 1585 | 776 (−51%) | 14 | 17* |
*Alloc count slightly increased due to per-string conversion, but total memory halved → lower GC pressure.
File: internal/sip/message.go
trace.go: data race fix in loop goroutines¶
Problem: goroutines startStatsLoop, startCountsLoop, startScreenLoop accessed fields t.statsStop, t.countsStop, t.screenStop via select <-t.fieldName. Meanwhile Close() nulled these fields (t.statsStop = nil) — data race detected by race detector.
Fix (internal/engine/trace.go): channels captured in local variables before starting goroutine:
stop := t.statsStop
done := t.statsDone
go func() {
defer close(done)
...
case <-stop: // no longer reads t.statsStop
...
}()
Applied to all three loop functions. All tests with -race now pass cleanly.
Current bottlenecks (2026-03-16, after ParseInto)¶
CPU (top flat)¶
| Function | flat% | cum% | Description |
|---|---|---|---|
| indexbytebody | 11.5 | - | String search (IndexByte) |
| Syscall6 / futex | 11.5 + 7.7 | - | Syscall, locks |
| executeCall | 3.9 | 38.5 | Scenario execution |
| renderSIPMessage | - | 15.4 | SIP template render |
| template.RenderStrict / expandTokensStrict | - | 15.4 | Variable substitution |
| waitForMatch | - | 11.5 | SIP response wait |
| mailboxRegistry.dispatch | - | 11.5 | Call-ID routing |
| MakeNoZero | - | 11.5 | String allocations |
Allocations (alloc_space, top)¶
| Function | flat | cum | Description |
|---|---|---|---|
| MakeNoZero / strings.Builder.grow | 29.8% | - | String buffers in render |
| mallocgc | 19.9 | - | General allocation |
| sip.ParseInto | 8.3 | 14.9 | SIP parsing (pool+Copy) |
| sip.Message.Copy | 3.3 | - | Copy on channel send |
| executeCall | 5.0 | 43.1 | Scenario |
| renderSIPMessage | - | 33.1 | Template render |
| template.RenderStrict / expandTokensStrict | - | 28.1 | [$var] variables |
| dispatch (mailboxRegistry) | - | 16.6 | Routing |
| ensureMessageTerminator | 5.0 | - | SIP CRLF |
| strings.genSplit | 5.0 | - | split on parse |
Recommendations¶
- template.RenderStrict / expandTokensStrict (~28% alloc) — replace regexp with
strings.Replaceor map lookup for[$var], reuse buffers. - MakeNoZero / strings.Builder (30% alloc) — pool of string buffers for rendering.
- sip.Message.Copy — still required when passing between goroutines; pool already helps with ParseInto.
- ensureMessageTerminator / genSplit — simplify without extra split/allocations.
Update (2026-03-16): sip.Parse → ParseInto (pool)¶
Introduced GetMessage / ParseInto / PutMessage (SIP struct pool) to reduce allocations:
- engine_test.go — all mock servers (~25 sites)
- main_test.go — mock servers (6 sites);
readSIPMessagekept onsip.Parse(returns message) - engine.go — local INVITE check (~line 1335)
Limitation: in hot-path (engine.go:417, 758, 876, 1717) messages are sent over channels to other goroutines — pool cannot be used without:
- switching to chan *sip.Message and PutMessage in receivers, or
- introducing sip.Message.Copy() and copying before send.
Update (2026-03-16): after reuse buffer¶
After optimizing varStore.Snapshot (scratch buffer) profile was re-taken:
- Workload: UAC, 3000 calls, rate ~91 cps
- Result: varStore.Snapshot and makemap no longer in top of allocs/CPU
Current bottlenecks (alloc_space)¶
| Function | flat | cum | Description |
|---|---|---|---|
sip.Parse |
19.7% | 32.0% | SIP message parsing (partially replaced with ParseInto) |
MakeNoZero |
17.8% | 17.8% | String allocations |
executeCall |
7.5% | 62.0% | Scenario execution |
regexp.ReplaceAllStringFunc |
6.3% | 19.7% | Variable substitution in template |
regexp.replaceAll |
6.3% | 13.4% | Regex in render |
strings.genSplit |
4.7% | 4.7% | split in string processing |
ensureMessageTerminator |
3.5% | 3.5% | SIP CRLF |
template.Context.RenderStrict |
2.0% | 31.6% | SIP template render |
Further optimization recommendations¶
- sip.Parse (~40 MB cum) — lightweight zero-copy parser or object pool for SIP structs
- regexp (ReplaceAllStringFunc in template) — replace with simple strings.Replace or map lookup for
[$var] - template.RenderStrict — reuse string buffers, avoid extra split
- strings.genSplit — consider strings.Cut or manual parsing without allocations
CPU profile (historical)¶
General stats¶
- Duration: 35.73 s
- Samples captured: 20 ms (0.056%) — profile taken under light load
- Mode: UAC (runClientShared), baseline scenario
Call chains (traces)¶
10ms runtime.netpoll
→ runtime.findRunnable → runtime.schedule → runtime.park_m → runtime.mcall
10ms runtime.newobject
→ runtime.makemap
→ github.com/sipcapture/gossipper/internal/engine.(*varStore).Snapshot
→ github.com/sipcapture/gossipper/internal/engine.(*Engine).executeCall
→ github.com/sipcapture/gossipper/internal/engine.(*Engine).runClientShared.func1
CPU conclusions¶
| Source | Share | Interpretation |
|---|---|---|
runtime.netpoll |
50% | Network wait (recv/send) — normal for I/O-bound |
runtime.newobject + makemap |
50% | Allocations in varStore.Snapshot during executeCall |
Main application hotspot: varStore.Snapshot — on each scenario step a copy of the variable map (makemap) was created. At high CPS this adds noticeable load on GC and CPU.
Heap profile¶
Top allocations (inuse_space)¶
| Function | Size | % |
|---|---|---|
runtime.mallocgc |
3084 kB | 57.55% |
runtime/pprof.StartCPUProfile |
1763 kB | 32.90% |
os.runtime_args |
512 kB | 9.56% |
Note: StartCPUProfile uses ~1.7 MB — profiling buffer. In production without -cpuprofile this is not allocated.
Recommendations¶
- ~~varStore.Snapshot:~~ Done: scratch buffer in
vars.go— map reuse instead ofmake(map)per Snapshot. - Network:
netpolldominates — bottleneck is network I/O, not CPU. Switching to gnet unlikely to help much without clear connection limits. - Next candidates: sip.Parse, regexp in template, template.RenderStrict — see table above.
Reproduction commands¶
# CPU + heap to files
go run ./cmd/gossip -sn uac -rsa 127.0.0.1:5060 -m 2000 -r 50 -cpuprofile cpu.prof -memprofile mem.prof
# Interactive view
go tool pprof -http=:8080 cpu.prof
go tool pprof -http=:8080 mem.prof