Internals

Architecture

One sentence: perf record → agent → TCP + zstd → server → parser → source mapper → SSE → browser.

The agent collects on the target. The server parses and broadcasts. The browser does no perf work.

On the target device

The agent is the only thing running on the target. Two flavors ship: a ~600-line Python 3.5+ script for hosts with Python, and a single static C binary with vendored zstd for bare-metal targets. Both speak the same wire protocol — the server cannot tell which agent connected.

Capability probing

Before collecting anything, the agent inspects the kernel:

Reads /proc/sys/kernel/perf_event_paranoid and warns if > 1.
Enumerates candidate events (cycles, instructions, cache-misses, cache-references, branch-misses, branch-instructions, page-faults, context-switches, cpu-migrations) and keeps only the ones perf record/perf stat actually accepts.
Tries call-graph modes in order: fp → dwarf → lbr, picks the first that produces non-empty stacks.
Probes whether perf script -F is supported (perf ≥ ~3.12) and falls back to the default output format on older kernels.

This costs roughly 6–12 seconds on first connection and is a one-time hit.

Collection rounds

Each round runs perf record and perf stat in parallel for N seconds (default 8), then perf script flattens the trace. The result — perf script text optionally followed by a ### PERF_STAT ### section — is compressed with zstd -1 (system or vendored) and pushed over TCP with a 5-byte header. Typical compression: 20–40×.

Health metrics

Independent of the perf pipeline, the agent collects device health every 2 seconds — CPU per-core, memory, load, temperature, process stats, network bytes — and streams them as JSON frames with flag 4. The browser renders sparklines, gauges, and per-core CPU bars without affecting the perf collection.

On the local machine

One Python process owns everything on the local side. A ThreadingHTTPServer serves the UI and the JSON API; a separate thread owns the single TCP listener that accepts agent connections.

Parser

parser.py turns perf script output into:

Per-event sample lists — one bucket per event type.
Function summaries with self %, total %, sample counts, module column.
Flame graph trees — collapsed call stacks aggregated into a value-weighted tree.
Thread index with per-tid sample counts and top functions, extracted from the pid/tid and comm fields.

The parser is defensive on purpose. The perf script format drifts across kernel versions; the optional [cpu], pid/tid, and flags fields appear in different combinations on 2.6, 3.x, 4.x, 5.x, and 6.x. The parser handles all of them and silently tolerates lines it doesn't recognize.

Source mapper

source_mapper.py pipelines sample addresses through addr2line -f (or -fi when --inline is on) in batches of 500. A single mapper is created at server startup and shared across requests — no per-request forking.

Lookups feed a heat map: each source line gets a sample count, and the UI colors it red → amber → green by share of the file's total. With --toolchain-prefix, the same flag derives the right addr2line and readelf for a cross-compiled target in one step. --sysroot resolves shared-library module paths and source files under a target tree, similar to perf --symfs.

Sessions

Every agent connection becomes a session. Raw chunks are written to disk under sessions/<timestamp>_<agent>/ as the agent streams them; metadata (events, sample totals, perf stat values, platform) is written when the connection closes.

Replay is lazy: the UI hits /api/sessions/<id> only when you click a session, and the server re-parses the raw chunks on demand. A standalone perf.data file can also be imported at startup with --import — the server runs perf script against it once and exposes the result as a session.

Wire protocol

Every message is a 5-byte header followed by a payload of exactly LEN bytes.

header = struct.pack('!IB', len(payload), compression_flag)
sock.sendall(header + payload)

Field	Size	Meaning
`LEN`	4 bytes (uint32, big-endian)	Payload length in bytes
`FLAG`	1 byte (uint8)	See flag table below
`PAYLOAD`	`LEN` bytes	Frame body (perf data, JSON, etc.)

Flag values

Flag	Direction	Payload
`0`	agent → server	Raw UTF-8 perf script output
`1`	agent → server	Zstd-compressed perf script output
`2`	server → agent	Command request (JSON): `start`, `stop`, `pause`, `resume`, `configure`, `list_processes`, `reprobe`, …
`3`	agent → server	Command response (JSON)
`4`	agent → server	Health metrics (JSON, every 2 s)

Payloads with flag 1 are decompressed with zstd -d -c. Perf data payloads carry plain perf script text, optionally followed by a ### PERF_STAT ### section the parser splits out.

Two connection modes

The agent can be the connector or the listener — the wire protocol is the same either way.

--server <host> — daemon mode: agent dials the server. Reconnects with exponential backoff if dropped. Useful on devices that sit behind NAT or restart often.
--listen — daemon mode: agent binds a port and waits. The server reaches out via the UI's Live Debug wizard. Useful when you want to discover targets from the UI.
--output FILE — headless: collect once, write to file (- for stdout). Requires --pid.

Broadcast to the browser

The server pushes parsed state to the UI through a Server-Sent Events stream at /api/stream. Four event types: status, event_types, per_event, perf_stat. The browser holds the connection open and rerenders whenever a new round lands. There is no polling.

Per-thread analysis, source views, and exports are pull-based: the UI hits /api/thread-view, /api/source, or the /api/export/… endpoints on demand — see the Reference for the full list.

Known limits

One agent connection at a time. A new agent replaces the current one.
perf_event_paranoid > 1 may restrict which events the kernel allows. The agent warns at startup.
Some container environments strip the perf capability set, so perf record -p <pid> returns empty. A system-wide perf record -a usually works as a fallback.
Source mapping requires a binary compiled with -g and not stripped.
The source view renders up to ~2000 lines (or hottest line ± 100, whichever is larger).