xtop Profiler Guide

The xtop profiler provides on-demand analysis of .xpti profile data collected by xtop run. It offers timeline visualization, map-level summaries, and task-level inspection.

Analysis commands (xtop task, xtop map, xtop convert, xtop open) do not require root privileges. Only data collection (xtop run) requires sudo.


Prerequisites

Before using the profiler, you need a .xpti profile file:

# Collect a profile
sudo xtop run -- ./your_application [args...]

This generates a .xpti file (SQLite format) containing host events, device events, cache metrics, and hardware topology. See the xtop User Guide for details on data collection.

Optional dependencies:

  • matplotlib – required for xtop map --plot (PNG plot generation)
  • flask – required for xtop open (Perfetto UI server)

Quick Reference

Command Purpose Example
xtop convert <file> Convert to Perfetto format xtop convert profile.xpti
xtop open <file> Open in Perfetto UI xtop open profile.xpti
xtop map <file> Full execution summary xtop map profile.xpti
xtop map <file> --plot Generate analysis plots (PNG) xtop map profile.xpti --plot
xtop map <file> --json Export analysis as JSON xtop map profile.xpti --json
xtop task <file> List all tasks with timing xtop task profile.xpti
xtop task <file> <id> Detailed analysis of one task xtop task profile.xpti 17

Timeline Visualization

The fastest way to understand your application’s behavior is to visualize the full execution timeline in Perfetto UI.

Convert to Perfetto Format

xtop convert profile.xpti

Options:

Option Description Default
-o, --output <path> Output file path <input>.perfetto-trace
--json Generate legacy JSON format instead of protobuf off
--debug Print debug information during conversion off

Output:

[xtop] Converting profile.xpti -> profile.perfetto-trace
[xtop] Done: 2370 host events, 4352 device events

The generated .perfetto-trace file can be opened at https://ui.perfetto.dev.

Reading the Timeline

The screenshot below shows a Perfetto timeline from a sort example (sort_with_ptr):

Perfetto Timeline

The timeline is organized into tracks:

Track Events Description
Host TaskDispatch / TaskComplete Host-side API calls. Each span represents the time from dispatch to completion of a task.
Device Launch / Terminate Device-side kernel execution on each Sub/Cluster/MU/Thread.
Flow arrows Host to Device Visual connection from host dispatch to the corresponding device execution.

What to look for:

  • Gaps in host tracks indicate scheduling delays or synchronization waits.
  • Uneven device track lengths suggest workload imbalance across hardware threads.
  • Long flow arrows mean high overhead between host dispatch and device execution.

Open in Perfetto UI

xtop open profile.xpti

Starts a local web server and opens the trace directly in your browser, without manual upload.

Options:

Option Description Default
-p, --port <port> HTTP server port 8888
--debug Print debug information off
# Open a single profile
xtop open profile.xpti

# Open a directory of traces
xtop open ./xtop_profile/

# Use custom port
xtop open profile.xpti --port 9999

Supported file types: .xpti (auto-converts), .json, .perfetto-trace

Requires flask. Install with: pip install flask


Map Analysis

Analyze the overall Map execution with parallelism, load balance, cache efficiency, and bottleneck detection.

Basic Summary

xtop map profile.xpti

Options:

Option Description Default
--plot Generate PNG plot with analysis charts off
--json Output analysis results as JSON off
-o, --output <path> Output file path (for --plot or --json) auto

Output:

============================================================
 MAP SUMMARY
============================================================

[Hardware]
  Subs: 2   Clusters/Sub: 2   MUs/Cluster: 4   Threads/MU: 8
  Total threads    : 128
  MU frequency     : 1100 MHz

[Duration - Host]
  Wall time        : 1,745.20 ms
  Task count       : 1,024
  avg              : 7.08 ms
  median           : 6.92 ms
  min              : 3.21 ms
  max              : 24.91 ms
  std_dev          : 2.34 ms

[Duration - Device]
  Task count       : 1,024
  avg              : 6.55 ms
  median           : 6.40 ms
  min              : 2.98 ms
  max              : 23.12 ms
  std_dev          : 2.10 ms

[Parallelism - Host]
  max_concurrent   : 128
  avg_concurrent   : 95.3
  ramp_up          : 45.23 ms
  max_hold         : 1,610.74 ms
  ramp_down        : 89.23 ms

[Parallelism - Device]
  max_concurrent   : 128
  avg_concurrent   : 94.5
  ramp_up          : 48.90 ms
  max_hold         : 1,605.64 ms
  ramp_down        : 90.66 ms

[Skew] (Sub/Cluster distribution)
  Sub0             : ######### 52%
  Sub1             : ######## 48%
  balance_score    : 0.96 (1.0 = perfectly even)

[Overhead] (scheduling + communication)
  avg_overhead     : 0.53 ms
  p95_overhead     : 1.82 ms
  max_overhead     : 6.41 ms

[Cache]
  L1 hit rate      : 92.1% (r:95.0% w:88.0%)
  L2 hit rate      : 74.3%
  L3 hit rate      : 88.9%

[Memory Bandwidth] (estimated from L3)
  read commands    : 50,000
  write commands   : 30,000
  cache line size  : 64 bytes
  time span        : 1,745.20 ms
  estimated BW     : 2.93 GB/s

[Stragglers]
  straggler_ratio  : 1.23x (1.0 = even, >2.0 = imbalanced)
  Top busy:
    sub=0 cluster=0 mu=0 thread=0   120 tasks, avg 7.25 ms
    sub=1 cluster=1 mu=2 thread=3   115 tasks, avg 7.18 ms
  Least busy:
    sub=0 cluster=1 mu=3 thread=7    95 tasks, avg 6.80 ms

[Idle]
  total_idle       : 12.34 ms
  total_active     : 1,732.86 ms
  idle_percentage  : 0.7%

[Utilization]
  active_threads   : 128 / 128 (100.0%)
  total_tasks      : 1,024

============================================================

Sections explained:

Section What it tells you
Hardware Device topology and clock frequency from profile metadata.
Duration Task execution time statistics from both host and device perspectives. Large gap between Host and Device averages indicates high scheduling overhead.
Parallelism How many tasks run concurrently over time. Ramp-up is time to reach peak concurrency; ramp-down is time from peak to completion. Long ramp-down suggests stragglers.
Skew Task distribution across Subs and Clusters. A balance score close to 1.0 means work is evenly distributed.
Overhead Difference between host-measured and device-measured task duration. Includes scheduling latency, driver communication, and queue wait time.
Cache Aggregate cache hit rates. Low L2 hit rate may indicate poor data locality.
Memory Bandwidth Estimated from L3 cache command counts and cache line size. This is an approximation, not precise measurement.
Stragglers Threads with disproportionately high or low workload. A straggler ratio above 2.0 indicates significant imbalance.
Idle Time gaps between consecutive tasks on each thread. High idle percentage suggests underutilization.
Utilization Fraction of hardware threads that received at least one task.

Generate Plots

xtop map profile.xpti --plot -o analysis.png

Generates a multi-panel PNG image for visual analysis. The screenshot below shows the plot output from a sort example:

Map Analysis Plot

The plot contains 8 panels:

Panel Description
Host concurrency over time Number of concurrent host tasks (Dispatch-Complete) at each point in time. The red dashed line shows peak concurrency.
Device concurrency over time Number of concurrent device tasks (Launch-Terminate). Compare with host concurrency to spot scheduling overhead.
Task count per thread Bar chart showing how many tasks each hardware thread executed. Even bars indicate good load balance.
Duration histogram Distribution of individual task execution times. A long tail indicates straggler tasks.
Sub/Cluster task count heatmap Task count distribution across the Sub x Cluster grid. Darker cells received more tasks.
MU/Thread task count heatmap Task count distribution across the MU x Thread grid.
Sub/Cluster duration heatmap Total execution time per Sub x Cluster combination. Uneven colors indicate workload imbalance.
MU/Thread duration heatmap Total execution time per MU x Thread combination.

Requires matplotlib. Install with: pip install matplotlib

Export as JSON

xtop map profile.xpti --json -o analysis.json

Exports the full analysis result as a JSON file for programmatic consumption or integration with other tools.


Task Analysis

Inspect individual task execution timing, hardware placement, and per-task cache statistics.

List All Tasks

xtop task profile.xpti

Options:

Option Description Default
--sort {id,active,overhead} Sort order id
--filter <expr> Filter tasks (e.g., sub=0, cluster=1) none
--limit N Limit number of rows all

Output:

========================================================
 TASK LIST
========================================================

 ID     Kernel           Active       Overhead     Sub    Cluster
------  -------          -------      -------      ---    -------
 0      sort_kernel      7.08 ms      1.52 ms      0      0
 1      sort_kernel      6.92 ms      0.12 ms      0      1
 2      sort_kernel      8.21 ms      3.41 ms      1      0
 ...
------
 Total: 1024 tasks

Columns explained:

Column Description
ID Task index (assigned at dispatch)
Kernel Kernel function name
Active Device execution time (Launch to Terminate)
Overhead Host duration minus device active time (scheduling + communication)
Sub Sub where the task executed
Cluster Cluster where the task executed

Examples:

# Sort by longest execution time
xtop task profile.xpti --sort active

# Show only tasks on Sub 0
xtop task profile.xpti --filter sub=0

# Top 10 slowest tasks
xtop task profile.xpti --sort active --limit 10

Task Detail

xtop task profile.xpti 17

Shows comprehensive timing breakdown, hardware placement, cache statistics, and diagnostic messages for a single task.

Output:

============================================================
 Task 17 (kernel: sort_kernel)
============================================================

[Timing]
  dispatch(host)  : 12384 us
  complete(host)  : 24118 us
  host_duration   : 11.73 ms

  launch(device)  : 500 ticks
  terminate       : 9942 ticks
  active_time     : 9.44 ms

  overhead        : 2.29 ms (19.5%)

[Placement]
  sub / cluster / mu / thread
  0 / 0 / 3 / 2

[Cache] (task time window)
  L1 hit   : 92.1% (512 samples)
  L2 hit   : 74.3% (256 samples)
  L3 hit   : 88.9% (64 samples)

[Diagnosis]
  ! High overhead (19.5% of host duration)
  ! L2 miss rate above threshold

Sections explained:

Section Description
Timing Host-side timestamps (dispatch/complete) and device-side timestamps (launch/terminate). Overhead = host_duration - active_time.
Placement Which Sub/Cluster/MU/Thread executed this task.
Cache L1/L2/L3 hit rates during the task’s time window. Sample count indicates data availability.
Diagnosis Automated warnings: high overhead, low cache hit rates, or other anomalies.

Typical Workflow

A common profiling workflow:

# 1. Collect profile data
sudo xtop run -- ./my_application

# 2. Visualize the full timeline
xtop open profile.xpti

# 3. Quick overview -- how well did the Map execute?
xtop map profile.xpti

# 4. Generate analysis plots
xtop map profile.xpti --plot

# 5. Identify slow tasks
xtop task profile.xpti --sort active --limit 10

# 6. Investigate a specific slow task
xtop task profile.xpti 17

What to Look For

Symptom Where to check Possible cause
Long wall time xtop map Duration section Too few tasks, or stragglers
High overhead xtop map Overhead section Scheduling bottleneck
Uneven distribution xtop map Skew / Stragglers Data-dependent workload imbalance
Low cache hit rate xtop map Cache section Poor data locality
Long ramp-down xtop map Parallelism section Straggler tasks holding up completion
Individual slow task xtop task <file> <id> Cache misses, large data set