benchmark — Benchmark Suites¶
Run a manifest-driven suite of curated bag checks.
This is useful when you want public rosbag datasets and internal gold bags to act like a reproducible regression suite, not ad hoc manual checks.
Usage¶
# Run all cases in a suite
bagx benchmark benchmarks/open_data_suite.json
bagx benchmark benchmarks/non_slam_suite.json
bagx benchmark warehouse_benchmark.json --rules warehouse_bot
# Export a machine-readable report
bagx benchmark benchmarks/open_data_suite.json --json benchmark-report.json
# Run only selected cases
bagx benchmark benchmarks/open_data_suite.json --case nvidia-r2b-robotarm
# Fail if any referenced bag is missing
bagx benchmark benchmarks/open_data_suite.json --fail-on-missing
Manifest format¶
The manifest is JSON and supports environment-variable expansion in bag_path.
It also supports optional rules_path values to apply custom message rules. rules_path can be either a JSON file path or a plugin name.
{
"suite_name": "open-data-dogfood",
"rules_path": "warehouse_bot",
"cases": [
{
"name": "nvidia-r2b-galileo2",
"bag_path": "${BAGX_REALBAGS}/r2b_galileo2",
"report_type": "eval",
"expect": {
"min_overall_score": 90,
"required_domains": ["Perception"],
"required_recommendations": [
"Perception topics detected",
"Camera calibration topics are recorded"
],
"forbidden_recommendations": ["No GNSS data", "No IMU data"]
}
}
]
}
The repository ships ready-made suites:
benchmarks/open_data_suite.json: public Autoware + NVIDIA bagsbenchmarks/non_slam_suite.json: perception/manipulation plus optional local Nav2 / MoveIt dogfood bagsbenchmarks/scoreboard.json: 30 public datasets for the Scoreboard page
Regenerate the docs table after scoring local bags:
./scripts/fetch_scoreboard_bags.sh # Autoware S3 + AutoCore (optional)
export BAGX_SCOREBOARD_BAGS=/path/to/bags
export BAGX_DB3_CACHE=/path/with/free/space
python scripts/generate_scoreboard.py --refresh --write-manifest
python scripts/generate_scoreboard.py
For proprietary stacks, pair a benchmark manifest with a custom rules plugin or file and keep your expectations in required_domains, required_recommendations, and min_topic_rates.
Supported expectations¶
min_overall_scoremax_overall_scoremin_domain_scorerequired_domainsrequired_recommendationsforbidden_recommendationsexpected_findingsmin_topic_ratesrequired_topics
expected_findings¶
expected_findings checks structured readiness findings by stable id instead of matching
human-facing recommendation text. See eval — Structured findings
for the available ids and severity policy.
{
"expect": {
"expected_findings": [
{
"id": "nav2.detected",
"severity": "info",
"domain": "nav2",
"category": "domain_detection"
}
]
}
}
Each item can be either a bare string (id only) or an object. When an object is
given, severity, domain, category, and affected_topics are checked
individually and produce separate expected_finding_* sub-checks in the report.
For temporal findings (with a time_range), an time_range_overlap
constraint scopes the match to a window of the bag — pass when the finding's
time_range overlaps the window:
{
"id": "anomaly.gnss.fix_lost.gnss",
"time_range_overlap": {"start_ns": 120000000000, "end_ns": 145000000000}
}
Useful for "GNSS must be lost only during the parked phase" style
expectations. Requires the eval report to include temporal findings — run
bagx eval --include-anomaly upstream.
forbidden_findings¶
The inverse of expected_findings — fail when listed ids appear. Each item can
be either a bare string (id only) or an object with a severity_min scope:
{
"expect": {
"forbidden_findings": [
"nav2.missing_global_plan",
{"id": "sync.delay.high", "severity_min": "error"}
]
}
}
With severity_min, the finding is only forbidden when its severity is at or
above the threshold — letting an info-level appearance through while gating on
the worse cases.
forbidden_findings also accepts time_range_overlap (same shape as in
expected_findings). Both qualifiers compose: the rule fires only when a
matching finding has severity ≥ severity_min and overlaps the
constraint window — useful for "no fix-lost during autonomy phase".
max_severity¶
A per-category ceiling. Any finding whose severity exceeds the ceiling fails its case:
--exit-on¶
bagx benchmark suite.json --exit-on warning returns a non-zero exit code when
any case's worst finding severity reaches the threshold. Use this together with
the case-level passed/failed status to gate CI on both manifest expectations
and structural severity.
JSON contract¶
Benchmark JSON reports (schema_version 1.3.0+) include:
schema_versionreport_typebagx_versionworst_severityat suite level and on each casefinding_idsfor each evaluated case
This makes it practical to gate regressions in CI or compare reports across releases.