Benchmarking And Release Gate
This page describes the recommended benchmark path and the release/readiness gate used for the default permissive workflow.
Recommended Benchmark
The standard benchmark path for this repository is:
bash scripts/download_ntu_viral_tnp01.sh
bash scripts/run_rko_lio_graph_benchmark.sh
That wrapper:
- uses the bundled NTU VIRAL
rosbag2 - runs
RKO-LIO + graph_based_slam - saves raw and corrected trajectories
- computes APE against the Leica prism reference
- verifies the Autoware map bundle when present
- writes
metrics.jsonfor the reporting pipeline
Typical outputs are written under:
output/bench_rko_lio_ntu_viral_<name>/traj_raw_prism.tumoutput/bench_rko_lio_ntu_viral_<name>/traj_corrected_prism.tumoutput/bench_rko_lio_ntu_viral_<name>/ape_raw_vs_gt.txtoutput/bench_rko_lio_ntu_viral_<name>/ape_corrected_vs_gt.txtoutput/bench_rko_lio_ntu_viral_<name>/metrics.json
Summaries And HTML Report
To summarize all collected runs:
python3 scripts/benchmark_summary.py \
--root output \
--write-md output/benchmark_summary.md \
--write-csv output/benchmark_summary.csv
To generate the static HTML report:
python3 scripts/generate_html_report.py \
--root output \
--out output/latest_report.html
To generate a short public-beta readiness report from the current local artifacts:
python3 scripts/generate_v2_beta_readiness_report.py
By default this writes:
output/v2_beta_readiness_<YYYYMMDD>.md
To generate a short public-facing map-authoring positioning report from the tracked benchmark, GNSS, dynamic-filter, and classic-path artifacts:
python3 scripts/generate_map_authoring_report.py \
--out output/map_authoring_report_$(date +%Y%m%d).md \
--write-json output/map_authoring_report_$(date +%Y%m%d).json
To stage a reusable submission-style bundle from an existing run directory:
bash scripts/create_map_authoring_submission_bundle.sh \
output/bench_rko_lio_ntu_viral_fresh_20260324 \
output/submission_bundle_ntu_viral_fresh \
--report output/map_authoring_report_$(date +%Y%m%d).md \
--verify-map
That bundle standardizes:
pointcloud_map/map_projector_info.yamlmetrics.jsonwhen present- trajectories and key logs when present
- focused reports under
reports/, with siblingjson/svgcopied automatically when present map_qa_summary.mdmanifest.json
To generate a separate stress-validation report that distinguishes the current default path from older long-loop and hard-dataset evidence:
python3 scripts/generate_stress_validation_report.py
By default this writes:
output/stress_validation_report_<YYYYMMDD>.md
To summarize dynamic-object-filter behavior across the tracked Leo Drive save-time benchmarks:
python3 scripts/generate_dynamic_object_filter_validation_report.py \
--out output/dynamic_object_filter_validation_report_$(date +%Y%m%d).md \
--write-json output/dynamic_object_filter_validation_report_$(date +%Y%m%d).json \
--write-svg output/dynamic_object_filter_validation_report_$(date +%Y%m%d).svg
The default report compares the tracked bag1 and bag6 dynamic-filter
benchmarks, so point reduction and voxel-removal behavior can be discussed as
cross-dataset evidence rather than a single-case anecdote. It also reports
coarse tile-footprint preservation via shared metadata tiles, tile jaccard,
and filtered-tile overlap ratio.
To promote an already-recorded aligned cross-validation run such as the MID360
long-loop check into metrics.json so it appears in benchmark_summary.md and
latest_report.html:
python3 scripts/write_aligned_trajectory_metrics.py \
--out-dir output/bench_rko_lio_mid360_v3 \
--bag demo_data/glim_mid360/rosbag2_2024_04_16-14_17_01 \
--reference-tum output/glim_mid360_reference.tum \
--corrected-tum output/bench_rko_lio_mid360_v3/traj_corrected.tum \
--raw-tum output/bench_rko_lio_mid360_v3/traj_raw.tum \
--graph-log output/bench_rko_lio_mid360_v3/graph_slam.log \
--reference-source glim_mid360_reference \
--reference-kind cross_validation \
--reference-label GLIM \
--points-topic /livox/lidar \
--points-frame livox_frame \
--robot-frame livox_frame
The summary/report pipeline now exposes the reference kind, so ground_truth
and cross_validation runs do not appear as if they were the same type of APE.
For a public-facing snapshot built on top of these artifacts, see
docs/comparison.md and docs/releases/v0.2.2.md.
To rerun the current MID360 cross-validation benchmark end-to-end:
bash scripts/run_rko_lio_mid360_crossval_benchmark.sh
This MID360 wrapper defaults to a tuned RKO-LIO + graph_based_slam profile
with voxel_size=0.5, max_range=80.0, search_submap_num=5,
loop_edge_dedup_index_window=20, and loop_edge_info_weight=200.
To benchmark the real open-data Leo Drive driving_30_kmh bag with mixed
RTK/non-RTK GNSS quality:
git clone --depth=1 https://github.com/autowarefoundation/applanix.git /tmp/applanix
bash scripts/run_open_data_applanix_velodyne_gnss_benchmark.sh \
--bag demo_data/autoware_leo_drive_isuzu/driving_30_kmh_2022_06_10-15_47_42_compressed \
--applanix-msg-dir /tmp/applanix/applanix_msgs/msg \
--verify-map
That wrapper writes a local Applanix_GSOF49 reference trajectory,
traj_raw.tum, traj_corrected.tum, and metrics.json so the run appears in
benchmark_summary.md and latest_report.html.
When the main bag already contains native sensor_msgs/msg/NavSatFix or
sensor_msgs/msg/Imu, the same wrapper now prefers those real topics before it
falls back to Applanix sidecar generation.
Current Leo Drive packet-path evidence is:
driving_30_kmh, GNSS-only classic path:APE RMSE 195.285 mbag1_front,no_imu:APE RMSE 0.248 mbag1_front, native/sensing/imu/imu_data:APE RMSE 0.251 mbag6_front,no_imu:APE RMSE 0.422 mbag6_front, native/sensing/imu/imu_data:APE RMSE 0.365 m
The important result is that packet IMU deskew is usable on the native
all-sensors bags, but only when the benchmark is replayed conservatively.
The wrapper now auto-selects rate=1.0 whenever --use-imu=true and --rate
is omitted. The earlier 20m+ regressions were runtime-sensitivity artifacts,
not a proof that the deskew math itself was fundamentally broken. To reproduce
the current experimental IMU result on the driving bag:
git clone --depth=1 https://github.com/autowarefoundation/applanix.git /tmp/applanix
bash scripts/run_open_data_applanix_velodyne_gnss_benchmark.sh \
--bag demo_data/autoware_leo_drive_isuzu/driving_30_kmh_2022_06_10-15_47_42_compressed \
--applanix-msg-dir /tmp/applanix/applanix_msgs/msg \
--use-imu true \
--tf-bag demo_data/autoware_leo_drive_isuzu/all-sensors-bag6_compressed \
--robot-frame-id base_link \
--imu-frame-id base_link \
--verify-map
To compare the same packet path on all-sensors-bag6 while isolating IMU
deskew from GNSS:
git clone --depth=1 https://github.com/autowarefoundation/applanix.git /tmp/applanix
bash scripts/run_open_data_applanix_velodyne_gnss_benchmark.sh \
--bag demo_data/autoware_leo_drive_isuzu/all-sensors-bag6_compressed \
--packet-topic /sensing/lidar/front/velodyne_packets \
--applanix-msg-dir /tmp/applanix/applanix_msgs/msg \
--use-gnss false \
--verify-map
bash scripts/run_open_data_applanix_velodyne_gnss_benchmark.sh \
--bag demo_data/autoware_leo_drive_isuzu/all-sensors-bag6_compressed \
--packet-topic /sensing/lidar/front/velodyne_packets \
--applanix-msg-dir /tmp/applanix/applanix_msgs/msg \
--tf-bag demo_data/autoware_leo_drive_isuzu/all-sensors-bag6_compressed \
--use-gnss false \
--use-imu true \
--verify-map
bash scripts/run_open_data_applanix_velodyne_gnss_benchmark.sh \
--bag demo_data/autoware_leo_drive_isuzu/all-sensors-bag6_compressed \
--packet-topic /sensing/lidar/left/velodyne_packets \
--applanix-msg-dir /tmp/applanix/applanix_msgs/msg \
--tf-bag demo_data/autoware_leo_drive_isuzu/all-sensors-bag6_compressed \
--use-gnss false \
--use-imu true \
--imu-rotation-use-orientation false \
--verify-map
To summarize the current cross-dataset odom-prior validation evidence after the classic-path runs have been recorded:
python3 scripts/generate_odom_prior_validation_report.py \
--out output/odom_prior_validation_report_$(date +%Y%m%d).md \
--write-json output/odom_prior_validation_report_$(date +%Y%m%d).json \
--write-svg output/odom_prior_validation_report_$(date +%Y%m%d).svg
This report intentionally compares driving_30_kmh and bag6_front side by
side, because the current velocity-based prior helps the fallback classic path
on one dataset and hurts or helps differently on another.
To validate packet IMU deskew as a repeatable matrix on real open data, use:
git clone --depth=1 https://github.com/autowarefoundation/applanix.git /tmp/applanix
bash scripts/run_open_data_packet_imu_deskew_validation_matrix.sh \
--applanix-msg-dir /tmp/applanix/applanix_msgs/msg
That matrix compares no_imu and native-IMU runs for the default bag1_front
and bag6_front cases at rate=1.0 and emits:
packet_imu_deskew_validation.mdpacket_imu_deskew_validation.json
The report is generated by generate_packet_imu_deskew_validation_report.py
and fails if any case violates the configured path-coverage, RMSE-regression,
or matched-pose thresholds.
The same bag also exposes native /gnss/fix. The backend now falls back to
receive time when the NavSatFix header stamp is far from ROS time
(gnss_header_stamp_max_skew_sec, default 30 s), which lets the graph attach
GNSS edges on all-sensors-bag6. In practice that native /gnss/fix still
disagrees with the GSOF49 reference enough to degrade the cross-validation
APE, so all-sensors-bag6 is useful for georeferenced smoke tests but not a
clean GNSS benchmark source.
To compare place-recognition behavior on MID360, rerun the same benchmark with and without an optional descriptor family and then render the short report:
bash scripts/run_place_recognition_benchmark.sh
To compare the current experimental BEV-assisted distance rerank instead:
bash scripts/run_place_recognition_benchmark.sh --candidate-mode bev_rerank
The report shows:
- runtime
use_scan_context - accepted/attempted loop counts
- accepted loop source counts
- observed
ScanContext loop candidatecount - observed
BEV rerank hintcount - observed
SOLiD rerank candidatecount APE RMSEdelta between the two runs- optional JSON summary via
--write-json - optional SVG summary via
--write-svg
The report is generated by generate_place_recognition_report.py.
Current checked-in evidence is:
- fair current-code baseline rerun:
output/bench_rko_lio_mid360_current_default_rerun_20260326/metrics.json(APE RMSE 4.096 m) - current best checked-in Scan Context candidate with DB/index fix,
aggregated descriptor/registration cloud, and
scan_context_threshold=0.55:output/bench_rko_lio_mid360_sc055_yawguess_scagg_screg_20260326/metrics.json(APE RMSE 3.568 m) - current experimental BEV-assisted distance rerank:
output/bench_rko_lio_mid360_20260326_202840/metrics.json(APE RMSE 3.607 m) - best observed BEV-assisted distance rerank:
output/bench_rko_lio_mid360_20260326_202119/metrics.json(APE RMSE 3.533 m) - short comparison report:
output/place_recognition_report_20260326.md
That candidate currently beats both the fair rerun baseline and the published
3.641 m default artifact, but the accepted loop still comes from the
distance-based path. Treat use_scan_context=true as an opt-in tuning path
rather than the repository default.
The BEV path is now more useful as a sensor-agnostic distance-candidate rerank than as a standalone loop source. It has shown better-than-baseline runs, but its rerun variance is still too large for a default-on setting.
To summarize the current stop/go decisions for place recognition and the classic fallback path in one short report:
python3 scripts/generate_exploration_closeout_report.py \
--out output/exploration_closeout_report_$(date +%Y%m%d).md \
--write-json output/exploration_closeout_report_$(date +%Y%m%d).json
A local snapshot can be written to:
output/exploration_closeout_report_20260327.md
That report fixes the current repository position in one place:
- public default place recognition remains the distance-based path
Scan Contextstays opt-inBEV-assisted rerankstays experimentalSOLiDstays experimental/off by default- the classic path remains a fallback workflow rather than the main public path
Dynamic Object Filter Benchmark
The dynamic-object filter is save-time only. It does not change live odometry or loop closure, so the right comparison is the saved map output with the same bag and the same backend settings.
Run the paired comparison on the open-data bag6 smoke path:
bash scripts/run_dynamic_object_filter_benchmark.sh
That wrapper:
- runs
run_open_data_gnss_smoke.shtwice on the same bag - saves
no_filter/anddynamic_filter/outputs under one root - renders
dynamic_object_filter_report.md,dynamic_object_filter_report.json, anddynamic_object_filter_report.svg
The report is generated by generate_dynamic_object_filter_report.py and
tracks:
- Autoware map verify result for both runs
- projector type
- saved grid cell count
- metadata tile count
- total saved point count
- filter candidate/kept/removed voxel counts
- saved-point reduction ratio
The current checked-in evidence is:
- baseline smoke:
output/open_data_gnss_smoke_bag6_autodetect_throttled_20260325 - filtered smoke:
output/open_data_gnss_smoke_bag6_dynamic_filter_20260326 - benchmark report bundle:
output/dynamic_object_filter_benchmark_bag6_20260326
In that checked run, the saved map went from 138732 to 87861 points while
keeping verify_autoware_map.py at PASS.
Leo Drive Classic Path Benchmark
To compare the current classic scanmatcher + graph_based_slam path on the
mixed-quality Leo Drive driving_30_kmh open-data bag, run:
git clone --depth=1 https://github.com/autowarefoundation/applanix.git /tmp/applanix
bash scripts/run_open_data_classic_path_benchmark_suite.sh \
--applanix-msg-dir /tmp/applanix/applanix_msgs/msg \
--verify-map
This wrapper emits:
classic_path_report.mdclassic_path_report.jsonclassic_path_report.svg
The report is generated by generate_classic_path_report.py.
The checked-in snapshot is:
output/classic_path_report_20260327.md
Current evidence is:
no GNSS:APE RMSE 313.695 mGNSS only:APE RMSE 195.285 mGNSS + velocity-based planar odom prior:APE RMSE 175.732 mGNSS + IMU:APE RMSE 271.144 m
So the classic path still needs work, but the direction is clearer now:
backend GNSS helps substantially, and a velocity-based planar odom prior helps
further on driving_30_kmh, while the current packet IMU path is still not a
default recommendation.
Release/Readiness Gate
To run the local readiness gate in one command:
bash scripts/run_release_readiness_checks.sh --ape-threshold 0.10
That wrapper can run:
- default build and package tests
- benchmark summary generation
- HTML report generation
- optional Autoware dogfood
With --ape-threshold, the gate is hard:
- it exits non-zero if any selected run is missing APE
- it exits non-zero if any selected run exceeds the threshold
- by default
run_release_readiness_checks.shapplies that hard gate only toground_truthruns;cross_validationruns stay visible in reports without blocking release
CI Coverage
CI exercises the reporting path in two ways:
- a passing synthetic benchmark fixture must generate summary and HTML report
- a failing synthetic benchmark fixture must trip the threshold gate with
exit code
2
The fixture generator is:
python3 scripts/generate_sample_benchmark_metrics.py \
--root /tmp/ci_fixture \
--profile passing
Use --profile failing to create a negative-path fixture.
Recommended Artifacts To Publish
If you want benchmark results to be easy to consume, publish:
metrics.jsonbenchmark_summary.mdbenchmark_summary.csvlatest_report.html- the exact param file used for the run
docs/comparison.mdwhen publishing the current positioning of the repodocs/releases/v0.2.2.mdwhen publishing the current public beta scopev2_beta_readiness_<YYYYMMDD>.mdwhen preparing a public beta snapshotstress_validation_report_<YYYYMMDD>.mdwhen discussing long-loop or aggressive-motion evidence
Related Commands
- Autoware quickstart:
docs/autoware-quickstart.md - public Autoware entrypoint:
bash scripts/run_autoware_quickstart.sh - public comparison page:
docs/comparison.md - end-to-end dogfood:
bash scripts/run_rko_lio_graph_autoware_dogfood.sh --auto-exit-secs 20