MultiSet AI VPS Gen2: Localization, Leveled Up

Today we're shipping VPS Gen2 — a ground-up upgrade to MultiSet's Visual Positioning System. Gen2 introduces an attention-based mechanism that prioritizes the features in a scene that are genuinely unique and easier to track, and adds production-grade reject-gating — so the system not only finds more correct poses but also knows when to stay quiet.

Rather than treating every pixel equally, Gen2's attention mechanism assigns higher weight to the pixels that are genuinely distinctive and down-weights the generic, repeated ones — which is precisely what helps in repetitive, low-feature areas. By focusing on the small set of distinctive cues in a view rather than its overall appearance, Gen2 can tell apart places that look almost identical — the repeated corridors, train-station platforms, hospital wings, parking levels, and retail aisles that defeat older systems. Two near-identical corridors no longer collapse onto the same pose.

Gen2 was trained and evaluated on a deliberately diverse mix of environments — transit interiors, multi-floor retail, crowded indoor scenes with multi-year temporal drift, and wide-area outdoor spaces. Our large-scale benchmark spans more than 70 distinct datasets and 16,068 test queries, holding every variable fixed except the positioning engine itself. The headline: Gen2 wins on recall in essentially every environment, and the margin grows with difficulty.

up to 25%

Net-useful poses

3.4×

Median error ↓ (hard scene)

73→13%

False-positive rate

+22pt

Recall lift, hardest scenes

01Higher recall at every precision gate

On the full large-scale benchmark — 70+ datasets, 16,068 queries — Gen2 improves localized recall across every accuracy threshold, from centimeter-tight (≤ 0.1 m / 1°) all the way out to coarse (≤ 5 m / 10°). It also localizes more queries overall while reducing false positives at every distance band.

Recall by accuracy gate — share of all queries · 70+ datasets Gen1Gen2

Aggregate "useful" poses — localized correctly to ≤ 0.5 m / 5° minus dangerous > 5 m false positives — improve across the benchmark, with per-scene net-useful gains reaching up to 25% in the hardest environments.

02The harder the scene, the bigger the win

The gains aren't uniform — and that's the point. On easy, feature-rich open floors Gen1 was already strong, so Gen2 adds a few points. But on the large, visually repetitive spaces where deployments actually struggle — long parallel corridors, train-station platforms, hospital wings, repeated signage, basement levels — Gen2's attention mechanism earns its keep, latching onto the small distinguishing details that separate one near-identical view from the next. The result: +15 to +22 points of recall in exactly the places older systems fail.

Recall (≤ 5 m) by environment — Gen1 → Gen2 Gen1Gen2

On the toughest transit scene — ~16,500 reference images, 2,620 queries — recall jumps from 49% to 68%, while median position error drops from 6.0 m to 1.8 m. The single strongest result in the benchmark.

03Better typical accuracy

Recall measures whether you got a pose; accuracy measures whether it's any good. Gen2 improves median position and rotation error across the board — most dramatically on the hardest, most repetitive scenes, where median position error falls by up to 3.4× and median rotation error is roughly halved.

Median error on the hardest transit scene — lower is better

On centimeter-scale indoor scenes the wins are subtler in absolute terms but consistent: even under heavy crowding and a three-year gap between the reference scan and the query images, Gen2 nearly doubles recall (27% → 49%) while holding accuracy at ~9.5 cm and sub-1.1° median.

04Far fewer confident-but-wrong poses

The most dangerous failure mode for any positioning system isn't a missing pose — it's a confident, wrong one, because downstream systems act on it. Gen2 introduces reject-gating that turns most would-be false positives into either correct localizations or honest rejections. In our constrained-indoor stress test, the false-positive rate collapsed from 73% to 13%.

Constrained-indoor stress test — outcome distribution

Across the full multi-scene benchmark these effects compound: total recall climbs from 54.6% to 61.8%, and the net-useful gain holds even after accounting for the handful of scenes where Gen1's per-pose precision was marginally higher.

Build with it

Multi-query localization API

Every benchmark above was measured in single-frame query mode — one image in, one pose out. That's the floor, not the ceiling. MultiSet's Multi-query localization endpoint pushes accuracy further still: it fuses a short burst of frames together with the device's local SLAM data into a single, more robust pose. On top of Gen2's gains, multi-frame fusion adds the most where it's needed most — repetitive corridors and crowded scenes. It's a drop-in upgrade: existing integrations get Gen2 with no code changes.

Read the Map Query API docs →

The short version

Uniform upgrade. Higher recall in every environment type we tested.
Scales with difficulty. Biggest gains in large, repetitive transit and basement spaces — where it counts.
Better median accuracy. Position and rotation error improve across the board, up to 3.4× on the hardest scenes.
Fewer dangerous failures. Reject-gating drops the false-positive rate from 73% to 13% under stress.
No migration cost. Available now on all new maps created after the Platform 2.0.0 release, across all query APIs.