VPS

April 20, 2026

VPS on Smart Glasses: Visual Positioning on Meta Ray-Ban | MultiSet AI

MultiSet AI's VPS SDK brings centimeter-accurate visual positioning to Meta Ray-Ban smart glasses. Learn how it works, what it unlocks, and how we handle privacy.

Shadnam Khan

MultiSet AI

VPS on Consumer Glasses Is Here. What It Actually Makes Possible.

‍

On April 19, Bilawal Sidhu - former Google PM who co-founded and led Google Maps Immersive View and shipped the ARCore Geospatial API - published a video demonstrating something that, even a year ago, would have felt like a research prototype.

He put on a pair of $299 Meta Ray-Ban smart glasses. He walked through a pre-scanned building. And a visual positioning system running on MultiSet AI's SDK localized him to within centimeters in real time, hands-free, with no depth sensors, no LiDAR, and no SLAM running on the device. Just a camera, a pre-built 3D map, and a VPS query returning a 6-DoF pose.

The comment section responded accordingly. Equal parts excitement and privacy alarm. References to Orwell. Calls for legislation. Predictions about who would actually get their hands on the technology (the most-upvoted use case comment: "Who actually needs this: Firefighters. Who will actually get this: Police").

Both reactions are understandable. Both miss the more important signal.

This isn't a concept video. It's a working integration, built on MultiSet's wearable VPS SDK, running on hardware you can buy at a mall. And on April 1, Meta featured MultiSet on their own developer blog as one of the first wave of integrations built on the Wearables Device Access Toolkit — the SDK Meta released in December 2025 to open Ray-Ban camera and audio streams to third-party developers.

To the best of our knowledge, no other third-party VPS provider has publicly demonstrated visual positioning on consumer-grade smart glasses. This post is about what that capability actually unlocks — for developers, for enterprises, and for the people who will eventually wear these things to work every day. It's also about the privacy questions the technology raises, and how we think about them at MultiSet.

‍

What's Actually Happening Under the Hood

Before we talk about use cases, it's worth understanding what the technology is actually doing — without jargon, and without hand-waving.

Step 1: Map the space. Someone scans the environment once, using whatever capture method they already have. iPhone LiDAR. A Matterport camera. A NavVis trolley. A Leica scanner. A Polycam app. An E57 point cloud export from FARO or XGrids. MultiSet is scan-agnostic — we ingest data from any of these sources and process it into a visual feature map. The space owner controls when, how, and by whom this scan happens.

Step 2: The glasses see. When a user wearing Meta Ray-Ban glasses enters the mapped space, the glasses stream camera frames to a companion iPhone app at 720p and 30 frames per second over Bluetooth. This stream is enabled by Meta's Wearables Device Access Toolkit, which gives developers access to the camera feed, the five-microphone array, and the open-ear speakers.

Step 3: MultiSet localizes. The companion app sends the camera frame to MultiSet's VPS API over TLS-encrypted connection. Our engine extracts visual features from the frame, matches them against the pre-built map, and computes the device's precise position and orientation — six degrees of freedom, with a median accuracy of approximately 5 centimeters. The pose is returned in milliseconds. The camera frame is discarded immediately after processing. It is never written to disk, never cached, never logged, and never shared with any third party.

Step 4: Spatial guidance delivered. The returned pose anchors the user in the mapped space. Today, guidance is delivered as audio through the glasses' speakers (turn-by-turn spatial navigation, contextual callouts) and as a visual overlay on the companion phone. When Meta opens their display APIs to third-party developers — the Ray-Ban Display HUD shipped in late 2025 but third-party display access isn't yet available — the same pose data will drive in-lens visual overlays without any change to the underlying localization pipeline.

The critical difference between this and phone-based VPS: the user never pulls out a device. They never point a camera. They just walk. The glasses are always on, always oriented toward where the wearer is looking. That shifts the interaction model from "on-demand localization" to "continuous spatial awareness" — and it changes what you can build.

You can watch MultiSet's own wearable VPS demo on Meta Ray-Ban for a walkthrough of the full integration in action.

‍

Beyond the Headline: What VPS on Consumer Glasses Actually Unlocks

Bilawal's video title grabbed attention by referencing military-grade situational awareness. But the majority of the video focuses on something more interesting: the civilian, commercial, and industrial applications that become possible when centimeter-accurate positioning runs on hardware that weighs 50 grams and costs under $300.

Here's what we see deploying.

The Hands-Free Imperative for Frontline Workers

There's a category of work where holding a phone or tablet isn't just inconvenient — it's a safety issue, a compliance issue, or a productivity bottleneck. VPS on glasses eliminates the device from the equation entirely.

A maintenance technician servicing industrial equipment needs both hands on the machine. Today, they reference a paper manual or a tablet propped on a surface, constantly breaking their line of sight and their workflow. With VPS-enabled glasses, step-by-step work instructions can be anchored to the exact location on the equipment they're servicing — delivered as spatial audio cues or, eventually, as visual overlays locked to the physical asset. The positioning layer knows not just that the technician is "in the plant" but that they're standing at valve assembly 4B on the third floor, facing east.

Field service engineers navigating unfamiliar facilities — data centers, hospital campuses, manufacturing floors — can receive audio turn-by-turn guidance without ever looking down at a phone. No Bluetooth beacon infrastructure required. No QR codes to scan at every junction. Just a pre-scanned map and a pair of glasses.

Warehouse workers following pick paths can be guided hands-free to the exact shelf and bin location, with audio confirmation of the item and quantity. This is the same use case DHL has been piloting with specialized enterprise AR headsets for years — except now the hardware is a pair of consumer glasses that doesn't look like a piece of industrial equipment strapped to someone's face.

Emergency Response and High-Stakes Environments

The most-upvoted positive use case in Bilawal's comment section was firefighters. It's easy to see why.

A firefighter entering a smoke-filled building needs spatial awareness without visual line of sight. If the building has been pre-scanned — which is increasingly common as facilities adopt digital twins for insurance, compliance, and operations — VPS can deliver audio-guided navigation through the structure. "Stairwell is 12 meters ahead and to your left. Victim's reported location is one floor up, northeast corner." No beacon infrastructure to install. No signals to degrade in fire conditions. The map is already built; the glasses just need to localize against it.

The same principle applies to any scenario where a team needs shared spatial awareness without line of sight: search and rescue coordination, hazmat response, tactical law enforcement. The positioning layer provides a common coordinate system. Every team member's location is known relative to the same map. Commands become spatial, not descriptive.

Accessibility

Meta's own developer blog post featuring MultiSet also highlighted OOrion and Aira — two applications built on the same Wearables Device Access Toolkit that serve blind and low-vision users. OOrion uses camera input and spatial audio to help users find objects and navigate environments. Aira connects professional visual interpreters to the wearer's first-person camera feed for live verbal guidance.

VPS adds a critical layer to these applications: persistent, centimeter-accurate positioning. Instead of general scene understanding ("there's a door ahead"), VPS-enabled accessibility tools can provide precise spatial instructions anchored to a known map ("the elevator is 8 meters ahead on your right, past the second hallway"). The difference between approximate and precise is the difference between useful and reliable.

Indoor Spatial Computing at Scale

GPS doesn't work indoors. This has been true for decades, and it remains the single biggest gap in consumer location services. Bluetooth beacons work, but they require installation, maintenance, calibration, and ongoing infrastructure cost at every venue. Wi-Fi fingerprinting is approximate at best.

VPS works with a camera and a map. For airports, hospitals, convention centers, corporate campuses, retail environments, and university facilities, that means indoor navigation and spatial experiences can be deployed without installing a single piece of hardware in the environment — just scan the space and publish the map.

On a phone, that's already valuable. On glasses, it becomes seamless. A traveler navigating a connecting flight at Heathrow, a patient finding their way to the right department at a hospital campus, a new employee on their first day navigating a 200,000-square-foot headquarters — none of these people should need to hold a phone in front of their face to get directions. The positioning layer should just work, silently, while they walk.

The Connecting Thread

Every use case above requires the same thing: the device has to know exactly where it is, in real time, without the user doing anything. That's what VPS provides. The glasses are the delivery mechanism. The map is the spatial database. The positioning engine is what makes it useful.

‍

The Privacy Question

The most-liked comment on Bilawal's video called for laws against private-space scanning. Several commenters invoked Orwell. Others predicted the technology would inevitably be captured for surveillance.

These concerns are legitimate. Any technology that processes camera data in real time should be held to a high standard, and anyone building in this space has an obligation to explain exactly what their system does and doesn't do with visual data. Here's how we think about it at MultiSet.

What VPS Is and Isn't

Most of the privacy reaction we've seen conflates three separate things that are actually distinct data flows with different owners, different retention policies, and different risk profiles:

The 3D map is created once, during a controlled scanning session, by the space owner or their authorized contractor. It's a spatial feature database — not a surveillance feed. The owner of the space decides when to scan, what to scan, who has access to the map, and where it's stored. In MultiSet's architecture, this is Client Content. It belongs to the client.

The live camera frames used for localization are what the device captures in real time and sends to the VPS engine to compute a pose. These frames are transient input to a computation. They are not the output, and they are not stored.

The device manufacturer's own data pipeline is whatever the hardware OEM — in this case, Meta — can independently access through their own on-device systems, their own AI capabilities, and their own data practices. This is the device layer. It's governed by the manufacturer's privacy policy, not the VPS provider's.

MultiSet operates in the second layer. Here's what that means in practice.

How MultiSet Handles Camera Data

A camera frame arrives at MultiSet's API endpoint over a TLS-encrypted connection. Our engine extracts visual features, matches them against the pre-built map, computes the device's 6-DoF pose, and returns it. The frame is discarded immediately after processing. It is never written to disk, never logged, never cached, and never shared with any third party.

The only data MultiSet retains is non-content operational telemetry: usage counts, API latency, error rates, SDK versions. No images. No camera frames. No Client Content.

This isn't a policy aspiration. It's how the system is built.

Deployment as a Privacy Architecture

For organizations where even transient cloud processing of camera frames exceeds their risk tolerance — and there are many such organizations in defense, pharmaceutical, financial services, and critical infrastructure — MultiSet offers deployment flexibility that puts the entire system inside the client's security perimeter. You can explore the full range of deployment options and pricing on our website.

Private cloud runs MultiSet's VPS services inside the client's own VPC. The client controls compute, storage, and networking. Client Content — maps, scans, models, point clouds, and all derived outputs — never leaves their environment.

On-premise deploys the same containerized services on the client's own physical servers, behind their own firewall, on their own network. Air-gapped deployments for sensitive facilities are supported.

On-device runs localization entirely on the endpoint device. No camera frames leave the device at all. No network call is required. This is the strongest possible privacy architecture — and it's available today for mobile and headset platforms, with the wearable path running cloud-based localization in the current release.

In all private deployments, telemetry routing is configurable for data residency compliance. The telemetry itself is limited to non-content operational metrics — usage counts, latency, errors — and explicitly excludes Client Content. These aren't future commitments. They're shipping today under 12-month enterprise licenses.

The Device Boundary

MultiSet processes camera frames solely for localization and does not share them with any third party. For how device-level data is handled by the glasses themselves — including camera access, on-device AI processing, and data retention — users and enterprises should review the device manufacturer's own data practices and privacy policies.

Our responsibility is the VPS layer. Within that layer, the commitment is unambiguous: your frames are processed and discarded, your maps belong to you, and your deployment architecture is your choice.

‍

Why This Is a Platform Shift, Not Just a Demo

It's tempting to look at Bilawal's video as a single impressive demo. But what's actually happening is the convergence of four things that didn't exist together before 2026.

Meta opened the camera. In December 2025, Meta released the Wearables Device Access Toolkit in developer preview, giving third-party developers access to the Ray-Ban camera feed for the first time. Before that, no external developer could programmatically access camera frames from consumer smart glasses in this form factor. The gate opened, and the clock started.

MultiSet shipped within 90 days. Our native iOS SDK v1.11.0 launched on February 2, 2026, with full Ray-Ban Meta support. The wearable VPS sample code (the repo Bilawal uses in his video) shipped with v1.11.1 on March 30. That timeline wasn't accidental — our SDK architecture was already cross-platform, and the wearable integration was an extension of an existing pipeline, not a ground-up rebuild. The AREA's 2025 enterprise VPS report had already ranked MultiSet as the most robust VPS platform in its evaluation — validating the foundation we built on.

Meta validated the integration publicly. On April 1, Meta published a developer blog post titled "Build Apps that See, Hear, and Respond" showcasing the first wave of Wearables Device Access Toolkit integrations. MultiSet AI was featured alongside applications in accessibility, education, and consumer experiences — described as using the glasses' low-latency visual processing to stream frames to our VPS for real-time location tracking.

The sample code is open. The wearable VPS sample repository is public on GitHub at MultiSet-AI/wearable-vps-samples. Developers can clone it, authenticate against their MultiSet account, point it at a mapped space, and have wearable VPS running on their own Ray-Ban Meta glasses. The barrier to entry is a pair of glasses and a MultiSet developer account.

What's Different About Glasses

Building VPS for glasses isn't the same engineering problem as building VPS for phones. Three constraints force architectural decisions that don't apply to handheld devices.

Compute budgets collapse. A flagship phone runs on a 5–7 watt power budget during burst processing. Smart glasses run on roughly 1 watt. The Snapdragon AR1 Gen 1 inside Ray-Ban Meta is purpose-built for always-on perception at minimal thermal output — but it doesn't have the headroom for heavy on-device vision processing. That's why the current architecture routes frames to cloud-based localization, and why the efficiency of the VPS engine (MultiSet's median localization time is approximately 52 milliseconds on Snapdragon 8 Gen 3, 38 milliseconds on Apple Silicon) matters more on wearables than on phones.

The interaction model is invisible. When someone uses VPS on a phone, they're holding a device, pointing a camera, waiting for a result. The experience is visible and intentional. On glasses, localization has to happen without the user doing anything — continuously, silently, while they walk, talk, and use their hands for actual work. This means the system has to be reliable enough to run without user intervention and graceful enough to degrade without disruption when conditions aren't ideal.

Form factor determines adoption. Enterprise AR headsets have demonstrated the value of hands-free spatial computing for years. But adoption has been limited by cost ($3,500 for HoloLens 2, $3,299 for Magic Leap 2), weight, battery life, and the social awkwardness of wearing a headset in a workplace. Consumer smart glasses at $299 — glasses that look like glasses — remove most of those barriers. The positioning layer doesn't change. The willingness to wear the device all day does.

What Comes Next

The current architecture — camera-in via glasses, cloud localization via MultiSet API, guidance out via audio and companion phone — is the first generation. The trajectory is clear.

On-device localization for wearables is the natural next step. MultiSet already supports on-device localization for mobile and headset platforms. As wearable chipsets gain compute headroom (Qualcomm's Snapdragon AR1+ Gen 1, announced June 2025, is 28% smaller and 7% lower power with a stronger NPU), the same capability will extend to glasses. When it does, the camera frame never leaves the device — the strongest possible privacy architecture applied to the most privacy-sensitive form factor.

In-lens display integration will happen when Meta opens the Ray-Ban Display HUD to third-party developers. Today, MultiSet delivers spatial guidance as audio and phone-based visual overlays. Tomorrow, the same pose data drives world-locked visual content directly in the wearer's field of view. The positioning pipeline doesn't change — only the output modality.

Cross-platform reach is already here. The same MultiSet SDK that runs on Ray-Ban Meta today also supports Meta Quest 2/3, Apple Vision Pro, Magic Leap 2, HoloLens 2, Pico Neo 3/4, Vive Focus 3 and XR Elite, XReal Air 2 Ultra, Unity, native iOS and Android, WebXR, and ROS 2. Developers build against one API and deploy across the full wearable spectrum — from $299 consumer glasses to $3,500 enterprise headsets to autonomous robots. That breadth isn't a feature list. It's an architectural commitment to cross-platform spatial infrastructure that doesn't lock enterprises into a single device vendor's roadmap.

Start Building

The tools are live. The sample code is open.

Wearable VPS samples on GitHub: MultiSet-AI/wearable-vps-samples
Full documentation: docs.multiset.ai
Developer portal: developer.multiset.ai
Bilawal Sidhu's full video: Watch on YouTube
Meta's DAT developer blog featuring MultiSet: Read on developers.meta.com

Enterprise deployment conversations — including private cloud, on-premise, and custom integrations — reach us at contact@multiset.ai.

The hardware is ready. The SDK is ready. What you build with it is up to you.