Augmented Reality glasses generate thousands of frames per hour. A 10 FPS egocentric camera produces 36,000 frames in sixty minutes, roughly 288 MB/hr of raw JPEG data. The device has two options: stream everything to the cloud, burning battery on frames that add no additional information, or sample blindly and silently drop irreplaceable context.

Both outcomes are bad. Uniform sampling wastes resources on duplicate scenes; it allocates the same frame rate during a 40-second stationary period as during a rapid room transition. Pixel-differencing fires on noise such as camera shake and misses rooms the wearer revisits, because it only compares consecutive frames. The root cause is the same in both cases: the device has no way to compare a new frame against the recent context it already remembers.

Frame selection is a memory problem. A device that accumulates experience (a searchable index of everything it has seen) can answer the question “is this frame novel?” for every incoming frame, in real time.

Temporal vector memory

Each frame is embedded using CLIP ViT-B/32 into a 512-dimensional vector. The embedding is compared against all vectors currently stored in an on-device temporal vector memory. If the nearest stored embedding is below a cosine similarity threshold (0.92), the frame is semantically novel; the device has not seen anything like it. Keep it. If a stored embedding is above the threshold, the frame is redundant. Skip it.

This is novelty-based frame selection. It operates in embedding space, not pixel space: two frames of the same kitchen from different angles match semantically, even though their pixels differ entirely. The memory accumulates over time, so each new frame is compared against the full history of what the device has seen, not just the previous frame.

The system was built in Rust with a core designed for resource-constrained hardware. Vectors are stored alongside per-record metadata (timestamps, sensor signals) in a structure that supports filtered search, enabling temporal windowing as a metadata query, not a separate data structure. The engine supports edge-to-cloud sync, quantized storage, and encryption at rest.

Results

Data source: 2,668 frames from the Meta Aria Everyday Activities dataset, 4 minutes 26 seconds of egocentric video at 10 FPS, downsampled to 224x224.

The temporal vector memory at τ=0.92 kept 130 of 2,668 frames. Uniform sampling and pixel-differencing were calibrated to the same count so the comparison is strictly about what was captured, not how many.

Same video, same frame budget, three selection methods. Top: uniform. Middle: pixel-diff. Bottom: temporal vector memory.

Same video, same frame budget, three selection methods. Top: uniform. Middle: pixel-diff. Bottom: temporal vector memory.

The comparison measures three things: how many visually distinct scenes each method captures (greedy clustering in CLIP embedding space at cosine < 0.85), the worst-case similarity gap between any skipped frame and its nearest kept frame (Worst Coverage), and what fraction of kept frames are near-duplicates of each other (cosine >= 0.95).

MethodFrames KeptUnique ScenesWorst CoverageRedundant
Uniform (1/20)1304 / 140.8261%
Pixel diff (MSE)1319 / 140.8571%
Temporal vector memory (τ=0.92)13014 / 140.920%

The 14 scenes are not predefined; they are discovered from the data. All 2,668 frames are embedded with CLIP, and greedy clustering at cosine < 0.85 produces 14 natural groups. That is the denominator for all three methods: how many of those 14 clusters have at least one frame in each method’s kept set? A keyframe is any frame novel enough to keep at the finer 0.92 threshold. Multiple keyframes fall inside a single scene because the wearer shifts angles, interacts with objects, or moves within the room.

14 scenes discovered by cosine-similarity clustering over the 130 keyframes. Each row is one scene. Some kitchen frames appear in non-kitchen clusters because the viewpoint was similar to another room at that threshold. Tightening the clustering threshold splits these into separate, more precise clusters.

14 scenes discovered by cosine-similarity clustering over the 130 keyframes. Each row is one scene. Some kitchen frames appear in non-kitchen clusters because the viewpoint was similar to another room at that threshold. Tightening the clustering threshold splits these into separate, more precise clusters.

The 14-scene count is threshold-dependent. At 0.80 cosine similarity, everything merges into 3 clusters. At 0.90, it fractures into 41. The 0.85 threshold sits at the transition where viewpoint changes register without over-segmenting.

Uniform sampling captured 4 of 14 unique scenes. 61% of its kept frames were near-duplicates of other kept frames. It spent most of its budget during stationary periods.

Pixel-differencing captured 9 of 14 scenes but with 71% redundancy. It fires on camera shake (head movements produce large pixel deltas that are semantically meaningless) and misses rooms the wearer revisited.

The temporal vector memory captured all 14 scenes with zero redundant frames. Every skipped frame has a nearest kept frame with cosine similarity >= 0.92, nothing was lost. The selection rate automatically tracks scene information density: dense clusters of keyframes during room transitions, near-zero during stationary periods.

The per-frame overhead of the memory engine, one insert plus one nearest-neighbor search, is 8 microseconds on Apple Silicon. That is less than 0.01% of the 100 ms frame budget at 10 FPS. The memory engine is not the bottleneck.

The working set for 130 keyframes is 72 KB with i8 quantization (267 KB for f32). Projected over continuous operation: 1.0 MB/hr (i8) or 3.5 MB/hr (f32). A 512 MB device could store roughly 512 hours of compressed timeline in i8, about 21 days.

Temporal windowing

By default, the engine compares each frame against the full history. A 60-second temporal window, implemented as a metadata filter on the search query, changes the novelty semantics: a room revisited after 60+ seconds is treated as new context, not a duplicate.

ModeKeyframesScenesRedundant
Full memory13014/140%
60-second window14114/144%

The 11 extra keyframes in windowed mode are room re-entries; the kitchen revisited after the window expired. Both modes capture all 14 scenes.

Threshold sweep

The similarity threshold is the single tuning parameter. Higher threshold means more frames kept; lower means fewer.

ThresholdFrames KeptCompressionUnique Scenes
0.883772.1x16
0.9213020.5x14
0.965824.6x19

The scene count scales with the threshold, suggesting the similarity scores carry semantic signal.

Bandwidth and power

What gets synced to the cloud is vectors and metadata, not frames. The cloud receives a searchable semantic index; it can answer “when did the user last see a kitchen?” but it never receives a photo of the kitchen.

ApproachData synced per hourReduction vs raw video
Raw video (10 FPS, 224x224 JPEG)288 MB/hr1x
Temporal vector memory (f32)3.5 MB/hr82x
Temporal vector memory (i8)1.0 MB/hr288x

The Aria2 system architecture paper (Lee et al., Dec 2025) models the full power budget of a wearable contextual AI device: ~3 Wh battery, ~200 mW average power ceiling for 15-hour all-day operation. Wireless is one of the largest power consumers, and the paper notes it will become a more acute bottleneck over time as digital logic scales but RF does not.

At 1.0 MB/hr (i8), the average bitrate is approximately 2.2 Kbps. That is well below the threshold where Bluetooth Low Energy replaces WiFi, eliminating the WiFi radio from the always-on power budget entirely. The paper’s baseline compression (10:1 via H.265) still requires multi-megabit wireless. The vector memory approach operates three orders of magnitude below that.

Each new keyframe syncs as approximately 600 bytes (i8). Edge and cloud merge to identical state without a central server or conflict resolution protocol.

Privacy by architecture

The cloud receives a searchable semantic index, not photos. CLIP embeddings are not invertible; you cannot reconstruct the original image from a 512-dimensional vector. The cloud can determine that the user was in a kitchen at 10:32 AM, but it cannot see the kitchen, identify who was present, or read text in the scene.

This is privacy by architecture, not by policy. The data that would violate privacy never leaves the device. Frames stay on-device and are fetched on demand, if ever, over a separate deferred channel when the device is on WiFi and charging. The always-on sync path carries only vectors and metadata, encrypted at rest.

What this means

Frame selection for power-constrained cameras is not a sampling problem. It is a memory problem. A device that remembers what it has seen can decide what is worth transmitting. The alternative, sampling strategies that are blind to content, either waste power on redundant data or silently lose context.

The approach generalizes to any camera-equipped hardware under a power or bandwidth constraint: AR glasses, drones, mobile robots, security cameras, body cameras. The embedding model is a deployment decision; the memory engine takes vectors of any dimension and any quantization. The contribution is showing that memory, not sampling strategy, is what makes frame selection work.

All benchmarks in this post were measured against the Meta Aria Everyday Activities dataset (sequence loc1_script4_seq2_rec1, cooking breakfast, room transitions, extended stationary periods), which is publicly available. The methodology is reproducible: deterministic seeds, three-way comparison under equal frame budgets, defined metrics for scene coverage and redundancy. Further evaluation is in progress.