Unified Spherical Frontend: Learning Rotation-Equivariant Representations of Spherical Images from Any Camera

Abstract

We present Unified Spherical Frontend (USF), a distortion-free lens-agnostic framework that transforms images from any calibrated camera onto the unit sphere via ray-direction correspondences, and performs spherical resampling, convolution, and pooling canonically in the spatial domain. USF is modular: projection, location sampling, value interpolation, and resolution control are fully decoupled. Its configurable distance-only convolution kernels offer rotation-equivariance while avoiding harmonic transforms entirely. USF scales efficiently to high-resolution spherical imagery and maintains less than 1% performance drop under random test-time rotations without training-time rotational augmentation, and enables zero-shot generalization to any unseen (wide-FoV) lenses with minimal performance degradation.

Motivation

Most vision pipelines apply planar CNNs that assume a pinhole camera and operate on regular 2D grids. Each camera type (pinhole, fisheye, panoramic) has a unique projection geometry, forcing practitioners to train separate, lens-specific models that inevitably overfit onto the distortion patterns they see. This is mathematically unavoidable: by Gauss’s Theorema Egregium, no flat projection preserves the intrinsic curvature of the sphere, so any 2D representation inevitably distorts geometry.

Diagram showing how pinhole, fisheye, and 360° cameras each require separate planar models with lens-specific weightsDiagram showing how pinhole, fisheye, and 360° cameras each require separate planar models with lens-specific weights
Lens-Specific Planar Model
Diagram contrasting rotation equivariance, where rotated inputs produce equivalently rotated outputs, with rotation invariance, where outputs remain identical regardless of input rotationDiagram contrasting rotation equivariance, where rotated inputs produce equivalently rotated outputs, with rotation invariance, where outputs remain identical regardless of input rotation
Equivariance vs Invariance

A model is rotation-equivariant if rotating its input produces an equivalently rotated output. This is a desirable property for any system where camera orientation may vary at test time. USF achieves both lens-agnostic processing and rotation-equivariance while supporting plug-and-play replacement of planar layers in any existing backbone, providing a generic framework for modern vision.

Methodology

Click the diagram to explore

Unified Spherical Frontend pipeline: planar images are projected onto the unit sphere, resampled to near-uniform distribution, then processed by spherical convolution and pooling layersUnified Spherical Frontend pipeline: planar images are projected onto the unit sphere, resampled to near-uniform distribution, then processed by spherical convolution and pooling layers

Unified Spherical Frontend. (i) A planar image and its lens normal map can be combined to form a (ii) spherical image. Cameras with different lenses produce spatially varying densities and distributions of pixels when projected onto the sphere. Thus, it is crucial to perform (iii) resampling before (iv) feeding into the backbone composed of spherical convolution and pooling layers. Optionally, the results can be (v) resampled back into the raw projected spherical image pixel locations, and (vi) unprojected back to the planar image for downstream integration.

Results

We replace standard planar layers with spherical counterparts while keeping all other aspects of the models and training protocol identical. All evaluation is performed in the planar domain for consistency. NR and RR denote non-rotated and randomly rotated settings. Prior spectral-domain spherical CNNs are compared only in the low-resolution MNIST experiment due to their prohibitive cost at higher resolutions.

Rotation Equivariance

Across all tasks, distance-only spherical kernels maintain strong performance under arbitrary rotations without augmentation, while planar models degrade sharply. The distance × direction kernel captures orientation-sensitive cues but sacrifices equivariance. Expressivity depends heavily on kernel parameterization: a low-level MLP underperforms a simple piecewise-constant design, highlighting the importance of appropriate embeddings.

Zero-Shot Lens Generalization

We evaluate cross-lens adaptability by training on one lens type and testing on all three. Pinhole (90° FoV, 280×280), fisheye (180° FoV, 560×560), and panoramic lenses are simulated from equirectangular images with matched pixel-to-FoV ratios.

Model Train Test Pinhole Test Fisheye Test Panoramic
mIoU ↑ mAcc ↑ mIoU ↑ mAcc ↑ mIoU ↑ mAcc ↑
Planar DeepLab v3 Pinhole 42.53% 55.75% 33.53% 47.36% 36.07% 53.61%
Fisheye 39.88% 53.46% 40.05% 55.86% 33.11% 56.53%
Panoramic 29.66% 43.85% 24.91% 37.54% 35.01% 58.30%
Spherical DeepLab v3 Pinhole 34.76% 47.47% 22.36% 35.52% 19.70% 35.09%
Fisheye 19.44% 31.52% 30.44% 44.21% 28.16% 43.99%
Panoramic 12.57% 23.05% 28.35% 41.58% 28.78% 45.27%

Zero-Shot Lens Generalization (Full-Dataset). Trained on one lens, tested on all three. Random rotation disabled. Spherical models show more consistent cross-lens transfer, especially between lenses with similar FoV coverage.

Planar models show clear performance drops when evaluated on a different lens from training. Spherical models transfer more consistently across lenses, especially when source and target share similar FoV coverage. Degradation is more noticeable between views with drastically different FoV (e.g., pinhole to panoramic) due to mismatched spatial coverage.

Computation Benchmark

All benchmarks use a 960×480 panorama input with batch size 8, RGB channels, averaged over 10 runs on an NVIDIA H200 GPU (PyTorch 2.8.0, CUDA 12.8, float32).

Geometry caching is the key enabler: once neighborhood structures and interpolation weights are precomputed, sustained runtime drops by orders of magnitude. Without caching, geometric preprocessing would dominate and make spherical pipelines infeasible at scale. With caching, spherical networks train at roughly the wall-clock time of planar counterparts, which we consider acceptable given the added geometric consistency, rotation equivariance, and lens-agnostic processing.

Citation