We use a regular iPhone, its built-in speaker and microphone, and a motorized turntable to create a 3D model of an object in under 30 seconds.
Read the paper (PDF)Figure 1. System overview. The phone emits near-ultrasonic FMCW chirps toward an object on a motorized turntable. Echoes are captured by the built-in microphone, processed into range frames, and backprojected into a 3D voxel volume entirely on-device.
A stock iPhone reconstructed this 3D point cloud from sound alone without any cameras or depth sensors.
Target object
Acoustic reconstruction · 8,861 points · 133 frames
Each acoustic frame is formed, phase-aligned for coherence across the full aperture, then backprojected into a 3D voxel grid.
The phone emits near-ultrasonic FMCW chirps and captures echoes with its built-in microphone. Dechirping mixes the received signal with the transmitted reference, and an FFT extracts per-frame range profiles at approximately 13 Hz.
Phase alignment across hundreds of frames despite clock drift and platform motion. A reference signal is used to track and correct phase offsets between consecutive frames, ensuring constructive interference during volumetric reconstruction.
Coherent frames are backprojected into a dense voxel volume. For each voxel, the algorithm sums phase-aligned contributions from all aperture positions. GPU-accelerated computation runs entirely on the iPhone, completing in under 30 seconds.
In radar, Inverse Synthetic Aperture Radar (ISAR) exploits target rotation to synthesize a large virtual aperture from a stationary sensor. We apply the same principle acoustically: the phone stays fixed while the turntable rotates the object.
In the object's reference frame, this is equivalent to the phone orbiting around a stationary target. A full turntable revolution yields a complete 360-degree synthetic aperture from a single transceiver position, enabling coherent 3D volumetric imaging without sensor arrays or mechanical translation stages.
A full 360° synthetic aperture from a single transceiver position.
Figure 2. The ISAR equivalence. The phone (right) is stationary. The object rotates on a turntable (center). Green dots show the virtual aperture positions - equivalent to the phone orbiting the object.
Prior systems achieve at most two of three desirable properties. This work is the first to combine all three.
| System | Phone-Native Acoustics | 3D Volumetric Output | Arbitrary Objects |
|---|---|---|---|
| FingerIO '16 | ✓ | — | — |
| LLAP '16 | ✓ | — | — |
| AIM '22 | ✓ | — | ✓ |
| SONDAR '24 | ✓ | — | ✓ |
| AirSAS '23 | — | ✓ | ✓ |
| NeuralSAS '24 | — | ✓ | ✓ |
| Ours | ✓ | ✓ | ✓ |
Phone-native = unmodified smartphone speaker + mic only, no external transducers.
Current constraints and the boundaries of this approach.
Acoustic wavelengths at 20 kHz are ~17 mm - orders of magnitude coarser than optical systems. This produces volumetric shape information, not surface-level geometric detail.
Range resolution is constrained to ~4.3 cm by the available acoustic bandwidth of the phone's speaker. Lateral resolution (~8.6 mm) is set by the wavelength at 20 kHz.
Current results are obtained in a quiet indoor environment with controlled multipath. In-the-wild operation with ambient noise and complex reflections remains future work.
The single-elevation circular aperture leaves a cone of unsampled spatial frequencies, causing anisotropic resolution - elongation along the vertical axis.