We use a regular iPhone, its built-in speaker and microphone, and a motorized turntable to create a 3D model of an object in under 30 seconds. No extra sensors. No special lab equipment. Just sound and the phone itself.
Read the paper (PDF)Figure 1. System overview. The phone emits near-ultrasonic FMCW chirps toward an object on a motorized turntable. Echoes are captured by the built-in microphone, processed into range frames, and backprojected into a 3D voxel volume entirely on-device.
A stock iPhone reconstructed this 3D point cloud from sound alone — no cameras, no depth sensors.
Target object
Acoustic reconstruction · 8,861 points · 133 frames
Each acoustic frame is formed, phase-aligned for coherence across the full aperture, then backprojected into a 3D voxel grid.
The phone emits near-ultrasonic FMCW chirps and captures echoes with its built-in microphone. Dechirping mixes the received signal with the transmitted reference, and an FFT extracts per-frame range profiles at approximately 13 Hz.
Phase alignment across hundreds of frames despite clock drift and platform motion. A reference signal is used to track and correct phase offsets between consecutive frames, ensuring constructive interference during volumetric reconstruction.
Coherent frames are backprojected into a dense voxel volume. For each voxel, the algorithm sums phase-aligned contributions from all aperture positions. GPU-accelerated computation runs entirely on the iPhone, completing in under 30 seconds.
In radar, Inverse Synthetic Aperture Radar (ISAR) exploits target rotation to synthesize a large virtual aperture from a stationary sensor. We apply the same principle acoustically: the phone stays fixed while the turntable rotates the object.
In the object's reference frame, this is equivalent to the phone orbiting around a stationary target. A full turntable revolution yields a complete 360-degree synthetic aperture from a single transceiver position, enabling coherent 3D volumetric imaging without sensor arrays or mechanical translation stages.
A full 360° synthetic aperture from a single transceiver position.
Figure 2. The ISAR equivalence. The phone (right) is stationary. The object rotates on a turntable (center). Green dots show the virtual aperture positions — equivalent to the phone orbiting the object.
Prior systems achieve at most two of three desirable properties. This work is the first to combine all three.
| System | Phone-Native Acoustics | 3D Volumetric Output | Arbitrary Objects |
|---|---|---|---|
| FingerIO '16 | ✓ | — | — |
| LLAP '16 | ✓ | — | — |
| AIM '22 | ✓ | — | ✓ |
| SONDAR '24 | ✓ | — | ✓ |
| AirSAS '23 | — | ✓ | ✓ |
| NeuralSAS '24 | — | ✓ | ✓ |
| Ours | ✓ | ✓ | ✓ |
Phone-native = unmodified smartphone speaker + mic only, no external transducers.
Honest framing of current constraints and the boundaries of this approach.
Acoustic wavelengths at 20 kHz are ~17 mm — orders of magnitude coarser than optical systems. This produces volumetric shape information, not surface-level geometric detail.
Range resolution is constrained to ~4.3 cm by the available acoustic bandwidth of the phone's speaker. Lateral resolution (~8.6 mm) is set by the wavelength at 20 kHz.
Current results are obtained in a quiet indoor environment with controlled multipath. In-the-wild operation with ambient noise and complex reflections remains future work.
The single-elevation circular aperture leaves a cone of unsampled spatial frequencies, causing anisotropic resolution — elongation along the vertical axis.