Using an unmodified iPhone's built-in speaker and microphone together with a motorized turntable, we reconstruct a 3D acoustic volume of a nearby object in under 30 seconds.
Read the paper (PDF)Figure 1. System overview. The phone emits near-ultrasonic FMCW chirps toward an object on a motorized turntable. Echoes are captured by the built-in microphone, processed into range frames, and backprojected into a 3D voxel volume entirely on-device.
A stock iPhone reconstructed this 3D acoustic point cloud using only its built-in speaker and microphone, without cameras or depth sensors.
Target object
Acoustic reconstruction · 8,861 points · 133 frames
Each acoustic frame is first formed from the received chirp, then phase-aligned across the synthetic aperture, and finally backprojected into a 3D voxel grid.
The phone emits near-ultrasonic FMCW chirps and records echoes with its built-in microphone. The received signal is dechirped against the transmitted reference, and an FFT produces a per-frame range profile at approximately 13 Hz.
Coherent reconstruction requires phase consistency across hundreds of frames. A reference signal is used to estimate and correct inter-frame phase offsets, reducing the effects of clock drift and other phase instability before volumetric reconstruction.
The phase-aligned frames are backprojected into a dense voxel volume. For each voxel, the algorithm sums the expected complex contributions from all aperture positions. GPU-accelerated computation runs entirely on the iPhone and completes in under 30 seconds.
Inverse Synthetic Aperture Radar (ISAR) uses target motion to synthesize a larger aperture from a stationary sensor. Here, we apply the same principle acoustically: the phone remains fixed while the turntable rotates the object.
In the object's reference frame, this is equivalent to the phone orbiting a stationary target. One full revolution yields a 360-degree synthetic aperture from a single transceiver position, enabling coherent 3D volumetric imaging without sensor arrays or mechanical translation of the phone.
A full 360° synthetic aperture from a single transceiver position.
Figure 2. The ISAR equivalence. The phone (right) is stationary. The object rotates on a turntable (center). Green dots show the virtual aperture positions - equivalent to the phone orbiting the object.
Prior systems have demonstrated subsets of the properties desired here. This system combines phone-native acoustics, 3D volumetric output, and support for arbitrary objects in a single prototype.
| System | Phone-Native Acoustics | 3D Volumetric Output | Arbitrary Objects |
|---|---|---|---|
| FingerIO '16 | ✓ | — | — |
| LLAP '16 | ✓ | — | — |
| AIM '22 | ✓ | — | ✓ |
| SONDAR '24 | ✓ | — | ✓ |
| AirSAS '23 | — | ✓ | ✓ |
| NeuralSAS '24 | — | ✓ | ✓ |
| Ours | ✓ | ✓ | ✓ |
Phone-native acoustics denotes use of an unmodified smartphone speaker and microphone without external transducers.
At 20 kHz, acoustic wavelengths are on the order of 17 mm, far coarser than optical wavelengths. The system therefore recovers volumetric shape information rather than fine surface detail.
Range resolution is limited to approximately 4.3 cm by the available acoustic bandwidth of the phone speaker. Lateral resolution is set by the acoustic wavelength and aperture geometry.
Current results are obtained indoors under controlled multipath and noise conditions. Operation in cluttered or noisy real-world settings remains future work.
A single-elevation circular aperture leaves part of the 3D spatial-frequency domain unsampled, producing anisotropic resolution and elongation along the vertical axis.