March 2026

Coherent 3D Acoustic Imaging on a Smartphone

We use a regular iPhone, its built-in speaker and microphone, and a motorized turntable to create a 3D model of an object in under 30 seconds.

Read the paper (PDF)
<30s
On-device processing
5mm
Voxel spacing
8.6mm
Lateral resolution
0
External sensors required
iPhone 15 Pro Speaker + Mic FMCW chirps Turntable 15s / revolution 200–600 frames On-Device GPU Backprojection < 30 sec 3D Voxel Volume 0.30 m³

Figure 1. System overview. The phone emits near-ultrasonic FMCW chirps toward an object on a motorized turntable. Echoes are captured by the built-in microphone, processed into range frames, and backprojected into a 3D voxel volume entirely on-device.


Result

Sound in, 3D out

A stock iPhone reconstructed this 3D point cloud from sound alone without any cameras or depth sensors.

Target object

Acoustic reconstruction · 8,861 points · 133 frames


Method

Three stages from chirp to volume

Each acoustic frame is formed, phase-aligned for coherence across the full aperture, then backprojected into a 3D voxel grid.

01

Frame Formation

The phone emits near-ultrasonic FMCW chirps and captures echoes with its built-in microphone. Dechirping mixes the received signal with the transmitted reference, and an FFT extracts per-frame range profiles at approximately 13 Hz.

FMCW sweep → mix with reference → FFT → range profile
02

Coherence Maintenance

Phase alignment across hundreds of frames despite clock drift and platform motion. A reference signal is used to track and correct phase offsets between consecutive frames, ensuring constructive interference during volumetric reconstruction.

Phase referencing → drift estimation → correction → coherent frame stack
03

3D Backprojection

Coherent frames are backprojected into a dense voxel volume. For each voxel, the algorithm sums phase-aligned contributions from all aperture positions. GPU-accelerated computation runs entirely on the iPhone, completing in under 30 seconds.

For each voxel v: I(v) = Σ s(r, θ) · exp(jφ)

Core Insight

The ISAR principle, applied to acoustics

In radar, Inverse Synthetic Aperture Radar (ISAR) exploits target rotation to synthesize a large virtual aperture from a stationary sensor. We apply the same principle acoustically: the phone stays fixed while the turntable rotates the object.

Stationary phone + Rotating object
= Virtual circular synthetic aperture

In the object's reference frame, this is equivalent to the phone orbiting around a stationary target. A full turntable revolution yields a complete 360-degree synthetic aperture from a single transceiver position, enabling coherent 3D volumetric imaging without sensor arrays or mechanical translation stages.

A full 360° synthetic aperture from a single transceiver position.

Figure 2. The ISAR equivalence. The phone (right) is stationary. The object rotates on a turntable (center). Green dots show the virtual aperture positions - equivalent to the phone orbiting the object.


Prior Art

The three-property boundary

Prior systems achieve at most two of three desirable properties. This work is the first to combine all three.

System Phone-Native Acoustics 3D Volumetric Output Arbitrary Objects
FingerIO '16
LLAP '16
AIM '22
SONDAR '24
AirSAS '23
NeuralSAS '24
Ours

Phone-native = unmodified smartphone speaker + mic only, no external transducers.


Prototype

System specifications

Device iPhone 15 Pro
Transducer Built-in speaker + mic
Waveform Near-ultrasonic FMCW
Frame rate ~13 Hz
Frames per scan 200 – 600
Turntable period 15 s / revolution
Standoff distance 0.30 m
Reconstruction volume 0.30 m³
Voxel spacing 5 mm
Range resolution ~4.3 cm
Lateral resolution ~8.6 mm (λ/2 at 20 kHz)
Processing time < 30 s
External hardware Motorized turntable only

Limitations

Current constraints and the boundaries of this approach.

Not a LiDAR replacement

Acoustic wavelengths at 20 kHz are ~17 mm - orders of magnitude coarser than optical systems. This produces volumetric shape information, not surface-level geometric detail.

Bandwidth-limited resolution

Range resolution is constrained to ~4.3 cm by the available acoustic bandwidth of the phone's speaker. Lateral resolution (~8.6 mm) is set by the wavelength at 20 kHz.

Controlled setting required

Current results are obtained in a quiet indoor environment with controlled multipath. In-the-wild operation with ambient noise and complex reflections remains future work.

Missing-cone anisotropy

The single-elevation circular aperture leaves a cone of unsampled spatial frequencies, causing anisotropic resolution - elongation along the vertical axis.