March 2026

Coherent 3D Acoustic Imaging on a Smartphone

Using an unmodified iPhone's built-in speaker and microphone together with a motorized turntable, we reconstruct a 3D acoustic volume of a nearby object in under 30 seconds.

Read the paper (PDF)
<30 s
On-device processing
5 mm
Voxel spacing
8.6 mm
Lateral resolution
No external sensing hardware
beyond the turntable
iPhone 15 Pro Speaker + Mic FMCW chirps Turntable 15s / revolution 200–600 frames On-Device GPU Backprojection < 30 sec 3D Voxel Volume 0.30 m³

Figure 1. System overview. The phone emits near-ultrasonic FMCW chirps toward an object on a motorized turntable. Echoes are captured by the built-in microphone, processed into range frames, and backprojected into a 3D voxel volume entirely on-device.


Result

Example reconstruction

A stock iPhone reconstructed this 3D acoustic point cloud using only its built-in speaker and microphone, without cameras or depth sensors.

Target object

Acoustic reconstruction · 8,861 points · 133 frames


Method

Pipeline overview

Each acoustic frame is first formed from the received chirp, then phase-aligned across the synthetic aperture, and finally backprojected into a 3D voxel grid.

01

Frame formation

The phone emits near-ultrasonic FMCW chirps and records echoes with its built-in microphone. The received signal is dechirped against the transmitted reference, and an FFT produces a per-frame range profile at approximately 13 Hz.

FMCW sweep → mix with reference → FFT → range profile
02

Coherence maintenance

Coherent reconstruction requires phase consistency across hundreds of frames. A reference signal is used to estimate and correct inter-frame phase offsets, reducing the effects of clock drift and other phase instability before volumetric reconstruction.

Phase referencing → drift estimation → correction → coherent frame stack
03

3D backprojection

The phase-aligned frames are backprojected into a dense voxel volume. For each voxel, the algorithm sums the expected complex contributions from all aperture positions. GPU-accelerated computation runs entirely on the iPhone and completes in under 30 seconds.

For each voxel v: I(v) = Σ s(r, θ) · exp(jφ)

ISAR Formulation

Applying ISAR to acoustics

Inverse Synthetic Aperture Radar (ISAR) uses target motion to synthesize a larger aperture from a stationary sensor. Here, we apply the same principle acoustically: the phone remains fixed while the turntable rotates the object.

Stationary phone + Rotating object
= Virtual circular synthetic aperture

In the object's reference frame, this is equivalent to the phone orbiting a stationary target. One full revolution yields a 360-degree synthetic aperture from a single transceiver position, enabling coherent 3D volumetric imaging without sensor arrays or mechanical translation of the phone.

A full 360° synthetic aperture from a single transceiver position.

Figure 2. The ISAR equivalence. The phone (right) is stationary. The object rotates on a turntable (center). Green dots show the virtual aperture positions - equivalent to the phone orbiting the object.


Prior Work

Comparison to prior work

Prior systems have demonstrated subsets of the properties desired here. This system combines phone-native acoustics, 3D volumetric output, and support for arbitrary objects in a single prototype.

System Phone-Native Acoustics 3D Volumetric Output Arbitrary Objects
FingerIO '16
LLAP '16
AIM '22
SONDAR '24
AirSAS '23
NeuralSAS '24
Ours

Phone-native acoustics denotes use of an unmodified smartphone speaker and microphone without external transducers.


Specifications

System specifications

Device iPhone 15 Pro
Transducer Built-in speaker + mic
Waveform Near-ultrasonic FMCW
Frame rate ~13 Hz
Frames per scan 200 – 600
Turntable period 15 s / revolution
Standoff distance 0.30 m
Reconstruction volume 0.30 m³
Voxel spacing 5 mm
Range resolution ~4.3 cm
Lateral resolution ~8.6 mm (λ/2 at 20 kHz)
Processing time < 30 s
External hardware Motorized turntable only

Limitations

Current limitations

Not a replacement for LiDAR

At 20 kHz, acoustic wavelengths are on the order of 17 mm, far coarser than optical wavelengths. The system therefore recovers volumetric shape information rather than fine surface detail.

Bandwidth-limited range resolution

Range resolution is limited to approximately 4.3 cm by the available acoustic bandwidth of the phone speaker. Lateral resolution is set by the acoustic wavelength and aperture geometry.

Controlled environment

Current results are obtained indoors under controlled multipath and noise conditions. Operation in cluttered or noisy real-world settings remains future work.

Missing-cone anisotropy

A single-elevation circular aperture leaves part of the 3D spatial-frequency domain unsampled, producing anisotropic resolution and elongation along the vertical axis.