Sparse Multi-baseline SAR Cross-modal 3D Reconstruction of Vehicle Targets (2024)

Da Lida_li@bit.edu.cnGuoqiang Zhaozhaoguoqiang@bit.edu.cnHoujun Sunsunhoujun@bit.edu.cnJiacheng Baobaojiacheng@bit.edu.cn

Abstract

Multi-baseline SAR 3D imaging faces significant challenges due to data sparsity. In recent years, deep learning techniques have achieved notable success in enhancing the quality of sparse SAR 3D imaging. However, previous work typically rely on full-aperture high-resolution radar images to supervise the training of deep neural networks (DNNs), utilizing only single-modal information from radar data. Consequently, imaging performance is limited, and acquiring full-aperture data for multi-baseline SAR is costly and sometimes impractical in real-world applications. In this paper, we propose a Cross-Modal Reconstruction Network (CMR-Net), which integrates differentiable render and cross-modal supervision with optical images to reconstruct highly sparse multi-baseline SAR 3D images of vehicle targets into visually structured and high-resolution images. We meticulously designed the network architecture and training strategies to enhance network generalization capability. Remarkably, CMR-Net, trained solely on simulated data, demonstrates high-resolution reconstruction capabilities on both publicly available simulation datasets and real measured datasets, outperforming traditional sparse reconstruction algorithms based on compressed sensing and other learning-based methods. Additionally, using optical images as supervision provides a cost-effective way to build training datasets, reducing the difficulty of method dissemination. Our work showcases the broad prospects of deep learning in multi-baseline SAR 3D imaging and offers a novel path for researching radar imaging based on cross-modal learning theory.

keywords:

Multi-baseline SAR , Sparse imaging , 3D reconstruction , Cross-modal learning

\affiliation

[bit]organization=Beijing Key Laboratory of Millimeter Wave and Terahertz Techniques, School of Integrated Circuits and Electronics, Beijing Institute of Technology,addressline=No.5 Zhongguancun South Street,city=Haidian District,postcode=100081,state=Beijing,country=China

1 Introduction

Synthetic aperture radar (SAR) offers all-weather, all-day, high-resolution imaging, making it a widely used remote sensing technology in terrain mapping and military reconnaissance [1]. Traditional SAR systems, constrained by their two-dimensional (2D) imaging mechanisms, produce 2D projection images of three-dimensional (3D) targets in the slant-range-azimuth plane. These images often suffer from distortions like layover and foreshortening, resulting in poor recognizability and interpretability. SAR 3D imaging technology addresses these limitations by mapping scatter centers in 3D space within the observed scene, overcoming the constraints of 2D imaging. Among these techniques, multi-baseline SAR 3D tomography stands out for its ability to provide comprehensive 3D spatial resolution [2]. By conducting multiple flights at different altitudes, it forms a synthetic aperture in the elevation direction, achieving height resolution. This technology enables 3D imaging and has significant application value in high-precision geographic remote sensing, urban 3D mapping, and detailed target interpretation [3].

To achieve refined 3D imaging results, multi-baseline SAR typically requires multi-aspect observations of the target to enhance spatial resolution, often necessitating circular trajectory flights around the target area for omni-directional resolution[4, 5, 6]. However, this approach faces practical challenges. Terrain and flight path constraints often limit the ability to acquire dense observation data. Additionally, the large volume of data required imposes significant computational and storage burdens[7]. Consequently, SAR researchers are focused on developing sparse imaging algorithms that can reconstruct detailed 3D target images using limited and incomplete measurement data. Existing methods fall into two main categories: those based on compressive sensing (CS) and those based on deep learning.

CS technology has been a mainstream approach for reconstructing signals from incomplete sparse measurements and was applied relatively early in multi-baseline SAR 3D imaging[8, 9, 10]. These methods model the sparse SAR 3D imaging problem as a sparse signal recovery model. By introducing various penalties and optimization techniques, they can reconstruct high-resolution imaging results from incomplete measurements. Typical works, such as those in reference, significantly reduce the dependence of multi-baseline SAR 3D imaging on measurement completeness. However, the iterative optimization process for 3D data is time-consuming, and the sensitivity of imaging quality to optimization parameter settings poses challenges for the further development and application of CS-based algorithms in SAR 3D imaging[11, 12].

In recent years, deep learning techniques have been widely applied in the field of sparse SAR 3D imaging. Researchers have explored the use of deep neural networks to learn image priors from training data and apply them to sparse imaging[13, 14, 15, 16, 17]. Thanks to their parallel structure and free iteration, deep learning methods can deliver more efficient and stable imaging results compared to CS-based methods, making them a popular research focus[18]. Based on different implementation strategies, the learning-based methods can be categorized into two classes.

The first category of methods integrates neural models into CS algorithms by transforming the traditional iterative solving process into cascaded deep neural network modules, replacing the nonlinear components of optimization with neural layers[19, 20, 21, 13]. Through extensive training, these networks learn data priors and optimization parameters, eliminating the need for manual settings. Compared to traditional optimization-based methods, this approach bypasses iterative processes, requiring only a single inference to achieve reconstruction accuracy similar to CS algorithms, thereby significantly reducing computational complexity. However, this method is constrained by the CS model and assumptions, making it unsuitable for anisotropic target imaging problems, especially for artificial structural targets like vehicles and aircraft[22].

The second category of methods combines traditional imaging algorithms with deep neural networks. Traditional techniques first perform pre-imaging on sparse data, followed by deep neural networks to enhance the pre-imaging results. In one study[23], a 3D UNet was used to improve the rough back projection (BP) imaging results of sparse data, achieving high-resolution reconstruction on simulated datasets. Another study [22] proposed a Sparse Aspect Completion Network (SACNet) based on a Generative Adversarial Network (GAN) structure to enhance the CS pre-imaging results of sparse data. These results show that a network trained solely on simulated data can achieve good target reconstruction performance on real measured data. Leveraging the powerful data representation capabilities and thorough training of deep neural networks, these methods establish a direct mapping from low-resolution to high-resolution target images. They can rapidly and stably reconstruct high-resolution 3D images from sparse observation data through single-pass inference, making them the state-of-the-art (SOTA) method for SAR 3D reconstruction.

In summary, deep learning-based methods for sparse multi-baseline SAR three-dimensional imaging have shown significant potential to replace traditional CS algorithms and become the next generation of sparse imaging algorithms[18]. However, these methods still face several practical challenges that limit further improvements in imaging quality and hinder their practical application.

1.
Image Enhancement Limitation: Existing deep learning-based methods treat the enhancement of sparse SAR 3D image resolution as an image enhancement task. These methods train deep neural networks using low-resolution to high-resolution SAR 3D image data pair, endowing the networks with denoising, artifact removal, and completion capabilities to improve image clarity. However, due to the constraints of electromagnetic imaging mechanisms, training with high-resolution images of the same modality limits the potential resolution, hindering further improvement.
2.
Data Quality Constraint: The performance of deep learning-based imaging methods is constrained by the quality of the training data. High-resolution SAR 3D supervised images used by existing algorithms require the acquisition and processing of full-aperture data, which is often inefficient, costly, and sometimes infeasible in practical applications. This limitation hinders the widespread adoption of deep learning-based algorithms in real-world scenarios.
3.
Observation Sensitivity and Noise Interference: SAR imaging results are highly sensitive to observation geometry and noise interference. The SAR images inputted into neural networks often exhibit poor stability in feature information, making it challenging to extract useful information. Consequently, the generalization ability of deep learning-based imaging methods remains a significant hurdle for their practical application.

2 Methodology

2.1 Method framework

The proposed sparse multi-baseline SAR cross-modal 3D reconstruction method framework is illustrated in Figure LABEL:fig:method-framework. The method consists of two main modules: the pre-imaging module and the cross-modal reconstruction module. The pre-imaging module first individually processes each acquired sub-aperture data for imaging, then non-coherently combines the sub-imaging results to obtain incomplete and rough volumetric imaging results of the target. The cross-modal reconstruction module comprises a deep neural network trained with optical image supervision, encoding the electromagnetic-optical cross-modal information. It takes the pre-imaging results as input and outputs visually structured, sharply contoured, and more realistic target 3D reconstruction images. Below, we will elaborate on the details of each module.

Sparse Multi-baseline SAR Cross-modal 3D Reconstruction of Vehicle Targets (1)

2.2 Sparse Multi-baseline SAR 3D Pre-imaging

2.2.1 Imaging Model

The sparse aspects multi-baseline SAR observation geometry is illustrated on the left side of Figure LABEL:fig:method-framework. The platform carrying the radar performs multi-aspect measurements around the target, represented by clusters of line segments in different colors. At each sub-aspect, the platform conducts multiple straight-line flight observations of the target at different heights, forming a height-extended synthetic aperture on top of the track-extended synthetic aperture. This is depicted by clusters of uniformly spaced points in the figure. Under far-field conditions, the frequency-domain echo data collected by the radar at different azimuth angles and heights for each sub-aspect collectively form the 3D annular k-space observation data of the imaging target. Through interpolation and coordinate transformation, the spatial position of the target scattering center can be directly obtained using 3D Fourier inverse transformation[8].

In this paper, the proposed method is utilized to obtain target 3D images at each sub-angle. Initially, phase errors caused by track errors in the measurement data are corrected to prevent image defocusing. Subsequently, based on the imaging scene configuration, the frequency-domain data in polar coordinate format is interpolated into spatial Euler coordinate system data. Then, 3D Fourier inverse transformation is applied to the interpolated data to reconstruct the target 3D image at the respective sub-angle. Finally, to integrate the target scattering structure information from various sub-angles, a direct non-coherent summation of the images from each sub-angle is performed to obtain the pre-imaging results of the target, which discards the maximum posteriori estimation[25]. This result is then fed into the cross-modal reconstruction module as the input image.

2.3 Cross-modal reconstruction

The cross-modal reconstruction module receives the target 3D pre-imaging results as input and employs a cross-modal reconstruction network (CMR-Net) to generate high-precision 3D images. This network is encoded with cross-modal information, enabling it to produce reconstructions that closely resemble the target’s true physical model. The structure of the CMR-Net is depicted in Figure 2. The CMR-Net features an encoder-decoder architecture with skip connection layers. The encoder extracts feature representations of target structures from rough 3D pre-imaging data, while the decoder reconstructs the vehicle’s 3D image from low-dimensional latent representations. Near the bottleneck layer of the network, we designed a Projection-Rerojection (PRP) module to enhance the feature representation capability of the network. At the network’s output end, a differentiable volume rendering module is introduced to convert the reconstructed 3D volume into multi-view 2D images. The corresponding 2D rendered image of the actual vehicle’s true digital 3D model is used as the ground truth to evaluate the quality of the 3D reconstruction. Such supervised strategy effectively integrates cross-modal information into the network. In the following sections, we will provide a detailed exposition of the design of each module.

2.3.1 Network architecture

Figure 2 illustrates the detailed structure and training process of the cross-modal reconstruction network. The main body of the network consists of a contracting path (left) and an expansive path (right), with a PRP module located near the bottleneck layer. The contracting path consists of four downsampling layers, each composed of a 3D convolutional layer with LeakyReLU activation followed by a max-pooling layer. In the contracting path, the convolutional layers increase the number of feature channels, while the pooling layers reduce the data dimensionality. Its endpoint is connected to a designed PRP layer (detailed in Section II-B). This layer internally projects the data into a low-dimensional representation, but the output after reprojection retains the same size as the input feature map and seamlessly integrates with the processing of the expansive path.

The expansive path is the counterpart to the contracting path, comprising four upsampling layers and one single convolutional layer. Each upsampling layer incorporates a 3D transposed convolutional layer with ReLU activation. Within the expansive path, skip connection layers connect feature maps of the same resolution in the contracting path to the corresponding feature maps, which are then passed into the transposed convolutional layers to decrease the number of feature channels and increase the data dimensionality. The final upsampling layer yields a 64-channel output map. To consolidate information from all channels and achieve a smoother reconstruction result, we introduce a single-kernel convolutional layer.

Sparse Multi-baseline SAR Cross-modal 3D Reconstruction of Vehicle Targets (2)

2.3.2 Projection-Reprojection module

In 3D images acquired from sparse data, the scattering structure information of anisotropic targets is often sensitive to factors like the number of observations, viewing angles, and noise levels. To ensure that the network can capture precise feature representations from structurally incomplete and variable 3D images, we’ve introduced a PRP module between the encoder and decoder. This module is specifically designed to enhance the network’s representation capability and improve its generalization performance.

The structure details of the PRP module are depicted in Figure 3. The projection module comprises two fully connected layers. The first layer employs LeakyReLU as the activation function and is connected to the input feature map, compressing the features into a low-dimensional vector. The second fully connected layer removes the activation function and further reduces the dimensionality of the feature vector to obtain the latent representation vector $z$ . The reprojection layer is the inverse operation of the projection layer, symmetrically enlarging the size of the feature vector and followed by a reshape operation to restore the feature vector to its original feature map size. The only distinction is that the last fully connected layer utilizes the ReLU function as the activation to ensure consistency in data dynamics within the network.

Sparse Multi-baseline SAR Cross-modal 3D Reconstruction of Vehicle Targets (3)

2.4 Differentiable volume render

The optical images of targets contain rich and intuitive structural information, and owing to their passive imaging mechanism, high-resolution optical images are often easier to obtain in practical applications compared to radar electromagnetic images. However, leveraging the advantages of 2D optical images into SAR 3D images requires bridging the differences between data dimensions. Differentiable volume rendering techniques offer a solution to this challenge. They can render 3D volume data into 2D images, and the differentiable nature of the rendering process allows for the computation of gradients of the 2D loss function with respect to the 3D structure. In this study, we introduce a differentiable rendering module at the end of the CMR-Net to render the reconstructed 3D image into 2D images from different views. We use optical images to supervise the reconstruction quality, thus leveraging cross-modal advantages.

Figure 4 illustrates the process of differentiable volume rendering. Given a 3D volumetric imaging data $V\in\mathbb{R}^{W\times H\times D}$ , camera position $o$ , and viewing direction $\mathbf{d}$ , volume rendering obtains the pixel values $C(\mathbf{r})$ along any camera ray $\mathbf{r}(t)=o+t\mathbf{d}$ using the formula:

C(\mathbf{d})=\int_{t_{1}}^{t_{2}}T(t)\cdot\sigma(\mathbf{r}(t))dt

(1)

Where $\sigma(\mathbf{r}(t))$ represents the volume density of the camera ray along the viewing direction $\mathbf{d}$ at point $\mathbf{r}(t)$ , $dt$ denotes the step distance of the ray in each integration step. $T(t)$ represents the cumulative transmittance, indicating the probability that the ray propagates between $t_{1}$ and $t_{2}$ without being intercepted, given by the following equation.

T(t)=exp\left(-\int_{t_{1}}^{t}\sigma(\mathbf{r}(u))\cdot du\right)

(2)

Sparse Multi-baseline SAR Cross-modal 3D Reconstruction of Vehicle Targets (4)

To compute this continuous integral, we discretely sample the ray at equidistant depths and utilize the integration rules discussed in the literature [26] to estimate the pixel value $C(r)$ , given by the following formula:

\hat{C}(\mathbf{r})=\sum\limits_{i=1}^{N}T_{i}(1-exp(-\sigma_{i}\Delta\delta))

(3)

where

T_{i}=exp\left(-\sum\limits_{j=1}^{i-1}\sigma_{j}\Delta\delta\right).

(4)

The value $\delta$ represents the distance between adjacent sampling points. The final rendered image is composed of the estimated pixel values along all camera rays passing through the rendering canvas pixels, expressed as:

I=\left[\begin{array}[]{ccc}\hat{C}(\mathbf{r}_{0,0})&\dots&\hat{C}(\mathbf{r}%_{w-1,0})\\\vdots&\ddots&\vdots\\\hat{C}(\mathbf{r}_{0,h-1})&\dots&\hat{C}(\mathbf{r}_{w-1,h-1})\end{array}\right]

(5)

Among these indices, $\mathbf{r}_{i,j}$ denotes the camera ray passing through the pixel point $(i,j)$ , while $i\in[0,h)$ , $j\in[0,w)$ , $h$ , and $w$ respectively denote the height and width of the rendered image.

3 Experiments

3.1 Experimental Settings

We created a dataset of simulated multi-baseline sparse aspects SAR 3D images of civilian vehicles and multi-view optical images to train CMR-Net. During validation, we conducted extensive comparative imaging experiments using a test set of simulated data and explored latent space interpolation. Additionally, we performed comparative imaging experiments on a real-world dataset, along with ablation experiments on the PRP module and data augmentation strategies. Notably, our network was trained exclusively on simulated data and then directly applied to infer real-world data. This section details our experimental settings and implementation methods.

3.1.1 Dataset and Augmentation

The multi-baseline sparse aspects SAR 3D images in our dataset were derived from the Civilian Vehicle Radar Dome Dataset, publicly released by the United States Air Force Laboratory[27]. This dataset includes fully polarized, far-field X-band simulated electromagnetic scattering data for ten civilian vehicles, covering a 360° azimuth angle and a 30° to 60° elevation angle range. The distribution of viewpoints and parameter settings for the simulated scene are illustrated in Figure 5 and detailed in Table 1.

We extracted omnidirectional data from eight elevation angles for five vehicles (two sedans, two SUVs, and one pickup) within an elevation angle range of 44.25° to 46°, with a sampling interval of 0.1875°. The 360° azimuth data was divided into 72 sub-apertures, each covering 5° with a sampling interval of 0.0625°. Nine sub-aperture images were then randomly selected and incoherently summed to produce the 3D pre-imaging result, which served as the input data for our dataset.

Parameter	Value
Radar center frequency	9.6GHz
Unambiguous range	$\approx 15m$
Extrapolation extent	$\leqslant 0.25\circ$
Azimuth extent	$360\circ$
Elevation extent	$30\circ to60\circ$

The multi-view optical images of vehicles used for supervision in the dataset were created by rendering digital 3D models of the vehicles. We collected 3D digital models identical to those in the CVDomes dataset and generated binary optical images of these vehicles from various viewpoints using computer rendering techniques. The process of vehicle modeling and dataset construction is illustrated in Figure 5.

In addition, we developed a data augmentation strategy to enhance the dataset by incorporating translation $T$ , rotation $R$ , and scaling $S$ operations. As shown in the diagram, each geometric transformation applied to the original 3D imaging data is mirrored by an equivalent transformation applied to the digital 3D model, which is then reflected in the rendered images. This strategy expands the data space, improves the network’s generalization capability, and reduces alignment constraints in real-world data.

Sparse Multi-baseline SAR Cross-modal 3D Reconstruction of Vehicle Targets (5)

3.1.2 Implementation

In the pre-imaging stage, we set the imaging scene size to $3.2m\times 3.2m\times 6.4m$ , with a spatial resolution of $0.05m$ , resulting in data dimensions of $64\times 64\times 128$ . Before feeding the data into the network, we normalized the pre-imaging data by scaling the dynamic range to [0, 1].

The parameter details for each layer of the cross-modal reconstruction network are provided in Figure 2. For the differentiable rendering module, we set the size of the rendered images to 256×256 pixels. The camera is positioned $7m$ away from the center of the scene and directed towards the center, with rendering done from 8 fixed viewpoints.

During the training stage, we set the hyperparameters as follows: a loss function scaling factor of $L_{s}=0.7$ , a batch size of 1, and the Adam optimizer with momentum parameters $\beta_{1}=0.5$ and $\beta_{2}=0.9$ . The initial learning rate was set to $1\times 10^{-4}$ and gradually reduced to $5\times 10^{-5}$ . The algorithm was implemented using the PyTorch framework, with both network training and inference conducted on a computer equipped with an NVIDIA RTX A6000 GPU. The network, trained solely on simulated data, was then evaluated on the test set and real-world data without any fine-tuning.

3.1.3 Baseline

To demonstrate the enhancement effect of integrating optical cross-modal information on imaging quality, we compared our approach with traditional imaging methods and state-of-the-art (SOTA) deep learning techniques trained with same-modality supervision using full-aperture data as baseline controls. The traditional imaging methods included back-projection (BP)[25] and compressed sensing (CS) techniques[28]. The deep learning methods trained with same-modality supervision comprised SACNet[22] and UNet3D[23], both designed to enhance the quality of sparse multi-baseline SAR 3D imaging.

Additionally, to highlight the design advantages of the CMR-Net architecture, we used the backbones of SACNet and UNet3D as comparative networks. After incorporating the differentiable rendering (DR) module, we trained them using the same implementation as our proposed approach. These served as baseline controls for the cross-modal supervised methods.

3.1.4 Metrics

3.2 Simulated data results

Based on the experimental setup, we conducted imaging experiments using simulated data with varying accumulated aperture numbers (ranging from 4 to 12) and different signal-to-noise ratios (SNR) (ranging from 5 dB to 30 dB). These experiments aimed to demonstrate the superiority of the proposed method in imaging accuracy and its ability to handle highly sparse and low SNR data.

3.2.1 Imaging results of CMR-Net

Figure 6 illustrates the pre-imaging and reconstruction results of our proposed method for accumulated aspects number 8 of various civilian vehicle types under a SNR of 30dB. The ground truth images were generated by directly voxelizing the 3D models. From the figure, it’s evident that the pre-imaging results show vehicle outlines as discrete point clusters. Although an increase in accumulated aspects numbers enhances the imaging of strong scattering structures on the vehicle body, significant structural deficiencies persist, and the features of different vehicle types remain indistinct.

In contrast, after cross-modal reconstruction, despite only introducing 2D optical image information, our method produces structurally complete and realistic 3D vehicle images. It effectively restores fine-grained features of different vehicle types.

Sparse Multi-baseline SAR Cross-modal 3D Reconstruction of Vehicle Targets (6)

Sparse Multi-baseline SAR Cross-modal 3D Reconstruction of Vehicle Targets (7)

Figure 7 illustrates the imaging results of different imaging methods for various vehicle types and accumulated aperture numbers 8 under a SNR of 30dB. It can be observed that traditional imaging methods, by optimizing sub-aperture imaging and utilizing aperture synthesis techniques, achieve a certain degree of image enhancement compared to the directly incoherent summation pre-imaging results. However, their imaging resolution is limited by the number of observed apertures, resulting in imaging results characterized by distributed discrete point clusters and incomplete vehicle structures. The deep learning methods SACNet and UNet3D, trained with supervision using full-aperture data, encode prior information about vehicle structures. These methods can reconstruct relatively complete vehicle contour images using sparse aperture data. However, due to limitations imposed by electromagnetic properties, the imaging results lack planar structures on the vehicle body, restricting further improvement in imaging accuracy. In comparison, the backbone networks of SACNet and UNet3D, combined with the differentiable rendering module and supervised training using 2D optical images, effectively enhance imaging quality. The imaging results generated by such imaging frameworks demonstrate more complete vehicle body structures, surpassing the resolution limitations of electromagnetic image supervision. However, as the number of accumulated apertures decreases, these networks may encounter the problem of disappearing vehicle body structures. In contrast, the CMR-Net proposed in this paper achieves stable reconstruction results for all accumulated aperture numbers, indicating that our network design can more effectively handle highly sparse data and possesses stronger generalization capabilities.

3.2.2 Imaging on lower SNR and less aspects number

The imaging capability under low signal-to-noise ratio is an important criterion for evaluating the performance of sparse imaging algorithms. To comprehensively assess the sparse imaging performance of our proposed method, we gradually decreased the signal-to-noise ratio from 25dB to 5dB while simultaneously reducing the accumulated aperture number from 12 to 4, conducting comprehensive imaging experiments. Figures 8 to 12 illustrate the imaging results of all tested vehicles.

In terms of quality, traditional imaging methods exhibit increasing noise and decreasing scattering structures on the vehicle body as the signal-to-noise ratio and accumulated aperture number decrease. The imaging results of the two deep learning methods supervised with full-aperture (full-resolution) images are also affected by noise. The inherently sparse vehicle contour features are eroded by noise sidelobes, leading to reduced feature discernibility.

Sparse Multi-baseline SAR Cross-modal 3D Reconstruction of Vehicle Targets (8)

Sparse Multi-baseline SAR Cross-modal 3D Reconstruction of Vehicle Targets (9)

Sparse Multi-baseline SAR Cross-modal 3D Reconstruction of Vehicle Targets (10)

Sparse Multi-baseline SAR Cross-modal 3D Reconstruction of Vehicle Targets (11)

Sparse Multi-baseline SAR Cross-modal 3D Reconstruction of Vehicle Targets (12)

However, after cross-modal supervision training, the imaging results of both networks at low SNR still retain most of the vehicle exterior features, thanks to the introduction of optical information. Nevertheless, due to network constraints, some structural disappearance, distortion, and blurring phenomena are observed. In contrast, our method achieves stable imaging results even under the most stringent conditions (5dB, 4 accumulated aspects), demonstrating the imaging capability of highly sparse data under low SNR conditions.

3.2.3 Quantity comparison

We computed quantitative metrics for the imaging results under all the aforementioned experimental conditions, and all results are listed in Table 2 and 3. As expected, deep learning methods trained with cross-modal supervision tend to produce results closer to the true vehicle models compared to traditional methods and deep learning methods supervised with full-aperture electromagnetic images.

Methods

Aspects

number

SNR

30dB

15dB

5dB

30dB

15dB

5dB

30dB

15dB

5dB

0.051

0.048

0.044

0.069

0.058

0.043

0.056

0.043

0.046

0.025

0.030

0.022

0.042

0.020

0.025

0.038

0.022

0.028

SACNet[22]

0.075

0.063

0.078

0.068

0.066

0.071

0.074

0.072

UNet3D[23]

0.027

0.036

0.026

0.034

0.026

0.033

0.030

0.039

SACNet+DR

0.140

0.146

0.155

0.143

0.144

0.145

0.114

0.124

0.127

UNet3D+DR

0.506

0.507

0.506

0.518

0.485

0.508

0.492

0.498

0.465

CMR-Net(Ours)

0.593

0.562

0.591

0.575

0.548

0.551

0.534

0.504

Methods

Aspects

number

SNR

30dB

15dB

5dB

30dB

15dB

5dB

30dB

15dB

5dB

1.088

1.166

1.095

1.134

1.138

1.111

1.148

1.096

1.137

1.089

1.106

1.098

1.133

1.074

1.084

1.112

1.083

1.101

SACNet[22]

0.872

0.727

0.900

0.799

0.859

0.855

0.894

0.662

UNet3D[23]

0.329

0.334

0.390

0.338

0.363

0.349

0.358

0.404

SACNet+DR

0.385

0.409

0.403

0.502

0.382

0.400

0.777

0.639

0.661

UNet3D+DR

0.109

0.116

0.111

0.109

0.118

0.110

0.112

0.117

0.132

CMR-Net(Ours)

0.078

0.074

0.061

0.072

0.086

0.082

0.117

0.114

0.141

3.2.4 Latent interpolation

Figure 13 shows the results of the latent space interpolation experiment. We performed interpolation between latent space representations and generated corresponding outputs. The figure illustrates that the interpolation between a pair of vectors in the latent space maps to meaningful and smooth nonlinear interpolations in the image space through the network[29]. For example, the image outputs between each pair of green-framed images exhibit variations in vehicle body height and rear features that differ from any simulated vehicle type. This phenomenon confirms that the interpolation path of latent space features does not collapse into an ”average” representation. We believe this is a splendid property for our CMR-Net, as it shows a broader imaging capability for vehicle targets.

Sparse Multi-baseline SAR Cross-modal 3D Reconstruction of Vehicle Targets (13)

3.3 Measured data results

To further validate the effectiveness of the method, we utilized a publicly available real-world dataset to demonstrate the generalization ability and practicality of our network in real-world scenarios.

3.3.1 Measured dataset

The real-world data we utilized is sourced from the GOTCHA Circular SAR dataset collected and released by the Air Force Laboratory[30]. This dataset was gathered in a scene containing numerous civilian vehicles. The radar operates at a center frequency of 9.6GHz with a bandwidth of 640MHz, functioning in Circular SAR mode. It completed 8 circular passes at different altitudes, with each pass having an average elevation angle distribution of [45.66, 44.01, 43.92, 44.18, 44.14, 43.53, 43.01, 43.06]. The diversity in circular observation apertures and elevations enables us to perform three-dimensional imaging of scene targets. Combining with the training conditions, we extracted data for validation experiments involving two SUV and two Sedan vehicle models from the GOTCHA dataset. The digital reference images of the data collection scene, radar flight paths, and selected test vehicle models are depicted in Figure 14.

Sparse Multi-baseline SAR Cross-modal 3D Reconstruction of Vehicle Targets (14)

3.3.2 Imaging results

From Figure 15 to 19 show the imaging results of the proposed method compared to other methods on a measured dataset with varying aperture accumulations. Due to the high noise characteristics of the measured data, the traditional method’s imaging results in the first column exhibit an uneven distribution of scattering points, significant interference, and poor readability. Using a cross-modal supervised training network, the vehicle shape is roughly restored. However, the results from network reconstructions using SACNet and UNet3D as backbones still show serious image distortion and structural loss, and are sensitive to the amount of aperture accumulation. In contrast, our approach recovers a more regular and complete vehicle shape, maintaining stable performance even in experiments with extremely sparse data.

Sparse Multi-baseline SAR Cross-modal 3D Reconstruction of Vehicle Targets (15)

Sparse Multi-baseline SAR Cross-modal 3D Reconstruction of Vehicle Targets (16)

Sparse Multi-baseline SAR Cross-modal 3D Reconstruction of Vehicle Targets (17)

Sparse Multi-baseline SAR Cross-modal 3D Reconstruction of Vehicle Targets (18)

Sparse Multi-baseline SAR Cross-modal 3D Reconstruction of Vehicle Targets (19)

4 Conclusion

In this paper, we propose a cross-modal reconstruction network to enhance the multi-baseline SAR sparse 3D imaging of vehicle targets. Our network, combined with differentiable rendering technology, uses rendered visual images of vehicles as supervisory signals to improve reconstruction accuracy. Additionally, we design a projection-backprojection component and a data augmentation strategy to enhance the network’s generalization ability. Experimental results on both simulated and real-world datasets show that our cross-modal reconstruction network achieves superior imaging quality compared to traditional imaging methods and other network-based cross-modal techniques. Furthermore, the dataset used for training our network is generated using computer simulation technology, making it easy to generalize and apply. Our method holds significant promise for multi-baseline SAR sparse 3D reconstruction and provides a novel approach to radar 3D reconstruction using deep learning technology.

References

Richards etal. [2010]M.A. Richards, J.Scheer,W.A. Holm, W.L. Melvin,Principles of Modern Radar, volume1,Citeseer, 2010.
Zhu and Bamler [2014]X.X. Zhu, R.Bamler,Superresolving sar tomography for multidimensionalimaging of urban areas: Compressive sensing-based tomosar inversion,IEEE Signal Processing Magazine31 (2014) 51–58.doi:10.1109/MSP.2014.2312098.
Zhu etal. [2018]X.X. Zhu, Y.Wang,S.Montazeri, N.Ge,A review of ten-year advances of multi-baseline sarinterferometry using terrasar-x data,Remote Sensing 10(2018) 1374.doi:10.3390/rs10091374.
Yue etal. [2023]X.Yue, F.Teng, Y.Lin,W.Hong,Target anisotropic scattering deduction model usingmulti-aspect sar data,ISPRS Journal of Photogrammetry and Remote Sensing195 (2023) 153–168.doi:10.1016/j.isprsjprs.2022.11.007.
Chen etal. [2017]L.Chen, D.An, X.Huang,Z.Zhou,A 3d reconstruction strategy of vehicle outline basedon single-pass single-polarization csar data,IEEE Transactions on Image Processing26 (2017) 5545–5554.doi:10.1109/TIP.2017.2738566.
Li etal. [2018]Y.Li, Y.Lin, W.Hong,R.Xu, Z.Zhuo, Q.Yin,Anisotropic scattering detection for characterizingpolarimetric circular sar multi-aspect signatures,in: IGARSS 2018 - 2018 IEEE InternationalGeoscience and Remote Sensing Symposium, 2018, pp.4543–4546. doi:10.1109/IGARSS.2018.8519260.
Rambour etal. [2020]C.Rambour, A.Budillon,A.C. Johnsy, L.Denis,F.Tupin, G.Schirinzi,From interferometric to tomographic sar: A review ofsynthetic aperture radar tomography-processing techniques for scattererunmixing in urban areas,IEEE Geoscience and Remote Sensing Magazine8 (2020) 6–29.doi:10.1109/MGRS.2019.2957215.
Austin etal. [2009]C.Austin, E.Ertin,R.Moses,Sparse multipass 3d sar imaging: Applications to thegotcha data set,Proceedings of SPIE - The International Society forOptical Engineering 7337 (2009).doi:10.1117/12.820323.
Zhu and Bamler [2010]X.X. Zhu, R.Bamler,Tomographic sar inversion by l₁ -normregularization—the compressive sensing approach,IEEE Transactions on Geoscience and Remote Sensing48 (2010) 3839–3846.doi:10.1109/TGRS.2010.2048117.
Budillon etal. [2009]A.Budillon, A.Evangelista,G.Schirinzi,Sar tomography from sparse samples,in: 2009 IEEE International Geoscience and RemoteSensing Symposium, volume4, 2009, pp.IV–865–IV–868.doi:10.1109/IGARSS.2009.5417514.
Yang etal. [2019]J.Yang, T.Jin, C.Xiao,X.Huang,Compressed sensing radar imaging: Fundamentals,challenges, and advances,Sensors 19(2019) 3100.doi:10.3390/s19143100.
Potter etal. [2010]L.C. Potter, E.Ertin,J.T. Parker, M.Cetin,Sparsity and compressed sensing in radar imaging,Proceedings of the IEEE 98(2010) 1006–1020.doi:10.1109/JPROC.2009.2037526.
Liu etal. [2022]C.Liu, Y.Wang, Z.Ding,Y.Wei, J.Huang,Y.Cai,Analysis of deep learning 3-d imaging methods basedon uav sar,in: IGARSS 2022 - 2022 IEEE InternationalGeoscience and Remote Sensing Symposium, 2022, pp.2951–2954.doi:10.1109/IGARSS46834.2022.9883292.
Wang etal. [2022]M.Wang, S.Wei, Z.Zhou,J.Shi, X.Zhang,Y.Guo,3-d sar data-driven imaging via learned low-rank andsparse priors,IEEE Transactions on Geoscience and Remote Sensing60 (2022) 1–17.doi:10.1109/TGRS.2022.3175486.
Sun etal. [2021]Y.Sun, L.Mou, Y.Wang,S.Montazeri, X.X. Zhu,Large-scale building height retrieval from single sar imagerybased on bounding box regression networks, 2021.doi:10.48550/arXiv.2111.09460.arXiv:2111.09460.
Wang etal. [2021]S.Wang, J.Guo,Y.Zhang, Y.Hu,C.Ding, Y.Wu,Tomosar 3d reconstruction for buildings using veryfew tracks of observation: A conditional generative adversarial networkapproach,Remote Sensing 13(2021) 5055.doi:10.3390/rs13245055.
Sun etal. [2021]Y.Sun, Z.Huang,H.Zhang, Z.Cao,D.Xu, 3drimr: 3d reconstruction andimaging via mmwave radar based on deep learning, 2021.doi:10.48550/arXiv.2108.02858.arXiv:2108.02858.
Xu etal. [2022]G.Xu, B.Zhang, H.Yu,J.Chen, M.Xing,W.Hong,Sparse synthetic aperture radar imaging fromcompressed sensing and machine learning: Theories, applications, and trends,IEEE Geoscience and Remote Sensing Magazine10 (2022) 32–69.doi:10.1109/MGRS.2022.3218801.
Wang etal. [2022]M.Wang, S.Wei, Z.Zhou,J.Shi, X.Zhang,Y.Guo,3-d sar autofocusing with learned sparsity,IEEE Transactions on Geoscience and Remote Sensing60 (2022) 1–18.doi:10.1109/TGRS.2022.3210547.
Wang etal. [2021]M.Wang, S.Wei,J.Liang, Z.Zhou,Q.Qu, J.Shi,X.Zhang,Tpssi-net: Fast and enhanced two-path iterativenetwork for 3d sar sparse imaging,IEEE Trans. on Image Process. 30(2021) 7317–7332.doi:10.1109/TIP.2021.3104168.
Zhou etal. [2022]Z.Zhou, S.Wei,H.Zhang, R.Shen,M.Wang, J.Shi,X.Zhang,Saf-3dnet: Unsupervised amp-inspired network for 3-dmmw sar imaging and autofocusing,IEEE Transactions on Geoscience and Remote Sensing60 (2022) 1–15.doi:10.1109/TGRS.2022.3205628.
Wang etal. [2023]S.Wang, J.Guo,Y.Zhang, Y.Wu,Multi-baseline sar 3d reconstruction of vehicle fromvery sparse aspects: A generative adversarial network based approach,ISPRS Journal of Photogrammetry and Remote Sensing197 (2023) 36–55.doi:10.1016/j.isprsjprs.2023.01.022.
Wang etal. [2021]S.Wang, J.Guo,Y.Zhang, Y.Hu,C.Ding, Y.Wu,Single target sar 3d reconstruction based on deeplearning,Sensors 21(2021) 964.doi:10.3390/s21030964.
Han etal. [2020]Z.Han, C.Chen, Y.-S.Liu, M.Zwicker,Drwr: A differentiable renderer without rendering forunsupervised 3d structure learning from silhouette images,arXiv preprint arXiv:2007.06127(2020).
Dungan etal. [2010]K.E. Dungan, C.Austin,J.Nehrbass, L.C. Potter,Civilian vehicle radar data domes,in: E.G. Zelnio, F.D. Garber(Eds.), SPIE Defense, Security, and Sensing,Orlando, Florida, 2010, p.76990P. doi:10.1117/12.850151.
Max [1995]N.Max,Optical models for direct volume rendering,IEEE Transactions on Visualization and ComputerGraphics 1 (1995)99–108. doi:10.1109/2945.468400.
Dungan etal. [2010]K.E. Dungan, C.Austin,J.Nehrbass, L.C. Potter,Civilian vehicle radar data domes,in: Algorithms for synthetic aperture radarImagery XVII, volume 7699, SPIE,2010, pp. 242–253.
Ertin etal. [2007]E.Ertin, C.Austin,S.Sharma, R.Moses,L.Potter,Gotcha experience report: Three-dimensional sarimaging with complete circular apertures,Proc SPIE (2007).doi:10.1117/12.723245.
Bojanowski etal. [2017]P.Bojanowski, A.Joulin,D.Lopez-Paz, A.Szlam,Optimizing the latent space of generative networks,arXiv preprint arXiv:1707.05776(2017).
CasteelJr etal. [2007]C.H. CasteelJr, L.A. Gorham,M.J. Minardi, S.M. Scarborough,K.D. Naidu, U.K. Majumder,A challenge problem for 2d/3d imaging of targets froma volumetric data set in an urban environment,in: Algorithms for Synthetic Aperture RadarImagery XIV, volume 6568, SPIE,2007, pp. 97–103.