Sparse Multi-baseline SAR Cross-modal 3D Reconstruction of Vehicle Targets (2024)

Da Lida_li@bit.edu.cnGuoqiang Zhaozhaoguoqiang@bit.edu.cnHoujun Sunsunhoujun@bit.edu.cnJiacheng Baobaojiacheng@bit.edu.cn

Abstract

Multi-baseline SAR 3D imaging faces significant challenges due to data sparsity. In recent years, deep learning techniques have achieved notable success in enhancing the quality of sparse SAR 3D imaging. However, previous work typically rely on full-aperture high-resolution radar images to supervise the training of deep neural networks (DNNs), utilizing only single-modal information from radar data. Consequently, imaging performance is limited, and acquiring full-aperture data for multi-baseline SAR is costly and sometimes impractical in real-world applications. In this paper, we propose a Cross-Modal Reconstruction Network (CMR-Net), which integrates differentiable render and cross-modal supervision with optical images to reconstruct highly sparse multi-baseline SAR 3D images of vehicle targets into visually structured and high-resolution images. We meticulously designed the network architecture and training strategies to enhance network generalization capability. Remarkably, CMR-Net, trained solely on simulated data, demonstrates high-resolution reconstruction capabilities on both publicly available simulation datasets and real measured datasets, outperforming traditional sparse reconstruction algorithms based on compressed sensing and other learning-based methods. Additionally, using optical images as supervision provides a cost-effective way to build training datasets, reducing the difficulty of method dissemination. Our work showcases the broad prospects of deep learning in multi-baseline SAR 3D imaging and offers a novel path for researching radar imaging based on cross-modal learning theory.

keywords:

Multi-baseline SAR , Sparse imaging , 3D reconstruction , Cross-modal learning

\affiliation

[bit]organization=Beijing Key Laboratory of Millimeter Wave and Terahertz Techniques, School of Integrated Circuits and Electronics, Beijing Institute of Technology,addressline=No.5 Zhongguancun South Street,city=Haidian District,postcode=100081,state=Beijing,country=China

1 Introduction

Synthetic aperture radar (SAR) offers all-weather, all-day, high-resolution imaging, making it a widely used remote sensing technology in terrain mapping and military reconnaissance [1]. Traditional SAR systems, constrained by their two-dimensional (2D) imaging mechanisms, produce 2D projection images of three-dimensional (3D) targets in the slant-range-azimuth plane. These images often suffer from distortions like layover and foreshortening, resulting in poor recognizability and interpretability. SAR 3D imaging technology addresses these limitations by mapping scatter centers in 3D space within the observed scene, overcoming the constraints of 2D imaging. Among these techniques, multi-baseline SAR 3D tomography stands out for its ability to provide comprehensive 3D spatial resolution [2]. By conducting multiple flights at different altitudes, it forms a synthetic aperture in the elevation direction, achieving height resolution. This technology enables 3D imaging and has significant application value in high-precision geographic remote sensing, urban 3D mapping, and detailed target interpretation [3].

To achieve refined 3D imaging results, multi-baseline SAR typically requires multi-aspect observations of the target to enhance spatial resolution, often necessitating circular trajectory flights around the target area for omni-directional resolution[4, 5, 6]. However, this approach faces practical challenges. Terrain and flight path constraints often limit the ability to acquire dense observation data. Additionally, the large volume of data required imposes significant computational and storage burdens[7]. Consequently, SAR researchers are focused on developing sparse imaging algorithms that can reconstruct detailed 3D target images using limited and incomplete measurement data. Existing methods fall into two main categories: those based on compressive sensing (CS) and those based on deep learning.

CS technology has been a mainstream approach for reconstructing signals from incomplete sparse measurements and was applied relatively early in multi-baseline SAR 3D imaging[8, 9, 10]. These methods model the sparse SAR 3D imaging problem as a sparse signal recovery model. By introducing various penalties and optimization techniques, they can reconstruct high-resolution imaging results from incomplete measurements. Typical works, such as those in reference, significantly reduce the dependence of multi-baseline SAR 3D imaging on measurement completeness. However, the iterative optimization process for 3D data is time-consuming, and the sensitivity of imaging quality to optimization parameter settings poses challenges for the further development and application of CS-based algorithms in SAR 3D imaging[11, 12].

In recent years, deep learning techniques have been widely applied in the field of sparse SAR 3D imaging. Researchers have explored the use of deep neural networks to learn image priors from training data and apply them to sparse imaging[13, 14, 15, 16, 17]. Thanks to their parallel structure and free iteration, deep learning methods can deliver more efficient and stable imaging results compared to CS-based methods, making them a popular research focus[18]. Based on different implementation strategies, the learning-based methods can be categorized into two classes.

The first category of methods integrates neural models into CS algorithms by transforming the traditional iterative solving process into cascaded deep neural network modules, replacing the nonlinear components of optimization with neural layers[19, 20, 21, 13]. Through extensive training, these networks learn data priors and optimization parameters, eliminating the need for manual settings. Compared to traditional optimization-based methods, this approach bypasses iterative processes, requiring only a single inference to achieve reconstruction accuracy similar to CS algorithms, thereby significantly reducing computational complexity. However, this method is constrained by the CS model and assumptions, making it unsuitable for anisotropic target imaging problems, especially for artificial structural targets like vehicles and aircraft[22].

The second category of methods combines traditional imaging algorithms with deep neural networks. Traditional techniques first perform pre-imaging on sparse data, followed by deep neural networks to enhance the pre-imaging results. In one study[23], a 3D UNet was used to improve the rough back projection (BP) imaging results of sparse data, achieving high-resolution reconstruction on simulated datasets. Another study [22] proposed a Sparse Aspect Completion Network (SACNet) based on a Generative Adversarial Network (GAN) structure to enhance the CS pre-imaging results of sparse data. These results show that a network trained solely on simulated data can achieve good target reconstruction performance on real measured data. Leveraging the powerful data representation capabilities and thorough training of deep neural networks, these methods establish a direct mapping from low-resolution to high-resolution target images. They can rapidly and stably reconstruct high-resolution 3D images from sparse observation data through single-pass inference, making them the state-of-the-art (SOTA) method for SAR 3D reconstruction.

In summary, deep learning-based methods for sparse multi-baseline SAR three-dimensional imaging have shown significant potential to replace traditional CS algorithms and become the next generation of sparse imaging algorithms[18]. However, these methods still face several practical challenges that limit further improvements in imaging quality and hinder their practical application.

  1. 1.

    Image Enhancement Limitation: Existing deep learning-based methods treat the enhancement of sparse SAR 3D image resolution as an image enhancement task. These methods train deep neural networks using low-resolution to high-resolution SAR 3D image data pair, endowing the networks with denoising, artifact removal, and completion capabilities to improve image clarity. However, due to the constraints of electromagnetic imaging mechanisms, training with high-resolution images of the same modality limits the potential resolution, hindering further improvement.

  2. 2.

    Data Quality Constraint: The performance of deep learning-based imaging methods is constrained by the quality of the training data. High-resolution SAR 3D supervised images used by existing algorithms require the acquisition and processing of full-aperture data, which is often inefficient, costly, and sometimes infeasible in practical applications. This limitation hinders the widespread adoption of deep learning-based algorithms in real-world scenarios.

  3. 3.

    Observation Sensitivity and Noise Interference: SAR imaging results are highly sensitive to observation geometry and noise interference. The SAR images inputted into neural networks often exhibit poor stability in feature information, making it challenging to extract useful information. Consequently, the generalization ability of deep learning-based imaging methods remains a significant hurdle for their practical application.

To address the aforementioned challenges, this paper proposes a sparse multi-baseline SAR 3D reconstruction method based on cross-modal supervision. We designed a cross-modal reconstruction network (CMRNet) to achieve high-resolution reconstruction of rough imaging results from very sparse multi-baseline SAR data. The main contributions of this paper are summarized as follows:

  1. 1.

    We integrate cross-modal supervision into SAR 3D reconstruction using differentiable rendering techniques[24]. By employing 2D optical images to supervise the 3D reconstruction process, we guide the network to produce high-resolution 3D images with coherent structures and prominent features, overcoming the resolution limitations of electromagnetic images.

  2. 2.

    The optical image data used for supervision offer a more accessible and cost-effective means of obtaining high-resolution SAR 3D images compared to processing full-aperture electromagnetic data. This approach reduces the difficulty of constructing high-quality datasets, paving the way for broader application of deep learning-based SAR 3D imaging methods.

  3. 3.

    We devised a unique data augmentation scheme and integrated a Projection-Reprojection module within the network to enhance its robustness and generalization capability.

  4. 4.

    Given the limited availability of data, the network was trained solely on simulated data and then validated on real measured data without any fine-tuning. Extensive experiments show that our method achieves outstanding 3D reconstruction performance under low signal-to-noise ratios and very sparse measurements compared to existing methods. Additionally, necessary ablation experiments confirm the effectiveness of our network design.

2 Methodology

2.1 Method framework

The proposed sparse multi-baseline SAR cross-modal 3D reconstruction method framework is illustrated in Figure LABEL:fig:method-framework. The method consists of two main modules: the pre-imaging module and the cross-modal reconstruction module. The pre-imaging module first individually processes each acquired sub-aperture data for imaging, then non-coherently combines the sub-imaging results to obtain incomplete and rough volumetric imaging results of the target. The cross-modal reconstruction module comprises a deep neural network trained with optical image supervision, encoding the electromagnetic-optical cross-modal information. It takes the pre-imaging results as input and outputs visually structured, sharply contoured, and more realistic target 3D reconstruction images. Below, we will elaborate on the details of each module.

Sparse Multi-baseline SAR Cross-modal 3D Reconstruction of Vehicle Targets (1)

2.2 Sparse Multi-baseline SAR 3D Pre-imaging

2.2.1 Imaging Model

The sparse aspects multi-baseline SAR observation geometry is illustrated on the left side of Figure LABEL:fig:method-framework. The platform carrying the radar performs multi-aspect measurements around the target, represented by clusters of line segments in different colors. At each sub-aspect, the platform conducts multiple straight-line flight observations of the target at different heights, forming a height-extended synthetic aperture on top of the track-extended synthetic aperture. This is depicted by clusters of uniformly spaced points in the figure. Under far-field conditions, the frequency-domain echo data collected by the radar at different azimuth angles and heights for each sub-aspect collectively form the 3D annular k-space observation data of the imaging target. Through interpolation and coordinate transformation, the spatial position of the target scattering center can be directly obtained using 3D Fourier inverse transformation[8].

In this paper, the proposed method is utilized to obtain target 3D images at each sub-angle. Initially, phase errors caused by track errors in the measurement data are corrected to prevent image defocusing. Subsequently, based on the imaging scene configuration, the frequency-domain data in polar coordinate format is interpolated into spatial Euler coordinate system data. Then, 3D Fourier inverse transformation is applied to the interpolated data to reconstruct the target 3D image at the respective sub-angle. Finally, to integrate the target scattering structure information from various sub-angles, a direct non-coherent summation of the images from each sub-angle is performed to obtain the pre-imaging results of the target, which discards the maximum posteriori estimation[25]. This result is then fed into the cross-modal reconstruction module as the input image.

2.3 Cross-modal reconstruction

The cross-modal reconstruction module receives the target 3D pre-imaging results as input and employs a cross-modal reconstruction network (CMR-Net) to generate high-precision 3D images. This network is encoded with cross-modal information, enabling it to produce reconstructions that closely resemble the target’s true physical model. The structure of the CMR-Net is depicted in Figure 2. The CMR-Net features an encoder-decoder architecture with skip connection layers. The encoder extracts feature representations of target structures from rough 3D pre-imaging data, while the decoder reconstructs the vehicle’s 3D image from low-dimensional latent representations. Near the bottleneck layer of the network, we designed a Projection-Rerojection (PRP) module to enhance the feature representation capability of the network. At the network’s output end, a differentiable volume rendering module is introduced to convert the reconstructed 3D volume into multi-view 2D images. The corresponding 2D rendered image of the actual vehicle’s true digital 3D model is used as the ground truth to evaluate the quality of the 3D reconstruction. Such supervised strategy effectively integrates cross-modal information into the network. In the following sections, we will provide a detailed exposition of the design of each module.

2.3.1 Network architecture

Figure 2 illustrates the detailed structure and training process of the cross-modal reconstruction network. The main body of the network consists of a contracting path (left) and an expansive path (right), with a PRP module located near the bottleneck layer. The contracting path consists of four downsampling layers, each composed of a 3D convolutional layer with LeakyReLU activation followed by a max-pooling layer. In the contracting path, the convolutional layers increase the number of feature channels, while the pooling layers reduce the data dimensionality. Its endpoint is connected to a designed PRP layer (detailed in Section II-B). This layer internally projects the data into a low-dimensional representation, but the output after reprojection retains the same size as the input feature map and seamlessly integrates with the processing of the expansive path.

The expansive path is the counterpart to the contracting path, comprising four upsampling layers and one single convolutional layer. Each upsampling layer incorporates a 3D transposed convolutional layer with ReLU activation. Within the expansive path, skip connection layers connect feature maps of the same resolution in the contracting path to the corresponding feature maps, which are then passed into the transposed convolutional layers to decrease the number of feature channels and increase the data dimensionality. The final upsampling layer yields a 64-channel output map. To consolidate information from all channels and achieve a smoother reconstruction result, we introduce a single-kernel convolutional layer.

Sparse Multi-baseline SAR Cross-modal 3D Reconstruction of Vehicle Targets (2)

2.3.2 Projection-Reprojection module

In 3D images acquired from sparse data, the scattering structure information of anisotropic targets is often sensitive to factors like the number of observations, viewing angles, and noise levels. To ensure that the network can capture precise feature representations from structurally incomplete and variable 3D images, we’ve introduced a PRP module between the encoder and decoder. This module is specifically designed to enhance the network’s representation capability and improve its generalization performance.

The structure details of the PRP module are depicted in Figure 3. The projection module comprises two fully connected layers. The first layer employs LeakyReLU as the activation function and is connected to the input feature map, compressing the features into a low-dimensional vector. The second fully connected layer removes the activation function and further reduces the dimensionality of the feature vector to obtain the latent representation vector z𝑧zitalic_z. The reprojection layer is the inverse operation of the projection layer, symmetrically enlarging the size of the feature vector and followed by a reshape operation to restore the feature vector to its original feature map size. The only distinction is that the last fully connected layer utilizes the ReLU function as the activation to ensure consistency in data dynamics within the network.

Sparse Multi-baseline SAR Cross-modal 3D Reconstruction of Vehicle Targets (3)

2.4 Differentiable volume render

The optical images of targets contain rich and intuitive structural information, and owing to their passive imaging mechanism, high-resolution optical images are often easier to obtain in practical applications compared to radar electromagnetic images. However, leveraging the advantages of 2D optical images into SAR 3D images requires bridging the differences between data dimensions. Differentiable volume rendering techniques offer a solution to this challenge. They can render 3D volume data into 2D images, and the differentiable nature of the rendering process allows for the computation of gradients of the 2D loss function with respect to the 3D structure. In this study, we introduce a differentiable rendering module at the end of the CMR-Net to render the reconstructed 3D image into 2D images from different views. We use optical images to supervise the reconstruction quality, thus leveraging cross-modal advantages.

Figure 4 illustrates the process of differentiable volume rendering. Given a 3D volumetric imaging data VW×H×D𝑉superscript𝑊𝐻𝐷V\in\mathbb{R}^{W\times H\times D}italic_V ∈ blackboard_R start_POSTSUPERSCRIPT italic_W × italic_H × italic_D end_POSTSUPERSCRIPT, camera position o𝑜oitalic_o, and viewing direction 𝐝𝐝\mathbf{d}bold_d, volume rendering obtains the pixel values C(𝐫)𝐶𝐫C(\mathbf{r})italic_C ( bold_r ) along any camera ray 𝐫(t)=o+t𝐝𝐫𝑡𝑜𝑡𝐝\mathbf{r}(t)=o+t\mathbf{d}bold_r ( italic_t ) = italic_o + italic_t bold_d using the formula:

C(𝐝)=t1t2T(t)σ(𝐫(t))𝑑t𝐶𝐝superscriptsubscriptsubscript𝑡1subscript𝑡2𝑇𝑡𝜎𝐫𝑡differential-d𝑡C(\mathbf{d})=\int_{t_{1}}^{t_{2}}T(t)\cdot\sigma(\mathbf{r}(t))dtitalic_C ( bold_d ) = ∫ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_T ( italic_t ) ⋅ italic_σ ( bold_r ( italic_t ) ) italic_d italic_t(1)

Where σ(𝐫(t))𝜎𝐫𝑡\sigma(\mathbf{r}(t))italic_σ ( bold_r ( italic_t ) ) represents the volume density of the camera ray along the viewing direction 𝐝𝐝\mathbf{d}bold_d at point 𝐫(t)𝐫𝑡\mathbf{r}(t)bold_r ( italic_t ), dt𝑑𝑡dtitalic_d italic_t denotes the step distance of the ray in each integration step. T(t)𝑇𝑡T(t)italic_T ( italic_t ) represents the cumulative transmittance, indicating the probability that the ray propagates between t1subscript𝑡1t_{1}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and t2subscript𝑡2t_{2}italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT without being intercepted, given by the following equation.

T(t)=exp(t1tσ(𝐫(u))𝑑u)𝑇𝑡𝑒𝑥𝑝superscriptsubscriptsubscript𝑡1𝑡𝜎𝐫𝑢differential-d𝑢T(t)=exp\left(-\int_{t_{1}}^{t}\sigma(\mathbf{r}(u))\cdot du\right)italic_T ( italic_t ) = italic_e italic_x italic_p ( - ∫ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_σ ( bold_r ( italic_u ) ) ⋅ italic_d italic_u )(2)
Sparse Multi-baseline SAR Cross-modal 3D Reconstruction of Vehicle Targets (4)

To compute this continuous integral, we discretely sample the ray at equidistant depths and utilize the integration rules discussed in the literature [26] to estimate the pixel value C(r)𝐶𝑟C(r)italic_C ( italic_r ), given by the following formula:

C^(𝐫)=i=1NTi(1exp(σiΔδ))^𝐶𝐫superscriptsubscript𝑖1𝑁subscript𝑇𝑖1𝑒𝑥𝑝subscript𝜎𝑖Δ𝛿\hat{C}(\mathbf{r})=\sum\limits_{i=1}^{N}T_{i}(1-exp(-\sigma_{i}\Delta\delta))over^ start_ARG italic_C end_ARG ( bold_r ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( 1 - italic_e italic_x italic_p ( - italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_Δ italic_δ ) )(3)

where

Ti=exp(j=1i1σjΔδ).subscript𝑇𝑖𝑒𝑥𝑝superscriptsubscript𝑗1𝑖1subscript𝜎𝑗Δ𝛿T_{i}=exp\left(-\sum\limits_{j=1}^{i-1}\sigma_{j}\Delta\delta\right).italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_e italic_x italic_p ( - ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT roman_Δ italic_δ ) .(4)

The value δ𝛿\deltaitalic_δ represents the distance between adjacent sampling points. The final rendered image is composed of the estimated pixel values along all camera rays passing through the rendering canvas pixels, expressed as:

I=[C^(𝐫0,0)C^(𝐫w1,0)C^(𝐫0,h1)C^(𝐫w1,h1)]𝐼delimited-[]^𝐶subscript𝐫00^𝐶subscript𝐫𝑤10^𝐶subscript𝐫01^𝐶subscript𝐫𝑤11I=\left[\begin{array}[]{ccc}\hat{C}(\mathbf{r}_{0,0})&\dots&\hat{C}(\mathbf{r}%_{w-1,0})\\\vdots&\ddots&\vdots\\\hat{C}(\mathbf{r}_{0,h-1})&\dots&\hat{C}(\mathbf{r}_{w-1,h-1})\end{array}\right]italic_I = [ start_ARRAY start_ROW start_CELL over^ start_ARG italic_C end_ARG ( bold_r start_POSTSUBSCRIPT 0 , 0 end_POSTSUBSCRIPT ) end_CELL start_CELL … end_CELL start_CELL over^ start_ARG italic_C end_ARG ( bold_r start_POSTSUBSCRIPT italic_w - 1 , 0 end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL start_CELL ⋱ end_CELL start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL over^ start_ARG italic_C end_ARG ( bold_r start_POSTSUBSCRIPT 0 , italic_h - 1 end_POSTSUBSCRIPT ) end_CELL start_CELL … end_CELL start_CELL over^ start_ARG italic_C end_ARG ( bold_r start_POSTSUBSCRIPT italic_w - 1 , italic_h - 1 end_POSTSUBSCRIPT ) end_CELL end_ROW end_ARRAY ](5)

Among these indices, 𝐫i,jsubscript𝐫𝑖𝑗\mathbf{r}_{i,j}bold_r start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT denotes the camera ray passing through the pixel point (i,j)𝑖𝑗(i,j)( italic_i , italic_j ), while i[0,h)𝑖0i\in[0,h)italic_i ∈ [ 0 , italic_h ), j[0,w)𝑗0𝑤j\in[0,w)italic_j ∈ [ 0 , italic_w ), hhitalic_h, and w𝑤witalic_w respectively denote the height and width of the rendered image.

3 Experiments

3.1 Experimental Settings

We created a dataset of simulated multi-baseline sparse aspects SAR 3D images of civilian vehicles and multi-view optical images to train CMR-Net. During validation, we conducted extensive comparative imaging experiments using a test set of simulated data and explored latent space interpolation. Additionally, we performed comparative imaging experiments on a real-world dataset, along with ablation experiments on the PRP module and data augmentation strategies. Notably, our network was trained exclusively on simulated data and then directly applied to infer real-world data. This section details our experimental settings and implementation methods.

3.1.1 Dataset and Augmentation

The multi-baseline sparse aspects SAR 3D images in our dataset were derived from the Civilian Vehicle Radar Dome Dataset, publicly released by the United States Air Force Laboratory[27]. This dataset includes fully polarized, far-field X-band simulated electromagnetic scattering data for ten civilian vehicles, covering a 360° azimuth angle and a 30° to 60° elevation angle range. The distribution of viewpoints and parameter settings for the simulated scene are illustrated in Figure 5 and detailed in Table 1.

We extracted omnidirectional data from eight elevation angles for five vehicles (two sedans, two SUVs, and one pickup) within an elevation angle range of 44.25° to 46°, with a sampling interval of 0.1875°. The 360° azimuth data was divided into 72 sub-apertures, each covering 5° with a sampling interval of 0.0625°. Nine sub-aperture images were then randomly selected and incoherently summed to produce the 3D pre-imaging result, which served as the input data for our dataset.

ParameterValue
Radar center frequency9.6GHz
Unambiguous range15mabsent15𝑚\approx 15m≈ 15 italic_m
Extrapolation extent0.25\leqslant 0.25\circ⩽ 0.25 ∘
Azimuth extent360360\circ360 ∘
Elevation extent30to6030\circ to60\circ30 ∘ italic_t italic_o 60 ∘

The multi-view optical images of vehicles used for supervision in the dataset were created by rendering digital 3D models of the vehicles. We collected 3D digital models identical to those in the CVDomes dataset and generated binary optical images of these vehicles from various viewpoints using computer rendering techniques. The process of vehicle modeling and dataset construction is illustrated in Figure 5.

In addition, we developed a data augmentation strategy to enhance the dataset by incorporating translation T𝑇Titalic_T, rotation R𝑅Ritalic_R, and scaling S𝑆Sitalic_S operations. As shown in the diagram, each geometric transformation applied to the original 3D imaging data is mirrored by an equivalent transformation applied to the digital 3D model, which is then reflected in the rendered images. This strategy expands the data space, improves the network’s generalization capability, and reduces alignment constraints in real-world data.

Sparse Multi-baseline SAR Cross-modal 3D Reconstruction of Vehicle Targets (5)

3.1.2 Implementation

In the pre-imaging stage, we set the imaging scene size to 3.2m×3.2m×6.4m3.2𝑚3.2𝑚6.4𝑚3.2m\times 3.2m\times 6.4m3.2 italic_m × 3.2 italic_m × 6.4 italic_m, with a spatial resolution of 0.05m0.05𝑚0.05m0.05 italic_m, resulting in data dimensions of 64×64×128646412864\times 64\times 12864 × 64 × 128. Before feeding the data into the network, we normalized the pre-imaging data by scaling the dynamic range to [0, 1].

The parameter details for each layer of the cross-modal reconstruction network are provided in Figure 2. For the differentiable rendering module, we set the size of the rendered images to 256×256 pixels. The camera is positioned 7m7𝑚7m7 italic_m away from the center of the scene and directed towards the center, with rendering done from 8 fixed viewpoints.

During the training stage, we set the hyperparameters as follows: a loss function scaling factor of Ls=0.7subscript𝐿𝑠0.7L_{s}=0.7italic_L start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = 0.7, a batch size of 1, and the Adam optimizer with momentum parameters β1=0.5subscript𝛽10.5\beta_{1}=0.5italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.5 and β2=0.9subscript𝛽20.9\beta_{2}=0.9italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.9. The initial learning rate was set to 1×1041superscript1041\times 10^{-4}1 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT and gradually reduced to 5×1055superscript1055\times 10^{-5}5 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT. The algorithm was implemented using the PyTorch framework, with both network training and inference conducted on a computer equipped with an NVIDIA RTX A6000 GPU. The network, trained solely on simulated data, was then evaluated on the test set and real-world data without any fine-tuning.

3.1.3 Baseline

To demonstrate the enhancement effect of integrating optical cross-modal information on imaging quality, we compared our approach with traditional imaging methods and state-of-the-art (SOTA) deep learning techniques trained with same-modality supervision using full-aperture data as baseline controls. The traditional imaging methods included back-projection (BP)[25] and compressed sensing (CS) techniques[28]. The deep learning methods trained with same-modality supervision comprised SACNet[22] and UNet3D[23], both designed to enhance the quality of sparse multi-baseline SAR 3D imaging.

Additionally, to highlight the design advantages of the CMR-Net architecture, we used the backbones of SACNet and UNet3D as comparative networks. After incorporating the differentiable rendering (DR) module, we trained them using the same implementation as our proposed approach. These served as baseline controls for the cross-modal supervised methods.

3.1.4 Metrics

3.2 Simulated data results

Based on the experimental setup, we conducted imaging experiments using simulated data with varying accumulated aperture numbers (ranging from 4 to 12) and different signal-to-noise ratios (SNR) (ranging from 5 dB to 30 dB). These experiments aimed to demonstrate the superiority of the proposed method in imaging accuracy and its ability to handle highly sparse and low SNR data.

3.2.1 Imaging results of CMR-Net

Figure 6 illustrates the pre-imaging and reconstruction results of our proposed method for accumulated aspects number 8 of various civilian vehicle types under a SNR of 30dB. The ground truth images were generated by directly voxelizing the 3D models. From the figure, it’s evident that the pre-imaging results show vehicle outlines as discrete point clusters. Although an increase in accumulated aspects numbers enhances the imaging of strong scattering structures on the vehicle body, significant structural deficiencies persist, and the features of different vehicle types remain indistinct.

In contrast, after cross-modal reconstruction, despite only introducing 2D optical image information, our method produces structurally complete and realistic 3D vehicle images. It effectively restores fine-grained features of different vehicle types.

Sparse Multi-baseline SAR Cross-modal 3D Reconstruction of Vehicle Targets (6)
Sparse Multi-baseline SAR Cross-modal 3D Reconstruction of Vehicle Targets (7)

Figure 7 illustrates the imaging results of different imaging methods for various vehicle types and accumulated aperture numbers 8 under a SNR of 30dB. It can be observed that traditional imaging methods, by optimizing sub-aperture imaging and utilizing aperture synthesis techniques, achieve a certain degree of image enhancement compared to the directly incoherent summation pre-imaging results. However, their imaging resolution is limited by the number of observed apertures, resulting in imaging results characterized by distributed discrete point clusters and incomplete vehicle structures. The deep learning methods SACNet and UNet3D, trained with supervision using full-aperture data, encode prior information about vehicle structures. These methods can reconstruct relatively complete vehicle contour images using sparse aperture data. However, due to limitations imposed by electromagnetic properties, the imaging results lack planar structures on the vehicle body, restricting further improvement in imaging accuracy. In comparison, the backbone networks of SACNet and UNet3D, combined with the differentiable rendering module and supervised training using 2D optical images, effectively enhance imaging quality. The imaging results generated by such imaging frameworks demonstrate more complete vehicle body structures, surpassing the resolution limitations of electromagnetic image supervision. However, as the number of accumulated apertures decreases, these networks may encounter the problem of disappearing vehicle body structures. In contrast, the CMR-Net proposed in this paper achieves stable reconstruction results for all accumulated aperture numbers, indicating that our network design can more effectively handle highly sparse data and possesses stronger generalization capabilities.

3.2.2 Imaging on lower SNR and less aspects number

The imaging capability under low signal-to-noise ratio is an important criterion for evaluating the performance of sparse imaging algorithms. To comprehensively assess the sparse imaging performance of our proposed method, we gradually decreased the signal-to-noise ratio from 25dB to 5dB while simultaneously reducing the accumulated aperture number from 12 to 4, conducting comprehensive imaging experiments. Figures 8 to 12 illustrate the imaging results of all tested vehicles.

In terms of quality, traditional imaging methods exhibit increasing noise and decreasing scattering structures on the vehicle body as the signal-to-noise ratio and accumulated aperture number decrease. The imaging results of the two deep learning methods supervised with full-aperture (full-resolution) images are also affected by noise. The inherently sparse vehicle contour features are eroded by noise sidelobes, leading to reduced feature discernibility.

Sparse Multi-baseline SAR Cross-modal 3D Reconstruction of Vehicle Targets (8)
Sparse Multi-baseline SAR Cross-modal 3D Reconstruction of Vehicle Targets (9)
Sparse Multi-baseline SAR Cross-modal 3D Reconstruction of Vehicle Targets (10)
Sparse Multi-baseline SAR Cross-modal 3D Reconstruction of Vehicle Targets (11)
Sparse Multi-baseline SAR Cross-modal 3D Reconstruction of Vehicle Targets (12)

However, after cross-modal supervision training, the imaging results of both networks at low SNR still retain most of the vehicle exterior features, thanks to the introduction of optical information. Nevertheless, due to network constraints, some structural disappearance, distortion, and blurring phenomena are observed. In contrast, our method achieves stable imaging results even under the most stringent conditions (5dB, 4 accumulated aspects), demonstrating the imaging capability of highly sparse data under low SNR conditions.

3.2.3 Quantity comparison

We computed quantitative metrics for the imaging results under all the aforementioned experimental conditions, and all results are listed in Table 2 and 3. As expected, deep learning methods trained with cross-modal supervision tend to produce results closer to the true vehicle models compared to traditional methods and deep learning methods supervised with full-aperture electromagnetic images.

Methods
Aspects
number
1284
SNR30dB15dB5dB30dB15dB5dB30dB15dB5dB
BP0.0510.0480.0440.0690.0580.0430.0560.0430.046
CS0.0250.0300.0220.0420.0200.0250.0380.0220.028
SACNet[22]0.0750.0750.0630.0780.0680.0660.0710.0740.072
UNet3D[23]0.0270.0270.0360.0260.0340.0260.0330.0300.039
SACNet+DR0.1400.1460.1550.1430.1440.1450.1140.1240.127
UNet3D+DR0.5060.5070.5060.5180.4850.5080.4920.4980.465
CMR-Net(Ours)0.5930.5930.5620.5910.5750.5480.5510.5340.504

Methods
Aspects
number
1284
SNR30dB15dB5dB30dB15dB5dB30dB15dB5dB
BP1.0881.1661.0951.1341.1381.1111.1481.0961.137
CS1.0891.1061.0981.1331.0741.0841.1121.0831.101
SACNet[22]0.8720.8720.7270.9000.7990.8590.8550.8940.662
UNet3D[23]0.3290.3340.3900.3380.3630.3630.3490.3580.404
SACNet+DR0.3850.4090.4030.5020.3820.4000.7770.6390.661
UNet3D+DR0.1090.1160.1110.1090.1180.1100.1120.1170.132
CMR-Net(Ours)0.0780.0740.0610.0720.0860.0820.1170.1140.141

3.2.4 Latent interpolation

Figure 13 shows the results of the latent space interpolation experiment. We performed interpolation between latent space representations and generated corresponding outputs. The figure illustrates that the interpolation between a pair of vectors in the latent space maps to meaningful and smooth nonlinear interpolations in the image space through the network[29]. For example, the image outputs between each pair of green-framed images exhibit variations in vehicle body height and rear features that differ from any simulated vehicle type. This phenomenon confirms that the interpolation path of latent space features does not collapse into an ”average” representation. We believe this is a splendid property for our CMR-Net, as it shows a broader imaging capability for vehicle targets.

Sparse Multi-baseline SAR Cross-modal 3D Reconstruction of Vehicle Targets (13)

3.3 Measured data results

To further validate the effectiveness of the method, we utilized a publicly available real-world dataset to demonstrate the generalization ability and practicality of our network in real-world scenarios.

3.3.1 Measured dataset

The real-world data we utilized is sourced from the GOTCHA Circular SAR dataset collected and released by the Air Force Laboratory[30]. This dataset was gathered in a scene containing numerous civilian vehicles. The radar operates at a center frequency of 9.6GHz with a bandwidth of 640MHz, functioning in Circular SAR mode. It completed 8 circular passes at different altitudes, with each pass having an average elevation angle distribution of [45.66, 44.01, 43.92, 44.18, 44.14, 43.53, 43.01, 43.06]. The diversity in circular observation apertures and elevations enables us to perform three-dimensional imaging of scene targets. Combining with the training conditions, we extracted data for validation experiments involving two SUV and two Sedan vehicle models from the GOTCHA dataset. The digital reference images of the data collection scene, radar flight paths, and selected test vehicle models are depicted in Figure 14.

Sparse Multi-baseline SAR Cross-modal 3D Reconstruction of Vehicle Targets (14)

3.3.2 Imaging results

From Figure 15 to 19 show the imaging results of the proposed method compared to other methods on a measured dataset with varying aperture accumulations. Due to the high noise characteristics of the measured data, the traditional method’s imaging results in the first column exhibit an uneven distribution of scattering points, significant interference, and poor readability. Using a cross-modal supervised training network, the vehicle shape is roughly restored. However, the results from network reconstructions using SACNet and UNet3D as backbones still show serious image distortion and structural loss, and are sensitive to the amount of aperture accumulation. In contrast, our approach recovers a more regular and complete vehicle shape, maintaining stable performance even in experiments with extremely sparse data.

Sparse Multi-baseline SAR Cross-modal 3D Reconstruction of Vehicle Targets (15)
Sparse Multi-baseline SAR Cross-modal 3D Reconstruction of Vehicle Targets (16)
Sparse Multi-baseline SAR Cross-modal 3D Reconstruction of Vehicle Targets (17)
Sparse Multi-baseline SAR Cross-modal 3D Reconstruction of Vehicle Targets (18)
Sparse Multi-baseline SAR Cross-modal 3D Reconstruction of Vehicle Targets (19)

4 Conclusion

In this paper, we propose a cross-modal reconstruction network to enhance the multi-baseline SAR sparse 3D imaging of vehicle targets. Our network, combined with differentiable rendering technology, uses rendered visual images of vehicles as supervisory signals to improve reconstruction accuracy. Additionally, we design a projection-backprojection component and a data augmentation strategy to enhance the network’s generalization ability. Experimental results on both simulated and real-world datasets show that our cross-modal reconstruction network achieves superior imaging quality compared to traditional imaging methods and other network-based cross-modal techniques. Furthermore, the dataset used for training our network is generated using computer simulation technology, making it easy to generalize and apply. Our method holds significant promise for multi-baseline SAR sparse 3D reconstruction and provides a novel approach to radar 3D reconstruction using deep learning technology.

References

  • Richards etal. [2010]M.A. Richards, J.Scheer,W.A. Holm, W.L. Melvin,Principles of Modern Radar, volume1,Citeseer, 2010.
  • Zhu and Bamler [2014]X.X. Zhu, R.Bamler,Superresolving sar tomography for multidimensionalimaging of urban areas: Compressive sensing-based tomosar inversion,IEEE Signal Processing Magazine31 (2014) 51–58.doi:10.1109/MSP.2014.2312098.
  • Zhu etal. [2018]X.X. Zhu, Y.Wang,S.Montazeri, N.Ge,A review of ten-year advances of multi-baseline sarinterferometry using terrasar-x data,Remote Sensing 10(2018) 1374.doi:10.3390/rs10091374.
  • Yue etal. [2023]X.Yue, F.Teng, Y.Lin,W.Hong,Target anisotropic scattering deduction model usingmulti-aspect sar data,ISPRS Journal of Photogrammetry and Remote Sensing195 (2023) 153–168.doi:10.1016/j.isprsjprs.2022.11.007.
  • Chen etal. [2017]L.Chen, D.An, X.Huang,Z.Zhou,A 3d reconstruction strategy of vehicle outline basedon single-pass single-polarization csar data,IEEE Transactions on Image Processing26 (2017) 5545–5554.doi:10.1109/TIP.2017.2738566.
  • Li etal. [2018]Y.Li, Y.Lin, W.Hong,R.Xu, Z.Zhuo, Q.Yin,Anisotropic scattering detection for characterizingpolarimetric circular sar multi-aspect signatures,in: IGARSS 2018 - 2018 IEEE InternationalGeoscience and Remote Sensing Symposium, 2018, pp.4543–4546. doi:10.1109/IGARSS.2018.8519260.
  • Rambour etal. [2020]C.Rambour, A.Budillon,A.C. Johnsy, L.Denis,F.Tupin, G.Schirinzi,From interferometric to tomographic sar: A review ofsynthetic aperture radar tomography-processing techniques for scattererunmixing in urban areas,IEEE Geoscience and Remote Sensing Magazine8 (2020) 6–29.doi:10.1109/MGRS.2019.2957215.
  • Austin etal. [2009]C.Austin, E.Ertin,R.Moses,Sparse multipass 3d sar imaging: Applications to thegotcha data set,Proceedings of SPIE - The International Society forOptical Engineering 7337 (2009).doi:10.1117/12.820323.
  • Zhu and Bamler [2010]X.X. Zhu, R.Bamler,Tomographic sar inversion by l1 -normregularization—the compressive sensing approach,IEEE Transactions on Geoscience and Remote Sensing48 (2010) 3839–3846.doi:10.1109/TGRS.2010.2048117.
  • Budillon etal. [2009]A.Budillon, A.Evangelista,G.Schirinzi,Sar tomography from sparse samples,in: 2009 IEEE International Geoscience and RemoteSensing Symposium, volume4, 2009, pp.IV–865–IV–868.doi:10.1109/IGARSS.2009.5417514.
  • Yang etal. [2019]J.Yang, T.Jin, C.Xiao,X.Huang,Compressed sensing radar imaging: Fundamentals,challenges, and advances,Sensors 19(2019) 3100.doi:10.3390/s19143100.
  • Potter etal. [2010]L.C. Potter, E.Ertin,J.T. Parker, M.Cetin,Sparsity and compressed sensing in radar imaging,Proceedings of the IEEE 98(2010) 1006–1020.doi:10.1109/JPROC.2009.2037526.
  • Liu etal. [2022]C.Liu, Y.Wang, Z.Ding,Y.Wei, J.Huang,Y.Cai,Analysis of deep learning 3-d imaging methods basedon uav sar,in: IGARSS 2022 - 2022 IEEE InternationalGeoscience and Remote Sensing Symposium, 2022, pp.2951–2954.doi:10.1109/IGARSS46834.2022.9883292.
  • Wang etal. [2022]M.Wang, S.Wei, Z.Zhou,J.Shi, X.Zhang,Y.Guo,3-d sar data-driven imaging via learned low-rank andsparse priors,IEEE Transactions on Geoscience and Remote Sensing60 (2022) 1–17.doi:10.1109/TGRS.2022.3175486.
  • Sun etal. [2021]Y.Sun, L.Mou, Y.Wang,S.Montazeri, X.X. Zhu,Large-scale building height retrieval from single sar imagerybased on bounding box regression networks, 2021.doi:10.48550/arXiv.2111.09460.arXiv:2111.09460.
  • Wang etal. [2021]S.Wang, J.Guo,Y.Zhang, Y.Hu,C.Ding, Y.Wu,Tomosar 3d reconstruction for buildings using veryfew tracks of observation: A conditional generative adversarial networkapproach,Remote Sensing 13(2021) 5055.doi:10.3390/rs13245055.
  • Sun etal. [2021]Y.Sun, Z.Huang,H.Zhang, Z.Cao,D.Xu, 3drimr: 3d reconstruction andimaging via mmwave radar based on deep learning, 2021.doi:10.48550/arXiv.2108.02858.arXiv:2108.02858.
  • Xu etal. [2022]G.Xu, B.Zhang, H.Yu,J.Chen, M.Xing,W.Hong,Sparse synthetic aperture radar imaging fromcompressed sensing and machine learning: Theories, applications, and trends,IEEE Geoscience and Remote Sensing Magazine10 (2022) 32–69.doi:10.1109/MGRS.2022.3218801.
  • Wang etal. [2022]M.Wang, S.Wei, Z.Zhou,J.Shi, X.Zhang,Y.Guo,3-d sar autofocusing with learned sparsity,IEEE Transactions on Geoscience and Remote Sensing60 (2022) 1–18.doi:10.1109/TGRS.2022.3210547.
  • Wang etal. [2021]M.Wang, S.Wei,J.Liang, Z.Zhou,Q.Qu, J.Shi,X.Zhang,Tpssi-net: Fast and enhanced two-path iterativenetwork for 3d sar sparse imaging,IEEE Trans. on Image Process. 30(2021) 7317–7332.doi:10.1109/TIP.2021.3104168.
  • Zhou etal. [2022]Z.Zhou, S.Wei,H.Zhang, R.Shen,M.Wang, J.Shi,X.Zhang,Saf-3dnet: Unsupervised amp-inspired network for 3-dmmw sar imaging and autofocusing,IEEE Transactions on Geoscience and Remote Sensing60 (2022) 1–15.doi:10.1109/TGRS.2022.3205628.
  • Wang etal. [2023]S.Wang, J.Guo,Y.Zhang, Y.Wu,Multi-baseline sar 3d reconstruction of vehicle fromvery sparse aspects: A generative adversarial network based approach,ISPRS Journal of Photogrammetry and Remote Sensing197 (2023) 36–55.doi:10.1016/j.isprsjprs.2023.01.022.
  • Wang etal. [2021]S.Wang, J.Guo,Y.Zhang, Y.Hu,C.Ding, Y.Wu,Single target sar 3d reconstruction based on deeplearning,Sensors 21(2021) 964.doi:10.3390/s21030964.
  • Han etal. [2020]Z.Han, C.Chen, Y.-S.Liu, M.Zwicker,Drwr: A differentiable renderer without rendering forunsupervised 3d structure learning from silhouette images,arXiv preprint arXiv:2007.06127(2020).
  • Dungan etal. [2010]K.E. Dungan, C.Austin,J.Nehrbass, L.C. Potter,Civilian vehicle radar data domes,in: E.G. Zelnio, F.D. Garber(Eds.), SPIE Defense, Security, and Sensing,Orlando, Florida, 2010, p.76990P. doi:10.1117/12.850151.
  • Max [1995]N.Max,Optical models for direct volume rendering,IEEE Transactions on Visualization and ComputerGraphics 1 (1995)99–108. doi:10.1109/2945.468400.
  • Dungan etal. [2010]K.E. Dungan, C.Austin,J.Nehrbass, L.C. Potter,Civilian vehicle radar data domes,in: Algorithms for synthetic aperture radarImagery XVII, volume 7699, SPIE,2010, pp. 242–253.
  • Ertin etal. [2007]E.Ertin, C.Austin,S.Sharma, R.Moses,L.Potter,Gotcha experience report: Three-dimensional sarimaging with complete circular apertures,Proc SPIE (2007).doi:10.1117/12.723245.
  • Bojanowski etal. [2017]P.Bojanowski, A.Joulin,D.Lopez-Paz, A.Szlam,Optimizing the latent space of generative networks,arXiv preprint arXiv:1707.05776(2017).
  • CasteelJr etal. [2007]C.H. CasteelJr, L.A. Gorham,M.J. Minardi, S.M. Scarborough,K.D. Naidu, U.K. Majumder,A challenge problem for 2d/3d imaging of targets froma volumetric data set in an urban environment,in: Algorithms for Synthetic Aperture RadarImagery XIV, volume 6568, SPIE,2007, pp. 97–103.
Sparse Multi-baseline SAR Cross-modal 3D Reconstruction of Vehicle Targets (2024)

References

Top Articles
Latest Posts
Article information

Author: Nathanial Hackett

Last Updated:

Views: 5967

Rating: 4.1 / 5 (52 voted)

Reviews: 83% of readers found this page helpful

Author information

Name: Nathanial Hackett

Birthday: 1997-10-09

Address: Apt. 935 264 Abshire Canyon, South Nerissachester, NM 01800

Phone: +9752624861224

Job: Forward Technology Assistant

Hobby: Listening to music, Shopping, Vacation, Baton twirling, Flower arranging, Blacksmithing, Do it yourself

Introduction: My name is Nathanial Hackett, I am a lovely, curious, smiling, lively, thoughtful, courageous, lively person who loves writing and wants to share my knowledge and understanding with you.