Abstract
Compressed ultrafast photography (CUP) is a computational optical imaging technique that can capture transient dynamics at an unprecedented speed. Currently, the image reconstruction of CUP relies on iterative algorithms, which are time-consuming and often yield nonoptimal image quality. To solve this problem, we develop a deep-learning-based method for CUP reconstruction that substantially improves the image quality and reconstruction speed. A key innovation toward efficient deep learning reconstruction of a large three-dimensional (3D) event datacube (x, y, t) (x, y, spatial coordinate; t, time) is that we decompose the original datacube into massively parallel two-dimensional (2D) imaging subproblems, which are much simpler to solve by a deep neural network. We validated our approach on simulated and experimental data.
Compressed ultrafast photography (CUP) can capture transient events at 100 billion frames per second in a single shot with a sequence depth up to hundreds of frames [1–3]. The operating principle is detailed in [1,2]. In brief, CUP uses a digital micromirror device (DMD) to spatially encode the dynamic scene, followed by recording the resultant image using a streak camera with a fully opened entrance slit. Given the spatiotemporal sparsity of the event, a compressed sensing (CS)-based algorithm can be used to decode the spatiotemporal mixing along the one spatial axis of the streak camera and reconstruct the 3D (x, y, t) event datacube.
Currently, CUP relies on the two-step iterative shrinkage/ thresholding (TwIST) algorithm [4] to reconstruct the event datacube. The recovered image resolution is degraded by the temporal shearing operation of the streak camera [5]. Several improvements in the reconstruction algorithms have been made in the literature [5–7]. However, the resultant image quality is still nonoptimal, and the optimization-based reconstruction methods typically need tens to hundreds of iterations to converge and often require fine-tuning of the hyperparameter to obtain high-fidelity results, both of which are time-consuming. Also, the memory requirement is high due to the complex computation such as matrix inversion. Inspired by the recent advances in applying deep learning (DL) [8] to computational imaging systems for faster and more accurate reconstruction [9–13], we present a DL-based method for the CUP image reconstruction to improve the image quality and accelerate the reconstruction speed toward real-time display applications.
In CUP, as illustrated in Fig. 1(a), the dynamic scene is first imaged by a lens to an intermediate image plane. Then, a beam splitter divides the light toward two directions. The reflected light is directly imaged by an external CMOS camera (Thorlabs, CS2100M-USB). The transmitted light is passed to a DMD (Texas Instrument, LightCrafter 6500) by a 4f imaging system with a tube lens (Thorlabs, AC508–100-A) and a stereo objective (Olympus, MV PLAPO 2XC). A static pseudorandom binary pattern is displayed onto the DMD to spatially encode the dynamic scene. Each encoding pixel is turned either on (tilted −12° respect to the DMD surface norm) or off (tilted + 12° with respect to the DMD surface norm) and reflects the incident light in one of the two directions. The reflected light masked with the pattern is collected by the same stereo objective and further relayed to the wide-open entrance slit of a streak camera (Hamamatsu, C13410–01A). Inside the streak camera, the incident light is temporally sheared by a sweeping voltage in the vertical axis according to the time of flight and imaged by an internal CMOS camera (Hamamatsu, ORCA-Flash 4.0) in a single 2D image. As discussed previously [1,2], the forward model of CUP can be expressed by
| (1) |
where I is the intensity distribution of the dynamic scene, T is the spatiotemporal integration operator, S is the temporal shearing operator, C is the encoding operator that comes from the DMD, E' is the streak camera measurement, and E" is the external CMOS camera measurement. Equation (1) can be further concatenated as
| (2) |
where E = [E', E"] and A = [TSC; T]. Given the known operator A and the spatiotemporal sparsity of the dynamic scene, the image reconstruction can be accomplished by solving an optimization problem,
| (3) |
where ‖ · ‖ 2 is the ℓ2 norm, ‖ · ‖TV is the total variation (TV) norm to encourage sparsity in the spatial gradient domain, and β is the regularization parameter. Currently, CUP uses iterative TwIST algorithm for image reconstruction, which takes approximately 55 s (with 50 iterations) to reconstruct a (256, 256, 32) data cube.
Fig. 1.
Schematic of the CUP system and CUP data acquisitions. (a) Schematic of the CUP system. DMD, digital micromirror device. (b) Schematic of CUP data acquisition. t, time; x, y, spatial coordinates of the dynamic scene; x’, y’, spatial coordinates at the streak camera; x”, y”, spatial coordinates at the external CMOS camera; C, spatial encoding operator; S, temporal shearing operator; T, spatiotemporal integration operator. The 3D image reconstruction can be decomposed into massively parallel 2D image reconstruction.
The reconstruction in CUP is an inherent large-sized 3D problem: the CUP captures 3D (x − y − t) data with a single 2D measurement (x − y). A key to reduce the complexity of applying DL for reconstruction is to recognize that, in the measurement operation, each 2D image slice (y − t) in the 3D datacube is independent of each other along the x axis: the T, S, C operator applies to the column of every instantaneous (x, y) image independently. As a result, the 2D image slice (y − t) corresponds to a 1D compressed line image (y) in the CUP measurement data. Therefore, the 3D image reconstruction can be decomposed into massively parallel 2D image reconstruction, as illustrated in Figs. 1(b) and 2. The measurements and the mask are first decomposed into independent line images (y, xi) and line mask Ci with i being the column index in the 2D measurement (x, y) and 2D mask C. The network input is then initialized to AT y with AT being the adjoint of A and function as an approximate inverse operator to reduce the learning burden. Such a setting is reliable for computational imaging problems [9] and has been widely used in recent works [14,15]. The network output is the 2D image slice (y, t). Network groups are constructed for each specific line masks, and image slices from the network groups are finally concatenated to the 3D datacube (x, y, t). Compared with the 3D mapping network that directly reconstructs the 3D datacube, this segmentation reconstruction method benefits from a smaller network, enabling faster training and requiring fewer training samples.
Fig. 2.
Deep learning workflow and network architecture for CUP. The measurements and the mask are first decomposed into independent line images (y, xi), and line mask Ci, i is the column index in the 2D measurement (y, x) and 2D mask C. Input to the network is then set to ATy, and the network output is the 2D image slice (y, t). Network groups are constructed for each specific line mask, and image slices from the network groups are then concatenated to the 3D datacube (x, y, t). The deep learning network uses U-net structure. Notations: N, number of kernels; K, kernel size; S, stride; L, number of layers in the dense block; G, growth rate of dense block. N#K#S# denotes the number of kernels, kernel size, and stride of the convolution layer, respectively. L#G# denotes the number of layers and growth rate inside the dense block, respectively.
The deep learning network (Fig. 2 inset) uses an encoder-decoder “U-net” architecture where each convolution layer is replaced with a dense block (DB) to improve the training efficiency [16,17]. The encoder gradually condenses the spatiotemporal information into feature maps with increasing depths; the decoder recombines the information from the feature maps into the ultimate image. Specifically, first, the input goes through the “encoder” path, which consists of four DBs connected by max-pooling layers for downsampling. Each DB consists of multiple layers, and each layer contains batch normalization (BN), the rectified linear unit (ReLU) nonlinear activation, and convolution [17]. The intermediate output from the encoder encodes rich information along with the depth (activation maps) with small lateral dimensions. Next, the low-resolution activation maps go through the “decoder” path, which consists of four additional DBs connected by upsampling convolutional (up convolution) layers. Four skip connections are set across different spatiotemporal scales along the encoder-decoder path to preserve high-frequency information. After the decoder path, an additional convolutional layer and the last layer produce the network output. For the loss function, we measure the ℓ2 distance between the network’s prediction and the ground truth image. To provide strong restrictions on the forward model, we add extra constraints on the forward model as the second term in the loss function: the encoded streak camera image and the image captured by the reference camera. We denote xi as the ground truth image, as the network’s prediction with i being the image index in the training image batch, and N as the training batch size. The resultant loss function is
| (4) |
where λ is the parameter that controls the relative weights of each loss component. λ is set to 1 in the training. After training, the reconstructed (x, y, t) datacube is predicted from the streak camera measurement, CMOS camera measurement, and the mask.
To generate the dataset for training, we adopted two strategies. First, we assembled a collection of 3D image cube x by applying various dynamics (such as image shifting, reshuffling of the 2D images in the 3D cube) on the MNIST database [18] and 1000 in-house experimental images of different objects. To obtain the corresponding measurement data set y, we applied the CUP forward operator A on the target image set x. To emulate experimental measurements, we encoded the dynamics by the mask captured in the real experiment and added the shot noises to the measurement data set y in the synthetic dataset. Second, we collected a small experimental dataset that contains the ground truth 3D image cube x. We obtained the ground truth datacube x by a line-scanning operation in the CUP system. In the line-scanning operation, we employed the DMD as a line scanner by turning on the DMD’s (binned) mirror rows sequentially and recording the temporally sheared (x, t) image. By scanning the sample along the y direction, which is perpendicular to the entrance slit of the streak camera, we stacked all the (x, t) images to form the ground truth 3D datacube (x, y, t). Furthermore, to improve the network reconstruction accuracy, we performed image augmentation, including crop, shift, flip, and affine transformations on the training images to increase the training sample size. In total, 10,000 samples were used for training, which consists of MNIST (~5%), the in-house experimental objects (~70%), and experimental dynamics (~25%).
We trained the network using the Adam optimizer [19] for 50 epochs. The leaning rate was initialized as 10−3 and scaled down by a factor of 0.5 every 5 epochs. For each epoch, we used measurements of the same dynamics with different added noise realizations to improve the robustness of the network. A total of 50 different noise realizations at the same noise level were generated for the synthetic datasets in the 50 epochs of one training. The training was performed on a campus cluster with two GPU (NVIDIA Tesla M2090) using Keras/Tensorflow. Once the network was trained, image reconstruction can be achieved in real time. To further show the speed improvement by the DL methods, we compared the reconstruction time when using DL, TwIST, and other algorithms—SALSA, FISTA, and GAP [20–22]. The DL reconstruction speed of a (256, 256, 32) datacube is at least 60 times faster than that enabled by the fastest GAP algorithm.
We first validated the DL method on simulated data and benchmarked it against the TwIST algorithm. A 256-by-256 Shepp–Logan (S–L) phantom was used as the base image. The simulated dynamic scene contained nine frames, with the S–L phantom decaying exponentially. The intensity decay obeys I[n] = exp(−0.25n), where I is the intensity trace, n is the frame index, and the decay rate is 0.25. The streak camera measurement was generated according to the forward model, and both shot noise and 1% Gaussian white noise were added. The dynamic scene was then reconstructed using the TwIST-based constrained reconstruction method [5] and our DL reconstruction method, respectively. Figures 3(a)–3(c) show temporally integrated images of the ground truth, the reconstructed images by TwIST, and the DL method, respectively. For the large bright patch in Region 1 and the small bright spot in Region 2, the image produced by the DL reconstruction shows better contrast and resolution than that of TwIST reconstruction. The boundary between the dark and bright patches in Region 3 is also more prominent in the DL result than that in the TwIST result. To compare the reconstruction quality of the two methods across frames, we plotted normalized intensity changes against the frame index at the same circled pixels indicated in Figs. 3(a)-(c) as shown in Fig. 3(d). For the TwIST reconstruction, the root mean square error (RMSE) of the reconstructed intensity trace against the ground truth intensity trace is 0.06, and the reconstructed decay rate is 0.26 with 95% confidence bounds using a nonlinear least squares fitting method and single exponential model. For DL reconstruction, RMSE of the intensity trace is 0.07, and the reconstructed decay rate is 0.28 with 95% confidence bounds. The DL reconstruction thus provides a comparable temporal reconstruction accuracy to that of the TwIST algorithm.
Fig. 3.
Results of the numerical simulation. (a)–(c) Temporally projected images of the ground truth, the TwIST reconstructed result, and the DL reconstructed datacubes. (d) Intensity trace against the frame index at the circled pixels in (a)–(c).
We then benchmarked the DL method against the TwIST algorithm on experimental data. We imaged the fluorescence decay of a fluorescent tissue paper upon pulsed laser excitation. The 515 nm picosecond pulse laser (NKT Photonics, Genki-XPC, 7 ps pulse duration) first passed through an engineered diffuser and excited the fluorescent tissue paper. We separated the fluorescence from excitation using a combination of a 532 nm dichroic mirror (ZT532rdc, Chroma) and a 590/50 nm bandpass emission filter (ET590/50 m, Chroma). We then used 10× objective (Olympus, UPLFLN 10X2) and tube lens to relay the fluorescence to the intermediate image plane (shown in Fig. 1). Next, the CUP system collected the photons.
We reconstructed the dynamic scene using both the constrained TwIST and the DL reconstruction method. Figure 4(a) presents the reference image captured by the external reference camera. Figures 4(b) and 4(c) show temporally integrated images of the TwIST and DL reconstructed datacubes, respectively. In the spatial domain, the results of the DL method illustrate sharper boundaries and higher spatial resolution. Figure 4(d) shows the normalized intensity changes across time at the same circled pixels indicated in Figs. 4(b) and 4(c). For TwIST reconstruction, the reconstructed fluorescence lifetime (reciprocal of the decay rate) is 6.13 ns with 95% confidence bounds using the nonlinear least squares fitting method and a single exponential model. The DL reconstructed fluorescence lifetime is 6.29 ns with 95% confidence bounds. Figures 4(e) and 4(f) show the reconstructed frames at t = 0 ns, 1.3 ns, and 2.5 ns reconstructed by the TwIST and DL methods, respectively. The results indicate that, in the temporal domain, the reconstruction accuracy provided by DL and TwIST is similar.
Fig. 4.
Experimental results. (a) Reference image captured by the external CMOS camera. (b) and (c) Temporally integrated images of TwIST and DL reconstructed datacubes. (d) Time-lapse intensity change at the same circled pixels in (b) and (c). (e) and (f) Reconstructed frames at t = 0 ns, 1.3 ns, 2.5 ns by the TwIST and DL methods, respectively. Scale bar, 100 μm.
Despite being able to produce a high quality image with much reduced reconstruction time, due to the segmentation method in DL reconstruction, the DL reconstruction results may have boundary artefacts (stripes in each x − y temporal frame). These boundary artefacts can be removed in postprocessing [23]. The running time for this additional step is approximately 0.17 s, bringing the total DL reconstruction time (DL image reconstruction and image postprocessing) to 0.27 s. This speed is still ~60 times faster than that enabled by the fastest algorithm GAP. In addition, the DL reconstruction is inferior to the iterative TwIST algorithm regarding the flexibility: the pretrained network works only with specific mask patterns. A potential solution to solve this problem is to use transfer learning, which can bypass the need to train the network from scratch [24]. Specifically, the pretrained network for the old mask can be fine-tuned for a new mask, which can reduce training time and training sample size. Moreover, for larger data size in the reconstruction, deeper DL network and more training samples may be required for optimal performance.
In conclusion, we developed a DL reconstruction method for CUP. Compared with the conventional TwIST algorithm, the DL method can recover the dynamic scene with sharper boundaries, higher feature contrast, and fewer artefacts while maintaining a similar temporal reconstruction accuracy. Moreover, the DL method increases the reconstruction speed by a factor of over 60, thereby enabling real-time reconstruction of large-sized event datacubes.
Acknowledgments
Funding. National Science Foundation (1652150); National Institutes of Health (R01EY029397, R35GM128761).
Footnotes
Disclosures. The authors declare no conflicts of interest.
REFERENCES
- 1.Gao L, Liang J, Li C, and Wang LV, Nature 516, 74 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Liang J, Ma C, Zhu L, Chen Y, Gao L, and Wang LV, Sci. Adv. 3, e1601814 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Liang J, Gao L, Hai P, Li C, and Wang LV, Sci. Rep. 5, 15504 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Bioucas-Dias JM and Figueiredo MAT, IEEE Trans. Image Process. 16, 2992 (2007). [DOI] [PubMed] [Google Scholar]
- 5.Zhu L, Chen Y, Liang J, Xu Q, Gao L, Ma C, and Wang LV, Optica 3, 694 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Yang C, Qi D, Wang X, Cao F, He Y, Wen W, Jia T, Tian J, Sun Z, Gao L, and Zhang S, Optica 5, 147 (2018). [Google Scholar]
- 7.Yang C, Qi D, Liang J, Wang X, Cao F, He Y, Ouyang X, Zhu B, Wen W, Jia T, and Tian J, Laser Phys. Lett. 15, 116202 (2018). [Google Scholar]
- 8.Goodfellow I, Bengio Y, and Courville A, Deep Learning (Massachusetts Institute of Technology, 2016). [Google Scholar]
- 9.Barbastathis G, Ozcan A, and Situ G, Optica 6, 921 (2019). [Google Scholar]
- 10.Qiao M, Meng Z, Ma J, and Yuan X, APL Photon. 5, 030801 (2020). [Google Scholar]
- 11.Yuan X, Liu Y, Suo J, and Dai Q, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2020). [Google Scholar]
- 12.Nehme E, Weiss LE, Michaeli T, and Shechtman Y, Optica 5, 458 (2018). [Google Scholar]
- 13.Ronneberger O, Fischer P, and Brox T, in Medical Image Computing and Computer-Assisted Intervention (MICCAI), series LNCS (Springer, 2015), Vol. 9351, p. 234. [Google Scholar]
- 14.Goy A, Arthur K, Li S, and Barbastathis G, Phys. Rev. Lett. 121, 243902 (2018). [DOI] [PubMed] [Google Scholar]
- 15.Lyu M, Wang W, Wang H, Wang H, Li G, Chen N, and Situ G, Sci. Rep. 7, 17865 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Huang G, Liu Z, Maaten LVD, and Weinberger KQ, in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2017), p. 2261. [Google Scholar]
- 17.Li Y, Xue Y, and Tian L, Optica 5, 1181 (2018). [Google Scholar]
- 18.http://yann.lecun.com/exdb/mnist/.
- 19.Kingma DP and Ba J, in International Conference on Learning Representations (ICLR) (2015). [Google Scholar]
- 20.Nguyen T, Xue Y, Li Y, Tian L, and Nehmetallah G, Opt. Express 26, 26470 (2018). [DOI] [PubMed] [Google Scholar]
- 21.Figueiredo MA, Bioucas-Dias JM, and Afonso MV, IEEE/SP 15th Workshop on Statistical Signal Processing (IEEE, 2009). [Google Scholar]
- 22.Beck A and Teboulle M, SIAM J. Imaging Sci. 2, 183 (2009). [Google Scholar]
- 23.Yuan X, IEEE International Conference on Image Processing (ICIP) (2016), p. 2539. [Google Scholar]
- 24.Guan J, Lai R, and Xiong A, IEEE Access 7, 44544 (2019). [Google Scholar]




