ISCTE-IUL

—Deep learning has shown promising results in several computer vision applications, such as style transfer applications. Style transfer aims at generating a new image by combining the content of one image with the style and color palette of another image. When applying style transfer to a 4D Light Field (LF) that represents the same scene from different angular perspectives, new challenges and requirements are involved. While the visually appealing quality of the stylized image is an important criterion in 2D images, cross-view consistency is essential in 4D LFs. Moreover, the need for large datasets to train new robust models arises as another challenge due to the limited LF datasets that are currently available. In this paper, a neural style transfer approach is used, along with a robust propagation based on over-segmentation, to stylize 4D LFs. Experimental results show that the proposed solution outperforms the state-of-the-art without any need for training or fine-tuning existing ones while maintaining consistency across LF views.


INTRODUCTION
Appealing paintings and artwork have attracted people for thousands of years.In the past, a skilled artist was always required to create a painting with a specific style, brush strokes and color palette, which typically took a long time.With the recent advances in learning-based techniques and the advent of style transfer, such creation is now possible to be performed by computers.Style transfer is an image editing application in which a new image is generated by combining the content of one image with the style of another one (e.g., a famous painting).Style transfer is a long-standing research area in the broader area of texture synthesis [1], [2].Recently, with the rapid development of deep learning, neural networks are being used to solve the style transfer task.Gatys et al. [3] were the first to apply Convolutional Neural Networks (CNN) to stylize an image.In their work, CNNs are used to extract the feature maps of the content image (i.e., the image from which the content will be transferred) and style image (i.e., the image from which the style will be transferred).Afterwards, a target image (i.e., the stylized image that combines the content image with the style image) is iteratively optimized by minimizing a loss function.Johnson et al. [4] improved the performance of [3] by training a feed-forward network for each style image and generating a stylized image with only one forward pass in the testing stage.Although it is 3 times faster than [3], the solution in [4] is not flexible in terms of the number of used styles since it requires training for each style.Additionally, other neural networks have also been exploited to achieve style transfer, such as generative adversarial networks that require paired training data to learn a specific style, which is not always available and may limit their applications [1].Moreover, Neural Style Transfer (NST) has been extended to consider videos [5] and different imaging modalities, such as stereo imaging [6] and 4D Light Fields (LF) [7], [8]; interested readers are encouraged to read the recent comprehensive reviews of the existing NST solutions in [1], [2].
4D LFs involve rich information since not only the light intensity is captured but also ray directions [9].LFs capture the same scene from different perspectives, thus allowing interesting applications such as depth or disparity estimation (i.e., the displacement of a point between different views, which is inversely proportional to the depth), view synthesis and postcapture refocusing [9], [10].4D LFs can be represented as an array of views (, , , ), where (, ) are the spatial coordinates, and (, ) are the angular coordinates of each view.When fixing one angular and one spatial coordinates, an Epipolar Plane Image (EPIs) (i.e., the unique 2D spatio-angular LF slice typically containing a regular structure with several oriented lines [11]) can be obtained as illustrated in Figure 1.
While generating stylized images that are visually pleasant is an important criterion for 2D images, maintaining cross-view consistency is also essential for 4D LFs.More precisely, directly applying 2D image or video style transfer methods to the entire 4D LF views, without considering the correlation between them, may result in inconsistent stylized LFs with highly unnatural artifacts.Only a few solutions are available in the literature that consider 4D LF cues in the style transfer application.Hart et al. [7] proposed an extension to the work of Johnson et al. [4] by adding a disparity loss term to the loss function.The disparity loss is computed by finding the difference between each stylized LF view and the stylized central view warped into that view.The disparity loss is then backpropagated through the network.This repeats for each LF view until convergence is reached.While their work considers cross-view consistency, it requires optimizing each LF view iteratively (assuming dense LFs).Moreover, although the feed-forward approach is fast, it needs to be trained for each style, hence, limiting style selection flexibility.Egan et al. [8] addressed these drawbacks and proposed a novel NST method that considers local angular consistency.Their work extended the Gatys et al. work [3] by adding the local angular consistency loss in the total loss function.Although their work ensures local angular consistency for LFs with larger disparity ranges, applying optimization using this technique for each view is very time-consuming.
The contribution of this paper is a novel 4D LF NST method that overcomes the limitations of the existing methods by: • Enabling NST flexibility (in terms of the number of styles that can be used) with less computational complexity: to achieve that, the optimization-based NST [3] method (which does not require training a model for each style image) is used.To reduce the optimization-based NST complexity significantly, only a limited set of views (i.e., the four corner views) are initially stylized using the method in [3] (different from [7] and [8] that require optimizing each LF view).
• Improving 4D LF view-consistency: by exploiting LF over-segmentation (that adheres to object boundaries and maintains LF view-consistency), the edits from all corner views are propagated into each LF view using per-pixel disparity in an occlusion-aware manner.The proposed method outperforms the existing solutions without training or fine-tuning the existing NST models.
The remainder of this paper is organized as follows: Section II describes the proposed method in detail, and Section III evaluates its performance through a series of experimental results.Finally, Section IV concludes the paper with some final remarks and proposes directions for future work.

II. PROPOSED METHOD
The proposed method contains four main steps as presented in Figure 2. Given a style image and a 4D LF, the four corner views are initially stylized using optimization-based NST [3].After that, disparity maps for all input LF views are estimated using [12]; to ensure spatio-angular consistency during the propagation.Next, the 4D LF is over-segmented into spatioangular coherent regions (a.k.a superpixels), as in [13] to facilitate the propagation and respect object boundaries and occlusions.Afterwards, the stylization is propagated into all LF views through occlusion-aware back-projection from each view into all corner views.Finally, remaining isolated non-stylized pixels that emerged after back-projections due to occlusions, are filled robustly.Each step is detailed in the following subsections.

A. Corner Views Stylization
Initially, only the extreme four corner views are stylized using the approach in [3].The corner views are selected since they typically contain the maximum scene information including dis-occlusions.The approach in [3] aims at minimizing the distances of the feature representation between the content/style image and the target one in one or more layers of the CNN.The target image is initially generated using a white noise image and iteratively optimized using the loss function, ℒ  , defined by (1), where ℒ  , is the content loss and ℒ  , is the style loss.To ensure view-consistent stylization, the initial white noise is set the same for all corner views.Finally, to control the output, two weighting factors (i.e., the content weight, , and the style weight, ) are included: Notice that the proposed method is independent of the used 2D NST method.However, the approach in [3] is used due to its flexibility to transfer any style and it enables controlling the target images by adjusting the weights in (1).Moreover, any number of views or angular positions can be used but the results may be influenced accordingly.

B. Disparity Maps Estimation
LF imaging provides rich information, which makes it possible to estimate a disparity map for each LF view.In this paper, the proposed method in [12] (that estimates disparities from each view to its right adjacent view) is used to estimate disparity maps for all LF views.Per-pixel disparity is used here to ensure consistent pixel projection during the propagation step.

C. Light Field Superpixel Creation
LF over-segmentation is capable of adhering to object boundaries and creating a unique label for each homogenous region to facilitate subsequent editing tasks.In this paper, the recently proposed Adaptive LF Over-segmentation (ALFO) method [13] is used to guide the propagation in an occlusionaware manner.The ALFO method exploits color, disparity and position features to apply adaptive K-means clustering.Additionally, it can robustly balance accuracy, shape regularity and view-consistency.In our experiments, the superpixel size is set to 20 as suggested in [13] as a reasonable size for robust adhesion to the borders.

D. Occlusion-aware Propagation
Given the LF disparity maps, LF superpixels and stylized corner views, the stylization now can be propagated into all other 4D LF views.Initially, each LF view is back-projected into all corner views using its disparity map (2): =   (,) +  ℎ (,)→ ,    =   (,) +   (,)→ , where   (,) ,   (,) are the spatial position coordinates of a pixel, , which is located in a view of angular coordinates (, ),    and    are the spatial position in a reference view (i.e.,  in this paper represents a single corner view, hence, the same equation is applied for all corner views independently), and  ℎ (,)→ ,   (,)→ are the horizontal and vertical disparity from view (, ) to the reference view.The used disparity estimation method [12] estimates disparity for adjacent views, therefore, for regularly sampled 4D LF views back-projection is applied by multiplying the disparity value by (  − ), (  − ) when computing    and    , respectively [13].
These equations are applied in the case of parallel optical light field capturing assumption, as in [13]- [16].Otherwise, intrinsic and extrinsic camera parameters should be considered.
Since the projected pixel coordinates may belong to ℝ2 , and to ensure integer indexing (since the visual information is only available for integer indices), the four neighboring pixels,   ∈ {  ,   ,   ,   }, of the back-projected pixel with integer positions (∈  2 ) are considered as presented in Figure 3.However, consistency is checked by comparing the label and disparity of the pixel in (  (,) ,   (,) ) and all pixels in   to choose which ones to be used for the interpolation.To overcome possible projection errors, due to disparity errors or discontinuities in superpixels, two conditions are checked before interpolation: • At least one pixel in   has the same label as the pixel in its original location (  (,) ,   (,) ).
• The absolute disparity difference between a pixel disparity in view (, ) and at least one pixel disparity in   is less than a threshold value, .We empirically set  = 0.1; since a superpixel with size (i.e., 20) is noticed to have, typically, similar disparity values.
If any of the above conditions holds for all pixels in   or part of them, then only these pixels are valid for interpolation.Interpolation is applied by computing the bilinear interpolation of valid pixels in   , otherwise, no interpolation is computed.
After computing the interpolated value from each corner view, the pixel in its original angular location (, ) is set to the mean color value of all valid back-projections from the four corner views.The mean is used after extensive experiments since it shows the best visual and numerical results when compared to using the median or weighted sum and maintains 1 Software implementation of all the used metrics can be found at: https://github.com/doegan32/Light-Field-Style-Transferconsistency across LF views.By doing this, only very few sparse and isolated pixels that have no projection, or invisible regions due to the angle of view, remain unstylized.To fully stylize all LF views, these remaining isolated pixels are filled by applying inward interpolation using the widely used region filling based on the Laplace equation as in [17].

III. EXPERIMENTAL RESULTS
In this section, several methods are used as benchmarks to evaluate the performance of the proposed method.Firstly, two different baseline methods are considered, as in [8]: i) by applying Independent View Stylization (IVS) using existing 2D NST [3] to all LF views independently; and ii) by applying Pseudo Video Stylization (PVS) as proposed (for videos) in [5] for styling a pseudo video sequence of 4D LF views.To the best of the authors' knowledge, only two recently proposed methods are specifically focused on tackling 4D LF challenges.The first one focuses on Global Angular Consistency Stylization (GACS) [7], and the second one focuses on Local Angular Consistency Stylization (LACS) [8].Moreover, different synthetic and realworld LF datasets and style images are used, as shown in TABLE I.For quantitative evaluations, two different metrics are used to evaluate the view-consistency namely: i) the LF Epipolar Consistency (LFEC) metric defined in [18]; and ii) the LF Angular Consistency (LFAC) metric 1 defined in [8].The LFEC and LFAC metrics evaluate the angular consistency by backwarping LF views into a reference view and finding the color variance.Different than the LFEC metric that back-warps all LF views into the central view, the LFAC metric back-warps into the center view of a local window of views; to robustly consider large occluded regions.Both metrics require estimating disparity to apply back-warping, therefore, we estimated per-pixel disparity maps, for our results and all benchmark methods by using [12].We noticed that, by using [12], the metric results of the benchmark methods are improved.Moreover, the disparity loss (which is the amount of disparity changes) is evaluated by using the disparity Mean Square Error (MSE) metric defined in [7].This metric computes the  × 100 between the central view disparity map estimated from the original LFs and the stylized ones.As in [8], the disparity estimation method in [19] is used.Results of all metrics are presented in TABLE II.Due to the limitation in the paper size, only the central view with horizontal EPIs are presented in TABLE III.However, we encourage the reader to see our and dynamic results 2 for all LF views for clear view-consistency evaluation.For the used NST implementation, we used standard GPU-based MATLAB implementation [20] and we set  = 50,  = 10 3 , the same values as used in the benchmark methods.The proposed method generates outperforming angular consistency in both LFEC and LFAC metrics, as can be seen in TABLE II.For the MSE metric, the GACS method achieves the best average results and preserves better object boundaries; hence, generates the central disparity maps that are similar to the original LF ones.However, it requires a pre-trained NST model as input for each style image.In this paper, corner views are used to minimize the occlusions, hence, there are no large holes left after propagation in densely sampled LFs.However, our method can be extended to consider sparse LFs that may have largely occluded regions by simply adding more reference views to consider all objects in LF views.The used technique for filling the holes in dis-occluded regions after propagation may generate some artifacts (which also occur in the benchmark methods) and thus requires further investigation.For time complexity, the proposed method reduces the needed time to stylize the entire LF significantly, i.e., for a LF with 81 views instead of taking 81 ×   , where   is the average time needed to stylize a single view, it takes less than 10 ×   including LF disparity estimation and superpixel generation.Finally, it can be observed that neither applying 2D methods for each view independently nor existing methods for video are adequate solutions for 4D LFs.

IV. FINAL REMARKS
In this paper, a novel view-consistent 4D LF NST method is proposed.Without any further training for new deep learning models or fine-tuning existing ones, we exploited an existing optimization-based NST method to initially stylize only four corner views.Afterwards, the stylized views are propagated into all other LF views in an occlusion-aware manner by using LF superpixels.Experimental results have been shown to outperform the considered benchmark methods and produce visually appealing and consistent results across all LF views.
For future work, we will extend style transfer to sparse LFs that include wide occlusions.Additionally, we will study other applications of the proposed propagation technique, such as semantic segmentation and object removal, where the edits are applied in reference views and propagated into other LF views.

ACKNOWLEDGMENT
The authors would like to thank Mr. Dónal Egan for publishing the software of their method [8] including the evaluation metrics and results that facilitated their comparison.

Fig. 1 .
Fig. 1.Example of light field representations.a) 4D light field represented as an array of views; b) Horizontal and vertical EPIs.

Fig. 2 .
Fig. 2. Overview of the proposed method for view-consistent 4D LF neural style transfer.By combining the style of 2D image with the content of 4D LFs and applying an occlusion-aware propagation, a consistent 4D stylized LF is generated.

Fig. 3 .
Fig. 3. Example of back-projection: a pixel in (, ) view that needs to be stylized is back-projected into each corner view (in blue squares).×

TABLE I .
TEST IMAGES USED IN OUR EXPERIMENTS