Light Field Image Coding Using High-Order Intrablock Prediction

This paper proposes a two-stage high-order intrablock prediction method for light field image coding. This method exploits the spatial redundancy in lenslet light field images by predicting each image block, through a geometric transformation applied to a region of the causal encoded area. Light field images comprise an array of microimages that are related by complex geometric transformations that cannot be efficiently compensated by state-of-the-art image coding techniques, which are usually based on low-order translational prediction models. The two-stage nature of the proposed method allows us to choose the order of the prediction model most suitable for each block, ranging from pure translations to projective or bilinear transformations, optimized according to an appropriate rate-distortion criterion. The proposed higher order intrablock prediction approach was integrated into a high efficiency video coding (HEVC) codec and evaluated for both unfocused and focused light field camera models, using different resolutions and microlens arrays. Experimental results show consistent bitrate savings, which can go up to 12.62%, when compared to a lower order intrablock prediction solution and 49.82% when compared to HEVC still picture coding.


I. INTRODUCTION
ight Field (LF) imaging technology available in lenslet LF cameras allows to jointly capture radiance data and angular information from the light rays hitting the camera's sensor, by means of multiplexing the LF data in a 2D conventional sensor.This is achieved through an array of microlenses, placed between the main lens and the camera sensor.Each microlens creates a micro-image (MI) on the sensor, which is the microlens scene perspective being captured through the main lens.Therefore, a lenslet light field image tends to be like the output of an array of very small cameras.
The additional knowledge of the scene angular information allows to perform various a posteriori image processing tasks, not straightforwardly possible with traditional cameras.Refocusing and change of perspective after the picture has been taken are the most common examples [1].These functionalities, derived from the ability to capture the "whole observable" (LF) scene [1], may be advantageous for several applications, like 3D Television [2], since by rendering several views from different perspectives, 2D, 3D and multiview signals can be created; image recognition and medical imaging [3].
Depending on the position of the camera sensor and the microlens array relatively to the main lens, different samplings of the light field can be performed, which define essentially the lenslet LF camera model [4].Two main types of lenslet LF camera models exist, the unfocused [5] and the focused model [4].In the classic unfocused camera model case, the sensor is one focal distance away from the microlens array.Thus, the microlens array is focused at infinity, i.e., the light rays that reach the microlens array are parallel [5].Consequently, the microlens array is completely defocused from the main lens image plane.Therefore, each microlens only captures angular information, meaning that each pixel, within the MI, corresponds to a different angle, or viewpoint [5].In the focused lenslet LF camera model, the sensor is away from the microlens array focal distance and the microlens array is focused on the main lens image plane, allowing for each microlens to generate a focused MI.This feature allows a higher spatial resolution for rendering, since more than one pixel can be extracted from each MI in the rendering process [4].These models have been the base for the deployment of this technology, allowing an increasing number of applications and users.
The growing interest in LF technology led the JPEG Committee to launch a new activity, known as JPEG Pleno, to address coding and representation of content generated by emerging imaging technologies such as LF, point-cloud and holographic technologies [6].
The large amount of data required to adequately represent a LF scene, when compared to the case of typical 2D pictures, calls for efficient techniques for both transmission and storage of this type of content.In this context, several authors proposed specific LF coding techniques, which can be applied directly to the lenslet LF images, in order to exploit the MIs redundancy.Alternatively, other techniques are applied to a different representation of the same LF, which comprises the view point images, also known as sub-aperture images (SAIs).The SAIs are generated by extracting at least one pixel, in a fixed position, from each MI and organizing them into a matrix.Each SAI represents a rendered image, from a different perspective, extracted from the LF image.
State-of-the-art LF coding schemes rely on block matching techniques to exploit the inherent spatial redundancy in lenslet LF images.However, these low order prediction (LOP) models use only two degrees of freedom (DoF), as only translations are used to describe the inherent LF image spatial redundancy.
Due to the small baseline between MIs in lenslet LF images, the different MIs can be approximately related by changes in perspective, which require eight DoF to be described.To the best of authors' knowledge, this characteristic has not been exploited by previous LF coding approaches described in the literature.This additional matching accuracy is important to develop a coding method able to cope with important features of the LF content, such as: 1) The LF camera model, i.e., both focused and unfocused models should be handled; 2) The type of microlens array structure, e.g., rectangular or hexagonal microlens layouts, creating rectangular, hexagonal or circular MIs; 3) The MI size, i.e., a parameter that depends on the camera, and has a strong influence on the number of possible rendered points of view and their spatial resolution.High order prediction (HOP) models, e.g., using geometric transformations with more DoF, have been studied during the last two decades in traditional 2D and 3D image coding scenarios.Several geometric models, like translation, rotation, scale, shear and perspective changes have been used to improve the coding efficiency, by exploiting spatial [7], temporal [8]- [13] and inter-view [14]- [17] redundancy.In most proposals, these models have been applied image-wise (instead of blockwise), due to two main reasons: (i) high computational complexity in block-wise model parameter estimation, and (ii) significant additional bit rate required for parameter transmission.Despite these drawbacks, this paper demonstrates that block-wise HOP models can increase block matching accuracy and, thus, coding efficiency for lenslet LF images.
The method proposed in this paper for encoding lenslet LF images relies on a two-stage block-wise HOP model, where each image block is intra predicted from a reference in the causal area of the image, i.e., containing pixels that were already encoded.Since this approach is applied block-wise, it is possible to optimize the HOP model (number of DoF) for each block to be encoded.Taking advantage of the extra DoF available in HOP models, it is possible to outperform state-ofthe-art coding techniques based on LOP models.
The remainder of this paper is organized as follows: Section II presents a review of several relevant state-of-the-art solutions, regarding LF image coding; Section III describes the geometric transformations used in the proposed prediction method; Section IV presents the proposed HOP model; Section V presents the test conditions and experimental results; and, finally, Section VI concludes the paper.

II. RELATED WORK ON LIGHT FIELD IMAGE CODING
Several schemes to encode lenslet LF images are described in the literature, aiming to exploit the intra-LF image redundancy.These schemes rely on different LF image representations and coding techniques, which may be categorized according to the fundamental adopted approach as: transform-based coding, pseudo-video sequence coding, disparity-based coding and non-local spatial prediction coding.

A. Transform-based coding
Some LF coding schemes rely, essentially, on the use of a transform, mainly the discrete cosine transform (DCT) [18], [19] or the discrete wavelet transform (DWT) [20].In [18], a 3D-DCT is applied to a stack of MIs, to exploit the existing spatial redundancy within a MI, as well as the redundancy between adjacent MIs.In [20], a LF image is decomposed into SAIs, and a 3D-DWT is applied to a stack of these SAIs.The lower frequency bands are transformed using a twodimensional discrete wavelet transform (2D-DWT), while the remaining higher frequency coefficients are simply quantized and arithmetic encoded.These coding schemes are reportedly more efficient than JPEG, but not as efficient as HEVC still picture coding.

B. Pseudo-video sequence coding
This type of LF coding schemes represent the LF image as a set of MIs or SAIs, and re-organize them into a low resolution pseudo-video sequence (PVS), which is then compressed using a standard video encoder.Various scanning strategies to order the PVS are considered to better exploit the redundancy between MIs or SAIs.Dai et al. [21] propose to scan the SAIs using either a raster or a spiral scan and then encode the generated video sequence with H.264/AVC.Vieira et al. [22] used similar scanning strategy combined with several prediction structures supported by HEVC.In both cases it is possible to conclude that the spiral scan is more efficient than the raster scan.More recently, in the ICME light field compression challenge [23], Liu et al. [24] used a PVS scheme to organize the SAIs into layers, depending on the proximity to the central view, starting with the central SAI and moving on to the outer views.The more distant the SAI is from the center, the higher the value of the used quantization parameter (QP) should be.This scheme was implemented using both HEVC test model (HM) and JEM [25] software.Because the rate allocation is not uniform along the LF image, this method is prone to reconstruct views with different objective qualities.

C. Disparity-based coding
In this type of LF coding schemes the LF image is considered as a set of views captured by different cameras (either in the form of MIs or SAIs), which may be encoded exploiting interview disparity.In [26], the authors propose a coding method that uses some SAIs to calculate a set of disparity maps prior to coding, which are then used to predict the remaining SAIs.The authors concluded that this approach is suitable to encode synthetic images, where disparity compensation alone can be enough to predict a SAI.A compression scheme that incorporates disparity compensation into 4D wavelet coding using disparity compensated lifting is proposed in [27].The disparity information derived from an approximated model of the scene is applied to modify the update and prediction filters of the lifting procedure.In [28], the authors propose a scalable (two-layer) LF coding approach for the focused LF camera model, using a LF representation that consists of a sparse set of MIs and associated disparity maps.Based on the sparse set of MIs and the associated disparity maps (first layer), a reference prediction LF image is obtained through a reconstruction method that relies on disparity-based interpolation and inpainting.This reconstructed LF image is then used to encode the original LF image (second layer), by encoding the prediction residue.This approach was later extended [29] with a third layer of scalability and the use of lossy encoded disparity maps, in contrast with the lossless transmission of the disparity maps, used in the first approach.Both versions of the work are able to outperform HEVC still picture coding.

D. Non-local spatial prediction coding
Several methods to exploit the non-local spatial redundancy were proposed as additional coding tools for existing video coding standards, like HEVC.In [30], a self-similarity compensated prediction is proposed to take advantage of the flexible partition patterns used by this video codec.In [31] this method was extended with a bi-directional mode to increase its coding efficiency.Additionally, in [32], an alternative nonlocal spatial prediction method has been investigated, relying on a prediction mode based on locally linear embedding integrated in HEVC.Differently from the other schemes, that exploit non-local spatial redundancy, this method distributes the computational complexity between the encoder and the decoder, i.e., the locally linear embedding procedure must be replicated in the decoder.In [33] the authors developed a multihypothesis coding method specifically for focused LF image and video.This method uses up to two hypotheses for prediction in both spatial and time domains, which outperforms single-hypothesis based prediction.For the unfocused camera model, the authors concluded that the rate-distortion efficiency is still much higher, compared to JPEG or HEVC, however the gains relatively to HEVC are smaller in this case when compared to the focused model [34].
The main advantage of this category is that, in most approaches, the lenslet LF images are encoded without the need of any pre-processing steps or any prior knowledge about the capturing device, e.g., the LF camera model, the microlens array structure and the MI size.

PREDICTION
In most state-of-the-art encoders, prediction between blocks of pixels is performed using very simple transformations, like translations.However, a lenslet LF image is comprised of MIs that are related by more complex transformations, resulting from the fact that each MI represents the scene being captured from slightly different perspectives.In such cases, it is advantageous to use geometric transformations that better exploit the features of the LF image and its MIs.
A geometric transformation (GT) is able to map perspective changes from one view (generically associated to a quadrilateral) into another view, requiring up to eight DoF.
Considering two different blocks,  and ′, each one with its own coordinate system, (, ) and (, ), respectively, it is possible to define a generic relationship: (, ) = +(, ), (, )., where  and  are mapping functions for each coordinate.These functions create a point to point correspondence between images.Depending on the number of DoF used by the mapping functions in (1), different number of independent point to point correspondences are possible.To describe these mapping functions, some GTs may be used, namely, Projective, Bilinear or a simpler Affine GT, as illustrated in Fig. 1.

A. Projective geometric transformation
In order to simplify the mathematics used in this kind of GT, homogeneous coordinates are commonly used [35].Thus, the Projective GT can be defined by a 3×3 matrix  verifying (2): The Projective matrix  can be decomposed into three different submatrices,   ,   and   : Each submatrix is responsible for a different elementary type of GT:   is responsible for the description of translations,   is able to define linear transformations such as rotation, scaling, and shearing, and   describes perspective transformations.
To fully exploit the capabilities of the projective matrix , a four-point correspondence is necessary between blocks  and ′.In this case, the full transformation matrix corresponds to the following system of equations: The system of equations ( 4) defines the necessary calculations for mapping the coordinates of every pixel of block A into the transformed block ′.
The number of available DoF is directly related with the number of known points of correspondence which exist between both images.For less than four points of correspondence, simpler transformations can be represented by the perspective model.For example, if one point is known, the only component that can be possibly described is a translation, i.e.,   = F > ,  ?G,   = [0,0] I and   = .This case is defined by (5): [, which can be translated into the system of equations ( 6):

B. Bilinear geometric transformation
The Bilinear GT is an alternative to the Projective GT, defined by a 4×2 matrix  verifying ( 7): where: The Bilinear GT matrix  can represent similar GTs as the Projective GT, with the same number of DoF.However, it performs a non-planar transformation, which makes it more flexible.Thus, only horizontal and vertical lines, as well as equispaced points along these directions, are preserved [36].Diagonal lines, on the other hand, are not mapped as lines but as quadratic curves.This feature is illustrated in Fig. 1, where, in the case of the Bilinear GT, points along vertical parallel lines are kept equispaced, while the points along diagonal lines are mapped onto a quadratic curve (block  M ).When the Projective GT (block  V ) is used, points along the parallel vertical lines do not stay equispaced but points along diagonal lines are also mapped along a line.Another property of this GT, when compared to the Projective GT, is the need for simpler calculations per pixel, given by ( 9):

C. Affine geometric transformation
When using either the Projective or the Bilinear GT, eight DoF are available.However, a simpler case exists, which is known as the Affine GT, that is able to describe GTs up to six DoF.The Affine GT can be described as a particular case of Projective or Bilinear GTs, by using matrices  and  with    = [0 0] and   = [0 0], respectively.This GT only requires three points of correspondence between images, defined by ( 10

IV. PROPOSED HIGH ORDER PREDICTION MODE
This section proposes a LF image coding method, based on a high order prediction model, which is implemented as a blockwise prediction mode in HEVC.This HOP mode is added to the set of HEVC Intra prediction modes, i.e., Planar mode, DC mode and the 33 intra Directional modes.
The proposed HOP mode predicts each block by applying a GT between two quadrilaterals, the current block and a block in the reference region, the causal area of pixels already encoded.The algorithm for the proposed prediction mode can be described through the following steps: 1) Selection of the next set of correspondence points to be evaluated: Selection of a quadrilateral in the causal area of pixels (from a set of pre-defined cases), with corners { Z L }, that is mapped into the block which is being predicted, with corners { Z } (see left side of Fig. 2); 2) Calculation of the GT parameters: Calculation of the transformation parameters that map the quadrilateral defined by { Z L } into the one defined by { Z }; 3) Inverse GT mapping: Mapping of the causal quadrilateral defined by { Z L } to the one defined by { Z }, using an inverse mapping procedure with the parameters calculated in the previous step, in order to compute the block prediction error; error and the estimated number of bits to transmit the GT parameters; 4) Estimation of the GT RD cost: Estimation of the ratedistortion (RD) cost, J, associated to the GT that is being evaluated, considering the computed block prediction; 5) Repeat the above steps to find the GT with minimum RD cost: Evaluate iteratively all the pre-defined combinations of correspondence points and choose the one that has the minimum RD cost .

A. Selection of the correspondence points
The major challenges faced by the proposed algorithm are the computational complexity required to estimate the optimal set of GT parameters and the necessary bit-rate for transmitting this data.To tackle both problems, a rate-distortion-complexity tradeoff is defined.From Fig. 2 (left side) it can be inferred that, if all possible four-point correspondences between the prediction block and the current block to be encoded were evaluated, the number of tested transformations per block would be larger than (2 `)a , i.e., for a search window ( = 128) more than 1.15 × 10 <e correspondence possibilities per block exist.To reduce the number of tests to a practicable number, a two-stage minimization problem is proposed, aiming to determine a good approximation to the optimal HOP model, as illustrated in Fig. 2

(right side): 1) LOP Model Estimation
In the first stage, a pure translational LOP model (two DoF) is used.The result of this stage is, the bidimensional vector, , with the lowest RD cost, pointing into the search window of the causal area (see the blue vector on the right side of Fig. 2).The search to determine  is performed using a full search algorithm, as described in [30].The LOP estimation stage of the proposed HOP mode is based on the Self Similarity (SS) prediction method.The prediction cost is minimized by testing all the possible positions inside the search window for a single vector that relates the current block to the prediction block.The  vectors, generated by the first stage, can be either encoded explicitly, similarly to motion vectors in HEVC or using the SS-Skip mode, which creates a list of candidates that includes the  vectors used to encode neighboring blocks.If one candidate from this list is selected to encode the current block, it is only necessary to encode the its index, as in the HEVC merge mode.Additionally, some predetermined vectors are added to the candidate list, referred to as MI-based candidates [30].These candidates correspond to vectors that are very likely to be selected by the SS prediction mode, such as, vectors pointing to the same spatial position of the current MI within the left, above and above-left MIs.

2) HOP Model Estimation
In the second stage, a HOP model (up to eight DoF) is used, employing as a starting point the result of the first stage (see, respectively, the red and blue quadrilaterals on the right side of Fig. 2).For this, a set of four vectors, f ghK i j, is computed, each of them defining the position of one corner of the reference quadrilateral, thus defining the 2D GT.
To further reduce the computational complexity of the second stage of this minimization problem, a 2D logarithmic fast search method has been adopted, which is applied to each corner of the prediction block (blue rectangle).In this case, the maximum number of search steps has been set to  `+( > ,  ?).− 1, depending on the size of the prediction block, i.e.,  > (width) and  ?(height).In each step, the searching points are defined according to a five-point small diamond-shaped basis pattern with an initial search step size equal to ( > ,  ?)/4 [37], [38].This 2D logarithmic fast search method using the five-point small diamond-shaped basis pattern is graphically represented in Fig. 3 across three search steps, represented, respectively, by black circles, green pentagons and yellow triangles.After each search step, the point that minimizes the RD cost function is set as the center of the next step and the search step size is halved until a unitary step value is reached.In the example of Fig. 3, in the first corner ( ; ), the five points associated with the first step, represented by the black circles, are tested.The point that minimizes the RD cost function for the first search step is the black circle on the top.For the second and third search steps, the points on left, respectively, green pentagon and yellow triangle, are the points that yield the lowest RD cost.The final point is selected to define the red arrow that describes the corner displacement of the first corner of the block.
Considering that the search procedure must be applied to all the corners of the prediction block over several search steps, there are two ways of implementing this second stage search: by jointly optimizing each step of the search procedure for the four corners or by independently optimizing each step of the search procedure.By considering five points for each of the S search steps of the 2D logarithm search, the required number of search points for each option is given by (5 × ) Z or 5 Z × , respectively where  is the number of corners.In order to reduce the computational complexity, the second option was used, where each step is optimized individually.
The stop condition for this search method is met when the corner step size reaches the unit.Therefore, the example shown in Fig. 3 represents the unitary steps as the yellow triangles.Since the underlying codec uses variable block sizes,  will depend on the block size.The search window for each corner is limited to + > ,  ? .− 1, as illustrated in Fig. 3 (see the dashed red block).
The quadrilateral used by the HOP model estimation may be scaled to increase pixel precision.Fig. 2 and Fig. 3 illustrate the second stage applied to a blue rectangle with the same size of the block being predicted (in black) to not overload the figures.However, in our implementation, a rectangle, twice the size of the original block, is used to determine the HOP model.This modification means that an integer pixel displacement in one of the corners of the large quadrilateral corresponds to a sub-pixel displacement in the area of the original rectangle.For a block twice the size of the original block, one extra search step is performed by the 2D logarithmic search algorithm that is used at the HOP stage.This occurs because the stopping condition for the search algorithm is the unitary step size.The pixel precision can be further extended by using a rectangle with sides four or eight times the size of the original blue rectangle, which increase, the number of search steps by one or two, respectively.After extensive testing, the best solution in a RD sense was adopted, that is increasing the blue rectangle to twice the original size, despite requiring one extra step.
As the second stage of the HOP search can be biased by the first stage result, the global result of the search method also tests the  vectors used in the previously encoded neighboring blocks (vector predictors), instead of considering only the best  vector from the first stage.Additionally, other  vectors can be tested in conjunction with the HOP model estimation, e.g.top ten candidates from the first stage.However, it was experimentally verified by the authors that the vectors that are more RD cost efficient are the  vector predictors.
The proposed approach can be implemented using either the Projective GT defined in (3): or the Bilinear GT defined in (8): Where  is the vector estimated during the LOP stage and ′, ′ and ′ are the GT parameters that describe the HOP stage.

B. Calculation of the GT parameters
After obtaining vector  (see the right side of Fig. 2), it is possible to determine submatrices  L ,  L and  L in equations (11) and (12), by using their width and height,  > and  ?, respectively, and the small vectors associated with the corner position change of the blue rectangle: Note that in the proposed two-stage approach, vectors  ⃗ Z , represented in the left of Fig. 2, correspond to the sum of vector  from the first stage, with the four smaller vectors from the second stage, i.e., ⃗ Z =  +  ⃗ ghK i , represented in the right of Fig. 2. If the Projective GT is used some auxiliary variables are defined: The Affine GT can be defined by any three of the four vectors ( ⃗ ghK ).In this paper, the first three vectors,  ⃗ ghK | ,  ⃗ ghK } and  ⃗ ghK ~ are generated using the second stage of the proposed approach, where the remaining vector,  ⃗ ghK • , is calculated assuming  € =  € = 0, thus resulting in  ⃗ ghK • = ( ; −  < +  `,  ; −  < +  `− ( ?− 1)).Using (14), the individual parameters in the submatrices can then be calculated by (15) for the Projective GT: Similarly, for the Bilinear GT, the corresponding submatrices are calculated by (16): For both cases we have:

C. Inverse GT mapping
As previously mentioned, a GT between two blocks corresponds to a mapping of every pixel within one block into the other block, e.g., the mapping functions (4) and ( 9) correspond to the Projective and Bilinear GT, respectively.When the mapping is performed from the rectangular block to be encoded to an arbitrary reference quadrilateral it is called a direct mapping, otherwise it is called an inverse mapping.An example of how both mapping procedures for a simple scaling GT can be found in Fig. 4. As can be observed in Fig. 4, when direct mapping is used the final quadrilateral shape (red block) does not match the desired reference block pixel grid, requiring to perform pixel interpolation prior to calculate the distortion between the transformed block and the reference block.For the sake of simplicity, an inverse mapping has been adopted, as it generates a rectangular prediction block with the same dimensions of the block to be encoded.
Thus, regardless of the size of the quadrilateral used for estimation, (4) and ( 9) take as input the coordinates of the block to be encoded, i.e.,  ∈ [0,  > − 1] and  ∈ F0,  ?− 1G, and generate as output the coordinates (, ) in the causal area, where the reference pixel value is going to be extracted from.Since  and  are typically fractional values, a bilinear interpolation filter is used to compute the actual pixel value.

D. Estimation of the GT RD cost
The optimal HOP model for each block is determined through RD optimization, minimizing the associated Lagrangian cost,  =  + , over the entire set of pre-defined GT.  refers to the distortion between the prediction block and the current block,  is the estimated number of bits used to encode the block using the GT under evaluation, and  is the Lagrange multiplier, computed as in HM version 15.0 for Intracoded frames.The parameter  is the same for all prediction modes, including the intra modes, so no biases in terms of prediction mode selection are introduced.In this paper, , is computed as the sum of absolute differences (SAD) in the pixel domain, in the first stage, and SAD in the Hadamard domain, in the second stage, as suggested in [39].
By using a two-stage method it is possible to evaluate if it is more advantageous to use LOP or HOP for each block, by comparing the associated costs, given by:  -hK =  -hK +  -hK , and where  -hK and  ghK are the estimated number of bits for the corresponding coding mode.The usage of LOP or HOP is conveyed to the decoder through a binary flag,  ghK .When LOP is considered more efficient in a RD sense,  ghK = 0, and only  is transmitted in the bitstream.On the contrary, if HOP is used, all elements that describe  are transmitted, followed by  ghK = 1 and the four additional  ghK i vectors.
The number of bits required to signal the HOP mode,  ghK , is the sum of  -hK and the estimated bits for encoding the four vectors,  ⃗ ghK i , that define the used HOP model.The rate of these small amplitude vectors is estimated using the same procedure as vector .

E. Encode the HOP mode information
After finding the optimal HOP model, the cost of the HOP mode,  ghK , is compared against the cost of the other intra prediction modes, i.e., DC, Planar and the 33 Directional modes, and the mode with the lowest RD cost is encoded.For this, the context adaptive binary arithmetic coding (CABAC) entropy coding method used by HEVC is used to encode the HOP mode information.The CABAC entropy coder is based on three steps: (i) binarization of syntax elements, (ii) context modeling, and (iii) binary arithmetic coding.In this implementation, these three steps have been maintained using, however, new contexts.Vectors  and  ⃗ ghK i , and flag,  ghK , are transmitted to the decoder using the HM approach for motion vectors and merge flags [40].
To encode , the same syntax elements of HEVC for motion data are used, i.e., motion vector differences, MVP index, reference picture list (RPL) and RPL index.
The way the HOP model information is conveyed to the decoder can highly influence the coding efficiency.One possible approach is to send the GT parameters, i.e., in the  or  matrix, which need to be represented with high precision.Alternatively, as proposed in this paper, the encoder just sends the four vectors,  ⃗ ghK i , which can be represented with just a few bits.The major advantage of encoding the GT parameters matrix is that they do not need to be recalculated at the decoder side through equations ( 13) - (17).However, they need to be encoded with a very high precision because these values are not  very robust to quantization [11].Consequently, encoding the vectors,  ⃗ ghK i , leads to higher compression efficiency.

V. EXPERIMENTAL RESULTS
In this section the performance of the proposed lenslet LF coding solution, incorporating the HOP mode, is evaluated in comparison with state-of-the-art coding solutions based on LOP approaches.First, this section describes the test conditions, including the used lenslet LF test images, the benchmark solutions and the relevant test parameters.Afterwards, experimental results comparing the RD performance of different types of prediction models are presented and discussed.These results are complemented with some statistical information about prediction mode usage and an evaluation of the quality of the rendered views, as proposed in [23], using the coded LF images.

A. Test conditions
In order to evaluate the RD performance of the proposed LF coding solution, two types of LF images were selected for the experimental test setup.The first type of images were acquired using LF cameras with a focused (FOC) optical setup [41], [42].The second type of images were acquired using a Lytro Illum camera that is commercially available and uses an unfocused (UNF) optical setup.This second set of images constitutes the dataset used for the 2016 ICME Grand Challenge on LF image compression extracted from the EPFL dataset [23].The central rendered views of all the test images are shown in Fig. 5, where the first row corresponds to the first type of images and the second and third rows correspond to the second one.This selection includes LF images with different resolutions, MI resolutions and types of microlens arrays, with different MI shape.Plane and Toy images have a resolution of 1920×1088 (MIs 28×28); Demichelis images have a resolution of 2880×1620 (MIs 38×38); Laura and Seagull have a resolution of 7240×5432 (MIs 75×75); EPFL images have a resolution of 7728×5368 pixels (MIs 15×15).
The proposed HOP mode was implemented into the HEVC test model version 15.0 (HM 15.0) as an additional intra prediction mode.This LF codec, corresponding to the proposed solution, will be referred to as HEVC-HOP, where HEVC using only the standard Intra modes is simply referred to as HEVC.Additionally, the work in [30] is used as benchmark for RD performance and it is referred to as HEVC-SS.
The common HM test conditions were adopted, using QP values of 22, 27, 32 and 37.The causal window size  is 128 for both HEVC-HOP and HEVC-SS, for every encoded image.The number of available SS or  vector predictors, used for coding, is 2.These  vector predictors are used as additional vector  candidates for the LOP model estimation stage.As mentioned in the previous section, these alternative vectors are tested in order to have a more unbiased result when estimating the HOP model.The number of candidates available for SS-Skip is 5 in both HEVC-SS and HEVC-HOP.

B. Experimental results
All the LF images in Fig. 5 are encoded and decoded using the HEVC, HEVC-HOP and HEVC-SS codecs, and the RD  performance is evaluated using a Bjøntegaard Delta Metric.Additionally, several variants of HEVC-HOP are tested.These variants of HEVC-HOP are referred to as HEVC-HOP-A, HEVC-HOP-P and HEVC-HOP-B, respectively for Affine (six DoF), Projective (eight DoF) and Bilinear (eight DoF) GTs.
Table I shows the RD performance comparison between HEVC and HEVC-SS, and between HEVC-SS and the various HEVC-HOP variants.

1) Comparison between LOP and HOP
Table I shows that HEVC-SS can outperform HEVC, for all tests, with bitrate savings up to 45.35%.Nevertheless, all versions of the proposed HEVC-HOP method are even more efficient than HEVC-SS to encode LF images.This increased performance, with bitrate savings up to 12.62% for certain images relatively to HEVC-SS (49.82% relatively to HEVC), comes from the use of a higher order prediction model.Since HEVC-SS is limited to two DoF, it is not able to accurately describe block transformations more complex than a simple translation.When comparing the results by means of comparing the effectiveness of adding prediction tools with more than two DoF, it is possible to notice that for the encoded LF images, the best case is when eight DoF are used.If eight DoF are available, i.e., when HEVC-HOP-P is being used, four points of correspondence are transmitted, which allows the description of not only translations, but also rotations, scaling, shearing and perspective changes.In this case, although extra information needs to be encoded, relative to the HEVC-SS case, the bitrate savings increases to 5.86% (28.81% relative to HEVC), in average, for all tested LF images.

2) Comparison between the proposed GTs
The proposed prediction mode HEVC-HOP-B, using a Bilinear GT, can achieve similar results to HEVC-HOP-P for most images, both in terms of average PSNR (BD-PSNR) and bitrate savings.However, comparing the average performance of each method regarding the type of camera models (AVG.FOC and AVG.UNF) it is possible to observe that HEVC-HOP-P is slightly more efficient for the unfocussed model images and HEVC-HOP-B is slightly more efficient for the focused model images.In the case of HEVC-HOP-A only six DoF are available because only three points of correspondence are transmitted.When compared HEVC-HOP-A to HEVC-HOP-P, the bitrate savings gains relatively to HEVC-SS are reduced to 4.84% (28.12% relatively to HEVC) on average considering all tested LF images, which may be due to the fact that HEVC-HOP-A is not able to compensate for perspective changes.However, in terms of computational complexity HEVC-HOP-A is approximately 4.5 times faster than HEVC-HOP-P.Note that none of the implementations is optimized in terms of computational complexity; therefore, the reported values for comparison may vary.It is worthwhile mentioning that for some cases (e.g., STONE test image) HEVC-HOP-A can outperform both eight DoF GTs.In HEVC-HOP-P four correspondence points are always encoded, even if only three are necessary.Since in some cases, more information might be transmitted to describe the same GT, HEVC-HOP-P is, for this particular test image, less efficient than HEVC-HOP-A.
Regarding the computational complexity, a study was performed using the image VESPA, from the EPFL dataset of LF images.This image was encoded and decoded using the codecs, HEVC, HEVC-SS, HEVC-HOP-A, HEVC-HOP-P and HEVC-HOP-B, with QP=32.These tests were performed using a PC equipped with an Intel Xeon CPU E3-1240 V2@3.4GHz and 24GB of RAM, running Ubuntu 16.04.The obtained running time to encode and decode each image is depicted in Table IV.The computational complexity of the proposed schemes must be compared to HEVC-SS, as it is used as reference.The HEVC-SS complexity is equivalent to encoding a P-Slice in HEVC [30].As can be seen from Table IV, the proposed algorithm increases the computational burden at the encoder side, where HEVC-HOP-A, HEVC-HOP-P and HEVC-HOP-B are 1.51, 6.84 and 5.23 times more complex than HEVC-SS, respectively.However, at the decoder the running time is reduced in relation to HEVC-SS.As the proposed method uses a more efficient prediction, the LF image encoder creates a lower number of partitions than the HEVC-SS.
One of the most important advantages of the proposed prediction method is the ability to choose between the LOP stage and the HOP stage for each image block.This decision is taken based on RDO criteria, which allows the proposed HEVC-HOP to outperform HEVC-SS in all cases.In Table II it is possible to observe that despite the HOP stage of the proposed prediction method being used more frequently than the LOP stage, there is always a considerable part of the image that is encoded using only the LOP prediction mode.However, the fact that the HOP stage is used more often than the LOP stage alone indicates that, although the proposed method HEVC-HOP-P requires additional overhead for transmitting the prediction information, i.e., one flag ( ghK ) and four vectors (f ⃗ ghK i j), it is very efficient in reducing the distortion between the current block and prediction block, therefore reducing the RD cost.An example of this can be seen in Fig. 6 where a comparison between the generated prediction block using either the LOP method or the proposed HOP method is shown.As it is possible to see, the prediction block generated using the proposed HOP stage ( ghK = 5080) is has a lower distortion when compared to the reference block than the prediction block using LOP ( -hK = 7851).The parameters that describe the GT and that are transmitted to the decoder are the f ⃗ ghK i j vectors, which in this specific case are {(3; 0), (4; 0), (2; −1), (2; 1)}.

3) Experimental results for rendered SAIs
To further evaluate the compression efficiency of the proposed HEVC-HOP-P, the objective quality of the SAIs extracted from the encoded LF images was tested.The experimental methodology adopted in [23] for the Lytro Illum unfocused camera setup was used.In this case, the PSNR-Y is calculated as an average of the PSNR-Y of 13×13 SAIs.This average PSNR-Y compares reference and reconstructed (i.e., encoded and decoded) SAIs.The processing chain designed to generate the SAIs consists in converting the hexagonal lenslet LF image to a square lenslet LF image (with 15x15 pixels per MI) and then, extracting one pixel, in a fixed position from each square MI, to render each SAI.For this case only the images from [23] have been used.The only difference to the methodology in [23] is that, instead of using fixed compression ratios, the results are calculated using the reconstructed images attained with fixed QPs (22, 27, 32 and 37), i.e., the number of bits is the same as in the previously used methodology.
From the results presented in Table III it is possible to observe that there is a coherence between the results for the encoded LF images and the rendered SAIs.The average bitrate savings achieved by HEVC-HOP-P relative to HEVC-SS are very similar for both cases.Nevertheless, as the RD cost was not optimized on each SAI, the results are not exactly the same.As previously explained, when using the HEVC-HOP-P method the LF image is encoded "as is", without the need to know any information about the used lenslet based LF camera.This may explain the bitrate increase for image STONE, as the proposed method calculates the RD cost based on the lenslet LF image, instead of the generated view or SAI.
Additionally, a comparison between the proposed HEVC-HOP-P and the state-of-the-art method [24] was performed.In [24], a comparison in relation to JPEG, using the EPFL dataset, reports a gain of 4.54 dB in the BD-PSNR.Similarly, this image set were encoded by the proposed HEVC-HOP-P, with QPs {22, 27, 32, 37}, achieving a BD-PSNR gain of 4.83 dB, in relation to JPEG.For a wider QP range {17, 22, 27, 32, 37}, the gain of HEVC-HOP-P decreases to 4.34 dB.Thus, it is fair to assume that the proposed HEVC-HOP-P has a very similar performance to the state-of-the-art method [24].

4) Results for different lenslet LF camera models
In general, the bitrate savings across the different codecs when compared to HEVC are higher when encoding LF images captured with cameras using a focused LF camera model.This is possible to see when comparing the average bitrate savings for the LF images captured with an unfocused camera model or the focused camera model in Table II.This occurs because HEVC-SS and HEVC-HOP are based on matching prediction tools.In the focused images, the MIs are focused, therefore sharper than the unfocussed images.In sharper MI, more prominent features exist and therefore the block matching is more reliable [34].Additionally, since the incident light in the camera's sensor in the unfocused case is focused at infinity, the disparity between MIs tends to be zero, which means that theoretically no perspective compensation can be matched.This can be justified by the noticeable lower relative prediction mode usage, shown in Table II, for the proposed prediction method as well as SS-Skip for most LF images captured with unfocused camera models.
The proposed HOP model is also more suited to adapt to nonrectangular shape MIs, e.g., hexagonal and circular shape, when compared to LOP model based methods.This happens because the corners of the prediction blocks, when using the proposed HOP model, are flexible to adapt for different block shapes.

VI. CONCLUSIONS
In this paper, a HOP mode for LF image coding was proposed, using geometric transformations of up to eight DoF.The proposed HOP mode is a two-stage block-wise approach that is able to achieve RD efficiency gains, relative to a LOP state-of-the-art solution for LF image coding and HEVC.These gains occur, regardless of the LF camera model, MI and LF image resolution and microlens array type.Experimental results show average bitrate savings of 5.86% and 28.81%, when compared to a LOP state-of-the-art solution and HEVC, respectively, across different types of LF images, when using the Projective GT.It is also possible to conclude that, the GTs with eight DoF, namely Projective and Bilinear, are generally more efficient than Affine GT in a RD sense.An additional testing methodology, based on the SAI objective quality, was also used to confirm the bitrate savings when comparing HOP with LOP tools.In this case the average bitrate savings achieved is 4.81% and 31.77%when comparing the HOP mode with the state-of-the-art LOP solution and HEVC, respectively.Future work will include the investigation of GT parameter prediction techniques and optimal HOP model selection, aiming to combine them in the same codec.Additionally, entropy encoding improvements will also be considered, namely, the binarization of the HOP model vectors.

Fig. 2 -
Fig. 2 -Block prediction using a HOP model: generic single-stage HOP model mapping (left side), and proposed two-stage HOP model mapping (right side)

Fig. 3 -
Fig.3-Fast search method adopted for each corner of the prediction block (blue rectangle) used to estimate the HOP model (red quadrilateral).

Fig. 4 -
Fig. 4 -Example of Direct Mapping and Inverse Mapping when a scale GT is applied.

Fig. 5 -
Fig. 5 -LF test images part of the experimental test setup.First row (from left to right): Plane and Toy (frame 0 and 150), Demichelis Spark (frame 0), Demichelis Cut (frame 0), Laura and Seagull.Second and third rows: sub-set of the LF EPFL dataset.

Fig. 6 -
Fig. 6 -Comparison between the prediction block generated by LOP and HOP stages.

TABLE I BD
-PSNR-Y AND BD-RATE RESULTS COMPARING HEVC, HEVC-SS (TWO DOF) AND HEVC-HOP, USING SIX DOF AND EIGHT DOF AND TWO DIFFERENT KINDS

TABLE II AVERAGE
PREDICTION MODE USAGE ACROSS THE FOUR QPS, IN PERCENTAGE OF PIXELS, FOR THE HEVC-HOP-P CASE