Improved Tests for Forecast Comparisons in the Presence of Instabilities

Of interest is comparing the out‐of‐sample forecasting performance of two competing models in the presence of possible instabilities. To that effect, we suggest using simple structural change tests, sup‐Wald and UDmax for changes in the mean of the loss differences. It is shown that Giacomini and Rossi ([Giacomini R, 2010]) tests have undesirable power properties, power that can be low and non‐increasing as the alternative becomes further from the null hypothesis. On the contrary, our statistics are shown to have higher monotonic power, especially the UDmax version. We use their empirical examples to show the practical relevance of the issues raised.


INTRODUCTION
Testing for the relative forecasting performance of two, or more, competing models has been the subject of substantial research. Important contributions include Diebold and Mariano (1995), West (1996), Clark and West (2006) and Giacomini and White (2006). These are based on assessing whether the out-of-sample loss differentials are significantly different from zero. They differ with respect to the exact specification of the null hypothesis (loss functions evaluated at the population values of the parameters or the in-sample estimates), having nested or nonnested models, using an unconditional perspective or one that conditions on some covariates. Being based on averages of the loss differentials, these tests may have little power when the relative forecasting performance is changing over time.
Of interest is comparing the out-of-sample forecasting performance of two competing models in the presence of possible instabilities. To that effect, we suggest using simple structural change tests, sup-Wald and UDmax as proposed by Andrews (1993) and Bai and Perron (1998), for changes in the mean of the loss differences. The tests effectively look at the entire time path of the models' relative performance, which may contain useful information not available when using tests that focus on the average relative performance. Giacomini and Rossi (2010), henceforth GR, proposed a fluctuations test and a one-time reversal (OTR) test also applied to the loss differences. When properly constructed to account for potential serial correlation under the null hypothesis to have a pivotal limit distribution, it is shown that the tests proposed by GR have undesirable power properties, power that can be low and non-increasing as the alternative becomes further from the null hypothesis. In the terminology of Perron (2006), these tests belong to the so-called partial sums type tests. These have repeatedly been shown to be inadequate for structural change problems. The good power properties reported in GR is simply 651 an artefact of imposing a priori that the loss differentials are serially uncorrelated and using the simple sample variance to scale the tests.
We replicate the power properties of their tests with the appropriate heteroskedasticity and autocorrelation (HAC) correction using exactly the same design they used. In the case of a one-time change in the relative forecasting performance of two models, the power functions of the tests are substantially lower than what they report. More importantly, the power functions are non-monotonic. The power does not tend to 1 as the magnitude of the difference between the models' relative forecasting performance increases and may even decline. These are clearly undesirable features of test statistics, which makes their usage in practice unreliable. On the contrary, the test statistics we now propose are shown to have higher monotonic power, especially the UDmax version.
We also revisit their empirical results related to assess the forecasting performance of the uncovered interest rate parity (UIRP) model relative to a simple random walk model for the UK Pound and German Deutsche Mark exchange rate relative to the US$. We show that their tests have little power to discriminate between the models they considered, while the sup-Wald and UDmax provide a strong rejection in the case of the UK Pound. However, there is no evidence that the UIRP model performed significantly better than a simple random walk model in any part of the sample. This illustrates the practical relevance of the power problems of the tests proposed by GR and the fact that the sup-Wald and UDmax tests for changes in the mean of the loss differences yield more powerful procedures.
This note is structured as follows. Section 2 reviews the framework considered by GR, our suggested tests and those proposed by GR. Section 3 re-evaluates the power functions of the tests when an HAC correction is applied. Section 4 does the same for the empirical applications. Section 5 provides brief concluding remarks.

THE FRAMEWORK AND THE TESTS
The interest is in comparing h-step-ahead forecasts from two competing models characterized by parameters Â and respectively. There is a sample size of T observations available, which is divided into an in-sample portion of size R and an out-of-sample portion of size P . The two models yield two competing sequences of h-stepahead out-of-sample forecasts and, for a given loss function L, these yield a sequence of P out-of-sample forecast loss where O Â and O are the in-sample parameter estimates. A rolling scheme method of estimation is used whereby the parameters are reestimated at each t D R C h; : : : ; T over a window of length R including data indexed t h R C 1; : : : ; t h. The local relative loss for the two models is the sequence of out-of-sample loss differences computed over centred rolling windows of size m given by (for m even): m 1 P tCm=2 1 j Dt m=2 4L j . O Â j h;R ; O j h;R / for t D R C h C m=2; : : : ; T m=2 C 1. The simulations and applications are restricted to the case with a quadratic loss function L t D .y t f t / 2 , where f t is the forecast and to the case of a one-step-ahead forecast.
The null hypothesis is constant forecast accuracy: for some c versus the alternative hypothesis of changing relative forecast accuracy. The tests considered are (1) the simple sup-Wald test for a single change (e.g. Andrews, 1993, denoted sup W ) and (2) the UDmax test of Bai and Perron (1998) that allows up to five breaks. These are applied to test for changes in the mean of the loss differences sequence. Let SSR be the sum of the squared demeaned loss differences over the full sample and SSR.i; j / be the sum of the squared demeaned loss differences over a sample involving the observation i to j .
is the HAC estimator of the demeaned loss differences under the alternative hypothesis. See Bai and Perron (1998) for further details. It is straightforward to show that the tests have the same limit distributions as in Andrews (1993) and Bai and Perron (1998) under the same assumptions used in GR. As we shall show, these tests have much higher power and, in particular, the UDmax version always has a monotonically increasing power function.
Our tests will not have power against alternatives with unequal but constant forecast accuracy (since we do not set c D 0 under the null hypothesis), but in such cases, the original test of Giacomini and White (2006) or that of Clark and West (2006) will have higher power than the tests proposed by GR. The way to use the tests together is as follows. First, use the sup-Wald or UDmax that we propose. If there is a rejection, conclude that there is a change in forecast accuracy between the models. If there is no rejection, apply the statistic of Giacomini and White (2006) or that of Clark and West (2006) to test if there is non-equal but constant relative forecasting performance. When 5% size tests are used, under the null of equal forecast accuracy this strategy will have a nominal size slightly less than 5% (0.95 0.05). So there is no size problem related to the use of multiple tests. Second, the power of the sup-Wald or UDmax will be the same as reported since it is used first. The power of the Giacomini and White (2006) or that of Clark and West (2006) will also nearly be the same as when used individually for the alternative hypothesis it is intended to detect, although in 5% of the cases, a constant non-equal relative forecasting performance will be classified as a time-varying one.
The null hypothesis adopted by GR is that of equal forecast accuracy versus the alternative hypothesis that one model provides better forecasts, that is, Tests for this null hypothesis were provided by Diebold and Mariano (1995) and the unconditional version of the statistics proposed by Giacomini and White (2006). The first test proposed by GR is the out-of-sample fluctuations test defined by max t jF OOS t;m j where for t D R C h C m=2; : : : ; T m=2 C 1, with O 2 an HAC estimate of the long-run variance 2 D They suggest the use of a kernel-based method using the Bartlett window, that is, where q.P / is a bandwidth that grows with P . GR make no recommendation about how to select q.P /. Following state-of-the art good practice in the simulations and applications, we use a data-dependent method, specifically the one advocated by Andrews (1991) based on an AR(1) approximation. Also, correcting for an omission in GR, the demeaned loss functions are This statistic is referred to as the GW-fluctuations test since it is based on the maximum (over some range) of the sequence of tests F OOS t;m , which are equivalent to the test of Diebold and Mariano (1995) and the unconditional version of the Giacomini and White (2006) test. The second test they propose is the OTR test defined by QLR P D sup tˆ P .t /, t 2 ¹OE0:15P ; : : : ; OE0:85P º , withˆ and O 2 is again defined by (3). For all tests, the framework can be adapted to a different null hypothesis in which the concern is about the forecast losses evaluated at the population parameters as considered in Clark and West (2006). In this case, one simply applies an adjustment to the forecast losses. For example, when one model specifies y t to be a martingale difference sequence and the other is a linear regression model of the form y t DˇX tC1 C e t , the adjusted meansquared loss differences are where f t is the forecast from the regression model. GR refer to the fluctuations test applied to such corrected loss functions as the CW-fluctuations test.
For both tests, the use of an HAC estimator for the long-run variance is essential. To illustrate, we generated loss differences as an AR(1) process with coefficient 0.75. Such type of serial correlation can arise as the result of serial correlation in the second-order moments of the residuals and/or the regressors, including, but not restricted to, GARCH processes. In this case, the size of all tests with a fixed number of lags q .P / D 2 is near 70%. When using Andrews's data-dependent method to select q .P /, the size of the GR tests (OTR and fluctuations) is between 5% and 10%. Hence, it is important to appropriately correct for potential serial correlation in the loss differences. Also, if instabilities are present under the alternative hypothesis, a situation that indeed motivates the tests proposed, the loss differentials will exhibit features akin to serial correlation in the sense that a test for serial correlation would tend to reject the absence of correlation. This is simply a consequence of the results in Perron (1989Perron ( , 1990) that a change in the mean (or slope) of a time series biases the sum of the autoregressive coefficients upwards when such changes are not explicitly modelled. Yet, GR impose a priori that the loss differentials are serially uncorrelated and use the simple sample variance as the estimate of 2 , namely, O 2 D P 1 P T j DRCh 4L j . O Â j h;R ; O j h;R / 2 . 654 L. F. MARTINS AND P. PERRON They do so for both the simulations reported and the applications. As we document in the next sections, the properties of their tests are very different when the test is properly constructed with an HAC estimate and the conclusions of their empirical applications are also different.

THE SIMULATIONS
We adopt the same simulation setup as in GR in order to avoid any potential biases due to the selection of particular DGPs. We also used their code available at the Journal of Applied Econometrics website in order to correct some inaccurate reporting or typos in their paper. The results obtained and the documented power reversal of the tests could be much more severe using other DGPs. Two forecasting models are considered. For the first, there is a covariate X t that potentially helps to forecast Y t so that f .1/ t;R D Ǒ t;R X tC1 (assuming that X tC1 is known when constructing the forecast) where Ǒ t;R is the in-sample parameter estimate from a regression of Y t on X t based on a rolling window of size R. For the second model, Y t is assumed to be a zero-mean white noise process so that f .2/ t;R D 0. Hence, under the GW framework, the loss differentials are while under the Clark and West framework, they are We consider simulations pertaining to assess the performance of the tests when the forecasting performance of the models is time varying such that there is a one-time break in the relative performance during the out-of-sample period induced by a break in the DGP. Under the GW framework, this is achieved by setting (with a proper correction for an error in GR) where X t D 0:5X t 1 C v t with v t i.i.d. N .0; 1/ and " t i.i.d. N .0; 1/ uncorrelated with v t . Hence, the relative performance changes at t D R C P . We use the parameters D 1=3 or D 2=3 and D m=P D 0:3, 0.7. The results with an HAC correction are presented in Figures 1 ( D 1=3) and 2 ( D 2=3). The left panel considers the same values of ı as in GR (0 to 1), while the right panel shows the power functions for values of ı up to 10. In all cases, we consider 5% two-sided tests. Consider first the case with D 1=3. When D 0:3, the GW fluctuations test has more power than the OTR test, as in GR, but the power is much lower than they reported. More importantly, both tests suffer from non-monotonic power; none have power 100% no matter how large ı is. The power of the fluctuations test reaches a maximum value of about 0.90 when ı is near 1, while the OTR test reaches a maximum power of about 0.6 when ı becomes large. The sup W does not have monotonic power either with a power function in between that of the GW fluctuations and the OTR test. The UDmax, on the other hand, has monotonic power that approaches 1 quickly and is the most powerful overall. When D 0:7, the OTR test has more power than the fluctuations test, as in GR. However, here with the HAC correction, the power decrease is even more pronounced. The power of the fluctuations test reaches a maximum value of about 0.37, while the OTR test reaches a maximum power of about 0.55. The sup W does not have monotonic power either, but its power function is now higher than the GW fluctuations and the OTR tests. The UDmax test again has monotonic power that approaches 1 quickly and is the most powerful overall. Consider now the case with D 2=3 presented in Figure 2. For both D 0:3 or 0.7 , the sup W has highest power followed closely by the UDmax, both having functions of the GW tests with a break in the relative performance at D 1=3 and D 0:7 monotonic power functions. As in GR, the OTR test has high power whether D 0:3 or D 0:7. In all cases, the OTR and GW fluctuations tests suffer from non-monotonic power that does not reach 1 even for very large values of ı. When D 0:3, the power of the OTR test achieves a maximum near but below 1 when ı is near 0.8, and the power remains the same as ı increases. The power of the GW fluctuations test reaches a maximum near 0.85 when ı is near 1, but it decreases to about 0.70 as ı increases further. When D 0:7, the power function of the OTR test is similar but that of the GW fluctuations test is considerably reduced when an HAC correction is applied reaching a maximal value near 0.15.
to note is that in all cases, the sup W and UDmax tests have nearly identical monotonic power functions that approach 1 quickly. On the other hand, the power of the CW fluctuations test never increases to 1 no matter how large the change is. The maximal power achieved depends highly on the exact specifications. When D 0:3, it is between 0.85 and 0.90 for the three values of considered. However, when D 0:7, it is near 1 when D 2=3 but not above 0.25 when D 1=2 and essentially zero when D 1=3.
In summary, the simulations show important problems of non-monotonic power for the GW or CW fluctuations and the OTR tests. The UDmax test always has power functions approaching 1 quickly. In most cases, the power of the sup W is comparable with that of the UDmax, although it can also be subject to power functions flattening below 1 as the alternative becomes large. Hence, in the presence of unequal time-varying forecast accuracy, the UDmax test for changes in the mean of the loss differences is clearly the preferred test.
A comment about the bandwidth selection is in order. It is well known that the reason for the non-monotonic power is the fact that a relatively large bandwidth is selected via Andrews' method under the alternative (e.g. Kim and Perron, 2009). It may be argued that with large breaks, the bandwidth selected is 'too high'. This is not the case. The average (over all replications) value of the bandwidth q .P / selected by Andrews' method ranges between 4 and 6 when ı varies between 0.5 and 1 for which the power reversal is present. These values are near the default value of STATA, say, which is 5 when T = 100 (Newey-West option).
While a data-dependent method is highly preferable over a fixed rule for the selection of q .P / to ensure the proper size (asymptotically and in finite samples), one may have a strong prior that the loss differentials are weakly correlated under the null hypothesis and therefore use a fixed rule to select q .P /. Figure S4(a) and (b) in Martins and Perron (2015) presents the results corresponding to Figures 1 and 2 when setting the popular rule of thumb of q .P / D 5 to construct the statistics. The results show that some of the power functions of the tests are no longer non-monotonic but that overall the sup-Wald and, especially, the UDmax tests have higher power, sometimes by a high margin. Hence, the superiority of the proposed tests holds under both a fixed or datadependent rule to select q .P /. Of course, the power problems are less with a fixed value q .P / D 3, but they are also much worse with a fixed value q .P / D 9. This is trivial since if q .P / is very small, the estimate becomes similar to using the standard sample variance.
It has by now become standard (good) practice to use a data-dependent method to select the bandwidth. It has the advantage of providing a selection method that is not ad hoc or arbitrary and that, in general, delivers tests with good finite sample size for a wide range of possible DGPs. As stated earlier, using a low fixed value would invariably lead to tests with size distortions for a wide variety of DGPs.

THE APPLICATIONS REVISITED
Giacomini and Rossi applied the tests they proposed to assess the forecasting performance of the UIRP model relative to a simple random walk model for the UK Pound and German Deutsche Mark exchange rate relative to the US$. Large positive values of the fluctuations test provide evidence that the UIRP model is superior to the random walk model. Again, the tests were constructed without an HAC correction assuming a priori uncorrelated forecast losses. They also departed away from the fluctuations test they proposed. Instead of (2), they reported results for the following version of the test In what follows, we consider the original statistic defined with the long-run variance estimated using the full sample. We consider two-sided 5% tests.
Consider first the results for the German Deutsche Mark presented in Figure 3. Here, none of the tests are significant, including the OTR and UDmax not reported. This contrasts with the results of GR who reported a significant rejection using the CW-fluctuations test without an HAC correction.  Consider now the results for the UK Pound presented in Figure 4. Here, also the OTR is not significant, as well as the sup W and UDmax based on the GW loss differences. On the other hand, the fluctuations-based tests offer a contrasting picture. The GW fluctuations test is barely significant but in favour of the random walk model, contrary to what was reported in GR. On the other hand, the CW fluctuations test is barely significant in favour of the UIRP, consistent with the result in GR. Based on the CW loss differences, the sup W and UDmax are both very highly significant at less than the 1% significance level, which illustrates the higher power of these tests. The estimate of the break date (that which maximizes the sequence of Wald tests for a single change) is 1990 : 09. To assess the nature of the change in forecasting performance, we estimated the mean of the loss differences pre-1990 : 09 and post-1990 : 09. These are 0.0002 and 0.00004. Hence, this points to better forecasting performance for the UIRP pre-1990 : 09 and vice versa post-1990 : 09. However, a standard CW test applied to the pre-1990 : 09 sample yields a t-statistic of 0.33. Hence, there is no evidence that the UIRP performed significantly better than the Random Walk in any part of the sample.

CONCLUSIONS
When constructed properly, it is shown that the tests proposed by GR have undesirable power properties, power that can be low and non-increasing as the alternative becomes further from the null hypothesis. In the terminology of Perron (2006), these tests belong to the so-called partial sums type tests. These have repeatedly been shown to be inadequate for structural change problems. Tests based on standard Wald statistics are much less prone to such problems. This is again the case here. We have shown that to detect changing relative forecasting accuracy the sup W , and in particular, the UDmax, tests applied to test for changes in the mean of the loss differences have much higher power. Of course, these are not appropriate to test for unequal but constant relative forecast accuracy. In such cases, the original tests of Giacomini and White (2006) and Clark and West (2006) are to be used. The fluctuations versions of these tests and the OTR test offer no power gains in this case.

SUPPORTING INFORMATION
Additional supporting information may be found in the online version of this article at the publisher's website.