An Overview of Gradient-Enhanced Metamodels with Applications

Metamodeling, the science of modeling functions observed at a finite number of points, benefits from all auxiliary information it can account for. Function gradients are a common auxiliary information and are useful for predicting functions with locally changing behaviors. This article is a review of the main metamodels that use function gradients in addition to function values. The goal of the article is to give the reader both an overview of the principles involved in gradient-enhanced metamodels while also providing insightful formulations. The following metamodels have gradient-enhanced versions in the literature and are reviewed here: classical, weighted and moving least squares, Shepard weighting functions, and the kernel-based methods that are radial basis functions, kriging and support vector machines. The methods are set in a common framework of linear combinations between a priori chosen functions and coefficients that depend on the observations. The characteristics common to all kernel-based approaches are underlined. A new ν\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\nu $$\end{document}-GSVR metamodel which uses gradients is given. Numerical comparisons of the metamodels are carried out for approximating analytical test functions. The experiments are replicable, as they are performed with an opensource available toolbox. The results indicate that there is a trade-off between the better computing time of least squares methods and the larger versatility of kernel-based approaches.

underlined.A new -GSVR metamodel which uses gradients is given.Numerical comparisons of the metamodels are carried out for approximating analytical test functions.The experiments are replicable, as they are performed with an opensource available toolbox.The results indicate that there is a trade-off between the better computing time of least squares methods and the larger versatility of kernelbased approaches.(Matlab/Octave's toolbox [12])

Abbreviations
Abstract Metamodeling, the science of modeling functions observed at a finite number of points, benefits from all auxiliary information it can account for.Function gradients are a common auxiliary information and are useful for predicting functions with locally changing behaviors.This article is a review of the main metamodels that use function gradients in addition to function values.The goal of the article is to give the reader both an overview of the principles involved in gradient-enhanced metamodels while also providing insightful formulations.The following metamodels have gradient-enhanced versions in the literature and are reviewed here: classical, weighted and moving least squares, Shepard weighting functions, and the kernel-based methods that are radial basis functions, kriging and support vector machines.The methods are set in a common framework of linear combinations between a priori chosen functions and coefficients that depend on the observations.The characteristics common to all kernel-based approaches are

Introduction
Despite continuous progress in the accuracy of experimental measurements and numerical simulations of the physics of a considered system, the need for metamodels keeps increasing.Metamodels are statistical or functional models of input-ouput data that are obtained either from experimental measures or from the numerical simulation of the associated physical phenomena.Metamodels are sometimes called surrogates, proxies, regression functions, approximating functions, supervised machine learners or are referred to with specific names such as the ones described later in this article.Although not directly linked to the physics, metamodels have proven to be necessary for creating simple, computationally efficient associations between the input and output of the considered phenomena.For example, in materials sciences input may be material properties or boundary conditions and outputs are displacements, forces, temperatures, concentrations or other quantities at specific locations; in design, inputs may be the parameters describing a shape or a material and outputs specific measures of performance such as mass, strength, stiffness; in geophysics inputs may be parameterized descriptions of the underground (permeabilities, faults, reservoir shapes) and the outputs quantities observed at the surface (flow rates, displacements, accelerations, gravity).Typically, actual or numerical experiments are costly in terms of time or other resources, in which case metamodels are a key technology to perform optimization, parameter identification and uncertainty propagation.
Infering nonlinear relationships requires large amount of data particularly when the number of input parameters grows (the "curse of dimensionality" [1]) so that it is important to use all available additional information.Gradients, i.e., derivatives of the outputs with respect to the inputs, are one of the most common and most useful side knowledge to be accounted for when building the metamodels: many finite elements codes have implemented adjoint methods to calculate gradients [2][3][4]; automatic differentiation is another solution for computing gradients [5][6][7]; there are responses such as volumes for which analytical gradient calculation is accessible.
Accounting for gradients when building metamodels often allows to decrease the necessary number of data points to achieve a given metamodel accuracy, or equivalently, it allows to increase the metamodel accuracy at a given number of data points.When guessing a function with a locally changing behavior (a non stationary process in the probability terminology) from a sparse set of observations, the traditional regression techniques relying only on the function values will tend to damp out local fluctuations.This is because useful regression methods comprise regularization strategies that make them robust to small perturbations in the data.Accounting for gradients is a way to recover some of the meaningful local fluctuations.The need for gradient information has been acknowledged in geophysics for reconstructing a gravity field from stations measurements when the underground is subject to local changes [8,9].Further illustrations of the interest of gradients will be given in Sect.10.
The purpose of this article is to review the various approaches that have been proposed to create metamodels An Overview of Gradient-Enhanced Metamodels with Applications 1 3 with zero order and gradient information.A global view is first developed.Section 2.1 is a general introduction to metamodeling which may be skipped by readers familiar with the concept.After a Section presenting the main notations, Sect.2.3 synthetizes into a unique framework the different techniques which will be covered in the review.A generic idea, which can be applied to any surrogate, for indirectly using gradients is summarized in Sect.3.
The article then details, in turn, each gradient-enhanced method: the large family of least squares approaches are the focus of Sect.4; Shepard weighting functions are summarized in Sect. 5.All the methods covered later are based on kernels.After summarizing the main concepts behind kernels in Sect.6, we provide details about gradient-enhanced radial basis functions (Sect.7), kriging (Sect.8) and support vector regression (Sect.9).Note that the formulation of the gradient-enhanced support vector regression ( -GSVR) proposed in part 9.4 is a new contribution.Multivariate cubic Hermite splines [10,11] are not discussed in this review as they seem to date to remain a topic of mathematical research.
Finally, the different methods are applied and compared on analytical test functions in Sect.10.The ensuing analysis of results and presentation of related softwares should help in choosing specific gradient-enhanced techniques.All methods described in this article have been implemented and tested with the opensource matlab/octave GRENAT Toolbox [12].

Surrogates and Their Building in a Nutshell
In many contexts, the observation of the response of a parametrized system can be done only for a few parameters instances, also designated as points in the design space.A solution to getting an approximate response at non-sampled parameter instances is to use a metamodel (or surrogate).A metamodel is a doubly parameterized function, one set of parameters being the same as that of the studied system (i.e., the coordinates of the points), the other set of parameters allowing further control of the metamodel response to give it general representation abilities.For simplicity's sake, parameters of the second set will be designated as internal parameters.The building of the metamodel involves tuning its internal parameters in order to match, in a sense to be defined, the observations at the points.
The simplest metamodels are polynomials tuned by regression, which are part of the response surface methodology (RSM [13]) for analyzing the results of experiments.For dealing with an increase in nonlinearity of the function, rising the degree of the polynomial could seem to be a solution.However, oscillations appear and the number of polynomial coefficients, n t , to be set grows rapidly, as where n p is the number of parameters and d • is the degree of the polynomial.This is why other techniques for approximating functions such as parametric kernel-based metamodels have received much attention.
The literature is already rich in contributions presenting and detailing surrogates models [14][15][16][17][18][19][20][21][22][23].Hereafter, the basic steps in building and using surrogate models are summarized (see also Fig. 1): -Initial data generation Sampling strategies generate points in the design space (using, for instance, Latin Hypercube Sampling [24]).The responses of the actual function are calculated at each instance of the parameters.In many cases, this step is computationally intensive because the actual function involves a call to, typically, solvers of partial differential equations.Details on sampling techniques can be found in [25][26][27].-Build the metamodel Because data is sparse, parametric surrogate models (which are reviewed in the rest of this paper) are used.This step mainly means determining the model internal parameters.-In many situations and in particular for optimization, enrichment (or infill) strategies are used for adding points to the initial set of sample points.Enrichment strategies post-process the current surrogate.An example of infill method for optimization is the Expected Improvement [28,29].-Finally the quality of the surrogate model is measured using dedicated criteria (such as R 2 or 3 , cf.Sect. 10).
At the end the building process and during the infill steps the surrogate can provide inexpensive approximate responses and gradients of the actual function.For a large number of sample points and/or a large number of parameters, the building of a surrogate can be (computer) time consuming but it is typically less expensive than a nonlinear finite elements analysis.
In the context of optimization, metamodels are often used for approximating objective or constraint functions and the approximation contributes to localizing the potential areas of the optimum.For efficiency in optimization, metamodels are not made accurate in the whole design space but only in potentially good regions.Such family of approaches is designated as surrogate-based analysis and optimization (SBAO) [18].It is composed of optimization algorithms that rely on a metamodel, a classical optimization algorithm and an infill strategy.SBAO presents (1) some similarities with Trust-Region methods [30] in the use of metamodels.However Trust-Region methods focus on proved rapid local convergence whereas SBAO targets globally optimal points.In SBAO the infill strategy looks for global optima by sequentially optimizing a criterion that is calculated directly with the metamodel (saving calls to the true functions) and that is a compromise between exploration and search intensification: exploration means adding points at badly known areas of the design space, intensification (also referred to as exploitation) means adding points in regions where one expects good performance.Among the many existing criteria [31,32], Expected Improvement and related criteria [28,29] are the most common and have led to the efficient global optimization algorithm (EGO) [33].Thus, the use of surrogates in optimization is iterative: each step of SBAO algorithms includes (i) building a surrogate followed by (ii) optimizing an infill criterion based on the surrogate and then (iii) calling the actual simulation at the point output by the infill subproblem.We now turn to the focus of this review that is gradient-enhanced metamodels also designated as gradientassisted or gradient-based metamodels.The next sections present our notations and a global framework for gradientenhanced metamodels.

Main Notations
Let us consider an experiment parameterized by n p continu- ous values grouped in the vector x (i) .n p if often known as the dimension of the (approximation or optimization) problem.The scalar output, or response, of the experiment is the function y(⋅).The notation x (i) designates both sample points (i ∈ [[1, n s ]]) and any non sampled point (i = 0).The vectors of responses (also sometimes called observations) and gradients calculated at all sample points are denoted y g and are assembled according to Eqs. ( 2)- (5).with where More generally, a function y and its derivatives is writ- ten using the following index notations: y ,i , y ,ij ... where i and j take values in [[0, n p ]] such as Finally, the notation • designates the approximation of the quantity of interest • provided by the metamodel.Bold (2) Fig.

Introduction to Gradient-Enhanced Metamodels
This review article focuses on metamodels that, in addition to using and describing the responses, also use and model the gradient of the response with respect to the x parameters.Henceforth, for each sample point the value of the function and the gradients are supposed to have been observed.The following approaches will be covered: indirect approaches, least squares techniques (LS), weighted least squares (WLS), moving least squares (MLS), the Shepard weighting function (IDW), radial basis functions (GRBF), cokriging (GKRG) and support vector regression (GSVR).In these last acronyms, G stands for gradientenhanced.A condensed view of the main references on which the next sections are based is given in Table 1.
Before precisely introducing each gradient-based surrogate, we give a common description of all the techniques (that can also be used for describing non-gradient-based models).It is noteworthy that all the surrogates discussed in this paper are obtained by linear combination of "coefficients" and "functions" that we will define soon.The approximation ỹ of an actual function y can be calculated as follows: designates the internal parameters of the surrogate model.The terms A(), B(), and C() are specific to each kind of surrogate model but share common defining properties.A() is the trend term whose goal is to represent the main (average, large scale) features of the function y().The B()'s are "functions" and the C()'s are "weighting coefficients".The B() functions are chosen a priori in the sense that, assuming is fixed, they do not depend on the observed responses, y x (i) and their derivatives, y x j x (i) .However, ( 8) the B() functions can depend on the locations of the sample points, x (i) , i = 1, … , n s .The B() functions are typically user inputs to the methods.In contrast, the C() coefficients are calculated from the observations y x (i) and y x j x (i) , so that their linear combi- nation with the B() functions, eventually added to the trend, makes an approximation to y(), as stated in Eq. ( 8).The coefficients are the weights in the linear combination of the B() functions.For example, if one expects that the response (for n p = 1) is proportional to 1/x plus a quadratic term one could a priori choose B 1 (x) = 1∕x and B 2 (x) = x 2 and cre- ate a simple approximation with constant coefficients ) ).The C i 's are then calculated from the observations, which in our context include both the response function and its derivatives at the sampled points, for example so that the approximation fits the observations in a least squares sense.When there is no a priori on the B() functions, a generic choice is made: basis functions (e.g., polynomials) for LS, arctan for neural networks, kernels evaluated at a given x (i) in kernel methods (GRBF, GKRG, GSVR here).
More generally, surrogates can be created by looking, at each x (0) , for the "best" (in a certain sense) linear combi- nation of the B()'s, in which case the coefficients depend on x (0) .The simplest template of such a surrogate would be y where B i x (0) is a measure of similarity between x (0) and x (i) (not detailed here) and the coefficients C i () are y x (i) .IDW and the ker- nel methods (GRBF, GKRG and GSVR) are refined examples of such surrogates.
Although mathematically equivalent to a single summation, the double summation in Eq. ( 8) emphasizes the specific structure of gradient-enhanced surrogates: in all kernel-based surrogates, the index i describes the sample point considered (therefore N = n s ) while j represents the variable with respect to which the derivatives are taken, j = 0 standing for the response without differentiation, (and Table 2 summarizes the expressions of the trend, the functions and the coefficients such as they will appear later in the text.Note that all metamodels but LS have internal parameters, , that, as with non-gradient-enhanced meta- models, are computed from the known points (x (i) , y(x (i) )), and y x j x (i) here, i = 1, n s , in a manner which is specific to each surrogate.For the sake of clarity, the difference between the functions B() and the coefficients C() is made assuming that is fixed, otherwise there is no clear general mathematical difference between them.
The methods that will be presented are organized in two groups, the kernel-based methods from Sect. 6 onward, and the rest (before).They can be distinguished in the same way as the two above examples.Kernel-based methods are built from the specification of a kernel, i.e., a function with two inputs that quantifies the similarity between what happens at these two inputs.The other approximations, which in this review are mainly variants of least squares, are made from a priori chosen single input functions that are linearly combined.Despite fundamental differences in the way they are constructed, many equivalences can be found between the methods: generalized least squares also have a kernel which is given at the end of Sect.4.2; vice versa, the kernels of the GSVR are implicitely products of functions and the GSVR approximation is a linear combination of them like that of least squares; the approximate responses of GRBF and GKRG have the same form [Eqs. (72) and (89) without trend are equivalent].These connections are further detailed in the paper.As a last common feature of the methods presented, it is striking that all the approaches but GSVR approximate the response by a linear combination of the observations y g , provided the internal parameters are fixed.The calculation of the approximate gradients will be achieved by deriving Eq. (8), i.e., calculating ỹ,k ( ), which is possible if A, B ij and C ij are differentiable functions.One could think of other ways to build ỹ,i ( ), like learning them directly from the gradient data y gs independently from the response y s , but such techniques are instances of the usual metamodel building (just applied to the gradients) and are out of the scope of this review.
In practice, it is common to only have access to some of the components of the gradient of the response.ỹ,k ( ) would be known while ỹ,l ( ), k ≠ l, would not.All the techniques reviewed in this article apply only to the components where the derivatives are known.However, to keep notations simple, the derivations will always be carried out for all of the variables, as if all components of the gradient were accessible.Further remarks about missing data and higher order derivatives are given in Sect.10.5.

Indirect Gradient-Based Metamodels
For taking into account the derivative information of the response in the making of a metamodel, the most basic idea is to use a first order Taylor's series at sample point x (j) to gener- ate additional data points.For each sample point, for each of the n p parameters, a neighboring point is created, where e k is an orthonormal basis vector of the design space.Under the assumption that Δx k remains small (|Δx k | ≪ 1), the Taylor's serie provides the extrapolated value of the function y at the neighboring point: 0 n s Neighborhood radius set from sample points positions Interpolating, cf.Eq. ( 41) for W j and Eq. ( 44) for Q j 5 GRBF 0 i,j x (0) ; w ij ( ) n p n s Cross-validation Interpolating, i,j are kernels and their derivatives, w ij solution of Eq. ( 62) Finally the non-gradient based metamodel can be built with n s × (n p + 1) sample points and associated responses.This approach has been used with kriging approximation [43,48,81] and has been called "Indirect Cokriging".The main drawback of this method is that it requires a good choice of the Δx k parameters: if the value is too small, the kriging correlation matrix can be ill-conditioned and too large a value leads to a degraded extrapolation by Taylor's expansion.In both cases, the metamodel provides an incorrect approximation.In previous works, Liu [81] used maximum likelihood estimation for estimating the value of each parameter Δx k and in [48], the Δx k are chosen equal to 10 −4 .
The indirect gradient-based approach does not scale well with dimension as the number of sample points is multiplied by n p + 1 when compared to a direct approach.More- over, because the n p new sample points are very close to the initial sample point, numerical issues (such as the bad conditioning of covariance matrices in KRG) occur that complicate the determination of the internal parameters.Regularization or filtering techniques should be brought in.Therefore, indirect gradient-based approaches should only be used in low dimension and for problems where dedicated techniques for determining Δx k and the inter- nal parameters exist.In other cases, it is better not to use gradients or to opt for a direct gradient-based approach.Examples of indirect gradient-based Kriging and RBF are proposed in Figs. 6 and 9.In these figures, the derivatives of RBF and KRG are determined analytically by deriving their predictors.In such low dimension, the indirect gradient-based approaches, InRBF and InOK, seem to perform as well as the direct gradient-based approaches, GRBF and OCK, in terms of approximating the true response derivative, dy/dx(x).However, as will be seen in Sect. 10 (Figs. 23 and 25), such indirect strategies are not competitive in higher dimensions.

Non Weighted Least Squares (LS and GradLS)
Least squares regression is the most common technique for approximating functions.Mainly applied in the context of response surface methodology (RSM [13]), the classical regression [82] can be extended for taking into account gradient information [75].In this text, the acronym LS designates least squares regression without the use of gradients and Gra-dLS is the gradient-enhanced version of it.Notice that this acronym differs from GLS that will designate generalized least squares.
The linear model used for gradient-based formulations remains the same as for non-gradient-based versions: but this time the vector y g contains n s × (n p + 1) terms (responses and gradients), the matrix F contains evalua- tions of the n t a priori chosen functions f j and their deriva- tives at each sample points x (i) , the vector contains n t pol- ynomial regression coefficients j , and the vector is made of the n s × (n p + 1) errors of the model.
For gradient-based least squares models, at each point x (i) , y x (i) , dy dx (x (i) ) , n p + 1 errors can be written: The matrices and vectors of Eq. ( 11) are now further defined: where The sizes of the previous matrices F s and F gs are n s × n t and n p n s × n t , respectively.The metamodel is built by determining the vector ̂ which minimizes the following mean squares error: Minimizing MSE over yields the function approximation, (11) where This metamodel leads to the familiar expression of regression approximation.Notice however that here, the gradients affect the coefficients ̂ .
The derivation of the GradLS model by minimization of the quadratic norm MSE can be interpreted as making the orthogonal projection of the vector of observed responses and gradients y g onto the space spanned by the columns of F. The result of the orthogonal projection is F ̂ .The pro- jection is itself defined by the inner product it relies on.In least squares without derivatives, the inner product is the usual dot product between vectors in an Euclidean space.Accounting for derivatives extends this inner product to an inner product in a Sobolev space [83]: where the g, s and gs subscripts have the same meaning as above with F. While both inner products account for the euclidean distance between the response and the metamodel, the Sobolev inner product further accounts for the difference in response and metamodel regularities through their gradients.In other words, the usual geometrical interpretations of least squares generalize to the least squares with derivatives formulation of Eq. ( 18) by moving from the Euclidean inner product to a product with additional derivative terms.
The derivatives of the GradLS approximation are directly obtained by deriving Eq. ( 19), As required for building the gradient-enhanced least squares model, the functions f j must be differentiable at least once.
Although the empirical mean square error Eq. ( 18) can be reduced by increasing the degree of the polynomial basis, ỹ() will increasingly oscillate between the n s data points, which degrades the prediction quality.This oscillatory phenomenon, known as Runge's phenomenon [84], is illustrated in Figs.2a and 3d in 1 and 2 dimensions, respectively.Runge's oscillations are mitigated when the actual function is polynomial, the number of sample points n s increases, when gradients are accounted for like here, or when a regularization strategy is added to the MSE minimization.For example, when approximating a 4th degree polynomial function using sufficiently many sample points in a dimension low enough so that Eq. ( 19) can be computed, a 4th degree least squares approach is exact.Regarding the effect of gradients, observe in Figs.2b and 3g how gradient-enhanced least squares have a more stable response than LS which only uses function values.

Generalized Least Squares (GLS)
Initially introduced for addressing the uncertainties and correlations in measured responses, generalized least squares (GLS) follow the same logic as the previous LS and GradLS models except that weights are introduced in the error, MSE.The generalized least squares error which incorporates gradient information now reads [75,[78][79][80], where W s and W gs are positive definite weight matri- ces.The minimization of the error leads to the regression coefficients, where F and y g are the same as in the GradLS approach (see above) and W g = diag W s W gs .The weighted least squares (WLS) [82] approach is a special case of the generalized least squares (GLS) where W g is a diagonal matrix.Note that Eq. ( 26) encompasses traditional GLS without gradients by setting W gs = 0.
In traditional GLS (models without gradients), the definition of the weight matrix W s depends on the context of the study: -If no a priori information on the covariance structure is available, the weights can come from a chosen weighting function R(): be such that W s is positive definite, a condition shared with kernels and further discussed in Sect.6. -If a covariance structure is known: and GLS degenerates into WLS. ( The geometrical interpretation of gradient-enhanced GLS is similar to that of GradLS made in the previous Section, the only difference being that the projection of the vector of observations onto the space spanned by the regression functions is no longer orthogonal but oblique, following the null space of the projection matrix F ̂ , ̂ given by Eq. ( 26).
In [78], normalization methods are proposed for calculating the weight matrices of gradient-enhanced GLS: • A standard normalization of responses and gradients where The coefficients i and i are meant to balance the influence of the derivatives and responses at each sample point, respectively.i are normalization coefficients calculated as In this case, W s contains n s non-null terms and the diagonal of W gs contains n s blocks of n p terms, i i i .
• A normalization using logarithmic derivatives where W s is like that of the standard normalization above and Fig. 2 Response-only and gradient-enhanced least squares (LS and GradLS) with polynomials of degrees (d  The k coefficients, which are further described in [78], are based on the logarithmic derivatives introduced in [85].i and i have the same expressions as in the standard normalization.To close the presentation on gradient-enhanced generalized least squares, following [86], we show how the approach can be looked at as a kernel-based method.This comment uses explanations given in Sect.8 so that readers not familiar with kernels as covariances of Gaussian processes may wish to read that Section first.The kernel is the covariance between two responses at different locations when the responses are considered as random processes, This relation is the expression of the kernel associated to the gradient-enhanced GLS.It is obtained assuming that the responses are centered (i.e., [Y(x)] = 0) and the weight matrix is the inverse covariance of the responses and their (31 (i) GradLS second derivative Fig. 3 Rosenbrock's function, response-only and gradient-enhanced least squares approximations (LS and GradLS) from polynomials of degree 9 built using n s = 25 sample points derivatives, W −1 g = g T g .It can then be checked that by using this kernel in the general prediction equation of kriging [with a null trend, see Table 6, or equivalently the GRBF prediction formula Eq. ( 72)], one gets back the GLS prediction formula, ỹ x (0) = f x (0) ̂ with ̂ given by Eq. ( 26).

Moving Least Squares (MLS)
Classical response surface methods like LS, GradLS or GLS approximate functions by combining once and for all a priori functions, f i () , i = 1, … , n t , that are globally defined throughout the design space.When it is not possible to decide beforehand which functions to combine, as it is the case when the function is expected to have local variations, it can be useful to proceed with local approximations.For example, it was proposed in [87] to apply the classical RSM in neighborhoods of the points of interest.moving least squares (MLS) [88] is a generalization of GLS that builds a different metamodel at each x (0) : The difference with previous approximations lies in the non constant regression coefficients ̂ x (0) [compare Eqs. ( 19) and (33)].Like other least squares techniques, the gradientbased MLS (also designated as Hermite version of MLS) [76,77] is built by minimizing an error function, which here is The weights w ij x (0) and w ijlk x (0) depend of the location of x (0) .These coefficients have the following properties: where h() and h kl () are weight functions.Although different weight functions could be chosen for the responses and gradients, the simplest solution is to take [76]).is a coefficient for managing the influence of the derivatives.= 1 leads to a MLS approximation without gradients.The matrix , where W Ms x (0) and W Mgs x (0) are n s × n s and n s n p × n s n p matrices, respectively: The weight functions are non-negative piecewise functions chosen among the non-exhaustive list provided in Table 3.
Finally, the MLS surrogate value at a non-sampled point x (0) is given by Eq. ( 33) where the coefficients ̂ x (0) are obtained by minimizing the weighted mean squares error of Eq. (35).Because the computation of these coefficients has to be done at each requested new point, MLS are computationally more expensive than other least squares techniques.

Shepard Weighting Function (IDW)
Also designated as inverse distance weighting method (IDW), the Shepard weighting method was introduced in [91].The gradient-enhanced version of [46] is based on the modified Shepard Weighting method of [74].The IDW approximation to the function is written as local linear combinations of local approximations to the true function around point x (i) , Q i ().Initially chosen as a quadratic function in [74], Q i () are taken here as the first order Taylor approximation at the sampled point x (i) for the gradient-enhanced version of IDW [43,46].
The IDW metamodel is formulated as, The relative weights, are made of the inverse distance functions, , where ∀d ∈ ℝ, (d) + = max(0, d), and R w is a radius of influence around x (j) .The weight functions W j are such that Q j (x) has an influence on the approximation only in a (hyper)sphere of center x (j) and radius R w .R w is set so that the hypershere includes N w sample points.A discussion on R w and N w can be found in [74].
The weight functions of Eqs. ( 41) and ( 42) have the following properties: The function Q j (x) is a first order Taylor approximation of y at x (j) , The IDW approximation interpolates responses and gradients of the actual function at the sample points.To prove it, the IDW prediction and its derivatives are now calculated at the sample points: because, The IDW metamodel bears similarities to the kernel methods of Sects.6, 7, 8, 9: W j (x) is a double input function that grows with proximity between x and x (j) ; In IDW, W j (x) is multiplied with response estimates (the Q j ()'s) in a way that is reminiscent of kriging, cf.GKRG in Table 2.Note also (45) ∀i x l = 0.
that, when compared to the other metamodels reviewed in this paper, IDW is the only approach that neither requires the inversion of large (n s (n p + 1) by n s (n p + 1)) systems of linear equations nor the resolution of optimization problems as GSVR will.For this reason, IDW is computationally inexpensive.We now turn to the already mentioned kernel methods.

Kernel Functions for Gradient-Enhanced Kernel-Based Metamodels
Most kernel-based metamodels have been developped in the field of machine learning.While support vector machines are arguably the most well-known, other approximation techniques belong to kernel-based techniques.In this article, we will focus on radial basis functions (see Sect. 7), Kriging (see Sect. 8) and support vector regression (see Sect. 9).These three surrogate models, like all kernel-based metamodels, require choosing a kernel function or kernel, , which measures a similarity, x (i) , x (j) , between any two points x (i) and x (j) , and is therefore a dou- ble input function.Kernel functions are examples of the functions B() of the general metamodel framework, Eq. ( 8).
As will be done in Sect.8 about kriging, one can look at the responses at each point x as a random process, Y(x).With this point of view, since a kernel is a similar- ity measure, it is natural to define a kernel as the correlation between the responses at different locations, Kernels must satisfy Mercer's conditions [92] which means that they must be continuous, symmetric and positive definite, a necessary condition for correlation functions.This is most easily done by taking the kernel function in a list of known Mercer's kernels [86,92,93].
In the case of gradient-enhanced approximations, a great simplification comes from the fact that the kernels involving gradients are deduced from the kernel involving only the responses: the correlation functions between a response and a gradient is the derivative of the kernel and the correlation between two gradients is the second derivative of the correlation, cf.Eq. ( 92).
An additional condition on the kernel functions has then to be satisfied: the kernels used in gradient-enhanced metamodels must be twice differentiable.
Multidimensional kernel functions are usually built from unidimensional kernels h by taking the product, where is the vector of the kernel internal parameters.In the above formula, we have introduced the stationarity assumption that the similarity between two points depends only on the vector separating them and not on where they are located, h(x (i)  k , x The sign of r is kept to simplify the calculations of the kernel derivatives.For gradient-enhanced metamodels, common twice (48) differentiable kernel functions are summarized in Table 4 (see for example [62,63,94,95]).Introduced by Stein [96] in the context of approximation, the Matérn class [97] of kernels have parameters that make them highly adjustable.Matérn kernels use a modified Bessel function of the second kind K normalized by a Gamma func- tion ( ).Thanks to the parameter , the smoothness of the kernel function can be accurately controlled.Matérn functions and their derivatives for 3 values of and = 0.8 are plotted in Fig. 4. In practice, two specific values of leads to the most often used Matérn 3/2 and Matérn 5/2 ( = 3∕2 or 5/2) functions.Figures 5b, c show these functions and their derivatives for 3 values of the parameter .They can be compared with the squared exponential kernel presented on Fig. 5a.The Matérn function is ⌈ ⌉ times differentiable [96]  (where ⌈•⌉ denotes the ceiling function).A stronger result is that the second derivative is continuous in 0 and its asymptotic value is [98] because the k-th derivative of the metamodel exists if the k + 1-th derivative of the kernel at 0 exists and is finite ( [86] for Gaussian Processes), the Matérn function with  > 1 can be used for building gradient-based (k = 1) meta- models.This assessment confirms the validity of the choice ≥ 3∕2 proposed in [62, 94, 95, 99].The squared expo- nential kernel has a very simple expression and is often encountered in practice.It should be noted that it yields extremely smooth metamodels: it is infinitely differentiable at r = 0 and so are the associated surrogates.Such smooth- ness is often not representative of the true function and, worse, it causes ill-conditioning of matrices in radial basis functions and kriging (cf.Sects.7 and 8).This is the reason why Matérn kernels should generally be preferred.
The implementation of multidimensionnal kernel functions and their first and second derivatives can lead to a (49) complicated and time consuming code.In order to improve both aspects [99] has proposed the following formulation: The derivatives can then be computed as shown below where only the derivatives of the unidimensional correlation function are needed,

Gradient-Enhanced Radial Basis Function (GRBF)
Gradient-enhanced radial basis function (GRBF) has also been designated as Hermite-Birkhoff or Hermite interpolation [64].This method was introduced in the more global context of artificial neural networks [65,66] and it was used for dealing with optimization problems involving expensive Fig. 4 Matérn function for = 0.8 and 3 values of ν ν ν An Overview of Gradient-Enhanced Metamodels with Applications 1 solvers in the context of computational fluid dynamics [46,67] and assembly design [56][57][58][59].

Building Process
The principle of GRBF is similar to that of classical RBF approach [100][101][102] with an extended basis of functions.The added functions are chosen as the derivatives of the radial basis functions .Thus, the GRBF approximation reads, (55) Fig. 5 Examples of several kernel functions recommended for gradient-based metamodels where Only one half of the first derivatives needs to be calculated because they are odd functions: The second derivatives of the radial basis functions will be denoted The GRBF building process consists in the determination of the w ij 's coefficients by ensuring that the GRBF approxi- mation interpolates the responses and gradients of the actual function at the sample points: (56) Equations ( 60) and ( 61) lead to the following matrix formulation: The vectors w g and y g contain the RBF coefficients and the responses and gradients of the actual function, respectively.The matrix g is built from the classical RBF matrix and the first and second derivatives of the radial basis functions matrices, denoted d and dd , respectively: The sizes of the , d and dd matrices are n s × n s , n s n p × n s and n s n p × n s n p , respectively.So, matrix g con- tains n s (1 + n p ) × n s (1 + n p ) terms.The other terms in Eq. ( 62) are The determination of the w ij 's finally consists in the inversion of the g matrix.This square symmetrical matrix is larger than the of the classical RBF approach.
In order to reduce the computation time, LU or Cholesky factorisation of the g matrix can be used.
Finally, the derivatives of the GRBF can be easily calculated by deriving Eq. ( 55).Fig. 6 RBF, Indirect RBF (inRBF) and GRBF approximations of the unidimensional analytical function y(x) = exp(−x∕10) cos(x) + x∕10.n s = 6 sample points, squared exponential function Figures 6,7 and 8 provide illustrations on analytical functions.In the figures, the points have been generated by an Improved Hypercube Sampling technique, IHS [103].An indirect version of the gradient-enhanced RBF is proposed in 1D.This approach works in unidimensional problems but it becomes unstable as the number of sample points and the number of parameters of the problem increase.

RBF Kernels and Conditioning
Many radial basis functions have been proposed in the literature (see for example [104]) and they can be completed by the kernels presented in Sect.6.In the case of gradient-based RBF, the kernels must be at least twice differentiable to comply with the expressions of Eqs.(60) and (61).Thus, Matérn or squared exponential kernels can be used in GRBF.The matrix made of 0, 1 and 2nd order derivatives g is guaranteed to be positive definitite as will be explained in Sect.8.3 about the C c matrix which has the same form.However, the conditioning of the matrix may be bad.As already discussed in Sect.6, the squared exponential kernel is likely to yield an ill-conditioned g matrix, an issue that can be addressed through any of the following techniques: use more distant sampled points or equivalently decrease the value of the internal parameters ; replace the squared exponential kernel with a Matérn kernel.Another solution can be to add a very small value (with an order fo magnitude of the machine accuracy) to the diagonal of the cos(x 1 ) + 1, IHS sampling with n s = 20, Matérn 3/2 kernel function g matrix.In this case the GRBF approximation will not interpolate the responses and gradients at the sample point.

Estimation of Parameters
The internal parameters of the RBF metamodel can be determined by minimizing the leave-one-out error (LOO) with respect to = ( i ) 1≤i≤n p (and in the case of the Matérn kernel).Based on the principle of Cross-validation [105,106], the classical LOO error is detailed hereafter where ỹ−i (x (i) ) is the RBF approximation at point x (i) without taking into account the response and the gradient of the actual function at that sample point x (i) : Bompard et al. [107] propose an extended LOO criterion by adding the derivatives information:  where is the approximation of the derivative provided by the metamodel built without taking into account the true k-th component of the gradient at point x (i) .The approxima- tions ỹ−i and ỹ −i,k x k are therefore obtained through Eq. ( 60) and partial use of (61) when a gradient is omitted from the LOO error.In order to avoid the building of numerous metamodels associated to each value of the internal parameters, an efficient way for computing the LOO was proposed in [108] and extended to GRBF in [107].It implies estimating the LOO criterion by inverting the kernel matrix once and for all and calculating a vector product instead of completely building the metamodel each time a data is removed.Finally, due to the multimodality of the LOO, a global optimizer has to be used such as a stochastic optimizer (e.g., an evolution strategy or a particle swarm algorithm).

Variance of a Stochastic Process Obtained with GRBF
As an extension to an idea given in [29], Bompard [54] proposes to look at the deterministic response y as an instance of a stationary Gaussian stochastic process Y whose corre- lation is given by the GRBF kernel and whose constant variance is . This allows to describe the mean and variance of the GRBF prediction.Let x (0) be the vector containing the evaluations and first derivatives of the RBF kernels 0i,j [from Eq. (55)] at the new point x (0) .By solving Eq. ( 62) for the weights, the GRBF estimation is expressed as a linear combination of the true responses and their derivatives, A mean and variance expressions are then calculated in a manner similar to kriging: The expression for the mean makes use of the further specification of the observations at the sample points, Y g = y g , i.e., one considers the conditional process Y knowing the observations at the sample points.
This variance could be used in infill criteria such as the expected improvement [33].Unfortunately, as was said earlier, this variance calculation will often fail due to loss of positiveness of the GRBF matrix g unless specific measures are undertaken.

Gradient-Enhanced Cokriging (GKRG)
Kriging, an alternative name for conditional Gaussian Processes, is today one of the main techniques for approximating functions and optimizing expensive to calculate functions.Cokriging is an extension of kriging for dealing with several correlated functions.Initially introduced for geostatistics [34,35], many works focus on the assumptions, principles and formulations of cokriging [36][37][38]42].Gradientbased cokriging was introduced by Morris et al. [39] as a way to account for gradient information in kriging, and has since then been applied to many fields.Table 5 summarizes the references and the kind of applications that concern gradient-enhanced cokriging.Because the concepts underlying gradient-enhanced cokriging have received various names, the last column of the Table lists the original keywords employed by the cited authors.It is seen that gradient-enhanced cokriging has been largely used in the context of fluid problems.The efforts made to calculate gradients in fluid simulations explain this observation.

Formulation of Gradient-Enhanced Cokriging
Gradient-enhanced cokriging is very similar to the classical kriging approach.Random processes associated with the deterministic objective function and its gradients are first defined through the primary response, Y, and the n p auxil- iary responses, W i [35]: In the particular case of gradient-enhanced cokriging, the auxiliary responses W i correspond to the components of the gradient: As in regular kriging, i and Z i represent, for each response, the deterministic trends and the fluctuations around the trends: All Z i 's are centered stationary Gaussian Processes.As in usual kriging, the covariance of Z 0 is a function of a (78) 2 generalized distance among the sample points.Some other cross-covariance relations have to be introduced for the auxiliary variables.These covariances and cross-covariances are defined in Sects.8.3 and 8.5.
The trend models i can be chosen independently of one another [111] and this choice leads to different kinds of (co)kriging (simple when i is a known constant, ordinary when i is an unknown constant and universal in the general case that it is both unknown and a function of x).In this paper, the universal cokriging model where the trend is built using polynomial regression will be detailed, see Eq. (81).In order to limit the amount of required inputs, Initial developments for taking into account gradients, application to water flow through a borehole Bayesian prediction using derivatives, Gaussian process [40] Approximation of functions using gradients Kriging to model gradients [41] Theoretical developments and application to analytical functions First order kriging [43,44] Gradient-based metamodel for minimizing the drag of an aerofoil (CFD) Direct and indirect cokriging [45] Gradient-based cokriging for optimization, infill strategy and application to structural optimization Kriging model including derivative information [46] Comparison with other gradient-enhanced metamodels and application to fluids Kriging, Gaussian process including derivative [21] Developments for using gradient and hessian information (code available) Gradient-/Hessian-enhanced kriging [47] Application to aerodynamic optimization Cokriging, gradient-enhanced kriging [48] Comparison of kriging with and without gradient, infill strategy and application to aerodynamic optimization Cokriging [

49] Uncertainty quantification and application to analytical and CFD examples
Gradient-enhanced kriging (GEK) [50] Application to aircraft aerodynamic shape optimization Gradient-based kriging (GBK) [51] Uncertainty quantification, approximation quality of analytical functions and application to design of nuclear plants Gradient-enhanced universal kriging (GEUK) [52] Application to structural and aerodynamic optimization with multi-fidelity approach Cokriging [53] Taking into account gradient and hessian information, application to analytical functions and aerodynamic optimization problems Gradient/Hessian-enhanced firect/indirect kriging (GEK) [55] Approximation quality of analytical functions, uncertainty quantification of a transonic aerofoil Cokriging, gradient and Hessian enhanced kriging [54,107] Comparison of gradient-based metamodels, application to analytical functions and CFD problems for shape optimization Direct/indirect co-kriging [60] Multi-fidelity approach to aerofoil design Direct gradient-enhanced kriging (GEK) [61] Comparison between direct and indirect gradient-based kriging using an analytical function and airfoil model.Study of the internal parameters estimation by likelihood maximization.
Direct/indirect gradient-enhanced kriging [56-59, 95, 109] Application to assembly design (nonlinear structural problems), comparison with gradient-based RBF, comparison with multifidelity method Gradient-enhanced/-based cokriging [62,63,110] Gradient-enhanced kriging with and without multi-fidelity models Gradient-enhanced kriging the trend models of the auxiliary responses, i , i ∈ [[1, n p ]], will be obtained by differentiation of the primary response trend, 0 [see [39] and Eq. ( 82)]. where The best linear unbiased predictor (BLUP) of the response using both primary and auxiliary responses makes the cokriging model [35].This predictor is a linear combination of deterministic functions called ()'s and the evalu- ations of primary and auxiliary responses at the sample points: The functions () are evaluated by minimizing the vari- ance of the estimation error  x (0) = Ŷ(x (0) ) − Y x (0) while accounting for the unbiasedness condition.Finally, the expressions of the cokriging prediction and variance are obtained.These steps are further explained in the next sections.

No Bias Condition
The condition for the cokriging estimator to be unbiased is ( 81) (84) Inserting the expression of the trend [Eqs.( 81) and ( 82)] leads to with Equation ( 85) can be further condensed after a simplification with respect to : It should be remembered that c and f 0 depend on the non-sampled point x (0) .For simplicity's sake the functions () have and will been written without specifying that they are defined only at the non-sampled point x (0) .

Formulation of the Variance
The variance of the cokriging error estimate is ( 85) The following notations are introduced for simplifying the covariances: Now the variance of the cokriging error estimation can be written in matrix notation, where is the cokriging covariance/crosscovariance matrix.It is composed of the classical kriging covariance matrix C, the cross-covariance matrix C WY made of covariances between primary and auxiliary responses and the cross-covariance matrix C WW between the auxiliary responses.Using the notations introduced in Eq. ( 87), these matrices are defined as: The global cokriging covariance matrix C c obtained is symmetric and contains n s (n p + 1) rows and columns.c 0c is the vector of covariances and cross-covariances between the sampled and any non-sampled points and it is expressed since a variance is always positive.The matrix g of GRBF [see Eq. ( 63)] is also positive definite because it has the same structure and is made of the same kernels.Above, positive definitness is not strict so that bad conditionning and even non invertibility may happen (e.g., when two sample points are identical).

Constrained Optimization Problem for Cokriging Building
Using the notations introduced in Eqs. ( 86) and ( 88), a cokriging model is built by solving the following constrained optimization problem.
Problem 1 (Universal cokriging) Find c ∈ ℝ n s (n p +1) that minimizes Universal kriging and universal cokriging lead to the same constrained optimization problem.In the case of cokriging however, additional cross-covariances are taken into account.This constrained optimization problem is solved by the lagrangian technique which yields the following expressions for cokriging prediction and variance: Like usual kriging, cokriging interpolates the responses at the data points by having the prediction equal to the response and the variance null there.The proof of this property is based on c 0c being equal to the ith column of C c at x (0) = x (i) .
Simple and ordinary cokriging can be easily deduced from the previous equations by considering 0 (x) = m where m is a known real or 0 (x) = where is an unknow real.In both cases and according to Eq. ( 82), where 1 n s and 0 n s ×n p are matrices containing n s 1's and n s n p 0's, respectively.

Covariance Structure
The most critical choice when creating a cokriging model is that of the covariance functions.In applications such as geostatistics (see for instance [35]), this choice can be governed by expert information.In the more general context of computer experiments, there is a wide range of covariance functions to choose from.However, noting that covariance functions are kernel functions such as introduced in Sect.6, multidimensional kernels can be formed by multiplying unidimensional kernels.Continuing this strategy for gradient-enhanced cokriging, Morris et al. [39] have proposed a general form for the cross-covariance relations: where ) is a unidimensional correlation function depend- ing on the real r and the correlation length and h (k) is its k-th derivative.Readers can note that kernels are even functions but their first derivative are odd, cf. for example Fig. 5. Therefore, referring to the covariance notation of Eq. ( 87), the following relation is found: More generally, in the case of gradient-enhanced cokriging, the covariances satisfy, where k , and � x (i) , x (j) ; In the literature, mainly squared exponential functions have been used for building kriging and cokriging approximations.Recently, many works [62,86,95,96] have focused on Matérn [97] covariances, in particular Matérn 2 [62,95].Similarly to RBF and GRBF approximations, Matérn kernels improve the condition number of the covariance matrix, therefore improving the stability of the method.
With the product covariances introduced [see Eq. ( 92)], the process variance can be factored out of the different covariances in Eq. ( 89):

Summary of Cokriging Formulations and First Illustrations
Table 6 summarizes the different cokriging formulations which look similar to kriging formulations with an extended definition of the correlation matrix and vector.If we only consider the metamodel predictions and not its variance or internal parameter learning, the functional forms of simple cokriging without trend and gradient-enhanced radial basis functions are identical [compare Eq. ( 71) with SCK where m = 0 in Table 6].
(   The Figs. 9, 10 and 11 illustrate how kriging, indirect kriging (the principle of indirect gradient-enhanced metamodels is presented in Sect.3) and ordinary cokriging approximate one and two-dimensional functions.The indirect version of the gradient-enhanced cokriging is only proposed in 1D, in Fig. 9, where it can be seen that it yields very accurate results (the line cannot be visually separated from the true function on the plot).The Figs. 9, 10 and 11 show that, like for RBF approximation, the use of the gradient information improves the approximation of the analytical function, in particular for multimodal functions such as the Six-hump Camel Back in Fig. 11.
Figure 12 shows confidence intervals calculated with the predictions and the variances of ordinary kriging and cokriging.Remark that the use of the gradients reduces the approximation uncertainty.
When compared to GRBF, GKRG has the same covariance structure: C c is the same matrix as g .Without trend and when the kernels are the same, the GRBF approximation of Eq. ( 72) is the same as that of GKRG (cf.Table 6).Differences arise because of the trend and the way the internal parameters are tuned.As a result, as will be observed in Sect.10, gradient-enhanced cokriging and GRBF have very similar performances with a slight advantage for the cokriging.

Derivatives of the Cokriging Approximation
Derivatives of the cokriging approximation to the response,  Ŷ x i x (0) , i = 1, … , n p , can be obtained in two equivalent ways.Firstly, Eq. ( 83) can be differentiated with respect to x i which means taking the derivatives of the functions ().Substituting the expression of the ()'s amounts to differ- entiating the correlation vectors r 0c (and the trend func- tions f 0 for universal cokriging) in the expressions for the approximation ỹ() given in Table 6.To do so, the second derivatives of the kernel functions, which appear in the derivatives of r 0c , are needed.The choice of the kernel must be adapted to this goal: squared exponential or Matérn (with  > 1) kernels are appropriate.It is remark- able that the second derivatives of the kernel functions were already required in the making of the cross-covariance matrix C WW , so approximating the derivative does not add requirements to the kernels.
Secondly and in an equivalent manner, the cokriging equations for predicting the response derivatives,  Ŷ x i x (0) , can be obtained following the same path as that followed for the response: the cokriging estimate to the derivative is written as a linear combination of both the responses and their derivatives at sample points like in Eq. ( 83); the no bias condition of Eq. ( 84) is replaced by a no bias on the derivatives, and results in a relation like Eq. ( 86) with similarly, the variance minimized is that of the error between derivatives, Var  Ŷ x i x (0) − Y x i x (0) , leading to an equation identical to Eq. ( 88) but c 0c is replaced by the vector . Therefore, the cokriging models summarized in Table 6 provide models for the derivatives by just differentiating the trend and the correlation vectors.As a result in these (differentiated) models, the kriging interpolation property also applies to the derivatives.A notable feature of such gradient-enhanced cokriging is that the uncertainty of the estimated response derivative is also calculated [99].This property was not used in previous works but it should turn out to be useful in the context of uncertainty quantification or reliability-based optimization.

Estimation of the Cokriging Parameters
As in the case of kriging, the estimation of the cokriging parameters, i , Y and (and for the general Matérn kernel), can be achieved using Leave-One-Out or Maximum likelihood techniques.Leave-one-out (LOO) was already introduced for GRBF in Sect.7.3 and has also been applied to Gradient-Based cokriging (see for example [54]).The Maximum likelihood approach [112] is made possible by the probabilistic interpretation of cokriging and more common than LOO.Maximum likelihood estimation operates by maximizing the following likelihood function (or minimizing the opposite of its log): At a given , L() can be analytically maximized over and 2 Y which yields the expression of their estimates: The correlation lengths i are obtained by a numerically minimizing the following expression which is the relevant part of minus the log-likelihood where ̂ and σ2 Y have been input: Because () is multimodal, it is essential to perform the minimization with a global optimization algorithm [113]: for example, a stochastic optimizer such as the Particle Swarm Optimizer can be employed [114].In order to reduce the number of optimization iterations, the gradient of the likelihood function is sometimes calculated and accounted for in the optimization [62,115].During the numerical optimization for finding the correlation lengths, , the correlation matrix K c has to be rebuilt, factorized and inverted at each iteration, which goes along with a noticeable computational cost.However, in most practical situations where metamodels are called in, the objective function relies on numerical simulation such as nonlinear finite elements and remains much more costly than the metamodel.Furthermore, the gain in accuracy of the gradientbased approximations allows in many cases to contain the computational time by reducing the necessary number of sample points [94,95,116].

Gradient-Enhanced Support Vector Regression (GSVR)
Support vector regression (SVR) is a nonlinear regression method that appeared within the framework of statistical learning theory.It is an extension of the support vector machines (SVMs) originally designed for nonlinear classification [117] and pattern recognition [118].
The literature on SVR is already rich and general introductions may be found in [119][120][121].Initially built for learning from function responses at sample points, many extensions of SVR to additionally account for derivatives have been proposed.In compliance with the rest of the text, we shall call them gradient-enhanced SVR or GSVR.Initially introduced in [68] with an iteratively re-weighted least squares procedure, GSVR has then been revisited, again with a least squares approach in [70], with regularized least squares in [71], and by the Twin SVR technique in [69,73].A general framework for incorporating prior knowledge in SVR which has been applied to function derivatives was also put forward in [72].More recently, GSVR has been applied to shape optimization in CFD problems [54].

Building Procedure
We now present the method introduced by [68] and applied in [54].The approximation is built from a linear combination of the basis functions i () and their deriva- tives (all of which are independent of the observations) ( 98) added to a constant trend term .The 's are the coeffi- cients of the combination, and will be adjusted using the observed responses y g : where and g x (0) contain the n s × (n p + 1) coefficients and evaluations of the basis function and its derivatives at the sample points, respectively.At this point, the expression of the approximation ỹ x (0) is the same as that in any least squares approach, cf.Eq. 19 for example with and ̂ , and g x (0) and f x (0) playing the same roles, respectively.However, the coefficients will be calculated through a different approach and the a priori functions i,j (x) will never be used as such but will always occur in products and hence they will be indirectly specified through a kernel and its derivatives, cf.Sect.9.2.
Support vector regression seeks to approximate the function responses, y x (i) , within a 0 accuracy and, addi- tionally, GSVR requires the derivatives, y x k x (i) , to be approximated within k accuracy.The SVR approximation is made more stable to changes in data by minimizing the vector norm ‖ ‖ 2 (cf.[117,120] for explanations on how reducing ‖ ‖ 2 makes the approximation less flexible, there- fore more stable).These considerations lead the constrained convex quadratic optimization Problem 2 where + , − , + and − are slack variables on the accuracies for avoiding problems with no feasible solution: Problem 2 (GSVR as a minimization problem) Find , , + , − , + , − 1 that minimize subject to ( 99) 1 Recall that bold notations designate vectors.For example, The hyper-parameters of the method, k , k = 0, … , n p , are user-defined penalty parameters that control the tradeoff between approximation regularity (low ‖ ‖ 2 ) and approximation accuracy in response and derivatives of the response.Geometrically, the constraints on accuracy are tubes of half-width i in the space of responses and derivatives outside of which the GSVR criterion is subject to a linear loss at a rate determined by the hyper-parameters k .
Problem 2 can be rewritten as a saddle-point problem involving a Lagrangian and positive Lagrange multipliers
Problem 4 (GSVR as a Convex Constrained Quadratic Optimization) Find + , − , + , − that minimize subject to The vectors y s and y gs contain the responses and the deriv- atives of the actual function at the sample points, respectively.is made of the k values (∀k ∈ [[1, n p ]]).The matrices r , rd and dd consist of the evaluations and derivatives of the kernel function (see Sect. 9.2), and designates the vector of k ∕n s .
Responses or derivatives that are inside their k tube, that is, responses and derivatives for which the accuracy constraints are satisfied, do not impact the solution of any of .
the above problems and could be removed altogether from the metamodel building without changing the result.This is because the Lagrange multipliers associated to these points are both (upper and lower limit) null.On the opposite, responses or derivatives at points that have one non-zero Lagrange multiplier influence the metamodel and are called support vectors.The dual variables ± and ± are determined by solving the Constrained Quadratic Optimization Problem 4. Classical Quadratic Programming algorithms [120] such as the Interior Point algorithm can be applied.For dealing with large number of sample points and parameters, dedicated algorithms, such as sequential minimal optimization [122], are preferable.In order to reduce the computational cost of GSVR, a few works introduce new formulations:

Kernel Functions
Problem 4 has the variables x (i) involved only through products of () and their derivatives.In SVR also, a ker- nel is defined as the inner product  x, x � = (x) ⊤ x � .
The "kernel trick" consists in not explicitely giving () but directly working with the kernel (, ).As was already stated in Sect.6, any function with two inputs cannot be a kernel, it has to satisfy the Mercer's conditions (see [120]) in order to be continuous, symmetric and positive definite.In the case of GSVR, the basis functions intervene in the following products: Therefore, in addition to the Mercer's conditions, a kernel used in GSVR must be twice differentiable.Again, like in GRBF and GKRG, squared exponential or Matérn ( > 1) functions can be used as kernels for GSVR (a list of kernels has been given in Sect.6).With the notations introduced in Eqs. ( 108)-( 111), the matrices present in Problem 4 can now be detailed: The sizes of matrices r , rd and dd are n s × n s , n s × n s n p and n s n p × n s n p , respectively.The full matrix of kernel functions and their derivatives at the sample points in Problem 4 is square and contains 2n s (1 + n p ) rows.

Evaluating the GSVR Metamodel
Solving the Convex Constrained Quadratic problem for ± and ± allows to calculate the 's from Eq. ( 101).The GSVR response estimate at a new point x (0) is then given by: where x (0) and d x (0) are the vectors of kernels functions x (i) , x (0) and their derivatives, respectively.The derivative of the approximation given by the GSVR metamodel is obtained by differentiating Eq. (115).To be able to do this, the kernel function must be at least twice differentiable.The trend term, , has not been calculated yet.Its value stems from the Karush-Kuhn-Tucker conditions for the Convex Constrained Problem 4: at a solution, the products between the dual variables and the associated constraints vanish:  116)-( 117) and ( 120)-( 121) are the same as those for the classical (non-gradient based) SVR for which the following conclusions hold: -from Eqs. ( 120) and (121), either 0 ∕n s − ±(i) = 0 and  ±(i) > 0, or ±(i) = 0 and  0 ∕n s >  ±(i) .-from Eq. ( 101) for , Eqs. ( 116) and (117), and because ỹ x (i) cannot be below and above y x (i) at the same time so that, in terms of the dual variables +(i) −(i) = 0, can be calculated: The above bounds on ±(i) are enforced as constraints of the quadratic optimization problem.

Gradient-Enhanced -SVR
The classical GSVR discussed so far requires choosing k and k (k ∈ [[0, n p ]]), and is sometimes called k -GSVR.k is typically taken as the standard deviation of the noise of the response data and its derivatives.Often though, ( 116) or there is no prior knowledge on an eventual noise on the response.Furthermore, if the k 's are taken small, there will be many non-zero Lagrange multipliers (i.e., among + , − , + , − ), in other terms there will be many sup- port vectors, and the SVR model will lose some of its "sparsity" in the sense that the ability to drop some of the terms when evaluating the metamodel [Eq.( 115)] will decrease.
-GSVR is an alternative support vector regression model where the k 's are no longer given but calculated.
-GSVR uses new scalars, k ∈ [0, 1] n p +1 , which act as upperbounds on the proportion of points that will be support vectors.This approach is inherited from -SVM (support vector machines [123,124]), and has been compared with classical -SVR in [125].The larger the k 's, the more the approximation is required to approach the data points and made flexible (or "complex" in an information theory sense).A -GSVR model solves the following optimization problem (compare with Problem 2): Problem 5 (GSVR as optimization problem) Find ( , , + , − , + , − ,2 ) that minimize subject to This problem is solved with a Lagrangian approach in a manner similar to that of k -GSVR [54].The n p + 1 new constraints on the positivity of the k 's induce n p + 1 new Lagrange multipliers.The resulting quadratic dual optimization problem is similar to that given as Problem 4 with the additional Lagrange multipliers.The -GSVR developped here has been implemented in the GRENAT Toolbox [12].
Figures 13 shows approximations of an analytical unidimensional function using -SVR, -GSVR and their derivatives [obtained by differentiating Eq. ( 115)].The filled areas correspond to the -tube of both approximations.Just as for the previous GRBF and GKRG metamodels, the gradientenhanced -GSVR exhibits more accurate approximations than -SVR does.

Tuning GSVR Parameters
The GSVR model involves the same parameters are the version without gradients, that is the i and i internal parameters, plus the parameters of the kernels .Several works, summarized in [126], discuss how to tune these parameters for non-gradient SVR, either in the form of empirical choices or of methodologies.Both -SVR and -GSVR (see Sect. 9.4) help in choosing the i by replacing them by i , a targeted proportion of points that are support vectors.
Algorithms have been proposed for determining the values of the aforementioned internal parameters using leave-oneout bounds [Eq.( 69)] for support vector regression.Introduced in [127], these bounds have been completed with the Span concept [128] that currently stands as the most accurate bound.A method for the minimization of this bound is described and studied in [129].The gradient of the leave-oneout bound for SVR with respect to the internal parameters has been calculated in [130] and used for tuning the parameters.
Recently, an extension of the Span bound to gradientbased SVR has been proposed in [54]: because the evaluation of the Span bound is computationally expensive, the authors have proposed to estimate the internal parameters of the kernel function as those of a gradient-enhanced RBF.Comparisons of response-only and gradient-enhanced metamodels will be carried out for modeling the 5 and 10 dimensional (n p =5 or 10) Rosenbrock and Schwefel functions which are respectively defined as, Rosenbrock's function has only one basin of attraction but it is located in a long curved valley, that is the variables significantly interact with each other.Schwelfel's function is highly multimodal and, worse, the function is challenging for many surrogates in that the frequency and the amplitude of the sin() function that composes it changes through the design space, making the function non stationary.Schwefel's difficulty is nevertheless limited because it is an additively decomposable and smooth function.
Between n s = 5 and n s = 140 points are generated by Improved Hypercube Sampling [103].Each sampling and metamodel building is repeated 50 times.The global approximation quality of the metamodel is measured by ∀x ∈ [−500, 500] n p , y(x) = 418.9829 computing the mean value and the standard deviation of the R 2 and 3 criteria for the 50 metamodels at n v = 1000 vali- dation points which are different from the sample points.These metamodel quality criteria are now defined.
As usual, the symbol denotes the average.R 2 Pearson's correlation coefficient, measures how well the surrogate predictions are correlated to the true responses.The closer R 2 is to 1, the better.3 is a normalized leave-one-out crite- rion [cf.Eq. ( 69)]: The closer 3 is to 0, the better the prediction accuracy of the surrogate.The results presented next have been obtained on a computer equipped with an Intel ® Xeon ® E5-2680 v2 processor (20 cores at 2.8 GHz) and 128 Gb of volatile memory (DDR3-1866).The execution times given are CPU times which correspond to the time required for running the computer code on a single processor loaded at 100%.

Comparison of LS and GradLS Models
The response-only and gradient-enhanced least squares metamodels, LS and GradLS, are compared in details when approximating the 3 and 5 dimensional Rosenbrock's functions.The least squares fit are carried out with polynomials of degrees d • = 1 to 10. Figures 15 and 16 show the results in terms of R 2 and 3 (mean and standard deviation) for the 3 dimensional function and Figs. 17 and 18 show the results for the 5 dimensional function.The approximation quality improves as the mean of R 2 increases and its standard deviation simultaneously decreases or, as the mean of 3 decreases and its standard deviation simultaneously decreases.In order to help understanding the outcome of the experiments, Fig. 14 summarizes both, (i) the number of terms in the polynomials, n t , which is a func- tion of their degree as seen in Eq. (1) and, (ii) the number of equations in the least squares approximations [Eq.(11)] which is equal to n s and n s (n p + 1) for the LS and GradLS models, respectively.The number of polynomial terms are plotted, for each n p separately, with the continuous lines.The number of equations are plotted as marks, blue marks (128) for the LS, and black marks with a dependency on n p for GradLS: the dotted lines give the number of equations in GradLS as a function of n p for each different n s .The Gra- dLS formulation uses more equations than LS does thanks to the gradients.As long as there are more independent equations than terms in the polynomial, the solution (21) to the Mean Squares Error exists and is unique.In this case in our implementation a QR factorization of the F ⊤ F is performed.On the contrary, if the degree of the polynomial is such that there are more polynomial terms than equations, the problem is ill-posed and the matrix F ⊤ F in Eq. ( 21) is no longer invertible.In our implementation of least squares, solution unicity is then recovered by using the Moore-Penrose pseudo-inverse3 of F, written F + , i.e., by solving ̂ = F + y g .The portions of the solid lines that are below the marks associated to each n s indicate the poly- nomial degrees for which there are sufficiently many equations to solve the original least squares problem, in other terms, the polynomials which are fully defined by the data points.
General trends are visible in Figs.  on the mean of 3 , and they are confirmed by R 2 .Not sur- prisingly, the quality of the approximations increases (i.e., the mean of 3 diminishes) with the number of sample points n s ; at a given n s (larger than 5), the approximations improve from polynomial degree d • = 1 up to 4, and then degrade as the degrees of the polynomials go to 10.The explanation is that Rosenbrock's function is a polynomial of degree 4. Below d • = 4, the true function cannot be represented by the approximations.Beyond d • = 4 it can, but higher order terms of the polynomial need to be cancelled, which requires sufficiently many sample points to be accurately done.An estimate of the limit on the number of sample points is n s such that there are as many equations as polynomial terms n t .This limit is seen in Fig. 14: for LS in dimension n p = 3, the lower bound on n s is 35, 84 and 120 for d • = 4, 6, 7; a ridge where 3 suddenly degrades crosses the upper right plot of Fig. 15 and this ridge closely follows these limits on n s .The same estimate (number of polynomial terms equal to number of equations) applied to GradLS yields n s limits n p + 1 smaller than those of LS: in Fig. 16, n   Schwefel's function is not easily approximated with a polynomial.This is noticed in Figs.19 and  Two conclusions can already be drawn from these tests.Firstly, least squares approximations have a high performance domain characterized by a number of equation larger than the number of polynomial terms.This shows that the regularization performed by the pseudo-inverse does not produce as good least squares approximations as additional data points do.Secondly, because the gradientenhanced least squares GradLS require sample sizes n p + 1 smaller than those of classical LS, they have a much larger high performance domain.
In addition, even outside of the high performance domain, it is observed in Figs.15,16,17,18,19 and 20 that, at a given n s and d • , GradLS consistently outperforms LS when approximating Rosenbrock's function.
Figure 21 shows the CPU time needed to calculate the LS and GradLS metamodels as a function of the number of sample points n s for degrees of the polynomial approxi- mation d • ranging from 2 to 10 and in dimensions n p =3, 5 and 10.In 10 dimension, the polynomials are limited to the degrees {2, 3, 4, 5, 6} to keep calculation times reasonable.In both LS and GradLS models, it is observed that the CPU time grows with n s , n p and d • and that the main factors for CPU time increase are n p and d • (a log scale is applied to the CPU axis).The effect of n p and d • is comparable in both metamodels so that the CPU times are close to each other.The CPU time of the gradient-enhanced GradLS grows slightly faster than that of LS with n s and in a man- ner independent of n p and d • , which is mainly noticeable at low CPU times.It therefore turns out that enhancing least squares through the gradients is not as costly as one could fear.The reason is that, for high degree polynomials in high dimensions, the main numerical cost comes from the factorization of the n t × n t matrix F ⊤ F, which has (n 3 t ) complexity, and n t is a function of n p and d • but not of n s [cf.Eq. ( 1)].The additional computation time of GradLS comes from the storage of the (n s (n p + 1) × n t ) matrix F and its product with other matrices.
Note also in Fig. 21 that the geometric evolution of the CPU time sometimes exhibits a discontinuity.For example, for LS, n p = 3 and d • = 6, there is a CPU time step at n s = 84.These discontinuities correspond to the change in least squares algorithms when there are or not sufficient equations for determining the n t polynomial terms: when the problem is well-posed, a QR factorization of F ⊤ F is performed, when the problem is ill-posed, it is the pseudoinverse method that is called.For high degree polynomials the CPU evolution curves are continuous because the problem is always ill-posed and the pseudo-inverse is the only active algorithm.

Comparison of Kernel-Based Models
We now compare kernel-based metamodels that use or do not use gradients.These are variants of the RBF and KRG approaches.The SVR metamodels were not tested because of the large computational cost induced by tuning their internal parameters which does not allow repeated runs.The tested approaches are ordinary kriging (OK), radial basis functions (RBF), both of which do not utilize gradients, Indirect gradient-based ordinary kriging (InOK), Indirect gradient-based RBF (InRBF), gradient-enhanced ordinary cokriging (OCK), and gradient-enhanced RBF (GRBF).All these examples take Matérn 3/2 as kernel function.To sum up, these results illustrate the advantage of direct gradient-enhanced metamodels in approximating non stationary, multimodal functions.They confirm other experiments carried out in the more complete study [57].
Figure 26 shows the CPU time taken for building the kernel-based metamodels for varying number of sample points and in dimensions 3, 5 and 10.SVR and GSVR are omitted in 10 dimensions because they take too much CPU time.
The building process includes the tuning of the metamodels' internal parameters which is performed here with an Particle Swarm Optimizer.
The typical CPU times of the kernel methods in Fig. 26 are larger than those of the least squares methods reported in Fig. 21.Furthermore, kernel methods show a higher sensitivity to the number of sample points n s and a larger CPU penalty for including gradients in the model than least squares do.The main reason for the larger CPU time of kernel methods is the tuning of their internal parameters, which least squares do not do.The higher sensitivity of  GRENAT [12], which stands for gradient enhanced approximation toolbox, is the toolbox that was used for generating all results and plots proposed in this review and in [59,94,95,131,132].GRENAT is written in Matlab/ Octave and follows the object oriented Matlab's syntax.It allows building and exploiting response-only, and indirect and direct gradient-enhanced kriging, radial basis functions and support vector regression.It can be linked to the MultiDOE toolbox [133] which compiles many sampling techniques.
The ooDACE Toolbox [134,135], also developed in object-oriented Matlab, has an implementation of cokriging that could accomodate accounting for gradients.
In addition to mathematical descriptions, Forrester et al. [21] also propose code samples for building gradientenhanced metamodels.

Remarks About Missing Data and Higher Order Derivatives
In keeping with previous works [21,136], the RBF, kriging and SVR metamodels and their gradient-enhanced versions that have been described in the review can readily be adapted for dealing with missing data: hybrid version of these metamodels can be considered by removing responses, components of gradient or full gradient at certain sample points.Components of the vector y g are deleted and the corresponding terms in the linear combinations making the approximations (in generalized least squares, GRBF, GKRG) are removed from the equations.In the case of GSVR, the constraints in the model defining optimization problem for which there is no longer an observation are removed.With IDW, terms of the first order Taylor approximations Q j (x) can be dropped, at the cost of loosing the corresponding gradient interpolation properties.
Deleting observations can even be a choice for minimizing the computational cost needed to build the metamodels and evaluate the responses and/or gradients.For example, when dealing with a function with known flat behavior in a part of the design space and a multimodal behavior in another part, accounting for gradient information is useful only in the latter.On the basis of relations like Eq. ( 26) for least squares, or Eq. ( 62) for radial basis functions, or Eq. ( 89) for cokriging, which all involve the inversion of a n s (n p + 1) by n s (n p + 1) covariance matrix, the computational complexity of each observation is at least cubic: assuming the number of operations required to invert a square matrix is slightly less than cubic, it will be at least cubic when multiplied by the number of repetitions of the inversion required for tuning internal parameters; then, accounting for all the gradients multiplies the complexity of the metamodels by at least (n p + 1) 3 .This metamodel complexity, although non negligible, will typically remain orders of magnitude smaller than the complexity of calculating the true response.
The formulations of gradient-enhanced RBF, cokriging and SVR can be also extended for taking into account higher order derivatives of the objectif function.Examples of formulations of Hessian-enhanced cokriging can be found in [21,53].In the case of SVR, an Hessian formulation may be based on the development proposed in [72]: the "prior knowledge", which is added to the classical SVR formulation (Problem 2 in Sect.9), consists in terms of the Hessian which are accounted for through new constraints.

Conclusions
We have reviewed the main surrogates (or metamodels) for approximating functions observed at a finite number of points that not only use function values but also their gradients.These surrogates are variations around the least squares methods, the Shepard weighting function, radial basis functions, kriging and support vector machines.Indirect methods, where the knowledge of the gradients produces new points to learn from, have also been covered.An effort was made to detail the logic and the formulations that led to these models.To the authors' knowledge, the -SVR formulation with gradients was given here for the first time.Another goal was to compare the metamodels.It was first done theoretically, in particular by casting all metamodels as linear combinations of functions chosen a priori and coefficients that depend on the observations.The comparison between metamodels was then substantiated by simple examples.
These examples, confirming other studies [9,57,132], show that exploiting gradient information is a determining advantage for approximating functions with locally changing behaviors.Including gradients in least squares methods comes at a negligible additional numerical cost.The more versatile kernel-based surrogates pay a numerical cost for also approximating gradients: all methods but Shepard weighting function have a complexity that scales at least with the cube of the number of observations, and each gradient at a point in a space of dimension n p adds n p observations.
The litterature on gradient-enhanced metamodels is recent but already rich.Today, many perspectives should be considered.
From a methodological point of view, there is a need for more robust, numerically less complex approaches that can account for large numbers (say, millions) of data points with their gradients, in higher dimensions (say, thousands).The current kernel methods can approximate a larger family of functions than least squares do, but they would not allow data sets beyond of the order of 1000 points because of bad conditionning issues and because of the rapidly growing number of operations.Beyond 10,000 points, computer memory would be an additional limitation.Recent works on Gaussian Processes have introduced strategies to deal with large number of points [137,138] and high-dimensional problems [139,140].However these approaches remain currently limited to response-only data.
On the applicative side, surrogates that learn and predict gradients should contribute to progress in local and global sensitivity analysis, uncertainty propagation, local trust region and global surrogate-based optimization methods.

Compliance with Ethical Standards
Conflict of interest The authors declare that they have no conflict of interest.

(
a) LS not using gradients

) w g = w 1 …
w n s w 11 w 12 … w 1n p w 21 … w n s n p ⊤ ; (68) y g = y 1 … y n s y 11 y 12 … y 1n p y 21 … y n s n p ⊤ .

140 Fig. 14
Fig.14 Number of terms in the polymonials, n t , as a function of the polynomial degree (solid lines), and number of available equations (blue and black marks for LS and GradLS, respectively) as a function of the number of parameters n p .The number of equations depends on n p only for GradLS: the dotted lines join, for each number of sample points n s , the number of equations for varying n p .(Color figure online)

3 Fig. 15
Fig. 15 Performance of the LS metamodel in approximating the 3 dimensional Rosenbrock's function in terms of the number of sample points (n s ) and the degree of the polynomial (d • )

3 Fig. 16
Fig.16 Performance of the GradLS metamodel in approximating the 3 dimensional Rosenbrock's function in terms of the number of sample points (n s ) and the degree of the polynomial (d • ).Compare with the response-only LS metamodel in Fig.15 20 which gather approximation performance indicators for the 3 dimensional version of the function and where the levels of R 2 and 3 are respectively smaller and larger than those of the 3 dimensional Rosenbrock function.By comparing Figs.19 and 20, it is also seen that the gradient-enhanced GradLS outperforms LS.

3 Fig. 17
Fig. 17 Performance of the LS metamodel in approximating the 5 dimensional Rosenbrock's function in terms of the number of sample points (n s ) and the degree of the polynomial (d • )

Fig. 18
Fig.18 Performance of the GradLS metamodel in approximating the 5 dimensional Rosenbrock's function in terms of the number of sample points (n s ) and the degree of the polynomial (d • ).Compare with the response-only LS metamodel in Fig.17

3 Figures 22 ,
Figures 22, 23, 24 and 25 show the values of the R 2 and 3 criteria for the Rosenbrock and Schwefel test functions in 5 and 10 dimensions.It can be seen in Figs.22 and23 that all the surrogates tested provide a good approximation to the Rosenbrock function in 5 and 10 dimensions, as measured by R 2 tending to 1 and 3 to 0 with n s , which is likely because the function is smooth and unimodal.Nevertheless, the methods that directly or indirectly utilize gradients have a visible advantage, that is, OK and RBF converge much slower to good values of 3 and R 2 as n s increases.Schwefel's function is the approximation target in Figs.24 and 25.For modeling such a multimodal non stationary function, it is observed that directly accounting for gradients is a determining asset: the surrogates relying only

3 Fig. 19
Fig. 19 Performance of the LS metamodel in approximating the 3 dimensional Schwefel's function in terms of the number of sample points (n s ) and the degree of the polynomial (d • ) Mean of R 2

3 Fig. 20
Fig.20 Performance of the GradLS metamodel in approximating the 3 dimensional Schwefel's function in terms of the number of sample points (n s ) and the degree of the polynomial (d • ).Compare with the response-only LS metamodel in Fig.19

Table 1
Summary of the main references on gradient-based metamodels

Table 2
Global framework for gradient-enhanced surrogates: definition of trends, a priori functions and coefficients as in Eq. (8) LS least squares, WLS weighted least squares, MLS moving least squares, IDW Shepard weighting function, GRBF gradient-enhanced radial basis function, GKRG cokriging, GSVR gradient-enhanced support vector machine

Table 5
Summary of works using gradient-enhanced cokriging