Formulation and properties of a divergence used to compare probability measures without absolute continuity

This paper develops a new divergence that generalizes relative entropy and can be used to compare probability measures without a requirement of absolute continuity. We establish properties of the divergence, and in particular derive and exploit a representation as an infimum convolution of optimal transport cost and relative entropy. Also included are examples of computation and approximation of the divergence, and the demonstration of properties that are useful when one quantifies model uncertainty.


Introduction
To compare different probabilistic models for a given application, one needs a notion of "distance" between the distributions. The specification of this distance is a subtle issue. Probability models are typically large or infinite dimensional, and the usefulness of the distance will depend on its mathematical properties. Is it convenient for analysis and optimization? Does it scale well with system size?
For situations that require an analysis of (probabilistic) model form uncertainly, the quantity known as relative entropy (or Kullback-Leibler divergence) is the most widely used such distance. This is true because relative entropy has all the attractive properties asked for in the last paragraph, and many more. (Relative entropy is not a true metric since it is not symmetric in its arguments, but owing to its other attributes it is more widely used for these purposes than any legitimate metric.) The definition of relative entropy is as follows. Suppose S is a Polish space with metric d(·, ·) and associated Borel σ-algebra B. Let P(S) be the space of probability measures over (S, B). If µ, ν ∈ P(S) and µ is absolutely continuous with respect to ν (denoted µ ν), then R(µ ν) . = S log dµ dν dµ (even though log dµ/dν can take both positive and negative values, as we discuss in the beginning of section 2, the definition is never ambiguous). Otherwise, we set R(µ ν) = ∞. While we cannot go into all the reasons why relative entropy is so useful, it is essential that we describe why it is convenient for the analysis of model form uncertainty. This is due to a dual pair of variational formulas which relate R(µ ν), integrals with respect µ, and what are called risk-sensitive integrals with respect to ν. It is immediate from either of these that for µ, ν ∈ P(S) and g ∈ C b (S), S gdµ ≤ R(µ ν ) + log S e g dν (in fact these expressions hold with C b (S) replaced by the bounded and measurable functions on S). If we interpret ν as the nominal or design model (chosen perhaps on the basis of data or for computational tractability) and µ as the true model (or at least a more accurate model), then according to the last display one obtains a bound on an integral with respect to the true model. (In fact by introducing a parameter one can obtain bounds that are in some sense optimal [11].) We typically interpret the integral S gdµ as a performance measure, and so we have a bound on the performance of the system under the true distribution in terms of the relative entropy distance R(µ ν ), plus a risk-sensitive performance measure under the design model. From this elementary but fundamental inequality, and by exploiting the helpful qualitative and quantitative properties of relative entropy, there has emerged a set of tools that can be used to answer many questions where probabilistic model form uncertainty is important, including [3, 7, 8, 10-13, 15, 16, 18, 19]. However, relative entropy has one important shortcoming: for the bound to be meaningful we must have R(µ ν ) < ∞, which imposes the requirement of absolute continuity of the true model with respect to the design model. For various uses, such as model building and model simplification, this restriction can be significant.
In the context of model building, it can happen that one attempts to fit distributions to data by comparing an empirical measure constructed using data with the elements of a parameterized family, such as a collection of Gaussian distributions. In this case the two distributions one would compare are singular, and so relative entropy cannot be used. A second example, and one that occurs frequently in the physical sciences, operations research and elsewhere, is that a detailed model (such as the population process of a chemical reaction network, which takes values in a lattice) is approximated by a simpler process that takes values in the continuum (for example a diffusion process). For exactly the same reason as in the previous example, these processes, as well as their corresponding stationary distributions, are not absolutely continuous.
Because relative entropy is not directly applicable to such problems, significant effort has been put into investigating alternatives ( [4,5] and references therein). A class that has attracted some attention (e.g., in the machine learning community) are the type-1 Wasserstein or, more generally, optimal transport distances [14,20,25]. These distances, which are true metrics, have certain attractive properties but also some shortcomings. One is that the distances do not have an interpretation as the dual of a strictly convex function. To be a little more concrete, it is the strict concavity of the mapping g → H[g; µ, ν] with H[g; µ, ν] . = S gdµ − log S e g dν (1.3) in the variational representation for R(µ ν ) that leads to tight bounds when applied to problems of control or optimization of stochastic uncertain systems [10]. As an elementary example, given a fixed bound M on R(µ ν ), it follows from (1.2) that for any c > 0 Also, in some problems of learning, one encounters optimization problems such as inf θ M (µ, ν θ ) where M is a "distance" and ν θ is a parameterized family. For M (µ, ν θ ) corresponding to relative entropy one obtains a min/max problem of the form that is solved iteratively, with H as in (1.3). Although we would prefer to avoid the restriction µ ν, the (strong) concavity/convexity properties of the mapping (g, ν) → H[g; µ, ν] appear preferable to those of the analogous affine mapping that corresponds to (1.4).
A second limitation of Wasserstein distances is that, owing to the absence of a chain rule, they do not in general scale well with respect to system dimension. This is an issue in applications to problems from the physical sciences, where large time horizons and large dimensions are common.
Rather than give up entirely the attractive features of the dual pair (R(µ ν ), log S e g dν), an alternative is to be more restrictive regarding the class of costs or performance measures for which bounds are required. Indeed, the requirement of absolute continuity in relative entropy is entirely due to the very large class of functions, C b (S), appearing in (1.1). For a collection Γ ⊂ C b (S) one can consider in lieu of R(µ ν ) what we call the Γ-divergence, which is defined by By imposing regularity conditions on Γ (e.g., Lipschitz continuity, additional smoothness) one generates (under mild additional conditions on Γ) divergences which relax the absolute continuity condition. Thus one is trading restrictions on the class of performance measures or observables for which bounds are valid, for the enlargement of the class of distributions to which the bounds apply. These divergences are of course not as nice as relative entropy, but one can prove that they retain versions of its most important properties (in particular, the sense in which a version of the chain rule persists is discussed in Sect. 6.2). In addition, the dual function remains log S e g dν. As noted this is useful owing to its convexity properties, and it is also useful when considering problems of optimization or control since the corresponding risk-sensitive optimization and optimal control problems are well studied in the literature.
In our formulation of the Γ-divergence the underlying idea is that to extend the range of probability measures that can be compared, one must restrict the class of integrands that will be considered. However, this leads directly to an interesting connection with the Wasserstein distance mentioned previously, which is that for suitable collections Γ we will prove the inf-convolution expression where W Γ is the Wasserstein metric whose dual (sup) formulation uses the set of functions Γ. Moreover one recovers relative entropy by taking the limit b → ∞ in G bΓ (µ ν ), which may be useful if one wants to allow relatively small violations of the absolute continuity restriction, while at the same time taking advantage of simple approximations for the Wasserstein distance in the high transportation cost limit. The sup formulation (1.5) can also be used as the basis for sampling based computation, by adapting the approach of [17].
The organization of the paper is as follows. In Section 2 we define the Γ-divergence, and prove the first main result of this paper, which is the inf-convolution formula described above (Thm. 2.4). In Section 3, we show several properties of the Γ-divergence, and establish a convex duality formula for the Γ-divergence. Section 4 investigates the Γ-divergence for a special choices of Γ, which are sets of bounded Lipschitz continuous functions. We establish a relation between Γ-divergence and optimal transport cost, and prove existence and uniqueness for optimizers of variational representations of Γ-divergence (Thm. 4.9), and also formulas for directional derivatives of the Γ-divergence (Thm. 4.16). Section 5 considers limits for the Γ-divergence, and in Section 6 there is a preliminary discussion on how one can apply the Γ-divergence to obtain uncertainty quantification bounds.
As last remarks we note that the paper [1] defines a "relaxation" of Wasserstein distance by putting in an entropy term of the mass-transfer matrix. The new divergence so defined is easier to compute than the original Wasserstein distance, but is not the same as the divergences we develop here. Also, [24] makes use of an infconvolution formula analogous to the one presented above to extend type-1 Wasserstein distances to positive measures.

Definition of the Γ-divergence
Throughout this section, S is a Polish space with metric d(·, ·) and associated Borel σ-algebra B. C b (S) denotes the space of all bounded continuous functions from S to R. Let P(S) be the space of probability measures over (S, B), M(S) be the space of finite signed (Borel) measures over (S, B), and M 0 (S) be the subspace of M(S) whose total mass is 0. R . = R ∪ {∞} is the extended real numbers. Throughout this section, we consider C b (S) equipped with weak topology induced by M(S). Thus for We recall that Thus R(µ ν) is always well defined. We recall the Donsker-Varadhan variational representation (1.1) for relative entropy. We will use equation (1.1) as an equivalent characterization of R(· ν) on P(S), and consider an extension to M(S). With an abuse of notation, we will also call the extended function R. The following lemma states basic properties of the extension. Its proof appears in the Appendix. (1) R(µ ν) ≥ 0 and R(µ ν) = 0 if and only if µ = ν, Though relative entropy has very attractive regularity and optimization properties, as noted R(µ ν) is finite if and only if µ ν. As such, it cannot be used to give a meaningful notion of "distance" without this absolute continuity restriction. In order to define a meaningful divergence for a pair of probability measures that are not mutually absolute continuous, but at the same time not to lose the useful properties of the "dual" function g → log S e g dν appearing in (1.1), a natural approach is to restrict the set of test functions in the variational formula. We define a criterion for the classes of "admissible" test functions we want to use. Definition 2.2. Let Γ be a subset of C b (S) endowed with the inherited weak topology. We call Γ admissible if the following hold.
We next define a new divergence by restricting the class of test functions in the definition of relative entropy. Let Γ c denote the complement of Γ. Definition 2.3. Fix ν ∈ P(S). For µ ∈ M(S), we define the Γ-divergence associated with the admissible set Γ by We also define the following related quantity. For η ∈ M(S) let When Γ is clear based on context, we will drop the subscript from G Γ and W Γ . Using a similar argument as in Lemma 2.1, one can show that G Γ (µ ν) = ∞ if µ(S) = 1. The next theorem states an important property of the Γ-divergence, which is that it can be written as a convolution involving relative entropy and W Γ . Theorem 2.4. Assume Γ is an admissible set. Then for µ ∈ M(S), ν ∈ P(S), It will be pointed out in Section 4 that if Γ is taken to be the Lipschitz functions with respect to a cost function c(x, y) that satisfies some specified conditions, W Γ (µ − ν) will be the corresponding optimal transport cost from µ to ν. If Γ is also admissible then the theorem tells us that by restricting the set of test functions in the variational representation of relative entropy to Γ, we get a quantity which is an inf-convolution of relative entropy and a metric.
The rest of this section is focused on the proof of Theorem 2.4. In order to do this, we need a few definitions and also will find it convenient to consider a more general setting. Definition 2.7. A subset C of a topological vector space Y over the number field R is 1. convex if for any x, y ∈ C and any t ∈ [0, 1], tx + (1 − t)y ∈ C, 2. balanced if for all x ∈ C and any λ ∈ R with |λ| ≤ 1, λx ∈ C, 3. absorbant if for all y ∈ Y , there exists t > 0 and x ∈ C such that y = tx. A topological vector space Y is called locally convex if the origin has a local topological basis of convex, balanced and absorbent sets.
Definition 2.8. For a topological vector space Y over the number field R, its topological dual space Y * is defined as the space of all continuous linear functionals ϕ : Y → R.
The weak * topology on Y * is the topology induced by Y . In other words, it is the coarsest topology such that functional y : For y ∈ Y and ϕ ∈ Y * , we also write y, ϕ . = ϕ(y) = y(ϕ).
Now let Y be a Hausdorff locally convex space with Y * being its topological dual space and endowed with the weak* topology. · · · f * m .
In our use we take Y = C b (S) equipped with topology induced by M(S), i.e., the topological basis around g ∈ Y is taken as sets of the form where m ∈ N, {µ k } k=1,2,...,m ⊂ M(S) and k > 0, k = 1, 2, . . . , m are arbitrary. It can be easily verified that under this topology, C b (S) is a Hausdorff locally convex space, with C b (S) * = M(S) ( [22], Thm. 3.10). For g ∈ C b (S) and µ ∈ M(S), we define the bilinear form We are now ready to prove the main theorem.
Proof of Theorem 2.4. Define H 1 , Then Notice that {0} ∈ dom(H 1 ) ∩ dom(H 2 ) = ∅, and both H 1 and H 2 are proper and convex. For lowersemicontinuity, under the topology induced by M(S), H 1 is lower semicontinuous because of (1.2) and the fact that supremum of continuous functions are lower semicontinuous, and H 2 is lower semicontinuous since Γ is closed. Thus, by Lemma 2.13 By equation (1.1) and the definition of W Γ , we know that R(µ ν) = H * 1 (µ) and W Γ (η) = H * 2 (η).
In the following display, the first equality is due to the definition of inf-convolution, and the second is since R(γ ν) < ∞ only when γ ∈ P(S): Thus the last thing we need to prove is that H * 1 H * 2 is lower semicontinuous. Note that relative entropy is lower semicontinuous in the first argument in the weak topology ( [9], Lem. 1.4.3 (b)), and W Γ is lower semicontinuous in the weak topology since it is the supremum of a collection of linear functionals. Let Consider any sequence µ n ⇒ µ with µ n , µ ∈ M(S). Here "⇒" means convergence in the weak * topology, i.e., for any f ∈ C b (S), f dµ n → f dµ. Let ε > 0, and for each µ n let γ n satisfy We want to show that If lim inf n→∞ F (µ n ) = ∞, the inequality above holds automatically. Assuming lim inf n→∞ F (µ n ) < ∞, let n k be a subsequence such that Notice that ). Then we can take a further subsequence that converges weakly. For simplicity of notation, let n k denote this subsequence, and let γ ∞ denote the weak limit of γ n k . Then using the lower semicontinuity of R(· ν) on P(S) and the lower semicontinuity of Since ε > 0 is arbitrary this establishes (2.1), and thus F is lower semicontinuous in M(S). The theorem is proved.

Properties of the Γ-divergence
Theorem 2.4 provides an interesting characterization of the Γ-divergence. Before we continue to specific choices of Γ, we first state some general properties associated with Γ-divergence. Throughout this section we fix an admissible set Γ, and thus drop the subscript from G Γ and W Γ in this section. Also, now that we have established the expression for G as an inf-convolution as in Theorem 2.4, we no longer need to consider G as a function on M(S) × P(S), and instead can consider it just on P(S) × P(S), since we want to use G as a measure of how two probability distributions differ. 2) G(µ ν) is a convex and lower semicontinuous function of (µ, ν). In particular, G(µ ν) is a convex, lower semicontinuous function of each variable µ or ν separately. 3 Remark 3.2. 1) The first property justifies our calling G a divergence as the term is used in information theory.
2) This is a straightforward corollary of Theorem 2.4, since the supremum of a collection of linear and continuous functionals is both convex and lower semicontinuous.
where the supremum is achieved uniquely at µ 0 satisfying A similar duality formula holds for the Γ-divergence when g ∈ Γ.
Proof. Using the definition of the Γ-divergence On the other hand, we know for relative entropy that The statement of the theorem follows from the two inequalities.
The last theorem has two important implications. The first is related to the fact that Lemma 3.3 implies bounds for S gdµ when R(µ ν) is bounded, an observation that has served as the basis for the analysis of various aspects of model form uncertainty [8,11]. Using Theorem 3.4, we obtain analogous bounds on S gdµ for g ∈ Γ when G(µ ν) is bounded. Applications of these bounds will be further developed elsewhere. The second is that for g ∈ Γ, if we take µ 0 as defined in Lemma 3.3, then where the first inequality comes from G(µ 0 ν) ≤ R(µ 0 ν). Since both inequalities above must be equalities, we must have The next lemma gives a more detailed picture of G(µ ν) when µ ν.
Proof. We use the definition to prove this lemma. For any g ∈ Γ, we define γ g ∈ P(S) by the relation S e g dν for x ∈ supp(ν), and γ g (supp(ν) c ) = 0. Then for x ∈ supp(ν), Since µ ν, we have and thus On the other hand, for any γ ∈ A(S), by definition, we can find a g γ ∈ Γ such that Combining the two inequalities completes the proof.
Remark 3.6. When µ ∈ A(S) we always have G(µ ν) = R(µ ν). This is because if γ ∈ A(S) then µ γ, and therefore Rearranging gives and so This statement is not valid when µ ν does not hold, since then log(dγ/dν) is not defined in supp(µ)\supp(ν), thus S log dγ dν dµ is not well defined.

Connection with optimal transport theory
In the proceeding sections, we discussed general properties for the Γ-divergence with an admissible set Γ ⊂ C b (S). In this section, we discuss specific choices of Γ which relate the Γ-divergence with optimal transport theory. First we state some well known results in optimal transport theory.

Preliminary results from optimal transport theory
The results in this section are from Chapter 4 of [20]. The general Monge-Kantorovich mass transfer problem with given marginals µ, ν ∈ P(S) and cost function c : where Π(µ, ν) denotes the collection of all probability measures on S × S with first and second marginals being µ and ν, respectively. A natural dual problem with respect to this is where ρ = µ − ν, C b (S) denotes the set of bounded continuous functions mapping S to R and We want to know when holds. The following is a necessary and sufficient condition. As with many results in this section, one can extend in a trivial way to the case where costs are bounded from below, rather than non-negative. Recall that S is a Polish space. This follows easily from On the other hand, Condition 4.1 also allows for a wide range of choices of c(x, y). For example, suppose that c is a continuous metric on S, where continuity is with respect to the underlying metric of S. Then we can choose It is easily verified that Q ⊂ C b (S), and that with this choice of Q (4.3) holds.
To make the presentation simple, we have assumed that c is non-negative, and further assume it is symmetric, meaning c(x, y) = c(y, x) ≥ 0 for any x, y ∈ S. To distinguish from W Γ (µ − ν) for general Γ, we denote the transport cost for µ, ν ∈ P(S) by Then by Theorem 4.2 c(x, y)π(dx, dy) .
Hence a simple sufficient assumption for Condition 4.4 is that for some θ > 0, θd(x, y) ≤ c(x, y) for x, y ∈ S. In fact, it is enough that for each compact set K ⊂ S there is θ with θd(x, y) ≤ c(x, y) for x, y ∈ S. To see this, let where δ > 0 and we can assume that 0 ≤ f ≤ 1. Since a single probability measure is always tight we can find a compact set K such that µ(K c ) ≤ δ/8 and ν(K c ) ≤ δ/8. Then under the assumption f is bounded and Lipschitz continuous with respect to c on K, and using Hence by choosing Γ properly, we get that the Γ-divergence is an infimal convolution of relative entropy, which is a convex function of likelihood ratios, and an optimal transport cost, which depends on a cost structure on the space S. Natural questions to raise here are the following. i) Do there exist optimizers γ * and g * in the variational problem (4.4)? If so, are they unique? ii) How can one characterize γ * and g * ? iii) For a fixed ν ∈ P(S) (resp., µ ∈ P(S)), what is the effect of a perturbation of µ (resp., ν) on G Γ (µ ν)?
We will address these questions sequentially in this section. From now on, we will drop the subscript Γ in this section for the simplicity of writing. We consider the case where G(µ ν) < ∞. To impose additional constraints on µ and ν such that G(µ ν) < ∞ holds, we make a further assumption on c.
We will assume the following mild conditions on the space S and cost c to make Lip(c, S; C b (S)) precompact.  d(x, y)).
Recalling the definition (4.1), we define the unbounded version as follows where C(S) is the set of continuous functions mapping S to R. Before we proceed, we state the following lemma, which will be used repeatedly in this section.
Lemma 4.8. If g ∈ Lip(c, S) and θ, ν ∈ P(S) satisfy S |g|dθ < ∞, then Proof. We use a standard truncation argument. Since by Lemma 3.1 we already have G(θ ν) ≤ R(θ ν), we only need to prove the first inequality in the statement of the lemma. If S e g dν = ∞, then Hence we only need consider the case S e g dν < ∞. Let g n = min(max(g, −n), n) ∈ Lip(c, S; C b (S)) = Γ for n ∈ N. We have |g n (x)| ≤ |g(x)| and Thus by the dominated convergence theorem lim n→∞ S g n dθ = S gdθ. Also we have  2) There exists an optimizer g * ∈ Lip(c, S) in the expression (4.4), which is unique up to an additive constant in supp(µ) ∪ supp(ν).
3) g * and γ * satisfy the following conditions: Remark 4.10. With many analogous expressions related to relative entropy, one can only conclude the uniqueness of γ * and g * (up to constant addition) almost everywhere according to either the measure µ or ν. However, because of the regularity condition g * ∈ Lip(c, S; C(S)) and Condition 4.7, the uniqueness of g * (up to constant addition) on supp(µ) ∪ supp(ν) will follow.
Proof. For n ∈ N consider γ n ∈ P(S) that satisfies Then by Lemma 1.4.3(c) in [9] {γ n } n≥1 is precompact in the weak topology, and thus has a convergent subsequence {γ n k } k≥1 . Denote γ * . = lim k→∞ γ n k . Then by the lower semicontinuity of both R(· ν) and W c (µ, ·), we have which shows that γ * is an optimizer in expression (4.4). If there exist two optimizers γ 1 = γ 2 , the strict convexity of R(· ν) and convexity of W c (µ, ·) imply that for γ 3 = 1 2 (γ 1 + γ 2 ) a contradiction. Thus the existence and uniqueness of an optimizer γ * of (4.4) is proved, which establishes 1) in the statement of the theorem. Before proceeding, we establish the following lemma, whose proof appears in the Appendix. Take g n ∈ Lip(c, S; C b (S)) such that Without loss of generality, we can assume g n (x 0 ) = 0 for some fixed x 0 ∈ K 0 ⊂ S. Since for any m ∈ N K m ⊂ S is compact, we have that {g n } n∈N is bounded and equicontinuous on K m by Condition 4.7. By the Arzelà-Ascoli theorem, there exists a subsequence of {g n } n∈N that converges uniformly in K m . Using a diagonalization argument, by taking subsequences sequentially along {K m } m∈N , where the next subsequence is a subsequence of the former one, and taking one element from each sequence, we conclude there exists a subsequence g nj j∈N , that converges uniformly in any K m . Since S = ∪ m∈N K m , we conclude that g nj j∈N converges pointwise in S. Denotes its limit by g * . It can be easily verified that g * ∈ Lip(c, S).
Since g nj (x) ≤ g nj (x 0 ) + c(x 0 , x) ≤ a(x 0 ) + a(x) and S (a(x 0 ) + a(x)) dµ < ∞, by the dominated convergence theorem lim j→∞ S g nj dµ = S g * dµ. By Fatou's lemma, we have lim inf j→∞ S e gn j dν ≥ e g * dν, and therefore Putting these together, we have We can add and subtract S g * dγ * because we have proved in Lemma 4.11 that γ * is integrable with respect to functions in Lip(c, S), and g * ∈ Lip(c, S). By Lemma 4.8 we have We also have which is due to where the last equality is because of the dominated convergence theorem and integrability of |g * | with respect to µ and γ * (Lem. 4.11). We can therefore continue the calculation above as Since both the upper and lower bounds on the inequalities coincide, we must have all inequalities to be equalities, and therefore The last equation gives us the relationship Thus we have shown the existence of optimizer g * ∈ Lip(c, S) and its relationship with γ * . Lastly, for any other optimizerḡ ∈ Lip(c, S) the analogous argument shows Hence uniqueness of the optimizer g * in supp(ν) up to ν − a.s. is also proved.
(Note that c satisfying Condition 4.1 is lower semicontinuous, and therefore ([2], Thm. 1.5) shows the existence of an optimal transport plan π * .) Since g * (x) − g * (y) ≤ c(x, y), Then the only inequality above must be equality, which implies that for (x, y) ∈ supp(γ * ), g * (x) − g * (y) = c(x, y), π * − a.s. This is also true for any other optimizerḡ ∈ Lip(c, S) for (4.4). Thus we are able to determine g * uniquely in supp(µ) µ − a.s. with the help of π * and data of g * in supp(ν). Lastly, since g * ∈ Lip(c, S) and by Condition 4.7, we conclude the uniqueness of g * in supp(µ) ∪ supp(ν) by the continuity of g * .
Remark 4.12. When µ ν Theorem 4.9 implies that for some constant c 0 and so the Γ-divergence of µ with respect to ν looks like a "modified" version of relative entropy.
The next theorem tells us that 3) of Theorem 4.9 is not only a description of the pair of optimizer (g * , γ * ), but also a characterization of it.
The first inequality comes from the fact that γ 1 ∈ P(S), while the second needs a little more discussion, which will be given below. Assuming this, the last display shows that (g 1 , γ 1 ) are optimizers. The second inequality follows from Lemma 4.8 and the fact that The proof is complete.
The last theorem answers questions i) and ii) raised earlier in this section, now we want to answer iii), which is to characterize the directional derivatives of G(µ ν) in the one variable when fixing the other, e.g., for ρ ∈ M 0 (S) which satisfies certain conditions. From Theorem 4.9 and remarks following it we know that any optimizer g * of expression (4.4) is unique in supp(µ) ∪ supp(ν). However, there is still freedom to choose g * in S\ {supp(µ) ∪ supp(ν)}, since the variational problem in (4.4) does not take into account of the information of g * outside supp(µ) ∪ supp(ν), other than requiring that g * belong to Lip(c, S). We will define a special g * that is uniquely defined not only in supp(µ) and supp(ν), but also on S\ also known as the "c-transform" in the optimal transport literature. The following lemma confirms that this construction of g * + still lies in Lip(c, S). While part 1 is standard, we could not find a reference for part 2, and so the proof appears in the Appendix.
Lemma 4.14. The following two statements hold.
2) g * + defined by equation (4.5) is in Lip(c, S). In addition, Remark 4.15. We also will make use of the function Then based on these constructions, we have the following result. A sufficient condition for the requirement that S g * − dρ be well defined and finite and the related assumption regarding convergence is e c(x,x0) ρ(dx) < ∞. Theorem 4.16. Take Γ = Lip(c, S; C b (S)) where c satisfies the conditions of Theorem 4.9 and µ, ν ∈ L 1 (a). Take ρ = ρ + − ρ − ∈ M 0 (S) where ρ + , ρ − ∈ P(S) are mutually singular probability measures, ρ + ∈ L 1 (a), and assume there exists ε 0 > 0 such that µ + ερ ∈ P(S) for 0 < ε ≤ ε 0 . Then where g * + is given by (4.6). Suppose that S e g * − dρ is well defined and finite, where g * − is given by (4.7), that if g n ∈ Lip(c, S) converges to g * − pointwise then S e gn dρ → S e g * − dρ, and that there is ε 0 > 0 such that ν + ερ ∈ P(S) for 0 < ε ≤ ε 0 . Then Proof. We use the variational formula (4.4) for G(µ + ερ ν), where µ + ερ ∈ P(S) and ρ + ∈ L 1 (a). Recall that g * + is an optimizer for (4.4). Using Lemma 4.8 with θ = µ + ερ, The other direction is more delicate. Take f (ε) = G(µ + ερ ν). From Lemma 3.1 we know that f is convex, lower semicontinuous and finite on [0, ε 0 ]. Using a property of convex functions in one dimension, we know f is differentiable on (0, ε 0 ) except for a countable number of points. Take ε ∈ (0, ε 0 ) to be a place where f is differentiable, and δ > 0 small. Take g * ε ∈ Lip(c, S) to be the optimizer for G(µ + ερ ν) satisfying g * ε (x 0 ) = 0 for some x 0 in the support of ν, so that Then using an argument that already appeared in this proof, we have and therefore If we denote then by a property of convex functions ( [21], Thm. 24.1), for any sequence of {ε n } n∈N such that ε 0 > ε n ↓ 0 and f is differentiable at ε n > 0, we have By the same argument used in the proof of Theorem 4.9 (paragraphs following Lem. 4.11), i.e., by applying the Arzelà-Ascoli theorem to {g εn } on each compact set K m ⊂ S, and then doing a diagonalization argument, there exists a subsequence of {n k } k≥0 ⊂ {n} n≥0 , such that g * εn k converges pointwise to a function that we denote by g * 0 ∈ Lip(c, S). To simplify the notation, let n denote the convergent subsequence.
Since both sides of the inequality coincide, g * 0 must be an optimizer for the variational expression (4.4). By Theorem 4.9 and equation (4.6), we have g * 0 (x) ≤ g * + (x) for all x ∈ S and g * 0 (x) = g * + (x) for all x ∈ supp(ρ − ) ⊂ supp(µ). Thus (4.10) and the other direction of the inequality is proved. Combining (4.10) and (4.8) gives We next consider the second statement, and now use that we have For the given ρ ∈ M 0 (S) and ε ∈ (0, ε 0 ) we have and thus lim inf For the reverse direction the line of argument parallels the previous case. With now f (ε) = G Γ (µ ν + ερ), we again have a right derivative at ε = 0. Let g * ε ∈ Lip(c, S) satisfy g * ε (x * ) = 0 for some point x 0 in the support of ν, and Without loss we can assume f (ε) is differentiable at ε > 0, and for δ > 0 have the bounds , which implies .
We can assume that there is a sequence ε n → 0 and g * 0 such that g * εn converges pointwise to g * 0 , and so under the assumptions of the theorem .
Remark 4.18. One can consider g * + defined in (4.5) the unique potential associated with G Γ (µ ν). This g * + is similar to the Kantorovich potential in the optimal transport literature. However, for the optimal transport cost W c (µ, ν) more conditions are needed (e.g., [23], Prop. 7.18) to ensure the uniqueness of the Kantorovich potential. Here under very mild conditions we are able to confirm the uniqueness of the potential, and prove that it is the directional derivative of the corresponding Γ-divergence, as is case of the Kantorovich potential for optimal transport cost when its uniqueness is established.

Limits and approximations of Γ-divergence
In this section, we consider limits that are obtained as the admissible set gets large or small, and the Γdivergence will be approximated by relative entropy or a transport distance, respectively. We also consider in special cases more informative expansions. Throughout the section we assume the conditions of Theorem 4.9.
On the other hand, we have by (5.1) that lim sup b→∞ G bΓ0 (µ ν) ≤ R(µ ν), and the statement is proved.
2) R(µ ν) = ∞. For this case, we want to prove that lim inf b→∞ G bΓ0 (µ ν) = ∞. If not, then there exists a subsequence {b k } b∈N such that For this subsequence, we can apply the argument used in part 1) to conclude there exists γ * b k such that Moreover there exists a further subsequence of this sequence, which for simplicity we also denote by {b k } k∈N , which satisfies γ * b k ⇒ µ. Then by the same argument as in 1), we would conclude This contradiction proves the statement.
On the other hand, if Γ = δΓ 0 for small δ > 0, we can approximate the Γ-divergence in terms of the W Γ0 .
Proof. For any δ > 0, Jensen's inequality implies and therefore lim sup For the reverse inequality we consider two cases.
In the rest of this section, we investigate the behavior when b → ∞, and in particular how G bΓ0 (µ ν) will behave for fixed µ and ν. We only consider the case that Γ 0 = Lip(c, S; C b (S)) for some function c satisfies the condition of Theorem 4.2, Assumption 4.4 and Assumption 4.6, and µ, ν ∈ L 1 (a) with a in Assumption 4.6. We separate the cases depending on whether µ and ν are discrete or continuous. The results presented here are only for special cases, and further development of these sorts of expansions would be useful.

Finitely supported discrete measures
We will consider the case where supp(ν) has finite cardinality, and µ is also discrete with finite support. The proof of the following appears in the Appendix.
where e(b) ≤ 0 satisfies e(b) → 0 as b → ∞. Furthermore, we can characterizeγ as the measure that minimizes R(γ ν) over the collection of γ ∈ P(S) that satisfy the constraint If to simplify the statement below we further assume that Remark 5.4. In discrete case, it is easily checked that the infimum in (5.3) is achieved. Take a sequence of θ n ν such that Since θ n is supported on the compact set supp(ν) = {x i } 1≤i≤N {θ n } n∈N is compact, and hence there exist θ * ν and a subsequence {θ n k } k∈N that converges to θ * weakly. By the lower semicontinuity of W Γ0 and therefore θ * achieves the infimum of (5.3).

An example with ν is continuous
To illustrate an interesting scaling phenomenon, here we consider the example with S = R, c(x, y) = |x − y|, ν = Unif([0, 1]), µ = δ 0 . Consider γ * (dx) = c 0 e −bx dx and g * (x) = −bx for 0 ≤ x ≤ 1, where c 0 is the normalizing constant. For this example Γ 0 = Lip(c, S; C b (S)) is the set of bounded functions over R with Lipschitz constant 1. It is easily checked using Theorem 4.13 that γ * and g * are the optimizers in For comparison we consider the optimal transport cost between µ and ν. We have and one can calculate that W c (µ, ν) = 1/2. Thus W bc (µ, ν) = b/2, and so G bΓ0 (µ ν) gives a much smaller divergence between non absolutely continuous measures µ and ν than the corresponding optimal transport cost when the admissible Γ = bΓ 0 is becoming large.
6. Application to uncertainty bounds

Extension to unbounded functions
From The inequality above with relative entropy in place of G Γ (µ ν) is the key to uncertainty bounds in [11]. We would like to extend this inequality to unbounded functions. Definê Since g i ∈ Γ, for all i Taking i → ∞ in the last display gives (6.1). For g ∈Γ − the reasoning is essentially the same.
In the case when Γ = Lip(c, S; C b (S)), where c satisfies the conditions introduced in Section 4, we can get a stronger version of the result. The proof is essentially the same as in Lemma 4.8, and is omitted.

Decomposition and scaling properties
A property of great importance in applications of relative entropy is the chain rule. When probability measures can be decomposed, such as when Markov measures on a path space are written as the repeated integration with respect to transition kernels, the chain rule allows one to decompose the relative entropy of two such measures on path space in terms of the simpler relative entropies of the transition kernels. This decomposition also exhibits important scaling properties of relative entropy, e.g., that for such Markov measures on path space the relative entropy scales proportionate to the number of time steps.
Except in special circumstances, optimal transport metrics do not possess a property like the chain rule, and it is therefore not to be expected that Γ-divergence would either. However, if one considers certain classes of functions on path space, then one can show there are analogous decomposition and scaling properties. In this section we will discuss a setting relevant to many applications, though the results have many analogues and possible generalizations.
As usual, we assume that S is a Polish space, and let p : S × B(S) be a probability transition kernel: -for every A ∈ B(S) the map x → p(x, A) is Borel measurable, and -for every x ∈ S, p(x, ·) is in P(S).
The quantities of interest are large and infinite time averages, both with respect to time and the underlying distribution, and we wish to bound in a tight fashion the error in such quantities due to model misspecification.
Thus if q is some other transition kernel, then we seek useful bounds on differences of the form where E γ,p indicates that the chain uses transition kernel p and initial distribution γ, and similarly for E θ,q . Under conditions, relative entropy can provide useful bounds when q(x, ·) p(x, ·) for a suitable set of x ∈ S. One question then is under what conditions will the Γ-divergence allow one to weaken the absolute continuity restriction. It is also worth noting that even when q(x, ·) p(x, ·) the bounds obtained using the Γ-divergence (when applicable) are tighter, since it is never greater than relative entropy, and in some cases the improvement can be dramatic. These issues will be explored in greater detail elsewhere.
It follows directly from discussion in earlier sections that even in the setting of product measures that one must restrict the class of functions f under consideration. When considering Markov measures, the following definition is relevant. Then R(Γ, p) will determine the set of costs f such that bounds can be obtained using the Γ-divergence. In particular, we have the following. Theorem 6.4. Suppose that f ∈ R(Γ, p) for some g and a. Consider any transition kernel q on S and any stationary probability measure π q of q. Then S f (x)π q (dx) ≤ S G Γ (q(x, ·) p(x, ·))π q (dx) + a.
Remark 6.5. If p is ergodic then we recognize as the equation that uniquely characterizes the multiplicative cost with g a type of cost potential. Note that for a given f the function g plays no role in the bound. We need to check that f is in the range of Γ (which of course imposes restrictions on f ), but the bound does not depend on knowing the specific form of g.
We next consider two examples to illustrate Definition 6.3.
Whether or not J is of full rank will depend on the structure of P . We have the following lemma. Proof. Let π denote the stationary distribution of P . Then interpreting π as a column vector, it is the unique vector in the null space of (P − I) T . According to the Fredholm alternative, the range of (P − I) is the n − 1 dimensional collection of vectors b ∈ R n such that b, π = 0. Now 1, π > 0, which shows that 1 is not in the range of (P − I). Therefore the range of J is all of R n .
To give a simple example of how the Γ-divergence could be used for model simplification, consider the situation where we are given an ergodic chain P with state spaceS, and would like to replace P by a chain Q with state space S =S ∪ M , where the new states are intended to replace a (possibly large) number of states in S, with the goal being to maintain good approximation of certain functionals of the stationary distribution. If π q denotes the stationary distribution of Q on S and π p that of P onS, then one could not use relative entropy to obtain any bounds. Suppose we were to extend P toS ∪ M (while keeping P as the transition matrix), by making all states in M transient. Then one could use the Γ-divergence as long as the functionals of interest are in R(Γ, p) (with respect to the extended transition probabilities). Note that the location of the new states would be relevant to this question, since the costs f depend on these locations. Similarly, one could do sensitivity bounds for non-absolutely continuous transitions by using such a device.

Conclusion
In this paper, we defined a new divergence by starting with a variational representation for relative entropy and placing additional restrictions on the collection of test functions used in the representation, so as to relax the requirement of absolute continuity. Basic qualitative properties of the divergence were investigated, as well as its relationship with optimal transport metrics. Future work will use the divergence to develop uncertainty quantification bounds, sensitivity bounds and methods for model approximation and simplification for stochastic for models without the absolute continuity requirement. Also needed is further investigation of qualitative and computational aspects of the Γ-divergence.

Appendix A.
In this appendix we collect proofs of some intermediate results.
Proof of Lemma 2.1. If we prove item 3, then items 1 and 2 will follow from the corresponding statements when µ is restricted to P(S) [9]. If m = µ(S) = 1, then taking g(x) ≡ c a constant, where C c ∈ (log ν(B), 0) for all c. Letting c → ∞ and using (1.1) (or more precisely the analogous statement using bounded measurable functions) shows R(µ ν) = ∞.
Proof of Lemma 4.11. This can be shown by contradiction. Assume there exists h ∈ Lip(c, S) such that S |h|dγ * = ∞. By symmetry, we can just consider h to be non-negative, since max(h, 0) ∈ Lip(c, S) and h = max(h, 0) − max(−h, 0). Thus we can assume there exists non-negative h ∈ Lip(c, S) satisfying S hdγ * = ∞, and by the fact that µ ∈ L 1 (a) together with Condition 4.6, where the second to last equation comes from dominated and monotone convergence theorems applied to the first and second terms respectively. However, since γ * is the optimizer, we have W c (µ, γ * ) ≤ W c (µ, γ * ) + R(γ * ν) = G(µ ν) < ∞.
This contradiction shows the integrability of γ * with respect to any Lip(c, S) function.
By combining the two expressions above, we have for x ∈ supp(µ), (4.5) also holds. In other words, g * is totally characterized by g * | supp (ν) and (4.5).
where the fourth inequality is because R(γ ∞ ν) ≥ R(γ ν) and the lower semi-continuity of W Γ0 (µ, ·). Since ε > 0 is arbitrary, this establishes (5.2) along the given subsequence. For any other sequence {b k } k∈N along which lim k→∞ (G b k Γ0 (µ ν) − [R(γ ν) + b k W Γ0 (µ,γ)]) has a limit, we can also take a subsequence from it according to the discussion above. Thus the statement is proved.
The proof of the claimed form forγ is as follows. Let θ be any probability measure with θ ν, and assume that for some i Then there exists j ∈ S i for which some of the mass is sent to a point x k with c(x k , y j ) > c(x i , y k ). By taking this mass from x k and assigning it to x i , while keeping all other assignments the same, we get a strictly lower cost. Thus (A.3) cannot hold for any i at an optimizer, and therefore equality must hold for all i.