Thresholding gradient methods in Hilbert spaces: support identification and linear convergence

We study $\ell^1$ regularized least squares optimization problem in a separable Hilbert space. We show that the iterative soft-thresholding algorithm (ISTA) converges linearly, without making any assumption on the linear operator into play or on the problem. The result is obtained combining two key concepts: the notion of extended support, a finite set containing the support, and the notion of conditioning over finite dimensional sets. We prove that ISTA identifies the solution extended support after a finite number of iterations, and we derive linear convergence from the conditioning property, which is always satisfied for $\ell^1$ regularized least squares problems. Our analysis extends to the the entire class of thresholding gradient algorithms, for which we provide a conceptually new proof of strong convergence, as well as convergence rates.


INTRODUCTION
Recent works show that, for many problems of interest, favorable geometry can greatly improve theoretical results with respect to more general, worst-case perspective [1,16,5,20]. In this paper, we follow this perspective to analyze the convergence properties of threshold gradient methods in separable Hilbert spaces. Our starting point is the now classic iterative soft thresholding algorithm (ISTA) to solve the problem (1) f (x) = x 1 + 1 2 Ax − y 2 , defined by an operator A on 2 (N) and where · 1 is the 1 norm. From the seminal work [11], it is known that ISTA converges strongly in 2 (N). This result is generalized in [9] to a wider class of algorithms, the so-called thresholding gradient methods, noting that these are special instances of the Forward-Backward algorithm, where the proximal step reduces to a thresholding step onto an orthonormal basis (Section 2). Typically, strong convergence in Hilbert spaces is the consequence of a particular structure of the considered problem. Classic examples being even functions, functions for which the set of minimizers has a nonempty interior, or strongly convex functions [30]. Further examples are uniformly convex functions, or functions presenting a favorable geometry around their minimizers, like conditioned functions or Lojasiewicz functions, see e.g. [4,20]. Whether the properties of ISTA, and more generally threshold gradient methods, can be explained from this perspective is not apparent from the analysis in [11,9].
Our first contribution is revisiting these results providing such an explanation: for these algorithms, the whole sequence of iterates is fully contained in a specific finite-dimensional subspace, ensuring automatically strong convergence. The key argument in our analysis is that after a finite number of iterations, the iterates identify the so called extended support of their limit. This set coincides with the active constraints at the solution of the dual problem, and reduces to the support, if some qualification condition is satisfied. Going further, we tackle the question of convergence rates, providing a unifying treatment of finite and infinite dimensional settings. In finite dimensions, it is clear that if A is injective, then f becomes a strongly convex function, which guarantees a linear convergence rate. In [22], it is shown, still in a finite dimensional setting, that the linear rates hold just assuming A to be injective on the extended support of the problem. This result is generalized in [8] to a Hilbert space setting, assuming A to be injective on any subspace of finite support. Linear convergence is also obtained by assuming the limit solution to satisfy some nondegeneracy condition [8,26]. In fact, it was shown recently in [6] that, in finite dimension, no assumption at all is needed to guarantee linear rates. Using a key result in [25], the function f was shown to be 2-conditioned on its sublevel sets, and 2-conditioning is sufficient for linear rates [2]. Our identification result, mentioned above, allows to easily bridge the gap between the finite and infinite dimensional settings. Indeed, we show that in any separable Hilbert space, linear rates of convergence always hold for the soft-thresholding gradient algorithm under no further assumptions. Once again, the key argument to obtain linear rates is the fact that the iterates generated by the algorithm identify, in finite time, a set on which we know the function to have a favorable geometry.
The paper is organized as follows. In Section 2 we describe our setting and introduce the thresholding gradient method. We introduce the notion of extended support in Section 3, in which we show that the thresholding gradient algorithm identifies this extended support after a finite number of iterations (Theorem 3.9). In Section 4 we present some consequences of this result on the convergence of the algorithm. We first derive in Section 4.1 the strong convergence of the iterates, together with a general framework to guarantee rates. We then specify our analysis to the function (1) in Section 4.2, and show the linear convergence of ISTA (Theorem 4.8). We also consider in Section 4.3 an elastic-net modification of (1), by adding an p regularization term, and provide rates as well, depending on the value of p ∈]1, +∞[.

THRESHOLDING GRADIENT METHODS
Notation. We introduce some notation we will use throughout this paper. N is a subset of N. Throughout the paper, X is a separable Hilbert space endowed with the scalar product ·, · , and (e k ) k∈N is an orthonormal basis of X. Given x ∈ X, we set x k = x, e k . The support of x is supp(x) = {k ∈ N | x k = 0}. Analogously, given C ⊂ X, C k = { x, e k : x ∈ C}. Given J ⊂ N, the subspace supported by J is denoted by X J = {x ∈ X | supp(x) ⊂ J} and the subset of finitely supported vectors c 00 = {x ∈ X : supp(x) is finite }. Given a collection of intervals {I k } k∈N of the real line, with a slight abuse of notation, we define, for every k ∈ N , Note that k∈N I k is a subspace of X. Therefore, the components of each element of k∈N I k must be square summable. The closed ball of center x ∈ X and radius δ ∈ ]0, +∞[ is denoted by B X (x, δ). Let C ⊂ X be a closed convex set. Its indicator and support functions are denoted δ C and σ C , respectively, and the projection onto C is proj C . Moreover, int C, bd C, ri C, and qri C will denote respectively the interior, the boundary, the relative interior, and the quasi relative interior of C [4, Section 6.2]. The set of proper convex lower semi-continuous functions from X to R ∪ {+∞} is denoted by Γ 0 (X). Let f ∈ Γ 0 (X) and let r ∈ ]0, +∞[.
The proximity operator of f is defined as is the soft-thresholder corresponding to I.
Problem and main hypotheses. We consider the general optimization problem where typically h will play the role of a smooth data fidelity term, and g will be a nonsmooth sparsity promoting regularizer. More precisely, we will make the following assumption: • for all k ∈ N , I k is a proper closed interval of R, and As stated in the above assumption, in this paper we focus on a specific class of functions g. They are given by the sum of a weighted 1 norm and a positive smooth function minimized at the origin, namely: In [9] the following characterization has been proved: the proximity operators of such functions g are the monotone operators T : X → X such that for all x ∈ X, T(x) = (T k (x k )) k∈N , for some T k : R → R which satisfies A few examples of such, so called, thresholding operators are shown in Figure 1, and a more in-depth analysis can be found in [9]. Observe that here the range of prox g is equal to the domain of ∂ψ. 3 A well-known approach to approximate solutions of (P) is the Forward-Backward algorithm [4] (FB) x 0 ∈ X, λ ∈]0, 2L −1 [, x n+1 = prox λg (x n − λ∇h(x n )).
In our setting, (FB) is well-defined and specializes to a thresholding gradient method. The Proposition below gathers some basic properties of g and f following from assumption (H).
Proposition 2.1. The following hold.
(i) · 1,I is the support function of B ∞,I = k∈N I k , (ii) dom ∂ · 1,I = c 00 , (iii) g ∈ Γ 0 (X) and it is coercive, (iv) f is bounded from below and argmin f is nonempty, admits a unique solutionū ∈ X, and for allx ∈ argmin f ,ū = −∇h(x). (vi) for all x ∈ X and all λ > 0, the proximal operator of g can be expressed as (vi): it follows from A.5(iv) together with [9, Proposition 3.6].

Definition and basic properties.
We introduce the notion of extended support of a vector and prove some basic properties of the support of solutions of problem (P).
It is worth noting that the notion of extended support depends on the problem (P), since its definition involves h (see Remark 3.4 for more details). It appears without a name in [22], and also in [14,15,17] for regularized least squares problems. Below we gather some results about the support and the extended support.
Proof. Let x ∈ dom ∂ f = dom ∂g, and let u ∈ ∂g(x) and let us start by verifying that supp(x) is finite. Let x * ∈ ∂g(x), and let y = x + x * . Proposition 2.1(vi) implies that for all k ∈ supp(x), prox ψ k • soft I k (y k ) = 0. Lemma A.4 and the definition of soft I k imply that y k / ∈ I k , and in particular that |y k | ě ω for all k ∈ supp(x). Then we derive that Next, we have to verify that J is finite, where J = {k ∈ N | − ∇h(x) k ∈ bd I k }. If N is finite, this is trivial. Otherwise, we observe that (∇h(x) k ) k∈N ∈ 2 (N ), which both implies that ∇h(x) k tends to 0 when k → +∞ in N . Since [−ω, ω] ⊂ I k , we deduce that J must be finite.
The following proposition clarifies the relationship between the support and the extended support for minimizers.
The reverse inclusion comes directly from the definition of esupp(x) and (iv). For the reverse inclusion, assume that esupp(x) = ∪{supp(x) | x ∈ argmin f } holds, and use the fact that esupp(x) is finite to apply Lemma A.9, and obtain some x ∈ argmin f such that supp(x) = esupp(x). We then conclude that 0 ∈ qri ∂ f (x) using (iv) and (ii).

Remark 3.4 (Extended support and active constraints)
. Assume that ψ = 0. Since g * is the indicator function of B ∞,I , in this case, the dual problem (D) introduced in Proposition 2.1(v) can be rewritten as This problem admits a unique solutionū ∈ B ∞,I , and the set of active constraints atū is {k ∈ N |ū k ∈ bd I k }. Sinceū = −∇h(x) for anyx ∈ argmin f by Proposition 2.1(v), Proposition 3.3(iii) implies that the extended support for the solutions of (P) is in that case nothing but the set of active constraints for the solution of (D'). 5 Remark 3.5 (Maximal support and interior solution). If ψ = 0 and the following (weak) qualification condition holds then, thanks to Lemma A.9 the extended support is the maximal support to be found among the solutions. If for instance h is the least squares loss on a finite dimensional space, it can be shown that the solutions having a maximal support are the ones belonging to the relative interior of the solution set [3, Theorem 2]. However, there are problems for which (w-CQ) does not hold. In such a case Proposition 3.3 implies that the extended support will be strictly larger than the maximal support (see Example 3.7). The gap between the maximal support and the extended support is equivalent to the lack of duality between (P) and (D).
Example 3.6. Let g : , as can be seen in Figure 2. The solutionsx ∈]x 1 ,x 2 [ are the ones having the maximal support, since supp(x) = {1, 2}, and also satisfy 0 ∈ ri ∂ f (x). Instead, on the relative boundary of argmin f we have supp( . This example is a one for which the extended support is the maximal support among the solutions.

Finite identification.
A sparse solutionx of problem (P) is usually approximated by means of an iterative procedure (x n ) n∈N . To obtain an interpretable approximation, a crucial property is that, after a finite number of iterations, the support of x n stabilizes and is included in the support ofx. In that case, we say that the sequence (x n ) n∈N identifies supp(x). The support identification property has been the subject of active research in the last years [22,15,26,18,17], and roughly speaking, in finite dimension it is known that support identification holds wheneverx satisfies the qualification condition 0 ∈ qri ∂ f (x). But this assumption is often not satisfied in practice, in particular for noisy inverse problems (see e.g. [18]). In [22,14], the case g(x) = x 1 is studied in finite dimension and it is shown that the extended support ofx is identified even if the qualification condition does not hold. Thus, the qualification condition 0 ∈ qri ∂ f (x) is only used to ensure that the extended support coincides with the support (see Proposition 3.3).
In this section we extend these ideas to the setting of thresholding gradient methods in separable Hilbert spaces, and we show in Theorem 3.9 that indeed the extended support is 6 always identified after a finite number of iterations. For this, we need to introduce a quantity, which measures the stability of the dual problem (D). Definition 3.8. We define the function ρ : X −→ R as follows: Also, given anyx ∈ argmin f , we define ρ sol = ρ (−∇h(x)).
It can be verified that ρ(u) ∈ ]0, +∞[ for all u ∈ X (this is left in the Annex, see Proposition A.2). Moreover, ρ sol is uniquely defined, thanks to Proposition 2.1(v). Theorem 3.9 (Finite identification of the extended support). Let (x n ) n∈N be generated by the Forward-Backward algorithm (FB), and letx be any minimizer of f . Then, the number of iterations for which the support of x n is not included in esupp(x) is finite, and cannot exceed ρ −2 sol λ −2 x 0 −x 2 . Remark 3.10 (Optimality of the identification result). Theorem 3.9 implies that after some iterations the inclusion supp(x n ) ⊂ esupp(x) holds. Let us verify that it is impossible to improve the result, i.e. that in general we cannot identify a set smaller than esupp(x). In other words, is it true that (6) (∃x 0 ∈ X)(∃x ∈ argmin f )(∀n ∈ N) supp(x n ) = esupp(x)?
If (w-CQ) holds, the answer is yes. Indeed, if there isx ∈ argmin f such that 0 ∈ qri ∂ f (x), we derive from Proposition 3.3(i) that esupp(x) = supp(x). So by taking x 0 =x, and using the fact that it is a fixed point for the Forward-Backward iterations, we conclude that supp(x n ) ≡ esupp(x). If (w-CQ) does not hold, then this argument cannot be used, and it is not clear in general if there always exists an initialization which produces a sequence verifying (6). Consider for instance the function in Example 3.7. Taking x 0 ∈]0, +∞[ and a stepsize λ ∈]0, 1[, the iterates are defined by x n+1 = (1 − λ)x n , meaning that for all n ∈ N, supp(x n ) ≡ {1}, which is exactly esupp(x). So in that case (6) holds true.
Proof. Letx ∈ argmin f , and let E = X esupp(x) be the finite dimensional subspace of X supported by esupp(x). First define the "gradient step" operator so that the Forward-Backward iteration can be rewritten as x n+1 = prox λg (T λh (x n )). Proposition 2.1(vi) implies that for all k ∈ N and all n ∈ N * , x n k = prox λψ k • soft λI k (T λh (x n−1 ) k ). Sincex is a fixed point for the forward-backward iteration [4, Proposition 26.1(iv)], we also have Using the fact that prox λψ k is nonexpansive, and that soft λI k is firmly non-expansive [4, Proposition 12.28], we derive Moreover, the gradient step operator T G is non-expansive since λ ∈ 0, 2L −1 (see e.g. [24, Lemma 3.2]), so we end up with (9) (∀n ∈ N * )(∀k ∈ N ) x n −x 2 ď x n−1 −x 2 − σ 2 n,k . 7 The key point of the proof is to get a nonnegative lower bound for σ n,k which is independent of n, when x n / ∈ E. Assume that there is some n ∈ N * such that x n / ∈ E. This means that there exists k ∈ N \ esupp(x) such that x n k = 0. Also, since supp(x) ⊂ esupp(x), we must havex k = 0, meaning that T λh (x) k = −λ∇h(x) k . We deduce from (7), (8), and Lemma A.4, that (10) T λh (x n−1 ) k / ∈ λI k and T λh (x) k ∈ int λI k .
Since Id − soft λI k is the projection on λI k , we derive from (10) that Moreover proj λI k (T λh (x n−1 ) k ) ∈ bd I k , therefore by Definition 3.8 and (10), we obtain that Plugging this into (9), we obtain (11) ∀n ∈ N * , x n / ∈ E ⇒ x n −x 2 ď x n−1 −x 2 − ρ 2 sol λ 2 . Next note that the sequence (x n ) n∈N is Féjer monotone with respect to the minimizers of f (see e.g. [20, Theorem 2.2]) -meaning that ( x n −x ) n∈N is a decreasing sequence. Therefore the inequality (11) cannot hold an infinite number of times. More precisely, x n / ∈ E can hold for at most λ −2 ρ −2 sol x 0 −x 2 iterations.

STRONG CONVERGENCE AND RATES
4.1. General results for thresholding gradient methods. Strong convergence of the iterates for the thresholding gradient algorithm was first stated in [11, Section 3.2] for g = · 1 , and then generalized to general thresholding gradient methods in [9,Theorem 4.5]. We provide a new and simple proof for this result, exploiting the "finite-dimensionality" provided by the identification result in Theorem 3.9.
and observe that it is finite, as a finite union of finite sets (see Proposition 3.2 and Theorem 3.9).
(ii): it is well known that argmin f = ∅ implies that (x n ) n∈N converges weakly towards somex ∈ argmin f (see e.g. [20,Theorem 2.2]). In particular, (x n ) n∈N is a bounded sequence in X. Moreover, (i) implies that (x n ) n∈N * belongs to X J , which is finite dimensional. This two facts imply that (x n ) n∈N * is contained in a compact set of X with respect to the strong topology, and thus converges strongly.
Next we discuss the rate of convergence for the thresholding gradient methods. Beforehand, we briefly recall how the geometry of a function around its minimizers is related to the rates of convergence of the Forward-Backward algorithm.
A p-conditioned function is a function which somehow behaves like dist (·, argmin φ) p on a set. For instance, strongly convex functions are 2-conditioned on Ω = X, and the constant γ φ,X is nothing but the constant of strong convexity. But the notion of p-conditioning is more general and also describes the geometry of functions having more than one minimizer. For instance in finite dimension, any positive quadratic function is 2-conditioned on Ω = X, in which case the constant γ φ,X is the smallest nonzero eigenvalue of the hessian. This notion is interesting since it allows to get precise convergence rates for some algorithms (including the Forward-Backward one) [2]: • linear rates if p = 2. For more examples, related notions and references, we refer the interested reader to [16,5,20]. Corollary 4.1 highlights the fact that the behavior of the thresholding gradient method essentially depends on the conditioning of f on finitely supported subspaces. It is then natural to introduce the following notion of finite uniform conditioning. Definition 4.3. Let p ∈ [2, +∞[. We say that a function φ ∈ Γ 0 (X) satisfies the finite uniform conditioning property of order p if, for every finite  any (δ, r) and for all p ∈ [2, +∞[ according to [20,Proposition 3.4].
In the following theorem, we illustrate how finite uniform conditioning guarantees global rates of convergence for the threshold gradient methods: linear rates if p = 2, and sublinear rates for p > 2. Note that these sublinear rates are better than the O(n −1 ) rate guaranteed in the worst case.
Theorem 4.5 (Convergence rates for threshold gradient methods). Let (x n ) n∈N be generated by the Forward-Backward algorithm (FB), and letx ∈ argmin f be its (weak) limit. Then the following hold.
(i) If f satisfies the finite uniform conditioning property of order 2, then there exist ε ∈ ]0, 1[ and (ii) If f satisfies the finite uniform conditioning property of order p > 2, then there exist (C 1 , C 2 ) ∈ ]0, +∞[ 2 , depending on (λ, f , x 0 ), such that Proof. According to Corollary 4.1, there exists a finite set J ⊂ N such that for all n ě 1, x n ∈ X J , and x n converges strongly tox ∈ argmin f . Also, the decreasing and Féjer properties of the Forward-Backward algorithm (see e.g. [20, Theorem 2.2]) tells us that for all n ∈ N, x n ∈ B X (x, δ) ∩ S f (r), by taking δ = x 0 −x and r = f (x 0 ) − inf f . Therefore, thanks to the finite uniform conditioning assumption, we can apply [20,Theorem 4.2] to the sequence (x n+1 ) n∈N ⊂ Ω = X J ∩ B X (x, δ) ∩ S f (r) and conclude.

1 regularized least squares. Let
A : X → Y be a linear operator from X to a separable Hilbert space Y, and let y ∈ Y. In this section, we discuss the particular case when h(x) = 1 2 Ax − y 2 Y and ψ ≡ 0. The function in (P) then becomes Ax − y 2 Y , and the Forward-Backward algorithm specializes to the iterative soft-thresholding algorithm (ISTA). In this special case, linear convergence rates have been studied under additional assumptions on the operator A. A common one is injectivity of A or, more generally, the so-called Finite Basis Injectivity property (FBI) [8]. The FBI requires A to be injective once restricted to X J , for any finite J ⊂ N . It is clear that the FBI property implies that h is a strongly convex function once restricted to each X J , meaning that the finite uniform conditioning of order 2 holds. So, the linear rates obtained in [8,Theorem 1] under the FBI assumption can be directly derived from Theorem 4.5. However, as can be seen in Theorem 4.5 , strong convexity is not necessary to get linear rates, and the finite uniform 2-conditioning is a sufficient condition (and it is actually necessary, see [20,Proposition 4.18]). By using Li's Theorem on convex piecewise polynomials [25,Corollary 3.6], we show in Proposition 4.7 below that f satisfies a finite uniform conditioning of order 2 on finitely supported subsets, without doing any assumption on the problem. First, we need a technical Lemma which establishes the link between the conditioning of a function on a finitely supported space and the conditioning of its restriction to this space.
Proposition 4.7 (Conditioning of 1 regularized least squares). Let (Y, · Y ) be a separable Hilbert space, let y ∈ Y and let A : X → Y be a bounded linear operator. In assumption (H) suppose that for every k ∈ N , I k ∈ I is bounded. Then X x → f (x) = x 1,I + 1 2 Ax − y 2 Y has a finite uniform conditioning of order 2.
Proof. Let J ⊂ N , J = {k 1 , . . . , k m }, with k 1 < . . . < k m , and suppose that argmin f ∩ X J = ∅. Define, using the same notation as in Lemma 4.6 Define A J = AΞ : R m → Y, and let S J = (A * J A J ) 1/2 , which verifies R(S * J ) = R(A * J ). Thus, there exists y J ∈ R m such that A * J y = S * J y J , so that we can rewrite Since the intervals I k are bounded, their support functions are finite valued and piecewise linear, so f J is a piecewise polynomial of degree two in R m . We then apply [25, Corollary 3.6] 10 to derive that f J is 2-conditioned on S f J (r), for any r ∈ ]0, +∞[. We conclude by using Lemma 4.6.
Combining Theorem 4.5 and Proposition 4.7, we can now state our main result concerning the linear rates of ISTA. Theorem 4.8 (Linear convergence for the iterative soft thresholding). Under the assumptions of Proposition 4.7, let (x n ) n∈N be the sequence generated by the forward-backward algorithm applied to f . Then (x n ) n∈N converges strongly to somex ∈ argmin f , and there exists two constants ε ∈ ]0, 1[ and C ∈ ]0, +∞[, depending on (λ, L, x 0 , I, A, y), such that The case p = 2 is also known as elastic net regularization and has been proposed in [34]. The elasitc-net penalty has been studied by the statistical machine learning community as an alternative to the 1 regularization in variable selection problems where there are highly correlated features and all the relavant ones have to be identified [13]. See also [10] for the case p < 2. Note that the proximal operator of such g can be computed explicitly when p ∈ {4/3, 3/2, 2, 3, 4} (see [9]) Then the following hold (i) For every x ∈ X, ρ(x) ∈ ]0, +∞[; (ii) int ( k∈N I k ) = k∈N int I k ; deduce that inf k∈J F dist (x k , bd I k ) > 0. On the other hand, for any k ∈ J ∞ , we have |u k | ď ω/2, while [−ω, ω] ⊂ I k , therefore dist (x k , bd I k ) ě ω/2, and ρ(x) = inf k∈J dist (x k , bd I k ) > 0 .
(ii): let x ∈ int k∈N I k . We are going to show that x k ∈ int I k for all k ∈ N . By assumption, there exists δ ∈ ]0, +∞[ such that B X (x, δ) ⊂ k∈N I k . Let k ∈ N , and let us show that [x k − δ, x k + δ] ⊂ I k . Let y k ∈ [x k − δ, x k + δ], and definex ∈ X such thatx k = y k andx i = x i for every i = k. Then we derive x −x = |x k − y k | = δ, whencex ∈ B(x, δ) ⊂ k∈N I k . This implies that y k ∈ I k , which proves that x k ∈ int I k . Now, we let x ∈ k∈N int I k , and we show that x ∈ int ( k∈N I k ). By (i), ρ(x) > 0 and, for every k ∈ N , x k ∈ int I k by assumption. Let η ∈ ]0, ρ[. Since dist (x k , bd I k ) ě ρ(x), we derive [x k − η, x k + η] ⊂ I k . On the other hand, the non-expansiveness of the projection implies that x k − p k ď x k − y k ď η < ρ, which leads to a contradiction. Therefore B X (x, η) ⊂ k∈N [x k − η, x k + η] ⊂ k∈N I k . This yields x ∈ int k∈N I k . Lemma A.4. Let ψ ∈ Γ 0 (X) be differentiable at 0 ∈ argmin ψ and let x ∈ X. Then Proof. prox ψ (x) = 0 ⇔ (Id + ∂ψ) −1 (x) = 0 ⇔ x ∈ 0 + ∂ψ(0) ⇔ x = ∇ψ(0) ⇔ x = 0.
(iv) For all x ∈ X, prox g (x) = ∑ k∈N prox g k (x k )e k .