PURELY TIME-DEPENDENT OPTIMAL CONTROL OF QUASILINEAR PARABOLIC PDE S WITH SPARSITY ENFORCING PENALIZATION ∗

. We prove ﬁrst-and second-order optimality conditions for sparse, purely time-dependent optimal control problems governed by a quasilinear parabolic PDE. In particular, we analyze sparsity patterns of the optimal controls induced by diﬀerent sparsity enforcing functionals in the purely time-dependent control case and illustrate them by numerical examples. Our ﬁndings are based on results obtained by abstraction of well known techniques from the literature.


Introduction
This paper is devoted to sparse optimal control of quasilinear parabolic partial differential equations (PDEs). We focus on the purely time-dependent control setting, i.e. controls depending on time only, but not on space; cf. [27]. We obtain results that we expect from known results for the linear and semilinear case; see, e.g., [14,32]. More precisely, we derive first-order necessary optimality conditions and associated sparsity patterns as well as second-order necessary and sufficient optimality conditions for problems of the following type: Here, α > 0 denotes the L 2 (I, R m )-Tikhonov regularization-parameter, and β > 0 weighs the sparsity enforcing penalization/cost term j k : L 2 (I, R m ) → R. For k ∈ {1, . . . , 7}, the latter is given by one of the following functionals that are adaptations of the classical (directional) sparsity enforcing penalizers [32] to the purely time-dependent-setting: For the precise setting, including boundary conditions of the state equation (Eq), conditions on the coefficient functions, the fixed actuators b i , and the control operator B, as well as the definition of the set of admissible controls U ad , we refer the reader to Section 2. PDE-constrained optimization has been subject to intensive research for several decades, cf. [35,43,59] for instance, and has many applications. In some of them it may be desirable to determine controls that act only on a small ("sparse") part of the (space-time-)domain under consideration. Starting with the pioneering work of Stadler [57] on sparse optimal control of linear elliptic equations, there have been many contributions on this topic in the recent past. For an overview we refer the reader, e.g., to [8], and focus on literature related to the present paper in the following. Regarding literature following the original idea of Stadler to enforce sparsity by adding an L 1 -penalization term to the objective functional we mention, e.g., [13,17,20,22,23,56,60]. These publications refer to different types of PDEs and cover several aspects, including first-and secondorder optimality conditions, discretization error estimates, and additional state-constraints. When considering parabolic PDEs, it might be favorable to obtain a space-time sparsity profile of the optimal control in which space-and time-variable are treated in a different way. This leads to the concept of directional sparsity introduced in [32] for linear PDEs. First-and second-order optimality conditions for this concept applied to semilinear parabolic PDEs have been obtained in [14,18], for instance. The specific difficulty herein arises from the fact that sparsity enforcing penalizers are convex, but nonsmooth, whereas the remaining part of the objective functional is smooth, but-due to nonlinearity of the state equation-nonconvex. Moreover, the so-called twonorm discrepancy [21,39] occurs. Differentiability and coercivity of the second derivative of the smooth part of the objective functional hold only w.r.t. different norms. As alternative approaches to enforce sparsity we finally mention, e.g., control in measure spaces [11,15,16,42], or L 0 -penalization [24,40].
To our knowledge, sparse optimal control of quasilinear PDEs has not been addressed in the literature so far. Regarding optimal control of quasilinear PDEs in general, we restrict the introduction to the parabolic setting and recent results, and refer to the introduction of [5,9] for earlier literature on quasilinear parabolic and elliptic optimal control problems. Well-posedness of the state equation and existence of optimal controls for an abstract functional that is convex, continuous, and coercive w.r.t. the control variable have been proven in [49] under rather general regularity assumptions on domain and coefficients. First-and second-order optimality conditions have been derived in [5] for control-constrained problems with usual L 2 -Tikhonov functional. Based on this, convergence of the SQP-method for the respective optimization problem has been shown in [36]. Additional state-constraints are considered in [37]. Optimality conditions for a similar model problem with slightly more regular coefficients and domain, but in contrast unbounded nonlinearities, have been analyzed in [9]. Optimal control of the so-called thermistor problem, a coupled system consisting of a quasilinear parabolic and a nonlinear elliptic equation, is addressed in [47,48].
Existence and regularity theory for solutions of the underlying PDEs poses the main difficulty in the analysis of such problems, in particular in the second-order analysis. The aforementioned papers have in common that they utilize the functional analytic concept of nonautonomous maximal parabolic regularity [3] to deal with this issue. Finally, we mention that a priori finite element error estimates and a posteriori Reduced Basis error estimates for the state equations from [5,9] have been obtained in [10,38], respectively.
The present paper contributes both to the fields of optimal control of quasilinear PDEs and sparse optimal control. We extend the first-and second-order analysis for sparse optimal control of semilinear parabolic PDEs from [14,18] to problems with a quasilinear parabolic state equation. Here, we focus on the purely time-dependent control setting, which may be advantageous in applications. Among the examples given in the introduction of [27] we mention, e.g., optimal cooling of steel profiles by controlling the intensities of the finite number of nozzles that spray water on the profile. With the techniques of this paper it would also be possible to treat directionally sparse, distributed optimal control problems along the lines of [13,14,18,32]; cf. also Example 2.6 of [5]. However, while space-time sparsity patterns for distributed optimal control problems have already been under detailed consideration in [14], the purely time-dependent control setting has not been addressed systematically in the context of directional sparsity. Moreover, the chosen setup allows to include control of quasilinear parabolic PDEs by fixed Neumann boundary sources up to dimension 3, whereas distributed Neumann boundary control is only possible up to dimension 2; cf. Example 2.4 of [5]. With respect to the state equation we rely on the low regularity assumptions of [5,49] that include certain discontinuous coefficients, nonsmooth domains, and mixed boundary conditions.
With our work, we combine two challenging aspects, namely sparsity enforcing penalization and a quasilinear state equation. In the presence of L 2 -Tikhonov regularization we are able to carry out a full first-and secondorder analysis, the latter one avoiding the two-norm gap. To do so, we pursue an abstract approach in the flavour of [21] and work out the abstract core of existing arguments for second-order conditions from [13,14,18], which may also facilitate the application to other problems. Due to the different nature of our nonlinearity, the second-order analysis in the bang-bang case α = 0, i.e. the case without L 2 -Tikhonov regularization, from [14,17] cannot be carried over to our setting; cf. Section 4.4. This illustrates that the transfer of techniques successfully applied to semilinear problems to quasilinear problems is by no means trivial. Second, we provide an extensive analysis of directional sparsity for purely time-dependent controls, which has to the best of our knowledge not been carried out so far, and include a numerical illustration of our findings. This is of particular interest due to the practical relevance of purely time-dependent controls. Besides the functionals j 1 − j 5 , whose structure corresponds to those already discussed in [14], we also propose and analyze the functionals j 6 and j 7 , that have-to the best of our knowledge-not been dealt with in the context of PDE-constrained optimization so far. These two functionals are interesting, because their sparsity patterns are similar to those of j 4 and j 5 , respectively, while they are advantegeous compared to j 4 and j 5 from a numerical point of view, because their proximity operator is computable.
In the next section we will state our assumptions and summarize our main results. Moreover, we provide an overview over the remaining part of the paper in which we prove our results and give a brief numerical illustration. Let us already point out some key features of our arguments for convenience of the reader. To avoid redundancy in two important steps of our arguments, we introduce abstract settings that allow to handle seemingly different situations with a common argument.
The first abstraction takes place in Section 3.1: we prove first-and second-order optimality conditions for an abstract optimization problem in Banach spaces whose functional is given by the sum of a smooth and a nonsmooth functional, both satisfying certain assumptions. For the rest of the paper, we show that these abstract results apply to the concrete instances of (P k ), k = 1, . . . , 7. The main work is to check that both smooth and nonsmooth part of the (reduced) functional of (P k ) satisfy the required assumptions. For the smooth part, given by the first two summands of J in (P k ), this has already been shown in [5] and we recall the respective results in Section 4.1. The assumptions for the nonsmooth part, i.e. for u → βj k (u), are verified in Section 3.2. To avoid checking the conditions for all seven cases of j k separately, we apply the second step of abstraction: it turns out that the seven functionals can be reduced to four generic cases for which we then prove the required results utilizing techniques known from the literature. Hereby, it is important to note that the conditions on the smooth and the nonsmooth part of the functional can be checked independently of each other. Consequently, the presence of a quasilinear parabolic state equation does not cause additional problems when applying results on the nonsmooth part of the functional that have been derived earlier in the context of optimal control of different underlying equations. In particular, our results on the abstract level may also serve as kind of a template for analogous concrete first-and second-order results on other problems, e.g., with a different underlying state equation.

Notation and assumptions
We introduce some notation and conventions, and state our assumptions on the control problem (P k ). Given Banach spaces X, Y we denote by L(X, Y ) the space of bounded linear operators from X to Y equipped with the operator norm, by X * := L(X, R) the topological dual of X, and X → Y indicates that X continuously embedds into Y . The domain of a closed linear operator A: X → Y , equipped with the graph norm, is denoted by Dom X (A). By B X r (x) ⊂ X we denote the closed ball of radius r around x ∈ X; for X = R d , B r (x) denotes the open Euclidean ball of radius r > 0 around x. We use the notation |·| 2 and |·| 1 for the 2 -(i.e. the Euclidean) and the 1 -norm on R d . For a convex function f: X → R on a Banach space X we denote by f (x, v) its directional derivative at x ∈ X in direction v ∈ X and by ∂f (x) ⊂ X * its subgradient at x; see, e.g., Chapter I.5 of [29] for the definitions. In this sense, we denote by · L 1 (u, v) or |·| 1 (x, y) the directional derivatives of the L 1 -or the 1 -norm at u in direction v or at x in direction y, respectively. With R K (ū) and T K (ū) we refer to the radial and tangent cone of a convex set K ⊂ X atū; see, e.g., Definition 2.54 of [6] for the definition. Finally, we use the set-valued sign function sign(z) = {±1} for z ≷ 0, sign(0) = [−1, 1]. We apply standard notation for Bochner-Lebesgue-and Bochner-Sobolev-spaces, as well as for real and complex interpolation spaces. The following assumptions on (P k ) are close to [5], but we forego those parts that refer to the improved regularity analysis from [5] on Bessel-potential spaces and stick to the setting of [49]. Moreover, as already explained in the introduction, we restrict ourselves to the purely time-dependent control setting; cf. Example 2.5 of [5].
, be a bounded domain with boundary ∂Ω. Γ N ⊂ ∂Ω is relatively open and denotes the Neumann boundary part whereas Γ D = ∂Ω \ Γ N denotes the Dirichlet boundary part. We assume that Ω ∪ Γ N is regular in the sense of Gröger [30] such that every chart map in the definition of regularity in the sense of Gröger can be chosen volume-preserving. The time interval I = (0, T ) with T > 0 is fixed.
In the following we denote the space time cylinder by Q := I × Ω and apply standard notation for Hölder-, Lebesgue-, and Sobolev-spaces on Ω. The conjugate exponent of some integrability exponent p is denoted by p , and likewise for all other appearing integrability exponents. Since Ω stays fixed we omit it when referring to function spaces on Ω. Moreover, by the subscript D we indicate that the respective function space carries homogeneous Dirichlet boundary conditions on Γ D .
2. We assume that there is p ∈ (d, 4) such that −∇ · µ∇ + 1: This choice of p is fixed from now on. Assumptions 2.1 and 2.2 certainly impose nontrivial conditions on the geometry of the domain, the elliptic operator −∇ · µ∇ + 1, and the boundary conditions. We refer the reader to Remarks 2.1 and 2.3 of [5] or Example 2.3 of [37] for a discussion and examples. Assumption 2.3. We fix regularization resp. cost parameters α, β > 0 and some s > 2 such that 1 holds. Moreover, we choose an initial condition y 0 ∈ (W −1,p D , W 1,p D ) 1−1/s,s and the desired state y d ∈ L ∞ (I, L p ). The control operator is given by Note that Assumptions 2.1-2.3 are identical to Assumptions 1-3 of [5], i.e. the suppositions w.r.t. domain, coefficients, and boundary conditions remain unchanged. We only modify the assumptions w.r.t. the initial condition and regularity of the right-hand side of (Eq), since Assumption 4 in [5] is related to the improved regularity analysis on Bessel-potential spaces. As pointed out in Section 3 of [5] this analysis is not required for the first-and second-order analysis of Sections 3.1 and 4.1-4.3 of [5], except for Proposition 4.7 of [5], a result concerning improved regularity of the adjoint state. We only rely on those results that are obtained completely within the W −1,p D -W 1,p D -setting, cf. also Theorem 5.3 of [49], and do not include the improved regularity assumptions of [5].

Main results
Let us start the paper by mentioning the main results of our analysis of (P k ). The corresponding proofs rely on auxiliary material from Section 3 and are postponed to Section 4. Note that solutions to the state equation (Eq) have to be understood in the sense of (4.2); for details hereof we refer the reader to Section 4.1.
Of course, we have to rely on well-posedness of (P k ). Due to convexity and continuity of j k and boundedness of U ad , existence of an optimal control for (P k ) is guaranteed by Proposition 6.4 of [49]. Regarding first-order necessary optimality conditions for a L 2 (I, R m )-local solution to (P k ) we will obtain the following result that also characterizes the different (directional) sparsity patterns resulting from the different functionals j k , k = 1, . . . , 7.
Theorem 2.4 (First-order necessary optimality conditions). Letū be a local solution to (P k ) w.r.t. the L 2 (I, R m )-topology. Then there exists a unique, so-called adjoint statep ∈ W 1,r (I, L p ) ∩ L r (I, Dom L p (−∇ · µ∇)), on Ω, and a uniqueλ ∈ ∂j k (ū) (see formulas (4.7)-(4.13)), such that the variational inequality is satisfied. In particular, in the respective cases k = 1, . . . , 7 the optimal controlū exhibits the following sparsity patterns, that will be described on more detail below: In an application one usually has to determine a suitable β experimentally in order to ensure that the support of the corresponding solution of (P k ) has roughly the desired size. Heuristically, choosing larger β decreases the support of the associatedū and for sufficiently large β it may holdū ≡ 0. However, since all the quantities u,ȳ,p,λ, γ in the above theorem are coupled, the actual size of the support ofū, i.e. the concrete amount of sparsity ofū, depending on the size of β is difficult to predict a priori. Nevertheless, on a qualitative instead of a quantitative level we can describe the different sparsity patterns as follows: j 1 -"Sparsity" This approach ensures both sparsity of the number of actuators b i and the time intervals at which they are active. However, there is no further structure in this sparsity. j 2 -"Sparse time-global selection of actuators" This approach selects a subset of the actuators that are allowed to be active. All other actuators are not used. The activity intervals of those actuators used are not sparse, in general. j 3 -"Sparsity in time of any control action" Any actuator, and then possibly all actuators, can become active only on a subset of I that is sparse. j 4 -"Sparse activity time for each actuator" An actuator b i becomes active at some time point t only if its contribution at time point t is above a threshold depending on i. Therefore, the time of activity of each actuator b i is sparse with a sparsity-pattern depending on i. j 5 -"Sparse selection of actuators at each time" At each time point t, an actuator b i can be active only if its contribution is above a threshold depending on t. Therefore, at each time point t a certain sparse subset of actuators is selected to become active at t. j 6 and j 7 These functionals result in similar sparsity patterns as j 4 and j 5 , but with different thresholds that weight the components differently. More precisely, in the cases k = 4 and k = 6 an actuator b i only becomes active at time point t if and only if |(B * p ) i (t)| > βγ i holds. The thresholds γ i , however, are different; it holds: for k = 4, and γ i = ū i L 1 (I) for k = 6.
Similarly, in the cases k = 5 and k = 7 an actuator b i becomes active at time point t if and only if for k = 5, and γ(t) = |ū(t)| 1 for k = 7.
Note that γ i in the case k = 4 may be interpreted as the fraction of the cost term |( ū i L 1 (I) ) i | 2 that is due to the i-th actuator. Similarly, in the case k = 5 we can imagine γ(t) as the fraction of the overal cost |u(·)| 1 L 2 (I) that is consumed at time point t. In this sense, the thresholds for k = 4, 5 are more intuitive than those for k = 6, 7.
Each of these possibilities may be of interest in certain applications. We point out that functionals j 6 and j 7 have an advantage compared to j 4 and j 5 from the perspective of fast numerical implementation, while the latter are superior in terms of interpretability. For details on that and numerical illustration of the sparsity patterns we refer the reader to Section 5. Analogous sparsity patterns are also obtained for α = 0; cf. the results of Section 3.2. Until this point of our analysis we could indeed allow both for α > 0 or α = 0, to which we will refer from now on as the regular or the bang-bang case, respectively. For the formulation of the following second-order conditions, however, we have to restrict the analysis to the regular case, as will be explained in Section 4.4. For the definition of the reduced functionalĴ we also refer to Section 4. Our second main result is the following: with some c ≥ 0 and > 0, it holds: , and f , f , j k , and j k given by (4.3), (4.4), (4.14)-(4.20), and (4.21)-(4.27), respectively. Conversely, letū ∈ U ad satisfy the first-order necessary optimality condition (2.1) and Then there are , c > 0 such that the quadratic growth condition (2.2) holds true. In particular,ū is a local solution to (P k ) w.r.t. the L 2 (I, R m )-topology.
Necessary and sufficient optimality conditions in the previous theorem have minimal gap. Positivity of the second derivative on the critical cone Cū is sufficient for local optimality, while nonnegativity on the same cone is necessary. As also observed in [14], the positivity condition (2.4) and the coercivity condition (2.3) are equivalent for the particular objective function. For no-gap second-order conditions for bang-bang problems with L 1 -penalization we refer the reader to [60], and for different, non sparse, settings, e.g., to [21,25,26].

Overview over the remaining part of the paper
In order to reduce redundancy, we formulate as many of the results and arguments as possible on an abstract, general level, from which the concrete results can be obtained afterwards. In Section 3.1 we discuss optimality conditions for an abstract problem in the flavour of [21] by abstracting the main ideas from [13,14,18]. A more concrete instance hereof is dealt with in Section 3.2. There, we show that the previous results apply to a certain class of optimization problems on Lebesgue spaces with four different sparsity enforcing penalizations, and analyze the resulting sparsity patterns of their solutions. Here, we rely again on [13,14,18]. In Section 4 we finally prove our main results by applying the framework of Section 3.2 to (P k ). A preview on how the concrete quasilinear parabolic setting from Theorems 2.4 and 2.5 fits into the abstract frameworks of Section 3 will already be given at the beginning of Section 3.2. At the end of the paper, numerical experiments utilizing proximal methods and subgradient descent are performed in Section 5.

First-and second-order optimality conditions on an abstract level
This section prepares the proofs of our main theorems in Section 4. As a first step, we analyze in Section 3.1 first-and second-order optimality conditions for a purely abstract optimization problem whose functional is given by the sum of a smooth, but nonconvex, and a convex, but nonsmooth term. More precisely, we extend the abstract framework for smooth functionals from [21], where, e.g., semilinear parabolic problems without sparsity have been considered, toward the inclusion of nonsmooth, but convex summands that satisfy certain properties that are typical for sparsity promoting cost terms as, e.g., j 1 -j 7 . The results are obtained utilizing the techniques of [13,14,18], and may therefore also be viewed as a summary of these earlier results on an abstract level. Thereafter, in Section 3.2, we make the problem under consideration a bit more concrete and deal with optimization problems on Lebesgue-spaces with directional sparsity enforcing penalization terms. Following [13,14,18] we verify that these problems fit into the framework of Section 3.1, and analyze the corresponding sparsity patterns of the solutions. Hereby, one has to note that the properties of the nonsmooth cost term are independent of the state equation of the respective optimal control problem. This allows to transfer results on sparsity-enforcing penalization terms from the literature concerned with problems with a different state equation.

Optimality conditions for an abstract nonsmooth and nonconvex problem
Let us define the following optimization problem: where K is a closed convex set in a Banach space U ∞ , f and g are real-valued functionals, such that f is smooth, but possibly nonconvex, and g is convex and Lipschitz, but not necessarily smooth. Since we have in mind the concrete situation ofĴ being the reduced functional of a sparse PDE-constrained optimal control problem, we include a two-norm discrepancy [39]. Coercivity of second derivatives can only be expected w.r.t. a weaker norm in a Hilbert space U 2 ⊃ U ∞ . In applications we usually expect U 2 to be an The precise setting is described in detail below. In particular, our assumptions on the smooth functional f are identical to those in [21], and cover a broad range of functionals arising from PDE-constrained optimization; see for instance [5,21]. Hence, we generalize the result from [21] for the smooth caseĴ = f to the inclusion of a nonsmooth summand g in the functional. Our approach differs from the similar abstract approach in [60] by several technical aspects. Our setting includes the presence of two nonequivalent norms, but instead of working with a Banach space and its predual as in [60] we restrict ourselves to formulating optimality conditions w.r.t. the Hilbert space U 2 . Moreover, unlike in [60] we do not include the convex constraint "u ∈ K" as indicator function in the nonsmooth part of the functional. We will show in Section 3.2 that our assumptions on g are typically fulfilled by penalizers promoting directional sparsity, while the applications discussed in [60] are primarily concerned with different, non-uniformly convex integral functionals. The proofs and assumptions of this section are inspired by [21] and well known techniques employed in particular in [13,14,18].
Assumption 3.1. Let U 2 be a Hilbert space and U ∞ a Banach space such that U ∞ → U 2 . With · ∞ , · 2 , and ·, · 2 we denote the corresponding norms and the duality pairing on U * 2 × U 2 . Let ∅ = K ⊂ U ∞ be convex and A ⊃ K be open in U ∞ . We fixū ∈ K.
1. The functional f: A → R is assumed to be twice continuously Fréchet differentiable w.r.t. · ∞ and to fulfill the following properties: 1a. The derivatives of f taken w.r.t. the space U ∞ extend to continuous linear, resp. bilinear, forms on U 2 , i.e.
2. The functional g: U 2 → R is assumed to be convex and Lipschitz continuous. We introduce the following sets We start with a discussion of first-order necessary optimality conditions: Theorem 3.2. Let Assumption 3.1.1a hold and suppose thatū is a local minimizer of (P 1 ) w.r.t. the U 2topology. Then there isλ ∈ ∂g(ū) such that The proof works completely analogous to for instance the proof of Theorem 2.1 in [14]. For convenience of the reader we provide the main steps.
The map f (ū) + g: U 2 → R is convex and continuous, and hence the claim follows from standard convex analysis; see, e.g., Proposition I.5.6 of [29].
Before addressing second-order optimality conditions, some comments on Dū and Cū seem to be appropriate. First, let us note: We can follow, e.g., Proposition 3.4 of [13] to prove this.
Due to the fact that g (ū, ·) is not a linear form on U 2 , we cannot apply the concept of polyhedricity, see, e.g., [61] for the definition, directly. This is different from the smooth case in [21], where Dū = Cū holds for polyhedric K. In fact, we do not know whether this equality still holds true in our abstract nonsmooth setting. Since in the following sufficient conditions will be formulated on the cone Cū, and necessary conditions on the possibly smaller cone Dū, we do not obtain no-gap second-order conditions for the fully abstract setting. However, for sparse optimization problems on Lebesgue-spaces equality holds, cf. Section 3.2, because Assumption 3.1.2a can be verified with Dū replaced by Cū in these cases. The following necessary second-order optimality condition is the abstract version of the first part of Theorem 2.5: Theorem 3.4. Let Assumption 3.1.1 and 2a hold, and suppose that there are c ≥ 0 and r > 0 such that Then it holds by assumption and Taylor expansion of f atū with some u θ k : Taking the inferior limits on both sides of (3.2) and utilizing Assumption 3.1.1.bii and 2a concludes the proof.
Obviously, the above proof also keeps valid with slightly modified assumptions. If Assumption 3.1.2a is weakened by replacing the limits by inferior limits, we need to strengthen Assumption 3.1.1 by demanding additional continuity of f as map U ∞ → L(U 2 ⊗ U 2 , R).
Next, we state and prove sufficient optimality conditions, that correspond-on the abstract level-to the second part of Theorem 2.5: Theorem 3.5. Let Assumption 3.1 hold, and suppose that there isλ ∈ ∂g(ū) such that the first-order necessary optimality condition in Theorem 3.2 is satisfied. If in addition holds, there are c, r > 0 such thatĴ In particular,ū is a U 2 -local solution of (P 1 ).
Proof. We argue by contradiction, following, e.g., the well known approach in [14,18,21]. If the statement of the theorem is not true there is a sequence (u k ) k ⊂ K such that u k →ū in U 2 and We set t k := u k −ū 2 , v k := t −1 k (u k −ū) and assume w.l.o.g. that v k v ∈ U 2 weakly. The contradiction is achieved in three steps I.-III.: I. First, we show that v ∈ Cū. It clearly holds v ∈ weak-cl U2 (R K (ū)). Since R K (ū) is convex due to convexity of K it follows that the weak and strong closure of R K (ū) coincide, cf. Theorem 2.23ii of [6], from which we deduce v ∈ T K (ū). From the first-order necessary optimality condition f (ū) +λ, u k −ū 2 ≥ 0 together with f (ū) +λ ∈ U * 2 and weak convergence of (v k ) k we immediately conclude f (ū) +λ, v 2 ≥ 0. The subgradient property therefore implies f (ū)v k + g (ū, v k ) ≥ 0 and f (ū)v + g (ū, v) ≥ 0. Applying Taylor expansion to f at u we obtain from (3.3) Taking the inferior limits on both sides hereof and using Assumption 3.1.1bi for the first summand on the left-hand side we obtain where we have used that g (ū, ·) is convex and continuous as in Proof of Theorem 4.2 of [18]. Hence this shows v ∈ Cū. Moreover, II. Next, we prove v = 0. Again, we apply Taylor expansion to f in (3.3) and obtain with someũ θ k = (1 −θ k )ū +θ k u k ,θ k ∈ [0, 1]: Here, we have used the first-order necessary condition in the second inequality. Dividing by t 2 k and taking the inferior limits on both sides yields: where we have applied Assumptions 3.1.1bii and 2b in the last step. Due to v ∈ Cū it follows from the assumption of the theorem that v = 0. III. In this final step we arrive at the desired contradiction. From Assumption 3.1.1biii and v k 2 = 1 we infer from the above considerations: Since the term inside the limes superior is always nonnegative due to convexity of g we arrive at the desired contradiction 0 < γ ≤ 0.
The crucial observation in the final step of the proof of Theorem 3.5 is the inequality Assumption 3.1.2biii ensures positivity of the left-hand side, while convexity of g implies non-positivity of the right-hand side. Without Assumption 3.1.2biii, we would only have nonnegativity of the left-hand side, which does not suffice to achieve a contradiction, unless the right-hand side could be shown to be negative. However, we think that the latter can only hold for g being strongly convex atū w.r.t. the U 2 -norm. A strongly convex function g, however, is the sum of a convex function and a U 2 -Tikhonov term u → γ 2 u 2 2 . Such a Tikhonov-term is smooth and would ensure Assumption 3.1.2b when being shifted to f . This is the reason why the application of Theorem 3.5 will be restricted to the regular case in the subsequent sections. As explained in Section 4.4, we expect that a different type of argument is needed for the the bang-bang case.
Finally, we mention that a variant of Theorem 3.5 with norm-gap can also be obtained. Let strong convergence u k →ū in Assumption 3.1.1b hold only w.r.t. another Banach space Z such that Z → U ∞ , e.g., Z = U ∞ . Under this weaker supposition, the quadratic growth condition in Theorem 3.5 holds true in a Z-neighbourhood ofū, and, consequently,ū is a Z-local solution to (P 1 ).

Optimality conditions for directionally sparse optimization on Lebesgue-spaces
In this section we incorporate directionally sparse optimization problems on Lebesgue-spaces into the framework established before. We replace (P 1 ) by the following slightly more concrete model problem that contains (P k ) as an instance: Given a complete, σ-finite measure space (Λ, ρ) we consider min u∈U adĴ , X ∈ {A, B, C, D}, with the following four typical (directional) sparsity enforcing functionals j X defined on L 2 (Λ): , where in the cases B-D, (Λ, ρ) is given by the product measure space of two complete, σ-finite measure spaces (Λ 1 , ρ 1 ) and (Λ 2 , ρ 2 ). Moreover, let α ≥ 0, β > 0 and f: In the following we will analyze first-order optimality conditions for (P 2 -X) together with the resulting sparsity patterns of the minimizers. Moreover, we will verify that functionals j X , X ∈ {A, B, C, D} fit into the framework of Section 3.1 which allows to formulate second-order necessary and sufficient conditions for (P 2 -X) provided that h fulfills the appropriate assumptions.
Before we put the content of this section into the context of the existing literature, a few comments motivating the formulation of (P 2 -X) are in order. Note that functionals j 1 -j 7 from Section 2.2 are included in this setting by an appropriate choice of Λ 1 and Λ 2 ; see Table 1 in Section 4.2 below. Let us explain this briefly on behalf of, e.g., the functional j 2 . First, one has to observe that the space L 2 (I, R m ) can be written equivalently as L 2 (Λ) with Λ = Λ 1 × Λ 2 , where (Λ 1 , ρ 1 ) is the m-element set equipped with the counting measure and (Λ 2 , ρ 2 ) is the interval I equipped with the Lebesgue measure. Next, a short computation shows that j 2 can be expressed as u → u L 1 (Λ1,L 2 (Λ2)) in this setting. In essence, one has to observe that measurable functions on Λ 1 can be identified with vectors in R m and that the L 1 (Λ 1 )-norm corresponds to the 1 -norm on R m under this identification; the variable x 1 ∈ Λ 1 corresponds to the index i enumerating the actuators in Assumption 2.3 and the variable x 2 ∈ Λ 2 corresponds to the time t. Thus, j 2 is a particular instance of the abstract functional j B introduced above.
In Section 4, i.e. in the proof of our main results for (P k ), the smooth functional h will be given by the first summand of the reduced functional of (P k ). Nevertheless, the particular choice of h does not matter for the arguments of the present section. This is the reason why we can easily adapt proofs from earlier literature concerned with semilinear elliptic or parabolic problems to the present setting. Again, let us briefly illustrate the case k = 2. The formulas (3.6) and (3.7) for the directional derivatives and the subgradient of j B that we will obtain in this section can easily be translated back to j 2 in formulas (4.15) and (4.8). Similarly, by translating back Proposition 3.7.1 on first-order conditions for (P B ) to the setting of j 2 we prove the case k = 2 in Theorem 2.4. Here, it is important to observe that the choice h(u) = 1 2 S(u) − y d 2 L 2 (Q) in (P B ), where S denotes the solution map of the quasilinear equation (Eq), is possible because it has been already been shown in [5] that h exhibits the required properties. Let us also sketch how to obtain second-order optimality conditions. To do so, we apply the abstract Theorems 3.4 and 3.5 with U 2 = L 2 (I, R m ), U ∞ = L s (I, R m ), K = U ad , f (u) = h(u) + α 2 u 2 L 2 (I,R m ) = 1 2 S(u) − y d 2 L 2 (Q) + α 2 u 2 L 2 (I,R m ) and g(u) = βj 2 (u). That under Assumptions 2.1-2.3 the smooth part of the functional, f , satisfies the respective assumptions has again already been proven in [5] as we will recall in Section 4.1. The assumptions on the nonsmooth part, g, will be verified in Proposition 3.7.2 of this section for g = βj B , and hence, by the above reasoning, for g = βj 2 . Let us in particular point out that in all arguments the smooth and the nonsmooth part of the appearing functionals can be handeled completely independent of each other as long as both parts exhibit the required assumptions. This is the reason why earlier results obtained in the context of optimal control of linear or semilinear equations can easily be transferred to the present case.
At a first look, one may wonder whether argueing in such an abstract way is worth the effort coming along with this. In fact, the benefits are the following: first, instead of checking assumptions for all seven j k we only need to check four generic cases in this section. For instance, once the abstract functional j B has been analyzed, the results immediately apply to the two concrete functionals j 2 and j 3 ; in order to deal with j 3 one has to interchange the choice of Λ 1 and Λ 2 compared to j 2 . Further, replacing the m-element set equipped with the counting measure by Ω equipped with the Lebesgue measure one can easily address the case of distributed directionally sparse optimal control as in, e.g., [14]. Second, argueing in an abstract way reveals that the actual choice of the state equation (entering only via the functional h) does not matter as long as the smooth part f of the functional stil exhibits the typical properties. Therefore, the abstract setting is also intended to facilitate the application toward problems with a different state equation.
Let us now put the content of this section into the context of the literature. Except for case D, all the results have already been obtained in [14,18,32,57] dealing with linear and semilinear problems in the more concrete setting. There, Λ is a domain in R d or a space-time cylider, equipped with the Lebesgue measure, and h is a smooth tracking-type functional originating from optimal control of a linear or semilinear PDE. For functional j 3 in the context of optimal control of an ordinary differential equation we refer the reader to [55]. As pointed out above, the proofs also apply to our abstract setting because they are actually independent of the concrete structure of h. Nevertheless, for convenience of the reader we repeat these results in our notation. We also mention that discrete analogues of functionals A and B are well known in the machine learning as "lasso" [58] and "group lasso" [62].
To the best of our knowledge, case D has not been analyzed in the context of PDE-constrained optimization so far. It can be motivated by the successfull use of analogous functionals in the discrete setting, e.g., the socalled "exclusive lasso" [7] in machine learning, or the sparse regression problem in [41]. In particular, j D results in sparsity patterns similar to j C , but, as well known in the discrete case, j D unlike j C allows the application of proximal algorithms; cf. Section 5.1. Therefore, we may view case D as an alternative to case C that also deserves a theoretical analysis.
In the following we repeatedly make use of the fact that h (ū) ∈ L 2 (Λ) * can be identified with its Rieszrepresentative ∇h(u) ∈ L 2 (Λ). Above, we introduced the functionals j X , X ∈ {A, B, C, D}, to be defined on the Hilbert space U 2 = L 2 (Ω). Byj X we refer to their respective, straightforward extensions to L 1 (Λ) in the case A, to L 1 (Λ 1 , L 2 (Λ 2 )) in the case B, or to L 2 (Λ 1 , L 1 (Λ 2 )) in the cases C and D.

Functional A
Both, directional derivatives and subdifferential of j A , are well known; see for instance [13]. Note that the proofs that originally pertain to Lebesgue-spaces on open sets of R d apply to our slightly more general setting without changes. Therefore, for some u, v ∈ L 1 (Λ) the directional derivatives ofj A : L 1 (Λ) → R are given bỹ Here, recall that L 1 (Λ) * = L ∞ (Λ) for any σ-finite measure space (Λ, ρ); cf. Satz 6.16 of [52]. However, note that we will actually be concerned with j A =j A • ι where ι: L 2 (Λ) → L 1 (Λ) denotes the canonical embedding. It is obvious, that the formulas for the directional derivatives remain true for j A . Regarding the subdifferential, recall that by the chain rule, see, e.g., Proposition 5.7 of [29], it holds ∂j A (u) = ∂(j A • ι)(u) = ι * ∂j A (ιu) = ι * ∂j A (u), which implies that the above characterization of the subdifferential is also valid for j A , because ι * acts as the embedding L ∞ (Λ) → L 2 (Λ).
Proposition 3.6. Letū ∈ U ad be a local solution to (P 2 -A).
Proof. The first-order conditions and the analysis of the sparsity pattern can be found in Corollary 3.2 of [13]. Regarding 2., Assumption 3.1.2a is verified in the proof of Theorem 3.7 of [13], while Assumption 3.1.2b is an immediate consequence of the convexity of j. Note that the properties of the nonsmooth cost term in [13] are actually independent of those of the smooth part of the functional, which is the only part of the functional specifically related to the state equation. Hence, the transfer of the techniques from the semilinear elliptic setting in [13] does not cause problems.

Functional B
This functional has been discussed in [14,18] for the special case that Λ 1 is a domain in R d and Λ 2 is an interval, both equipped with the Lebesgue measure. We refer the reader to [55] for the particular case j 3 in the context of an optimal control problem with ODE-constraints. The results and their proofs also apply to our setting. Using the notation Λ 0 1 (u) = {x 1 ∈ Λ 1 : u(x 1 , ·) L 2 (Λ2) = 0}, we obtain the directional derivatives and subgradients ofj B : L 1 (Λ 1 , L 2 (Λ 2 )) → R; cf. Proposition 2.8 of [14]: Here, note that L 1 (Λ 1 , L 2 (Λ 2 )) * = L ∞ (Λ 1 , L 2 (Λ 2 )); cf. Theorem 8.18.3 of [28]. As for case A, we obtain the representation of the subdifferential of j B on L 2 (Λ) by an application of the chain-rule.
Proposition 3.7. Letū ∈ U ad be a local solution to (P 2 -B).
1. If α > 0, it holds ρ 1 -a.e. on Λ 1 or ρ-a.e. on Λ, respectively: If α = 0, it holds ρ 1 -a.e. on Λ 1 : 2. g = βj B satisfies Assumption 3.1.2 with Dū replaced by Cū, Proof. For the first part, see Corollary 2.9 of [14]. For the second part, Case III in the proof of Theorem 3.3 in [14] and Section 4 of [18] prove Assumption 3.1.2a and 2b. As already observed in the case A, the specific semilinear parabolic state equation under consideration in [14,18] does not influence the properties of the nonsmooth cost term. Therefore, the adaptation of the cited results is possible.
Proof. As for the cases A and B, the concrete type of the state equation in [14] does not influence the properties of the nonsmooth cost term. Therefore, we can easily adapt the respective arguments: for the first part, see Corollary 2.6 of [14]. For the second part, note that Assumption 3.1.2a is verified in Case II of the proof of Theorem 3.3 in [14], while 2b is obtained as follows: For t k , v k as in Assumption 3.1.2b it follows from Lemma 4.7 of [14] that lim inf ·)). Consequently, by weak lower semicontinuity of the L 2 (Λ 1 )-norm it holds Due to j C (ū, v k ) → j C (ū, v) and the formula for j C stated in the proposition, we conclude lim inf k→∞ j C (ū, v 2 k ) ≥ j C (ū, v 2 ), and together with (3.10) the claim follows.
The additional assumption ∇h(ū) ∈ L ∞ (Λ) is only required to verify Assumption 3.1.2a, i.e. for necessary second-order optimality conditions.

Functional D
Even though discrete versions of j D are well known in the machine learning community, cf., e.g., [7,41], the analysis of j D in the present infinite dimensional setting is-to the best of our knowledge-new. First, a short computation shows for any u, v ∈ L 2 (Λ 1 , L 1 (Λ 2 )) and t > 0: Dividing by t and sending t 0 yields: Consequently, we obtain and as for A-C these formulas remain true for j D instead ofj D . We have: Proposition 3.9. Letū ∈ U ad be a local solution to (P 2 -D) and definē γ(x 1 ) = ū(x 1 , ·) L 1 (Λ2) , ifū = 0, γ(x 1 ) = 1, else.
Proof. Although functional j D has not been under consideration in [14] we will make use of some techniques and intermediate results from [14] in the following. As explained for the cases A-C, they can be applied in our setting because the analysis of the nonsmooth cost terms in [14] is actually independent of the concrete choice of the smooth functional. Part one is verified along the lines of the proof of Corollary 2.6 in [14] utilizing the above formula (3.13) for the subgradient. Regarding part two, we start with the verification of Assumption 3.1.2a with Dū = Cū. Let v ∈ Cū andū = 0. As in the case II of the proof of Theorem 3.3 in [14] we define v k ∈ ·)) (3.14) holds ρ 1 -a.e. on Λ 1 for 0 < t < k −2 . With similar arguments as in [14] it can be shown that f (ū)v k + βj D (ū, v k ) = 0, i.e. v k ∈ Cū. From (3.14) and (3.11) we conclude for those 0 < t < k −2 Finally, we take 0 < t k < k −2 and conclude due to v k → v strongly in L 2 (Λ): Hence, we have verified Assumption 3.1.2a. Next, let v k , t k be as in Assumption 3.1.2b. First, recall from the proof of Lemma 4.7 in [14] that which implies Along the lines of the proof of Lemma 4.6 in [14] we obtain that v k v, v ∈ Cū, and j D (ū, v k ) → j D (ū, v) implies ·)), weakly in L 2 (Λ 1 ).
Thus we conclude from (3.15): lim inf i.e. we have verified Assumption 3.1.2b.

Proofs of the main results
Finally we can prove Theorems 2.4 and 2.5 for (P k ) from Section 2.2. They are obtained by application of the abstract results of Sections 3.1 and 3.2 to Here, S: u → y = y(u) denotes the solution map of the state equation (Eq) to be specified below in Section 4.1.
Before proving Theorems 2.4 and 2.5, already stated in Section 2.2, we summarize the required auxiliary results on f and g in the next subsection.

Auxiliary results regarding the smooth part f of the functionalĴ
First, we recall the following result on well-posedness of the state equation from Proposition 3.5 of [5]; see also Corollary 5.8 of [49].
and a test function ϕ ∈ L s (I, W 1,p D ). The following result holds true: is well defined, twice continuously Fréchet differentiable, and the following properties hold true: Proposition 4.2 allows to motivate the choice of s in Assumption 2.3. In fact, proving differentiability of f requires differentiability of the control-to-state map S. The latter requires differentiability of the solution map S of (4.1) and hence, in order to ensure differentiability of the nonlinear superposition operator in A, L ∞ (Q)regularity for the solutions of (4.1). Now, choosing s as in Assumption 2.3 ensures that there are κ 1 , κ 2 > 0 such that cf. Proposition 3.3 of [5]. In particular,S is a well defined map L s (I, W −1,p D ) → C(Q) and differentiability can be shown with the help of the implicit function theorem; cf. Section 4.1 of [5].
Next, we state an improved regularity result for the adjoint state p which is different from the regularity result ( [5], Prop. 4.7) obtained on Bessel-potential spaces because it does not need the additional assumptions from [5]. Note that this improved regularity result is only used to ensure ∇h(ū) = B * p ∈ L ∞ (I, R m ), which is required for necessary optimality conditions for j 4 and j 5 (i.e. case C); cf. Proposition 3.8.
and satisfies the equation in the sense of distributions. Moreover, the operator −ξ(y)∇ · µ∇ has nonautonomous maximal parabolic regularity on L r (I, L q ), and there is an embedding with some σ > 0 provided that r ∈ (2, ∞).

Auxiliary results for the nonsmooth part g of the functionalĴ
Regarding the nonsmooth part g of the functional, it suffices to observe that the seven possibilities for j k given in the introduction can be reduced to the four generic cases A-D from Section 3.2. The respective choices of Λ 1 , Λ 2 and the identification of the variables x 1 , x 2 (notation of Sect. 3.2) with the indices i and time t are summarized in Table 1.
Therefore, we can translate the results for j X , X ∈ {A, B, C, D} from Section 3.2 back to j k for k = 1, . . . , 7. For reference, we state concrete formulas for the directional derivatives, the subgradients, and the surrogates for the second-order derivative.

Main results: Proofs of Theorems 2.4 and 2.5
First, we prove Theorem 2.4 that states first-order optimality conditions and the resulting sparsity patterns of the optimal control. As already pointed out, we obtain this result by straightforward application of Propositions 3.6, 3.7, and 3.8. Note that-unlike for existence of optimal controls-the box-constraints on the controls are not required for this result.
Proof of Theorem 2.4. Each (P k ), k ∈ {1, . . . , 7}, can be understood as realization of (P 2 -X) for some X ∈ {A, B, C, D}; cf. Section 4.2. First-order optimality conditions and sparsity patters for the latter have been obtained in Section 3.2. The formula for the gradient of the smooth part of the functional has been stated in Propositions 4.2 and 4.3.
Next, we prove Theorem 2.5 on second-order optimality conditions. Again, the proof is short, because the main work has already been done in Theorems 3.4 and 3.5, and the verification of the corresponding assumptions in Sections 3.2 and 4.1.
Proof of Theorem 2.5. Apply Theorems 3.4 and 3.5 to U ∞ = L s (I, R m ), U 2 = L 2 (I, R m ), f , g, and h specified as in Section 4.1 above, and K = U ad . Assumption 3.1.1 and 2 with Dū replaced by Cū have been verified in Proposition 4.2 and Section 3.2. The additional requirement ∇h(ū) = B * p ∈ L ∞ (I, R m ) in the case B follows from Proposition 4.3.
Here, the presence of box-constraints for the controls is necessary for the verification of Assumption 3.1.1, because all L s -topologies, s ∈ [1, ∞), are equivalent on the L ∞ -bounded admissible set. Omitting the controlconstraints is possible, but infers a norm-gap, as explained at the end of Section 3.1.

Limitations of the approach: The bang-bang case
We conclude this section by an outlook to the bang-bang case α = 0 that illustrates the limits of our secondorder analysis. In fact, the present approach cannot be extended to the bang-bang case. Regarding necessary optimality conditions, a short computation shows that Cū = {0} holds forū satisfying the first-order optimality conditions of (OCP 1 ). Hence, the first part of the statement of Theorem 3.4 is still true, but trivial for α = 0. For sufficient optimality conditions, Assumption 3.1.1biii is crucial. It is well known that this property for the smooth part of the functional can only be expected in the case of α > 0, or a similar so-called Legendre-Clebsh condition; cf. [21]. As explained at the end of Section 3.1, it seems impossible to avoid this assumption on f by exploiting properties of g. Recall that in the case α = 0 the second derivative of the smooth part of the functional reads as follows: withȳ = S(ū), z v := S (ū)v andp as in Proposition 4.2. Assuming appropriate higher regularity forp this can be transformed into The approach in [14,17] for the bang-bang case is based on a second-order sufficient optimality condition of the type with some c > 0 and all v from a certain cone. For such a condition to hold in our setting, we expect that 1 − ξ (ȳ)∇ · µ∇p ∈ L ∞ (Q), and therefore ∇ · µ∇p ∈ L ∞ (Q) has to hold, which is a very strong assumption. Moreover, to follow the arguments of [14,17] we would need certain continuity properties like Consequently, ∇ · µ∇p ∈ L ∞ (Q) would be required to depend continuously on u, which is out of reach, even in a highly smooth setup and with controls measured in L ∞ (I, R m ).
On the other hand, to apply the approach for L 1 -penalized semilinear elliptic bang-bang problems from [60], we would have to guarantee that there is a bounded bilinear extension f (ū): M(I, R m ) × M(I, R m ) → R, to the space M(I, R m ) of R m -valued regular Borel measures on I. Moreover, appropriate higher regularity for the adjoint state would be needed. This is of course more delicate for parabolic problems than for elliptic ones. Moreover, it is not clear whether the results obtained in [60] for L 1 -penalization also hold for directional sparsity functionals.
This shows that it is by no means obvious that techniques successfully applied to semilinear parabolic or semilinear elliptic problems can be transferred to the quasilinear parabolic case. We leave this as an interesting open problem.

Numerical illustration
Let us conclude the paper by some numerical computations that illustrate the different sparsity patterns induced by the penalizers j k , k ∈ {1, . . . , 7}. Before presenting these numerical examples in Section 5.2, we give a concise overview over the fast proximal method in Section 5.1, that is used to solve (P k ) for k = 1-3, 6, 7. For k = 4, 5 we apply a subgradient method to (P k ).

Proximity operators and fast proximal methods
Proximal algorithms, see, e.g., [50], have been applied successfully in different areas, e.g., image processing, and machine learning, but also in PDE-constrained optimization [53,54]. This class of algorithms specifically applies to problems of type (P 2 -X), cf. Section 3.2, consisting of a nonconvex, but smooth, and a convex, but nonsmooth summand in the functional. In the context of sparse optimal control we are aware of possibly faster methods, e.g., certain Newton-type methods in function space [32,46,51,57], or algorithms on the discrete level, e.g., Section 6 of [11] and Section 4 of [32]. However, proximal methods usually have the advantage that they are relatively easy to implement and, compared to second-order methods, less intrusive. Let us recall from, e.g., Algorithm 2 of [53] the basic concept of the so-called fast proximal method, formulated on behalf of (P 2 -X). Given a fixed step size L > 0, an initial guess u 0 , and t 0 = 1, set v 0 = u 0 and for = 1, 2, 3, ...
until the current iterate u reaches a desired optimality criterion. Here, we denote by the so-called proximity operator; see, e.g., [50] or Chapter 24 of [4] for an overview. It is a crucial condition for the applicability of proximal algorithms to (P 2 -X), that the nonsmooth part of the functional, j X , is "proximable", i.e. we have to know how to compute (5.1) efficiently. We briefly address this issue using the notation of Section 3.2. In case A, it is well known, see, e.g., Lemma 4.3 of [53] or Section 3.3.2 of [51], that j A is proximable with for a.a. x ∈ Λ.
For case B and U ad = L 2 (Λ) it holds cf., e.g., Section 3.3.2 of [51]. We refer the reader to Section 6.5.4 of [50] or Theorem 3 of [41] for the same formula in the discrete case. For the proximity operator in the case B with U ad := {u ∈ L 2 (Λ): u ≥ 0 a.e.}, we refer the reader to Section 3.3.2 of [51]. We are not aware of an explicit formula for the proximity operator in the case of bilateral box-constraints. To the best of our knowledge, the functional of case C is not "proximable". For case D and U ad = L 2 (Λ), however, a formula for the proximity operator is well known for the discrete analogon, see, e.g., Theorem 3 of [41], and the adaption to our setting is not difficult. We obtain: [Prox τ (v)](x 1 , x 2 ) = sign(v(x 1 , x 2 )) max(0, |v(x 1 , x 2 )| − ρ(x 1 )) for a.a. x ∈ Λ, where ρ(x 1 ) ∈ R has to satisfy for every x 1 ∈ Λ 1 : max(0, |v(x 1 , ·)| − ρ(x 1 )) L 1 (Λ2) = ρ(x 1 ) 2τ . (5. 2) The efficient computation of proximity operators for case C, as well as for cases B and D in the presence of box-constraints, is certainly of interest, but beyond the scope of the present paper. Also, we do not address the convergence analysis of the fast proximal method in our precise setting.

Results
We consider the following specification of (P k ): We choose Ω = B 1 (0) ⊂ R d , T = 8, µ ≡ I 2 ∈ R 2×2 , Γ D = ∂Ω, α = π 25 · 10 −2 , β = 10 −2 , y 0 ≡ 0, ξ(s) := 1 2 + 1 1+exp(−20s) , and y d (t, x) := − t 4 sin( π 2 t) exp(−36|x − m(t)| 2 2 ) with m(t) = 2 3 (sin( π 4 t), cos( π 4 t)) T . The eight control actuators are given by b i , ϕ := Ω 1 B 1 5 (m(i−1)) (x)ϕ(x)dx, i = 1, . . . , 8. Consequently, the state equation reads as follows: ∂ t y(t, x) − ∇ · 1 2 + 1 1 + exp(−20y(t, x)) ∇y(t, x) = We omit control-constraints and set U ad = L s (I, R m ). Space is discretized with the help of FEniCS and mshr [1,44] using piecewise linear finite elements with 3324 DoF and mesh size h max ≈ 5.0 · 10 −2 . For time discretization we use the implicit Euler scheme with 160 equidistant timesteps; more precisely: state and adjoint state are discretized piecewise linear in space and piecewise constant in time, while the controls-whose "spatial" components are vectors in R m in our setting and, consequently, do not need to be discretized-are R m -valued piecewise constant in time. Hence, the chosen discretization coincides with the variational discretization concept [34]. Moreover, discretization and optimization commute ("optimize then discretize = discretize then optimize"); in particular, the discretization of the subgradient of j k yields the subgradient of the discretization of j k . Note that introducing constant (w.r.t. time) control bounds u a , u b ∈ R m would not change this sitation. However, we point out that dealing with distributed controls (instead of purely time-dependent ones) would be more intricate because then the spatial component of the controls requires discretization (or variational discretization) as well. For issues related to the handling of the control-variable in the discretization of sparse optimal control problems we refer the reader to the overview in Section 4.5.3 of [51], or to, e.g., [18,19] for semilinear parabolic or [12,13] for semilinear elliptic problems.
For k = 1-3, 6, 7, i.e. cases A, B, and D, we solve the discretized counterpart of (P k ) by the fast proximal method described above with step size L = 10 −2 . In case D, equation (5.2) is solved by bisection. Regarding the computation of the proximity operators note that commutativity of optimization and discretization pointed out above applies to the minimization problem (5.1) defining the proximity operator vice versa. Since for case C we are not aware of a proximity operator, we solve the optimal control problem for k = 4, 5 by a classical subgradient descent method, cf. Chapter II.2.1.2 of [45] for instance, with the step size in iteration given by 10 √ . The initial guess in all cases is u 0 i ≡ 0.1 and the nonlinear problem at each timestep of the solution of the discretized state equation is solved by the built-in nonlinear solver provided by FEniCS. Figure 1 b)-h) shows the optimal controls of (P k ) for k = 1, . . . , 7. For comparsion we also display the nonsparse optimal control for β = 0 in Figure 1 a). The typical sparsity patterns described in Section 2.2 are clearly visible; cf. the description below Theorem 2.4. In c) the number of control actuators becoming active is sparse. The controls u 2 , u 3 and u 5 are identical zero, but the remaining actuators are active over the whole time interval. In f) all actuators are in use at some point in time, but it seems to be the case that only two actuators can become active at the same time. In d), finally, any control action is limited to certain subintervals of I, but in them all actuators become active simultanuously. Also the difference between k = 1 and k = 4 becomes visible. In b) an actuator only becomes active if its contribution is above a certain threshold level that is the same for all actuators. This results in a number of actuators never becoming active. In e), however, this threshold level is different for each actuator, cf. the formula in Theorem 2.4, and therefore each actuator is used, but only at those times where it is needed sufficiently much. Finally, by comparison of e) with g) and f) with h) it can be seen that usage of functionals j 4 and j 6 as well as j 5 and j 7 lead to similar results w.r.t. the sparsity pattern, respectively.
In Figure 2 a) and b) we illustrate the convergence speed of the fast proximal and the subgradient method. We display the L 2 (I, R m )-norm of the residuals r of the control iterates u , i.e. r := u + α −1 (p + βλ ),  where p and λ denote the adjoint state and the subgradient of j k associated with u ; cf. the optimality conditions in Theorem 2.4 and Section 3.2. As expected, the fast proximal method converges much faster than the subgradient method. In our opinion, this indicates that replacing functionals j 4 , j 5 by j 6 , j 7 , respectively, may be worth considering in applications in order to allow the application of fast proximal methods.