Dynamic Narrowing of VAE Bottlenecks Using GECO and $L_0$ Regularization

When designing variational autoencoders (VAEs) or other types of latent space models, the dimensionality of the latent space is typically defined upfront. In this process, it is possible that the number of dimensions is under- or overprovisioned for the application at hand. In case the dimensionality is not predefined, this parameter is usually determined using time- and resource-consuming cross-validation. For these reasons we have developed a technique to shrink the latent space dimensionality of VAEs automatically and on-the-fly during training using Generalized ELBO with Constrained Optimization (GECO) and the $L_0$-Augment-REINFORCE-Merge ($L_0$-ARM) gradient estimator. The GECO optimizer ensures that we are not violating a predefined upper bound on the reconstruction error. This paper presents the algorithmic details of our method along with experimental results on five different datasets. We find that our training procedure is stable and that the latent space can be pruned effectively without violating the GECO constraints.


Introduction
Deep neural networks are constructed and trained such that intermediate, latent representations of the input data are learned [9]. For this reason, deep learning and representation learning are two terms that are very closely related. The goal of representation learning is to project high-dimensional data points such as images, documents, etc. into a vector space that is typically of low dimensionality compared to the original data points. In this space, similar data points ideally have similar representations, i.e. the points lie close to each other in the vector space in terms of some distance metric [3,20]. The low number of dimensions and the similarity property pave the way for applications such as recommender systems (detecting similar items), anomaly detection (detecting contrasting items), data generation, data interpolation, etc. In these applications it is highly preferred that the dimensionality of the vector space is kept to a minimum that is needed for the task at hand. Not only does it allow for more efficient calculations in the data pipeline, it also lowers the required storage capacity.
Common machine learning algorithms that are used for representation learning typically require the user to specify the vector space dimensionality upfront when designing the model. In classical unsupervised methods such as Principal Components Analysis (PCA) and the related Singular Value Decomposition (SVD)-both popular dimensionality reduction techniques-the number of retained variables is a hyperparameter of the model. There exists an extensive amount of literature that tackles to problem of estimating the optimal number of components to keep in (probabilistic) PCA. Cross-validation or the Bayesian Information Criterion (BIC) are straightforward baseline methods that can already provide satisfying results [14]. More advanced techniques such Horn's Parallel Analysis [11], Minka's Laplace method [21] or Deng et al.'s penalized probabilistic PCA [8] are able to outperform these baselines in accuracy, efficiency or both. For more in-depth discussions, analyses and comparisons of these particular types of methods we refer the reader to the texts in [2,24,13].
Since the arrival of deep learning onto the machine learning scene, we have seen an explosion of models and techniques that are tuned to the task of representation learning and dimensionality reduction. As opposed to PCA and many other classic algorithms that model linear dependencies, deep neural networks have the powerful ability to learn highly non-linear mappings. One powerful class of such models are the autoencoders, where the input data is first projected ('encoded') onto a low-dimensional subspace, after which this projection is used to reconstruct the original input data itself ('auto') as well as possible. Mathematically, if the input of the autoencoder is a d-dimensional vector x, the encoder f enc : R d → R n produces a n-dimensional projection z, with n d. The decoder g dec : R n → R d then takes z as input and produces an d-dimensional output x : z = f enc (x) ; x = g dec (z) . (1) From these equations, it becomes clear that autoencoders actually consist of two neural networks, which together create a diabolo architecture: the input and output dimensionality is the same, but the intermediate representation has a lower dimensionality, thereby creating the data 'bottleneck'. The projection z has also been called 'latent vector', 'latent code' or 'hidden representation', and 'latent space' is often used to denote the vector space of all possible codes. To make sure the output x resembles the input x as closely as possible, the loss function will usually contain a reconstruction error term reflecting either mean squared error (MSE), negative log-likelihood (NLL), a perceptual or adversarial loss, or any other metric or divergence expression (we will elaborate further on loss functions for autoencoders in Section 2.1). There are of course many (hyper)parameters that determine the ability of the decoder to reconstruct the original input. One of these parameters is the width of the bottleneck, i.e. the dimensionality of the latent vector z. The stronger the bottleneck, the more the input data will be compressed by the encoder, thereby potentially causing information loss, which adheres to the information bottleneck (IB) principle as a "trade-off between compression and prediction" [28,1]. The amount and nature of the compression depends highly on the model architecture and the reconstruction loss. For example, if we decode an entire image by minimizing a pixel-based L 2 error, it is known that the reconstructions will be an increasingly blurred out version of the original image when the bottleneck gets tightened [23,12], causing the latent vectors to focus on low-frequency information.
Similar to PCA, the bottleneck dimensionality of autoencoders is predefined in the model architecture. For a given problem statement or use case, one typically provisions an adequate amount of latent dimensions in such a way that the application's requirements are met. Requirements can be quality-centered, e.g. accuracy or reconstruction error; performance-based, such as memory consumption or processing time; or application-driven, for example if the latent vectors in existing subsystems already have a fixed dimensionality. The general conception reads "the more latent dimensions, the better", at least in order to improve the quality-based metrics. This line of thought is indeed backed by information theory: consider a random variable Z 1:k that represents a valid latent vector z with k dimensions, and a random variable X that stands for the output of the decoder. If an additional one-dimensional variable Z k+1 is added to the latent vector, and if we treat the decoder as an information channel, it is known that (with H(· | ·) denoting conditional entropy): In other words, increasing the latent space with extra dimensions might reduce the uncertainty about the decoder output. Indeed, it is in general always possible for the decoder to ignore the extra information in the latent vector if it would lead to deteriorated reconstructions. In this paper we will try to answer the exact opposite question: to what extent can we eliminate dimensions from the state space with a minimal sacrifice on the output quality? Or, more explicitly, given a set of conditions that need to be satisfied-e.g. a maximum classification error-can we effectively prune away latent dimensions without violating these conditions. This could be achieved by often resource-and time-consuming hyperparameter optimization techniques. The core of this work, however, is to devise a learning scheme that allows us to dynamically prune the latent space on-the-fly during training, without any additional hyperparameter tuning. During training, we will make the information bottleneck trade-off between accuracy and compression very explicit by means of a constraint-based formulation of the problem using GECO (Generalized ELBO with Constrained Optimization) [26], which will be explained in Section 2.1.
We will show that VAE bottlenecks can effectively be pruned with our methodology. And even more, we will provide convincing experimental evidence that more dimensions can be eliminated if the constraints are relaxed; that is, if we are satisfied with an overall higher reconstruction error, we can potentially achieve an increased prune rate. The other way around, pushing hard on the reconstruction quality will require more latent dimensions to compress the data, which clearly demonstrates the information bottleneck trade-off. The remainder of this paper is structured as follows. First, we will give background information on VAEs and GECO, after which we will provide details on the L 0 -ARM (Augment-Reinforce-Merge) gradient estimator [17,30]. In Section 3 our train-ing procedure will be explained. Finally, we will conduct a set of experiments to validate our approach on five different datasets in Section 4.

Background Material
In this section we will cover the theoretical building blocks that are used in the algorithm of Section 3. We will first cover variational autoencoders and the GECO extension optimization procedure. Afterwards we will explain the details of the L 0 -ARM gradient estimator.

Variational Autoencoders and GECO
In traditional autoencoders the latent vectors are deterministic for a given input data point. Variational autoencoders (VAEs), on the other hand, deal with stochastic latent vectors, i.e. the encoder models a distribution q θ (z | x) over latent vectors rather than a single fixed vector [15]. This distribution is called the approximate posterior and is learned to approximate the true, but unknown posterior distribution p(z | x). This is typically done by modeling q θ as a Gaussian distribution with diagonal covariance. In practice, the neural network f θ enc : R d → R n × R n with parameters θ produces both the mean and logarithmic standard deviations of this distribution given some input data x. A latent vector z is then sampled from the approximate posterior as follows: The decoder neural network g φ dec : R n → R d with parameters φ models the likelihood distribution p φ (x | z), and produces a reconstructed data point: Both the encoder and decoder networks are optimized using the ELBO [15]: The first term in this loss function minimizes the negative log-likelihood of the data given a latent vector sampled from the posterior distribution. Differentiability of this term is ensured by the so-called reparameterization trick, in which sampling z from the posterior is replaced by sampling a noise vector from a standard normal distribution and rewriting z as a differentiable function of : Here, denotes an element-wise vector product. In the second term of Equation (5), p(z) is the prior distribution over latents, which is often fixed to a standard normal with zero mean and identity covariance matrix: p(z) = N (z; 0, I).
Minimizing the KL divergence in Equation (5) can be regarded as a regularization term, as it causes uninformative factors to collapse onto the standard normal distribution. To make the regularization strength explicit and tunable, Higgins et al. proposed β-VAE in which β is the regularization coefficient [10]: Increasing the value of β puts more weight on the KL term, thereby implicitly tightening the information bottleneck. Indeed, Higgins et al. argue that varying and choosing an appropriate value of β represents trading off reconstruction quality vs. latent channel capacity, and that it is generally advised to set β > 1 in order to arrive at a disentangled latent space [10]. In contrast, Rezende and Viola propose Generalized ELBO with Constrained Optimization (GECO), which rephrases the ELBO loss function in terms of a constrained optimization problem using a Lagrange multiplier λ [26,27]: This Lagrangian optimizes the KL divergence subject to E z [C(x, g dec (z))] ≤ 0, for a given constraint function C for which C(x, g dec (z)) ∈ R (we have left out the parameters for reading comfort) 1 . A GECO constraint can essentially be any condition or metric the VAE needs to satisfy, but the constraint will usually model an upper bound on a predefined reconstruction error, e.g. the L 2 error: or any other useful divergence. The positive constant τ ≥ 0 represents a desired tolerance or upper bound on the reconstruction error, such that E z [C(x, g dec (z))] indeed becomes negative if the upper bound is satisfied. The value of τ can be tuned according to the application needs, which makes τ a hyperparameter of the model. One might argue that tuning τ is no different or easier than picking an appropriate β in β-VAE. However, tuning parameters in latent spaces is an abstract operation, while tuning quality metrics in the data space is much more tangible and practical, and is often directly related to the application at hand. Optimizing the Lagrangian in Equation (8) is done through min-max optimization: it is minimized w.r.t. parameters θ and φ, while it is maximized w.r.t. λ. The details on how to optimize the multiplier λ are explained in the original work by Rezende and Viola, and will be further clarified in Section 3.

L 0 Regularization and ARM
There has been considerable effort in the past to sparsify neural networks by pruning redundant or low-magnitude weights [22,16,7]. Louizos et al. introduced a learning method that employs L 0 regularization [18]. The L 0 norm of a numerical vector represents the number of non-zero components, so that minimizing the L 0 norm comes down to setting as much vector components to zero as possible. This is useful for sparsification of neural networks as follows. Consider the vector w of all layer weights in a network, and a binary vector ν (containing only 0s and 1s) with the same dimensionality as w. By performing the element-wise vector product w ν, the vector ν acts as an on-off gating mechanism: for every 0 in ν the corresponding weight in the neural network is essentially eliminated from the computational graph. Therefore, L 0 regularization on ν comes down to eliminating weights from the neural network. If we would apply traditional L 2 or L 1 regularization, minimization implies that the vector components will become small, but are not encouraged to become exactly zero. And this is needed for the purpose of pruning, since a tiny but non-zero component is still able to carry useful information, and is therefore not eliminated effectively.
Minimizing the L 0 norm ν 0 in a gradient descent context is, however, not straightforward, since it is not differentiable w.r.t. ν. More specifically, because ν 0 takes discrete values, it has zero gradients everywhere in its domain. Gradient estimators such as Reinforce [29], Straight-Through estimation [4], and Hard Concrete estimation [18] can overcome this issue. More recently, Augment-Reinforce-Merge or ARM was presented as an unbiased and low-variance gradient estimator that outperforms previously mentioned estimators on a variety of tasks [30], and it has been applied successfully to L 0 -based network sparsification by Li and Ji shortly after its introduction [17]. It works as follows, given an arbitrary neural network f with n parameters w, a gating vector ν with dimensionality n, and a loss function L w,ν for which: In the above expression, E represents any error metric between the output of f and a label y, 1 is the indicator function, and β is a regularizer. Both terms in Equation (10) are not differentiable w.r.t. ν. To overcome this issue, we model the components ν j as samples from a Bernoulli distribution with parameters π j : ν j ∼ Ber(ν; π j ). An upper bound of min ν L w,ν can then be calculated [5]: We denote the right-hand side of the inequality byL w,γ . The regularizing term is now differentiable w.r.t. π, but the first term is still problematic. We will use ARM to estimate the gradients of this term, after writing the Bernoulli parameters π as a function of logit parameters γ: π j = σ(γ j ). The function σ : R → [0, 1] is ideally antithetic and symmetric around its inflection point. The standard sigmoid function is therefore a good candidate: σ(γ) = 1 /(1+exp(−γ)). With this reparameterization, and the shorthand notation F(ν) = E(f (x; w ν), y), Equation (11) can be written aŝ We notice that sampling gates from a Bernoulli distribution is replaced by sampling from a uniform distribution between 0 and 1. According to the theory behind ARM, the gradient ofL w,γ w.r.t. γ can now be estimated-unbiased and with low variance-as This remarkably simple estimator has but one disadvantage, which is that two forward passes through the network are needed. In the original paper, the authors introduce the AR-estimator to overcome the double forward pass, but it leads to a higher variance; we will therefore stick to standard ARM in this work. Finally we have to point out that, since the σ function is smooth and not guaranteed to be exactly zero for closed gates, at inference time, gates for which σ(γ j ) ≤ 0.5 are explicitly closed, i.e. ν j is set to 0, such that the corresponding weights are effectively eliminated.

Shrinking VAE bottlenecks
In this section we will describe how to combine GECO optimization and the L 0 -ARM gradient estimator in order to narrow VAE bottlenecks. The result of our methodology will be a novel VAE optimization scheme that effectively learns to project data points in a latent space while at the same time reducing that latent space's dimensionality and ensuring high-quality reconstructions. Before diving into the details, we first take a look at a couple of issues with the original L 0 -ARM weight pruning technique. First, as Li and Ji state, sparsity strengthi.e. a measure for the number of weights that can be deleted-needs to be tuned explicitly by the regularization parameter β [17]. Apart from this regularization parameter, gates are encouraged to close themselves deliberately without any explicitly imposed constraints. Even more, in the original text the authors show that most of the gates are already settled to open or closed after ca. 50 training epochs. That is, often well before convergence of the main metric, i.e. the classification or regression error, and this causes a "drift" between pruning and performance optimization. Another concern is that the prune rate highly depends on the value of the regularization parameter, and setting this parameter precisely according to one's needs is not transparent and requires hyperparameter optimization. Finally, since the regularization parameter is our only proxy to tune the prune rate, we have little control over the final performance of the network. For example, if we are satisfied with a slightly lower predictive accuracy, we can decide to increase the prune rate. This operation will indeed come at a performance cost, but it is unclear by how much the performance will deteriorate. Therefore, we have to resort to a complete hyperparameter sweep. With GECO, however, it is possible to set an upper bound on the reconstruction error during training. So, instead of tuning the prune rate implicitly with a regularization parameter, we will explicitly set a maximum on the reconstruction error and prune as many dimensions as possible without violating this maximum. Our procedure, as shown in Algorithm 1, is simple yet effective at shrinking the bottleneck dimensionality of VAEs. At the core is a basic VAE architecture consisting of an encoder f θ enc and decoder g φ dec neural network, as described in Section 2.1, and a global gating vector ν as explained in Section 2.2. The vector ν has the same number of dimensions as the VAE latent space. For a given data point x, we determine its latent vector z by sampling from the output of the encoder using the reparameterization trick: z ∼ f θ enc (x) (line 11 in Algorithm 1). Next, the sampled vector z is gated through vector ν using the operation z ν, thereby deactivating some of the components. The resulting vector is used by the decoder to reconstruct the original data point: x = g φ dec (z ν). To measure the reconstruction quality, we have chosen to use the squared error between the original and reconstructed data points, as shown in lines 12 and 13. By analogy with Equation (9), we can now define an upper-bound constraint as the difference between the reconstruction error and a pre-defined threshold τ (line 14). At this point we are able to define the constrained optimization problem that we are solving, in which the Lagrangian incorporates an additional L 0 regularization term on the gates: Lines 14 to 26 in Algorithm 1 comprise the GECO optimization details, as implemented by Rezende and Viola [27]. In line 22 we calculate both the constraint term and the KL regularization. For the latter we have introduced the notation KL elem to denote the vector of element-wise KL components. More specifically, given that (µ, log σ) = f θ enc (x), parameterizing a multidimensional Gaussian distribution with mean µ and diagonal covariance diag(σ) as explained in Equation (3), and p(z) a standard normal distribution, the j'th component of the element-wise KL is calculated as: Since the right-hand side is always positive, line 22 is just a compact way of writing the average KL divergence, thereby taking into account the pruned components. After all, once a dimension is eliminated, its KL divergence should not weigh on the optimization loss anymore. Regarding the constraint term, we entertain a moving average which simulates the expectation of the constraint w.r.t. z, as defined in Equation (14). The parameter α is set to control the rate of the moving average, with a value typically just slightly lower than 1. For Sample z ∼ f θ enc (xi) with reparameterization trick Optimize parameters φ, θ, γ and λ with Adam 32 end Algorithm 1: Optimization procedure to shrink the bottleneck dimensionality of a VAE using GECO and L 0 -ARM gradient estimation. efficiency reasons, we stop the gradients from flowing through this moving average, i.e. only the current constraint (at step i) is involved in backpropagation. We also perform a monotonous squared softplus operation on λ to make sure that the multiplier is positive and is able to 'move fast', i.e. to decrease quickly once the constraint is met, and vice versa. We also clamp the values of λ such that they cannot become arbitrarily small or large. This is needed to prevent λ from growing exponentially. In line 24, we add the L 0 regularization term to the overall loss when the constraint is satisfied for the current data point, i.e. we only allow gates to close themselves when we have some wiggle room and are not violating constraints. This is needed to stabilize the optimization procedure: if we are in the process of eliminating a dimension by gradually decreasing the corresponding γ component, the number one priority is maintaining the required reconstruction loss. If this loss would increase too much, we should not proceed, but instead focus on improving the model parameters.
The remaining lines of Algorithm 1 constitute the L 0 -ARM gradient estimation of the gating vector. First, notice the hit variable which is set to True once the constraint is satisfied. Therefore, at the start of training, the gates are set wide open, and only when the constraints are achieved for the first time, we will start the L 0 -ARM optimization. The parameter k in lines 8 and 9 is introduced in order to scale the sigmoid function; a typical value of k > 1 will make the sigmoid shape steeper, thereby allowing faster transitions between open and closed gates. Most operations are self-explanatory and are in line with Equation (13). We do want to point out that we only need to sample a single latent vector from the encoder, but that we need two forward passes through the decoder in lines 12 and 13. The resulting reconstruction errors are then used to update the gradients for γ in line 29. In this equation, according to the chain rule, we need to multiply by k and by λ . Finally, in the last line, all parameters are optimized using Adam-or any other gradient descent flavor.

Experiments
We will evaluate our methodology in a variety of settings on five different, standard image datasets. We consider five image datasets, of which the first three are typically used to evaluate VAE performance and latent space disentanglement. The last two datasets contain realistic images, are not artificially constructed, and have been extensively used in the past to benchmark different varieties of machine learning models.
-dSprites (Shapes2D) [19]: 64×64 images of white 2D shapes on a black background. There are five degrees of freedom: shape (square, ellipse or heart), scale, rotation, x-position, y-position. We use a standard convolutional neural network as architecture for the encoder f θ enc . For input color images of 64 × 64 × 3 pixels, we use four convolutional layers with stride 2 and kernel sizes of 4 × 4 pixels for the first layer and 3 × 3 pixels for the subsequent layers. Each of these layers is fed through a leaky ReLU activation with leakiness 0.02 and a BatchNorm layer with momentum 0.8. The output of the final convolutional layer is flattened and used as input for a linear layer with 128 output neurons. A final linear layer is applied which calculates both a mean and variance prediction; this layer therefore contains 2 · n output neurons, with n the dimensionality of the latent space. For MNIST images of 28 × 28 × 1 pixels, there are only three convolutional layers with resp. 4 × 4, 3 × 3 and 2 × 2 filters. The decoder architecture is a mirrored version of the encoder network. Two linear layers transform the latent vector into the input of four consecutive convolutional layers. Each of the convolutions has a 3 × 3 kernel with stride 1, and is preceded by a BatchNorm layer and an upsampling operation which increases the image size by 2 through nearest neighbor interpolation. Between each layer we apply a leaky ReLU function, and the final activation is a regular sigmoid to obtain pixel values between 0 and 1. For MNIST we have again three convolutional layers with resp. kernel sizes 2 × 2 and two times 3 × 3.
In all experiments we set k = 7, α = 0.99, λ min = 10 −5 , λ max = 5, batch size 64, learning rate 10 −3 , and we use Adam as optimizer. We initialize λ = 1 at the beginning of training, and all components of γ are initialized to 0.42 such that σ(k · 0.42) ≈ 0.95, which means that there is a probability of 95% that a gate is open. This is done to ensure training stability.

Visualizing the training procedure
In a first experiment we will track the behavior of the training procedure by observing four different quantities through time: the mean squared reconstruction error (MSE), the number of gates that are open or closed, the Lagrange multiplier λ, and the fraction of data points within a batch that satisfy the constraint. These graphs are shown in Figure 1 in which we have trained a VAE on dSprites for 50K batches, with n = 10, i.e. the original-and therefore maximum-dimensionality of the VAE. At the start of training we see that λ first increases until the MSE dives below the threshold τ , after which it drops to a small value. This has the effect that the regularization terms gain more importance, and after ca. 10K batches we notice that the gates are gradually starting to close themselves, which is visible in the descending staircase pattern in the upper right plot. This plot shows the L 1 norm of the smoothed gate vector σ(kγ), which is essentially the sum of the vector components since these are all positive. The y-axis of the graph therefore represents a continuous measure of the number of open gates. Each plateau in the graph corresponds to an integer amount of opened gates, so we see that after 10K batches there are 9 open gates, which drops to 6 gates at 20K batches, and to 5 gates at 25K batches. Of course, when more gates are closed, the reconstruction errors are expected to increase, which can be observed in the upper left graph, and the GECO tolerance hit/miss ratio will drop. Whenever the MSE is above τ , the multiplier λ will increase in order to temporarily attribute more weight to the reconstruction error. We can visually distinguish two such increases around 25K and 35K batches in the lower left plot. As a final remark, we want to bring to attention the oscillatory behavior of the gates after 40K batches. These oscillations come from the effect that in the  process of closing an additional gate, the constraint is violated severely, which causes λ to increase in order to lower the MSE in favor of the regularization terms, thereby opening the gate again. In other words, the model is constantly trying to close a gate, but it is obstructed by the GECO mechanism not allowing any heavy constraint violations.

Visualizing the information bottleneck
In a next experiment we will vary the GECO tolerance across a range of values, and observe the final bottleneck dimensionality after training. This will allow us to test the hypothesis whether an increased tolerance will lead to a tighter bottleneck and vice versa. We perform tests on the MNIST dataset, for which we set the initial dimensionality n = 32. The GECO tolerance is set to different MSEs of 3, 4, 6, 8, 10, 14, 18, 22, 26 and 30, and train for 2000 epochs. The resulting bottleneck pareto front is shown in Figure 2, and we can clearly observe the inverse relationship between the threshold τ and the number of retained latent dimensions ν 0 . For τ = 3 the model learns to withhold 27 out of 32 dimensions, while for τ = 30 we only need 2 dimensions. For this specific dataset, neural architecture and training procedure, Figure 2 makes the information bottleneck pareto front very visual and tangible. Next to the plotted bottleneck line, we also show reconstructions of a given image of the number 7 for a few selected points on the graph. We can clearly see that the quality of the reconstructions  deteriorates as we increase τ -for τ = 30 the construction resembles a 1 rather than a 7-and that they also become blurrier. Based on visual samples, it is relatively easy to select the GECO threshold one is comfortable with for the application at hand.

Comparing reconstruction qualities
In the next experiment we train two VAEs on the dSprites dataset with different settings for the GECO threshold: τ = 20 and τ = 35 (in MSE). For τ = 20 this results in 5 latent dimensions, while τ = 35 gives 3 dimensions. As mentioned in the beginning of this section, the dSprites dataset is constructed with 5 degrees of freedom (shape, scale, rotation, x-and y-position). In Figure 3 we can observe that in the middle row all five variables are being used to reconstruct the original image. In the bottom row, however, we only see blurred circles at the positions where the original shapes are located. This means that both shape and rotational information is being discarded in the reconstructions, which indeed corresponds to the 3 latent dimensions that have been retained instead of 5. While this line of  reasoning may sound logical, it is far from a rigorous proof that the 3 dimensions indeed each correspond with the remaining degrees of freedom (scale, x-and yposition). Since we are not explicitly striving for disentangled latent spaces in this work, we will not go deeper into this matter.

Evaluating compression efficiency
In this final experiment we want to find out how efficient our method is at pruning latent dimensions. For this purpose, we devise an approach in which we first train a baseline regular VAE with a predefined number of latent dimensions n and with a GECO tolerance of 0 MSE, which means that we want to squeeze out every bit of reconstruction quality there is to gain, thereby sacrificing KL regularization. We give the model a training budget of 200K batches, after which we write down the moving average (with factor 0.95) of the final reconstruction error. Then, in a second step, we train a new VAE with L 0 regularization and GECO that we provision with 2n latent dimensions, and we observe to what extent this number can be lowered to n given the obtained reconstruction error in step one as GECO tolerance. We do this for all five datasets, for values n = 5, 10, 30, and for training budgets of 200K, 300K and 400K batches. The resulting figures are presented in Table 1. For most configurations, our approach is able to prune the bottleneck very effectively, coming close to the original value of n, sometimes achieving the same value or even surpassing it. This is often true for n = 30 (except for CIFAR-100), which shows that the original VAE was probably overprovisioned. We also observe that we are coming close to the optimal number of dimensions already after a training budget of 200K batches, i.e. the same budget with which we have trained the original VAE. This is probably thanks to the higher initial dimensionality of 2n, which allows the reconstruction error to drop more quickly below the GECO treshold at the start of training, after which we can start pruning the bottleneck. Only in a select number of cases-again, mostly for n = 30-are we able to further reduce the dimensionality significantly if we continue training for 400K batches. In summary, Table 1 shows that our method is widely applicable in different settings and that it can provide near-optimal dimensionality reductions without incurring a large additional training budget.

Conclusion
In this work we have devised an algorithm to reduce the bottleneck dimensionality of VAEs on-the-fly during training. For this purpose we have used L 0 regularization and the L 0 -ARM gradient estimator to train a gating mechanism that either blocks or passes information for each latent factor independently. To decide how many gates should be open or closed, we employ GECO to define constraints as upper bounds on the reconstruction error. In the experiments we show that our algorithm is effective at reducing the latent dimensionality on five different datasets. It is also a useful tool to assess whether a VAE bottleneck is over-or underprovisioned. The only downside of the algorithm is the (small) computational overhead that comes from a second forward pass through the decoder network, as required by L 0 -ARM. In future work one can look at applying the method to all intermediate representations [7], and not only to the encoder output. It would also be interesting to see if it can be applied in recommender systems or VAE-based state space models for control tasks, e.g. in robotics.