ASAP – A sub-sampling approach for preserving topological structures modeled with geodesic topographic mapping q

Topological data analysis tools enjoy increasing popularity in a wide range of applications, such as Computer graphics, Image analysis, Machine learning, and Astronomy for extracting information. However, due to computational complexity, processing large numbers of samples of higher dimensionality quickly becomes infeasible. This contribution is twofold: We present an efﬁcient novel sub-sampling strategy inspired by Coulomb’s law to decrease the number of data points in d -dimensional point clouds while preserving its homology. The method is not only capable of reducing the memory and computation time needed for the construction of different types of simplicial complexes but also preserves the size of the voids in d -dimensions, which is crucial e.g. for astronomical applications. Furthermore, we propose a technique to construct a probabilistic description of the border of signiﬁcant cycles and cavities inside the point cloud. We demonstrate and empirically compare the strategy in several synthetic scenarios and an astronomical particle simulation of a dwarf galaxy for the detection of superbubbles (supernova signatures). (cid:1) 2021 The Authors. Published by Elsevier B.V. ThisisanopenaccessarticleundertheCCBYlicense(http:// creativecommons.org/licenses/by/4.0/).


Introduction
Topological data analysis (TDA) provides exploration tools for increasingly diverse applications in various domains, ranging from Biology and medical images [1], mapping disease spaces [2], and Astronomy [3].Persistent homology (PH) is a TDA technique for computing the properties of shapes of a finite metric space (also called point cloud dataset) and can capture these features in an extended range of scales.Nonetheless, as the number of points or the dimensions of a dataset increases, the computation of the PH soon becomes impractical.
Numerous methods and toolboxes provide novel approaches to tackle the problem of computational costs.Sparse Rips filtration [4] builds an -net on top of the point set followed by an association of weights to each node, which results in a provably good approximation of the full data Rips filtration.In [5] two new atomic operations for efficient computation of PH are suggested, and SimBa [6] combines these two strategies to reach a higher sparsity in the number of simplices, which increases the efficiency for computation of Rips filtration.The toolbox Ripser [7] decreases the computational costs by avoiding to build the complete coboundary matrix building and storing only the parts needed.This improves the memory consumption and reduces the computational time.These methods are limited to Rips and are not extendible to other types of filtration.
A general concept for scaling down the computation independent of the filtration was reported in [8] proposing to subsample the data randomly repeatedly and construct an average landscape for the point cloud.Although their approach can be applied for constructing all types of filtration, it is sensitive to the distribution of the data on the structures as a consequence of random sampling.MaxMin [9] was introduced as another intuitive sub-sampling approach.By selecting a random data point as the first sample, it continuously picks the next sample point that has the longest distance to the previous samples until the desired number of samples is achieved.Although sampling using this method usually achieves more uniformly spaced distribution of points than random sub-sampling [8], it does not provide any information about the range or distance between samples, and final results may vary a lot dependent on the starting point.
When applying a filtration to a point set, topological features appear and disappear (referred to as birth and death) by increasing the filtration parameter value.Topological features exhibiting a short lifetime are considered as topological noise in some applications, as explained in [10].They introduced the confidence band inspired by the p-value definition from statistics.Using this definition, one can distinguish between properties which belong to the point cloud and do not emerge as artifacts due to the subsampling of the data.Furthermore, the persistence diagram does not provide any information about the location of these features inside the point cloud.This location information is essential in several applications, such as medical image segmentation [1], detecting voids in the cosmic web [3] and supernovae in galaxies.In [1] a technique for positioning persistent 1-cycles was introduced, which is not easily extendable for locating cavities and higher dimensional properties.Moreover, Dionysus [11] can also record the boundaries of a topological feature during the computation of PH.In [3], the authors use this toolbox to locate the voids and filaments in the Cosmic web.The recovered boundary, however, is often not fully located on the border of a hole or cavity and varies with repeated sampling over the point cloud.They furthermore construct the filtration on top of a 3D grid and then compute the distance-to-measure function [12] for every point on the grid.As a consequence the boundary points also fluctuate by changing the grid size.We will discuss the above mentioned problems in detail in Section 2.3.
While simplicial complexes and filtrations are useful for producing clean representations of noise-free data sets, they are not as effective when applied to intrinsically noisy structures.In these cases, a probabilistic description of the low dimensional structures is desirable, as a way to capture the underlying nature of the observed data.Existing techniques are for example nonparametric density estimators, such as Parzen windows [13], its extensions Manifold Parzen Windows [14] and Fast-Parzen Windows [15] or semi-parametric generative models like the Infinite Gaussian Mixture Model [16].However, despite fitting observed points with high accuracy those techniques are blind towards the low-dimensional nature of the structures and often the computational costs for training and evaluation is prohibitive.As an alternative, Generative Topographic Mapping (GTM) [17] models a noisy manifold as a low dimensional, linear, latent space embedded in the ambient space through a non-linear mapping function.The corresponding noise is defined as a Multivariate Gaussian Mixture Model (GMM) [18], with centers constrained to lie on the embedded latent space.However, despite the non-linearity of the mapping function, classical GTM is insufficiently flexible to model cavities and holes, which are non homeomorphic to a linear subpsace.
Physical particle simulations are one way of investigating astronomical phenomena such as galaxies and supernovae.Radiation and winds from massive stars at the end of their life can greatly affect the dynamics of gas in the interstellar medium (ISM) and in turn, change the structure of the galaxy and its ability to create new stars.Dwarf galaxies are very sensitive to the physical processes determining their evolution due to their low mass and are therefore used as probes to characterize, study and isolate them in simulations.Similar to real dwarfs simulated irregular galaxies have a very clumpy ISM and holes due to supernovae visible in the gas density distribution [19,20].The characterization of the distribution of supernova shells in the ISM (so-called superbubbles), and the energies of the expanding shells [21,22], can shed light on the feedback physical processes.Superbubbles are of great astronomical interest but typically measured by eye in available catalogues and automatic tools are highly desirable.
This contribution extends A Sub-sampling Approach for Preserving topological structures ASAP2 [23], that reduces the computational cost suitable for different types of PH filtration, general ddimensional point clouds, and large number of samples.In this paper the structures found by subsequently performed PH are statistically analyzed to determine their robustness.Additionally, we propose a strategy to provide a probabilistic description of the shell of these bubbles, which, in our astronomical application, provides additional information about the supernovae borders and the stars that shape these borders.In order to fully capture the properties of such cavities, taking advantage of their low dimensional nature, we propose a modified version of the GTM: geodesic GTM (gGTM).Through this formulation, the topological features of the modelled structures are accounted for by embedding a closed low dimensional latent space onto the ambient space of the point cloud.Through the new latent space formulation we are finally able to interpret the topological structure of manifolds, embedded in higher dimensional spaces, while still capturing their natural stochasticity.
In the following, the novel sub-sampling strategy, statistical analysis, and probabilistic description is explained in detail.We then compare to state-of-the-art methods in several controlled experiments and finally investigate a snapshot of an astronomical particle simulation by computing the number and size of superbubbles within a jelly-fish like dwarf galaxy.

Methods
This section consists of three main parts.First, we describe the sub-sampling procedure followed by the calculation of the confidence band on the PH plot and its interpretation.Every time the point cloud is sampled, we extract the boundaries of significant features in the PH plot.Finally, as the boundary points fluctuate between samples, we suggest a probabilistic description of the border of cycles and cavities in the PH plot.

The sub-sampling approach ASAP
Computing the PH for the analysis of the evolution of shapes across different resolutions is often prohibitive due to the combinatorial nature of existing algorithms complexity, in both time and space.Therefore, we propose a two-stage strategy based on sub-sampling and Coulomb's law [24].As described before, we first sub-sample points from the point cloud data set N (finite metric space) to reduce the amount of computation time and memory.The subset N r & N aims to contain fewer points s 2 N r for which the persistence diagram D N r ð Þ approximates the persistence diagram D N ð Þ of the full data.Therefore the set N r has to satisfy the following two conditions [4] checked in every step: > r 8s i ; s j 2 N r with ij: We satisfy (1) by selecting a random point s i , insert it to N r and remove all points p j È É from N belonging to an open ball centered around s i with radius r: The process is repeated until the point set N is depleted, implicating that all points are covered by at least one open ball of a sample point in N r .Due to the removal of points in every step, the packing condition is also fulfilled for all remaining points with distance larger than r from s i in N.
The sub-sampling strategy fulfils both necessary conditions, but the result is not completely uniform, and the pairwise distance of any sample point pair is between r and 2r.However, it is more desirable to have sample points equidistant from each other forming a uniform grid.As a result, we expect when all points on its boundary connect to each other it coincides with the birth time of the void.Moreover, in astronomical applications it is crucial to measure the size of the cycles, cavities and streams as accurately as possible, for which N r needs to contain the borders of the data.Therefore we propose an extension to the sampling inspired by the movement of identical electrical particles, such as electrons, on the surface of a conductive sphere [24].The electrons will repel each other based on Coulomb's law and approximate a uniform distribution.To take advantage of this physical repulsion force each sample is repelled by neighbouring samples by where the set N i consists of sample points in 2r radius of s i and c denotes the learning rate.If neighbouring points are far from s i the force will be low, and the learning rate controls the strength of the movement.The appropriate range for the displacement is between 0:1r; r ð Þ , since the effect of smaller movements is negligible and larger movements result in s i intruding positions already covered by other samples.The learning rate is gradually reduced in every step t following such that the samples converge to the new positions.s is a constant which determines the decay rate of the learning rate.Instead of moving the samples itself we take the closest point in the original set ŝi 2 N to the displacement position as substitute for shows the open ball cover after random sampling.The balls of N r after the update using the repulsion force are illustrated in (c) achieving a more uniform grid that covers all boundaries as desired.
Algorithm 1: ASAP a sub-sampling approach preserving topological structures The computational complexity of Algorithm 1 depends on the number of times that repulsion forces are iterated, which depends on the data and the learning rate.With our choice of learning rate we typically observe about 10 iterations in our experiments.Formally, assuming the while loop on repulsion forces iterates k times in the d-dimensional data set that contains jNj points over the maximum number of samples denoted by D s , the worst case complexity can be written as O kdjNjD s ð Þ .Here we discuss the general case, however, in our implementation we employ k-d trees [25] for the neighborhood search in Eq. ( 1), ( 2) and ( 4), which reduces the computational complexity for pairwise distances from squared to log linear.

Confidence bands of significant features
The persistence diagram illustrates the birth and death time of topological features for a unique point cloud.These features in every dimension represent a specific property of the dataset, such as connected components (H 0 ), holes (H 1 ), cavities (H 2 ), etc.As a result, the derived persistence diagram of sampled data D N r ð Þ does not entirely resemble the persistence plot of the point cloud D N ð Þ.The Bottleneck distance is a metric of comparing two persistence diagrams [26], and it is defined as follows Here l is a bijection that maps every feature point The diagonal line where the birth and death time of features are identical is assumed to include an infinite number of points such that if the number of feature points in the persistence diagram of N and N r is not the same, the extra points are paired with the points on the diagonal line.
Since the persistence diagram varies for distinguished sets of samples, a confidence band was introduced to separate significant topological features from noise [10].To this aim, we follow the bootstrap procedure as described by [10].However, we either sam-ple the data based on Random Sub-sampling Method [8] (abbreviated by RSM in the following), MaxMin [9], or our novel method ASAP.Next, for a given significance level a 2 0; 1 Þ confidence set determines the region on a persistence diagram where we detect topological signal in the dataset with 1 À a confidence.According to [10], a confidence band could be included in persistence diagram with a band width of ffiffiffi 2 p c n , which specifies if a point in diagram should be considered as signal or noise.

Locating cavities and cycles
Following distinguishing significant topological signals and measuring their radius size, the other on-demand information is to identify and describe the point set that builds the observed feature.For instance, as explained in the introduction, a decisive step in observing a supernova in a simulated galaxy is to describe its shell or boundary.To this aim, we took advantage of the Dionysus toolbox [11], which also records the location of the topological feature generator during the computation of the persistence diagram.However, this method returns the boundaries of cycles or cavities that even may contain points outside the border of the structure.Fig. 2(a) exemplifies one cycle boundary detected by [11], and indeed the boundary of the cycle misses some border points of the hole and invades the structure, even in this ideal situation.
One way of overcoming this problem is to locate the boundary of the same cycle in every taken sub-samples during the bootstrap procedure.Consequently, all parts of the border of a hole are recorded through sampling and locating the same hole several times.We collect the boundary points retained through such Finally, we sort the points by their vote value and save the ones with the upper quartile v 3 of the votes.Fig. 2(b) shows the result of the voting operation (100 times) on the single cycle inside the data that recovers the border very satisfactory.

Building probabilistic models of cavities
Having identified the cavities, we would now like to capture their shape and location in the form of a probability density model aligned with the individual cavities sampled by points b i .Given the low dimensional nature of such topological features, we will model the probability distribution as the Generative Topographic Mapping (GTM) [17].The main idea behind GTM is to represent inherently low-dimensional structures (manifolds) embedded in a higher dimensional space by constructing a mapping via Radial Basis Functions (RBF) from a linear latent space in R ' (' > 0 being the intrinsic dimension of the manifold) to the embedding space R D .With classical GTM, the resulting embedded manifold is always a ''stretched" and ''bent" version of the linear latent space.While such a model has been shown to be very useful in capturing densities aligned along non-linear embeddings of linear spaces (e.g.deformed ''sheets of paper" embedded in R D with D P 3) [27], it cannot naturally capture closed manifolds such as cycles and spheres.To make GTM applicable to our case we need to modify the latent space definition.The resulting density modelling algorithm is in the following referred to as of ''geodesic GTM (gGTM)".
Let us concentrate on the case of holes with their corresponding spherical latent space.Consider the sphere centered at O ¼ 0; 0; 0 ð Þ2R 3 , having radius r ¼ 1 (unit sphere).Every point x on the surface of the sphere is uniquely determined by a pair of angular coordinates: h and k where, by definition of spherical coordinates, we have: The geodesic distance between any pair of points x i ; x k 2 I ' \ is given by (e.g.[28]) where r is the radius of the unit sphere: r ¼ 1 and DX is the central angle under the segment of great circle connecting x i and x k : where In the spirit of the original GTM [17] we can now define a regular grid of size M on the angular interval I ' \ , placing an RBF on each node c m ; m ¼ 1; . . .; M of the grid.The radial basis function centered on c m is: where r > 0 is a scale parameter.The only difference with [17] is the replacement of the Euclidean distance with the geodesic distance defined in Eq. ( 7) (hence, the name geodesic GTM).Every point on the unit sphere is then mapped to the ambient space by the function: where the elements of / x ð Þ are the M RBFs as defined in Eq. ( 9) and W is the weight matrix of dimension D Â M.
The gGTM model will be trained on the ''robust" set of boundary points, i.e. those that accumulated enough votes in the resampling voting scheme described in the previous section.We collect such points (exemplified in Fig. 2b) in the set Q ¼ t 1 ; . . .; t l f g , where v t i ð Þ P v 3 for all i ¼ 1; 2; . . .; l.To initialise the weight matrix W we first take advantage of a physics-inspired diffusion algorithm (SAF, [29]) that collapses points t i 2 Q towards high density regions in their proximity, thus sampling closer to the ''mean" of the noisy manifold.The resulting data set Q ¼ t1 ; . . .; tN È É is the diffused version of the data Q.
We then estimate the mean radius and the boundary centre as: The weights in matrix W can be initialised by building a refined \ and defining latent points The latent points are then mapped to the embedding space by applying the transformation between spherical and Cartesian coordinates: The weights in matrix W are set through linear regression so that y x i ; W ð Þ%n i for all corresponding points x i 2 I ' \ and n i .For every point x i in the latent space, the orthogonal vector to the spherical surface computed at the embedded point y x i ; W ð Þis: A pair of tangent vectors to M orthogonal to n and spanning the tangent space T M n i ð Þ to M at the point y x i ; W ð Þcan be recovered by differentiating n i (Eq.( 13)) w.r.t.h and k.After normalization to unit length, we obtain We can now define the noise model for our gGTM by constructing covariance matrices C i of multivariate Gaussians centered at images y x i ; W ð Þ of the latent centers.In particular, we construct C i proportional to the matrices having as Eigenvectors u i and v i : Here, 0 < b < 1 is a regularization term, I the identity matrix and g a scaling factor proportional to the distance between neighbouring nodes of the embedded grid.The manifold aligned probabilistic model takes the form of a constrained mixture model: where each component p tjx i ; W; b; g ð Þis defined by: The parameters can be trained via the Estimation Maximization (EM) algorithm [30], as described in [31] for a manifold-aligned noise model.In the case of circles, the gGTM model is constructed analogously.The only difference is that the latent space has circular structure parametrized by onedimensional angular interval I '

Experiments
In this section, we address the comparison between ASAP as a preprocessing step for building a simplicial complex and other state-of-the-art methods that were proposed to decrease the resources needed for computing the TDA.To compute the PH of point sets we mainly use GUDHI [32], which is faster and more memory efficient due to the structure of the simplex tree than many other toolboxes [33].In order to obtain the Rips filtration on datasets with larger number of points, we build the simplicial complex using Ripser [7].We first discuss controlled experiments with known ground truth, followed by the results of ASAP on real-world data from an astronomical galaxy particle simulation.

Synthetic data with ground truth 2circles dataset
We first experiment on a simple twodimensional dataset which was introduced in [8] to demonstrate several sub-sampling methods for PH.500 points are distributed uniformly on two circles with radius 1 and 4. For comparison we subsample 100 times with each method, saving the minimal point set required that recover the known features outside the 95% confidence band in the PH.We record the performance in form of several different evaluation measures, namely the average number of constructed simplices, as well as Median Relative Dominance (MRD) and Median Absolute Dominance (MAD), as summarized in Tables 1 and 2. Fig. 3(d) shows the persistence plot for Alpha filtration of the samples selected based on ASAP, RSM [8], and Max-Min [9].Each point illustrates the birth-death time of a topological feature of the point cloud.The points for features of homology groups H0, H1 are presented in red and blue, respectively.The death time of the features with Betti number 1 shows the correct value for the radii of both circles.The two red dots outside the confidence band manifest the data consists of two separated parts.Furthermore, the two blue dots outside the confidence band imply the existence of two significant holes in the dataset.The figure was denoised by removing points with minimum persistence (death time -birth time for every feature) smaller than threshold 0.5.
Notably, the sub-sampling suggested in [8] can only reduce the number of samples to 175 points that preserve the persistence of all features of the original dataset, while MaxMin and the proposed method ASAP can reduce the original point set to only 30 and 25 points, respectively.However, comparing panel ASAP (a) and Max-Min (c) shows that the former samples are more uniformly spaced than MaxMin on both circles and therefore the PH over repeated runs is typically more robust.Furthermore, due to covering and packing conditions of ASAP, the distance between every two neighboring samples is not smaller than r and bigger than 2r if enough data points lie in the space between two samples.The same conditions does not hold for MaxMin samples.
In an additional experiment, we also compared the three subsampling methods reducing the number of points consecutively while observing the resulting death time of the small circle, with the result shown in Fig. 4. Since we know the data points form a circle, the death time also stands for the radius of the circle.Note, that the circle with radius 4 contains more samples and hence is robust for all sub-sampling methods.We repeat the MaxMin and RSM 10 times as suggested in [8], and the number of points in each sub-sample of ASAP corresponds to hyperparameter r ranging from 0:1; 1:1 ½ .We are more lenient in this experiment, observing the death time without considering if the feature stands out of the 95% confidence band or not.For both Alpha and Rips filtration (Fig. 4 (a) and (b)) ASAP and MaxMin preserve the death time of the smaller circle with much fewer points than RSM, as indicated by the shorter blue line.Moreover, in both plots by decreasing the number of points, the death time is mostly increasing.In panel (a) the changes are not dominant, but in panel (b) the change is more visible with MaxMin diverging more from the baseline (depicting the true death time of the original small circle) for equal sample size.
2Spheres To compare the methods in higher dimensions we distribute points non-uniformly and unevenly on two hyper-spheres in R 5 with radius 1 and 2, in the following referred to as 2Spheres dataset.Even though the data consists only of 1200 points the computation of Rips filtration is very memory consuming due to dimensionality.Note that the code for efficient Rips filtration with SimBa [6] only returns the Betti numbers up to 3 dimensions.We sub-sample the point cloud based on ASAP, RSM [8], and MaxMin [9] and construct the Alpha complex on the resulting sub-sets.We iterate the sampling procedure 100 times and present the persistence diagram and barcode plot that is similar for all three methods in Fig. 5. Since the data is not uniform RSM [8] cannot preserve its homology outside the 95% confidence band if we reduce the number of samples to less than 1000 points.On the other hand, ASAP with radius 0.58 preserves not only the homology of the data in R 5 with an average of 686 sub-sampled points, but also the death times for Betti number 4 corresponds to the radii of the spheres.MaxMin can also recover the same two features with a slightly lower number of samples 680.This is possible since MaxMin takes the number of samples as a parameter, while ASAP controls the number indirectly by the radius parameter.The persistence diagrams and barcode plots of all three methods are mostly identical if RSM is allowed enough samples, except for the small difference in the confidence interval, and therefore only the result of ASAP is displayed.The persistence barcode was furthermore denoised by removing properties with a minimum of birth-death time below 0.2 to make the plot more readable.
Synthetic dwarf galaxy Finally we create a synthetic data-set mimicking some main characteristics of our astronomical applica- tion, in the following referred to as s.dwarf.In total 9656 nonuniformly distributed points in R 3 form a synthetic dwarf galaxy containing 2 cavities with different size, 3 cycles: two with the same radius and one with a slightly bigger radius contained within a half spherical head, as well as a connected and a separated stream as shown in panel (a) of Fig. 6.As demonstrated in panel (b), the persistence diagram of Alpha filtration on 551 subsamples (only 5.7% of the original set) with ASAP (r ¼ 0:15) preserves the main features of the original data and also maintains the radii of cycles and cavities.The RSM [8], on the other hand, can save the same features outside the confidence band only with more than 6200 sub-samples.The striking difference between the number of samples needed using [8] and ASAP, is caused by the aim of the latter to distribute the points evenly and thus keeping the same topological features with much fewer samples.MaxMin can also preserve all known features outside the confidence band with 550 sub-samples.The difference between the persistence diagram of the three methods is small, thus we only present the ASAP result in panel (b).
Both GUDHI and Ripser fail to compute the Rips filtration of the entire dataset due to high memory usage.Nevertheless, if we decrease the number of points by ASAP with r ¼ 0:15, Ripser manages to calculate the Rips filtration and its persistence diagram.All expected features are visible outside the confidence band in panel . However, some new 0 homology features also appear in this plot that emerge since the distance between every two points after sub-sampling is more than r.SimBa is also capable of computing the Rips filtration, as depicted in Fig. 6(e), but the death times do not conform with the exact size of the cycles and cavities within the head.Besides, 0 homology features are represented with three red points that do not correspond to the number of connected components.Note that points on the persistence diagram shape a multiset, and each red dot can illustrate more than one feature.We can also compute the Rips filtration on the MaxMin selected sub-samples.As presented in panel (e), all expected features stand out the confidence band, and similar to ASAP some extra 0 homology features are also included.Nevertheless, the birth time of features is more distorted as the features of the same size (two out three cycles) are presented with two distinct points.The run time of the ASAP sampling for 2sphere dataset r ¼ 0:58 ð Þis about 0.4 s and it occupies about 1 MB of memory.On the Synthetic dwarf galaxy r ¼ 0:15 ð Þ , it takes about 0.7 s to select the samples and it consumes 2.2 MB of memory.All experiments were conducted on a single core of a processor with a maximum clock rate of 4.5 GHz.
To locate the position of notable features with a confidence value higher than 95%, we use Dionysus [11] and then applied the voting procedure as explained in Section 2.3.In Fig. 6(a) points in light and dark green are the detected points on the border of cycles, and the points in dark and light blue build the outline of two cavities inside the head of the synthetic galaxy.The probabilistic models visualized in Fig. 7 built using the detected border points from Fig. 6a clearly depict the intrinsic nature of the structures.The top panels (a-c) depict the likelihood of the datasets given the models, as iso-surfaces over the space containing the respective data points: the manifolds' neighbourhood.Each neighbourhood is discretized in a uniform grid and for each node in the grid we compute its likelihood given the manifold's model.The isosurfaces in Figs.7-8 are obtained by locally interpolating all nodes of the grid having the same likelihood, equal to a specific iso-value.For all figures, the iso-value is chosen to be 1% of the maximum likelihood computed over the whole grid.The coherent structures emerging from these iso-surfaces explain the noisy cycles that characterize the boundaries of the 2-dimensional holes.In the same way, the noisy spherical iso-surfaces in panels (d and e) of Fig. 7, cleanly separate regions populated by the boundary points (spherical shells) from the internal holes.
Table 1 presents the total number of simplices arising in every filtration on all synthetic datasets investigated.SimBa can only compute the Rips filtration and although Ripser computes the Rips filtration on the synthetic dwarf dataset it does not provide any information about the size of the simplicial complex inside the  structure indicated by a ''-" in the respective columns.Additionally, we denote with ''1" whenever the computation of the Rips filtration fails due to the memory complexity.For our proposed method, MaxMin [9] and RSM [8], we report the results for the number of samples preserving the homology of the data after denoising.This information reveals that ASAP decreases the number of sample points significantly, and hence reducing the number of simplices in different filtration while preserving the topological features.
To evaluate the robustness of detecting known toplogical features from point clouds the Median Relative Dominance (MRD) and Median Absolute Dominance (MAD) were introduced in [9].Relative dominance and absolute dominance are defined as , respectively, where R 0 and R 1 stand for the birth and death time of a feature.K 0 is the time when the Betti profile changes permanently to the profile of a single point in d-dimensions, and K 1 targets the time for which a complex becomes a complete simplex between all edges.Finally, we compute the median value over 100 iterations of sampling the data and calculating these metrics.Note that these metrics are calculated for every feature in a dataset separately, hence we add the suffix (s) and (b) in Table 2 to denote the results for the small and big circle or sphere respectively.The higher the value of these two metrics, the more robust the identification of similar features in the sub-sampled dataset is.Note that these metrics are not suitable for the synthetic dwarf galaxy dataset, since if the border points of the features are looser than the real border, the death time and metrics value increase falsely.Besides, the value of K 1 may vary drastically for alpha filtration if the center of an enclosing ball for the final added simplex located outside the simplex, thus we only discuss the MRD.
Table 2 displays the comparison based on these two metrics for ASAP and MaxMin.We did not insert RSM results here as long as RSM needed a larger number of samples to get similar topological features, thus the comparison is biased by the number of samples.As explained before, ASAP and MaxMin reach a similar number of samples for 2Spheres dataset.For 2Circles dataset, we chose the radius of sub-sampling using ASAP equal to 0.8 which results on average 28 samples that is closer to the number of samples selected by MaxMin (30).In both cases ASAP and MaxMin detect the expected features outside the 95% confidence band and the sub-sampling is repeated 100 times.The results disclose that ASAP reaches higher values for the MRD and MAD evaluation measures on both datasets.The lower metrics values for MaxMin stem from the strategy of the method to select samples.Fig. 3 reveals that although MaxMin samples are more evenly spaced than RSM, they are not as well placed to their neighbors as ASAP samples, which lead to a later birth time and lower metrics values.[34] simulation snapshot of a dwarf galaxy falling into a cluster environment with its gas stripped by ram pressure.The point set corresponds to the position of 33 500 gas particles.The distribution of points in this point cloud varies significantly, and points are dispersed on multiple separated parts.Hence, we expect to see several red points linked with Betti number 0 in the persistence diagram of this dataset as the Betti number 0 corresponds to connected components of the dataset.We sub-sample the dataset using ASAP with r ¼ 0:4 reducing the set to % 37:6% of the total amount of points.Then the Alpha simplicial complex was constructed on the subset.We select a radius value for sub-sampling using ASAP to pursue two conditions: first, the expected topological features are outside the 95% confidence band and second, the computation of Alpha filtration on the remaining samples is tractable.Based on the confidence band of 95% the data consists of four distinguished parts (0 homology features in red) and three cavities (blue points) with death time equal to 3.98, 1.66, and 1.48 respectively.We repeat the sub-sampling by ASAP 100 times, and each time, these three features are located inside the sub-sample sets, then using the technique defined in Section 2.3, the points on the border of each hole are detected.Panel (a) also illustrates the border points of the three cavities inside the head part of the galaxy.The largest cavity has a late birth time (approximately 10) shown in panel (b) of Fig. 8.A gap between points on the border of this cavity (points highlighted in light blue) is the reason for this delayed birth time.The first two holes were modelled via the modification to GTM described previously in Section 2.4.The resulting iso-surfaces of the likelihood of the probabilistic models w.r.t. a regular grid for the points enclosing the two holes are shown in panel (d) and (e) of Fig. 8.Given the spherical latent space adopted in our version of GTM, we could not properly model the irregular hole previously mentioned.Instead, we adopted the methodology outlined in [35], where multiple manifolds in a data set are modelled as low-dimensional graphs and embedded through the RBF formulation onto the ambient space.The results for this ''hole" are shown in Fig. 8(e).

Particle simulation of a Jellyfish-like dwarf galaxy
The technique presented can be used to get insights into the physical processes at play in galaxy evolution by post-processing N-body simulations.Firstly, in the simulations, by computing the time evolution of the probabilistic iso-surfaces (like those shown in Fig. 8) it is possible to follow the expansion of the gas around the modeled supernovae explosions, thus verifying the efficiency of the stellar feedback process.Also, distributions of the diameters, expansion velocities, ages of the bubbles can be computed.The standard model describing the evolution of superbubbles is an adiabatic, pressure-driven expanding process with a continuous energy injection [21].Recent simulations suggest that the ambient pressure does affect the expansion of the bubbles [36], which can now be analyzed quantitatively.
With ASAP and subsequent probabilistic modelling identifying the expanding superbubbles automatically, it would be possible to study the effect of the environment on the superbubbles accretion rate, and whether it depends on parameters such as the local ambient pressure and density.Moreover, the superbubble size distribution can be measured in a dynamical scenario like the one that is recreated in the simulations we used i.e. the fall of a galaxy in the hot and high density cluster gas.Observations show that the slope of the distribution of the size of superbubbles is different in galaxies with different morphologies (late-type vs. early-type) [36,37].Our simulation scenario effectively captures the well known galactic morphological transformation due to the galaxy-cluster interaction, thus allowing a comparison of superbubble distribution for different galactic evolutionary stages.Lastly, being able to isolate the particles belonging to the cavity walls can shed light on the physical properties of the shock wave at the border of the bubbles.

Conclusion
In this paper we expand the novel ASAP formulation for subsampling a point cloud that preserves the topological properties and reduces the memory consumption and computational cost  for TDA analysis.The formulation is expandable for d-dimensions, is not limited to a specific type of filtration and its performance is shown for a variety of data sets.The features found are analyzed for their robustness using a statistical approach providing the confidence levels.We separate the signal from noisy features through a statistical test and argue the downside of the suggested technique for detecting the boundary of a cycle.Accordingly, we suggest a voting strategy to solve this problem, and finally, the points on the outline of the located cycles or cavities are modeled by a probabilistic model.Each model is generated by a generalized GTM approach, which allows further investigation and analysis of their properties.As it is disclosed through empirical results on several datasets, the proposed approach preserves the size of topological properties.The accuracy of such information is indispensable in some domains, such as astronomy where it informs about physical phenomena, namely supernovae in a galaxy.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
if it is not contained in an open ball of any other sample point.Algorithm 1 details the complete procedure of the extended sampling strategy and Fig.1shows the result on a simple twodimensional example.Panel (a) depicts the point cloud N consisting of a line and a square with a circular hole in the centre and (b) repeated sampling in a multiset C. The set of distinct points from C form the set C. Multiplicity m b ð Þ of each boundary point b 2 C expresses how many times b was selected in a bootstrap procedure during the repeated sampling by ASAP.To stabilize the selection of boundary points, we created the following voting scheme: First a tolerance ball B b; r ð Þ of radius r is created around every b 2 C. Next, a collection of counter variables

Fig. 1 .
Fig. 1.(a) Points N distributed on a line and holed square, (b) ball cover after random sub-sampling and (c) after repulsive selection.

Fig. 2 .
Fig. 2. (a) Sample points extracted by ASAP (black) and the boundary of the cycle found by [11] (red).Panel (b) corresponds to the border recovered by the voting system (100 runs) with the colorbar depicting the number of votes.

Fig. 5 .
Fig. 5. 2Spheres: Persistence diagram (a) and barcode plot (b) of Alpha filtration.The plots are very similar for samples extracted by ASAP, RSM, and MaxMin and hence we just show one example.

Fig. 4 .
Fig. 4. 2circles: mean and standard deviation of the death time of the smaller circle over 10 repeated samplings of ASAP, RSM, and MaxMin when reducing the number of samples kept with Alpha (a) and Rips filtration (b).

Fig. 8
Fig. 8 panel (a) shows an N-body Smoothed Particle Hydrodynamics[34] simulation snapshot of a dwarf galaxy falling into a cluster environment with its gas stripped by ram pressure.The point set corresponds to the position of 33 500 gas particles.The distribution of points in this point cloud varies significantly, and points are dispersed on multiple separated parts.Hence, we expect to see several red points linked with Betti number 0 in the persistence diagram of this dataset as the Betti number 0 corresponds to connected components of the dataset.We sub-sample the dataset using ASAP with r ¼ 0:4 reducing the set to % 37:6% of the total amount of points.Then the Alpha simplicial complex was constructed on the subset.We select a radius value for sub-sampling using ASAP to pursue two conditions: first, the expected topological features are outside the 95% confidence band and second, the computation of Alpha filtration on the remaining samples is tractable.

Fig. 8
Fig. 8 panel (b)shows the persistence diagram for the reduced set, denoised using a threshold of 1 for the minimum birth-death time.Based on the confidence band of 95% the data consists of four distinguished parts (0 homology features in red) and three cavities (blue points) with death time equal to 3.98, 1.66, and 1.48 respectively.We repeat the sub-sampling by ASAP 100 times, and each time, these three features are located inside the sub-sample sets, then using the technique defined in Section 2.3, the points on the border of each hole are detected.Panel (a) also illustrates the border points of the three cavities inside the head part of the galaxy.The largest cavity has a late birth time (approximately 10) shown in panel (b) of Fig.8.A gap between points on the border of this cavity (points highlighted in light blue) is the reason for this delayed birth time.The first two holes were modelled via the modification to GTM described previously in Section 2.4.The resulting iso-surfaces of the likelihood of the probabilistic models w.r.t. a regular grid for the points enclosing the two holes are shown in panel (d) and (e) of Fig.8.Given the spherical latent space adopted in our version of GTM, we could not properly model the irregular hole previously mentioned.Instead, we adopted the methodology outlined in[35], where multiple manifolds in a data set are modelled as low-dimensional graphs and embedded through the RBF formulation onto the ambient space.The results for this ''hole" are shown in Fig.8(e).The technique presented can be used to get insights into the physical processes at play in galaxy evolution by post-processing N-body simulations.Firstly, in the simulations, by computing the time evolution of the probabilistic iso-surfaces (like those shown in Fig.8) it is possible to follow the expansion of the gas around the modeled supernovae explosions, thus verifying the efficiency

Fig. 7 .
Fig. 7. Iso-surfaces of the likelihood computed by the probabilistic models of the recovered cycles (top row) and holes (bottom) of the synthetic dwarf galaxy (s.dwarf) depicted in Fig. 6(a).

Fig. 8 .
Fig. 8. Jellyfish-like dwarf galaxy particles (a) and Alpha filtration of the ASAP subset (b).Iso-surfaces of the likelihood computed by the probabilistic models of the recovered holes (c-e).

Table 1
Comparison of the number of simplices constructed by several methods and filtrations with lowest numbers marked in bold.

Table 2
Comparison of MRD and MAD for several methods and filtrations with best values marked in bold.Suffix (s) and (b) mark the results for the small or big structure respectively.