Robust Dynamic Classifier Selection for Remote Sensing Image Classification

Dynamic classifier selection (DCS) is a classification technique that, for each new sample to be classified, selects and uses the most competent classifier among a set of available ones. We here propose a novel DCS model (R-DCS) based on the robustness of its prediction: the extent to which the classifier can be altered without changing its prediction. In order to define and compute this robustness, we adopt methods from the theory of imprecise probabilities. Additionally, two selection strategies for R-DCS model are presented and are applied on remote sensing images. The experiment results demonstrate that our model successfully incorporates uncertainty with respect to the model parameters without losing the performance.


I. INTRODUCTION
Image classification in remote sensing is the process of assigning land cover classes to pixels.Recent advances in the remote sensing technology including hyperspectral imaging and Light Detection And Ranging (LiDAR) systems facilitate and improve the relevant information acquisition [1].
Making use of multiple data sources enables a more comprehensive interpretation of the scene and improved classification performance [2]- [6].We develop here a novel multi-source classification method based on the concept of Dynamic Classifier Selection (DCS) [7], that we extend to the framework of imprecise probabilities.
A DCS selects dynamically for each test sample the classifier with the highest probability of correctly classifying it.The key is how to select the most competent classifier for any given query sample.Usually, this classifier choice is made based on a local region of the feature space where the query sample is located in.Most works define this local region by applying the K-Nearest Neighbors technique, which groups samples with similar features to construct a local region [8], [9].In this work, we group samples differently, by incorporating the concept of robustness to the model specification.
Despite the huge progress in image classification, the current machine learning methods are not yet sufficiently robust to various perturbations in the data and to model errors to support reliably high-stakes applications [10], [11].The work in [12] analyzed the global sensitivity of a maximum a posteriori (MAP) configuration of a discrete probabilistic graphical model (PGM) with respect to perturbations of its parameters, and provided an algorithm to evaluate the robustness of the MAP configuration with respect to those perturbations.For a family of PGMs, the maximum perturbation level that does not alter the MAP solution is called the critical perturbation threshold.In a classification problem, these thresholds determine the level to which the classifier parameters can be altered without changing its prediction.The experiments in [12] established a strong correlation between these robustness measures and the accuracy of the corresponding classifiers.This property combined with DCS was applied to classification in our earlier work [13], but only for the cases with binary classes and two classifiers.
Here we build further on this idea, developing a robust classification method with multiple classes and multiple classifiers.
In particular, we build on Naive Bayes Classifiers (NBCs), but the proposed framework can be extended to other classification models.We define and compute the perturbation thresholds based on the concepts from the theory of imprecise probabilities.Particularly, we use the Imprecise Dirichlet Model (IDM) [14] to extend the specification of local probabilities in the model to corresponding credal sets.This imprecise probabilistic extension of an NBC is called a Naive Credal Classifier (NCC) [15].Specifically, we perturb an NCC by varying the values of the hyperparameter that determines the degree of precisions in IDM.Thus, the perturbation threshold of an NCC is the maximum value of the hyperparameter under which the NCC still remains determinate.
Based on this imprecise-probabilistic measure for the robustness of a class prediction, we here propose a robust DCS (R-DCS) model and apply it on remote sensing image classification.We first extract features from single or multiple data sources.The extracted features carry different types of information, such as spectral, spatial and elevation information in the captured scene.Afterwards, classifiers are constructed by different types of features and are used for dynamic selection in R-DCS.
We provide two selection strategies for R-DCS: Rt and R-LA.Rt strategy selects classifiers by only considering the value of their perturbation thresholds.While conceptually simple, this approach does not always perform well because the exact relation between perturbation thresholds and performance differs from one classifier to another.The second strategy R-LA improves upon this by determining the empirical relation between the perturbation thresholds of different classifiers and their probabilities of correctly classifying the instance that is considered.Experimental results on two real data sets with HSI and LiDAR data, demonstrate the efficiency of the proposed method for sensor fusion and classification.
This paper is organized as follows.The NBC and its imprecise-probabilistic extension NCC are introduced in Section II.In Section III, we first present the computation of perturbation thresholds for NCCs.Then the proposed model R-DCS is illustrated by introducing two selection strategies in R-DCS and how R-DCS works in multi-sources data classification.Experiments on HSI and LiDAR data are reported in Section IV.We conclude the paper in Section V.

A. Naive Bayes Classifiers
Let C denote the class variable, which takes values c in a finite set C and m denote the number of features.The i-th feature variable F i takes values f i in a finite set F i .For notational convenience, we gather all feature variables in a single vector An NBC is a popular probabilistic model, where features are conditionally independent given the class.Thus, the MAP estimate of the class under NBC becomes: where Z = c∈C P (c) is the partition function.The (conditional) probabilities that appear on the right side are typically learned from data.To avoid zero probabilities, we adopt Laplace smoothing: with n the total number of data points, n(c) the number of data points with class c and n(c, f i ) the number of data points with class c and i-th feature f i .

B. Naive Credal Classifiers
The Naive Credal Classifier (NCC) is an extension of the NBC to the framework of imprecise probabilities that can be used to robustify the inferences of an NBC.Basically, the idea is to consider an NBC whose local probabilities are only partially specified.
Instead of considering a probability mass function P (C) that contains the probabilities P (c) of each of the classes c ∈ C, an NCC considers a set of such probability mass functions, which we denote by P(C).Similarly, for every class c ∈ C and every i ∈ {1, ..., m}, it considers a set of conditional probability mass functions P(F i |c).
Particularly, we use a version of the IDM [14] to construct these local sets, suitably adapted such that it is guaranteed to contain the result of Laplace smoothing.For all c ∈ C, the local set over class variable C is defined by: where s is a fixed hyperparameter that determines the degree of imprecision, t(c) is a probability mass function on C. P (C) is taken to belong to P(C) if and only if for all c ∈ C, P (c) is in the corresponding set P(c) defined above.For every i ∈ {1, . . ., m} and c ∈ C, the local set P(F i |c) is defined similarly.
If we choose a single probability mass function P (C) in P(C) and a single conditional probability mass function P (F i |c) in P(F i |c) for every c ∈ C and i ∈ {1, . . ., m}, we obtain a single NBC.By doing this in every possible way, a set of NBCs can be obtained.This set is an NCC.In this work, the base classifiers for DCS will be constructed by a set of NCCs.

III. PERTURBATION THRESHOLDS AND R-DCS MODEL
According to the definitions above, we first present the computation of perturbation thresholds for NCCs in this section.Next, inspired by the observation in [12] that instances with higher perturbation thresholds have higher chance to be classified correctly, we illustrate the R-DCS model by introducing two selection strategies and their application in multi-sources data classification.

A. Computation of perturbation thresholds for NCCs
An NCC is a set of NBCs obtained by choosing different (conditional) probability mass functions from the corresponding sets.If all these NBCs agree on which class to return, the output of the NCC will be that class.Otherwise, the NCC is indeterminate and consists of a set of possible classes.In this work, given perturbations in every local set by varying the value of s, the goal is to obtain the maximum value of s, which is called the perturbation threshold, under which the NCC remains determinate.
The following theorem in [12] reformulates the computation of such perturbation thresholds as an optimization problem by the MAP inference.
Theorem 1: Let X be a variable taking values in a finite set V al(X), P be a set of candidate mass functions over X and x be an MAP instantiation for a mass function P ∈ P. Then x is the unique MAP instantiation for every P ∈ P if and only if min Theorem 1 was used to test the robustness of the MAP estimates in PGMs in [12], and can be exploited to compute the perturbation thresholds for PGMs.In our case, we use a specific version of PGM, i.e., the Naive Bayes topology, and thus we can reformulate Theorem 1 to the following problem.
Let P(C|f) be the corresponding set of conditional probability mass functions, whose local sets contain the corresponding results of Laplace smoothing, ĉ be an MAP instantiation for P (C|f).Then, based on Theorem 1, ĉ is the unique MAP instantiation for every P (C|f) ∈ P(C|f) if and only if: As we adopt Laplace smoothing to learn the model, the first criterion is always satisfied.With the definition in (1), the second criterion is reformulated by ⇔ max ⇔ max where c (2) is an estimated class that yields the highest probability P (C|f) given feature f for all c ∈ C \ {ĉ}.Specifically, we use the IDM to construct the local credal sets which is introduced in (3).By substituting (3) into (8), we define for any given feature vector f, the perturbation thresholds s (per) for an NCC is the maximum value of s that satisfies α(c (2) ; s) where α(c (2) ; s) is an unconditional criterion function over c (2) and perturbation level s, β(f i |c (2) ; s) is a conditional criterion function over c (2) and s for ease of presentation.These two criterion functions are computed for all i ∈ {1, ..., m} by α(c (2) ; s) = n(c (2) ) + 1 + s n(ĉ) + 1 , (10) where s ∈ R + 0 , n(c) and n(c, f i ) hold the same definition in Section II.In practical application, we initiate s from 0 and increase its value over a specific scale (we use 0.1) in each iteration until s does not satisfy (9).We will use this perturbation threshold as an indicator to provide selection strategies for R-DCS in the following section.

B. Selection strategies for R-DCS
The key of DCS is to find the classifier with the highest probability of being correct for a given unseen sample.We here provide two selection strategies based on the perturbation thresholds that were defined in the previous section.
1) Rt strategy In order to select the most competent classifier among a set of available ones, a first idea is simply to choose the classifier with the highest perturbation threshold for each sample.We refer to this strategy as Rt.
Let Ψ = {ψ 1 , ψ 2 , ..., ψ L } be the base classifiers forming DCS.In particular, each ψ l ∈ Ψ is an NCC in this work.Let X = {x i } be a set of training samples and Y = {y i } be a set of testing samples, each of these samples x i ∈ R m , y i ∈ R m is a vector of pixel values at a particular location in m image channels.We determine for all these samples the perturbation thresholds defined in the previous section, and denote by s (per) l,i the perturbation threshold of the l-th classifier (ψ l ) in sample i.Let λ j ∈ {1, ..., L} denote the index of the base classifier that will be assigned to sample j.The Rt strategy selects for each test sample y j the classifier ψ λj ∈ Ψ that exhibits the highest perturbation threshold: and the classifier ψ λj is assigned to the sample y j .
2) R-LA strategy R-LA aims to choose a classifier based on estimating the accuracy of each classifier in a local surrounding region of the image sample in a perturbation thresholds space.In particular, we choose N training samples whose perturbation thresholds are closest to that of the test sample for each of the classifiers respectively.
Let us define the perturbation distance between two data samples as the absolute value of the difference in their perturbation thresholds for a given classifier: Let N l,j be the set of N training samples that are the nearest neighbors of y j in terms of d l (x i , y j ).For each sample y j to be classified, we determine the most competent classifier ψ λj as follows: where Ñl,j is a subset of N l,j composed of those training samples that are correctly classified by ψ l .Classifier ψ λj is then assigned to the sample y j Fig. 1 illustrates this strategy with a fictitious example that contains ten training instances, whose thresholds in two classifiers are depicted on the plane.The threshold values of Classifier ψ 1 and ψ 2 are the x− and y−coordinate respectively.Every instance in the training set corresponds to a black point.Consider now a test instance y j whose pair of thresholds corresponds to the red dot and let N = 3.Then the three dots with green triangles and purple squares construct the set N 1,j and the set N 2,j respectively.Next, we compare the accuracy of both classifiers on these set of points.Whichever classifier perform the best on them is the one that we will use to classify this particular test instance.

C. R-DCS in multi-sources data classification
We apply the proposed R-DCS model in multi-sources data classification.A framework of multi-sources data classification with R-DCS is illustrated in   This general model admits multiple types of features from one or more data sources.In this work, we extract features by applying morphological openings and closings with partial reconstruction on different data sources, similarly as in [16], [17], to generate morphological features.
In particular, for HSI data, spectral features are obtained from the original HSI and spatial features are generated by mathematical morphology.For LiDAR data, elevation features are generated by morphological operators.A separate classifier is constructed for each type of the features.By this, a pool of classifiers is obtained for dynamic selection in the R-DCS model.

IV. EXPERIMENTAL RESULTS AND ANALYSIS
We conduct experiments on two real data sets: a HSI data set and a combined HSI and LiDAR data set.We compare the methods Rt and R-LA in our proposed R-DCS model with the following schemes: 1) K-nearest neighbors (KNN) classifier with spectral features of HSI; 2) NBC with different features, i.e.NB-Spe (spectral features in HSI), NB-Spa (morphological features in HSI) and NB-Ele (morphological features in LiDAR); 3) Generalized graph-based fusion (GGF) [2].Three widely used performance measures: overall accuracy (OA), average accuracy (AA) and Kappa coefficient (κ) are used for quantitative assessment.In the following experiments, half of the labeled samples are used for training and the rest are for testing.Experimental results are reported in average of 10 runs.The results reported in Table I and Figure 3 reveal that our method R-LA achieves the best performance.Compared with NB-Spe and NB-spa, our Rt method shows a lower accuracy, which demonstrates that DSC with the highest robustness may not improve the classification performance.In contrast, the proposed R-LA yields an improved performance, which benefits from the adaptive selection of thresholds, ensuring that the classification accuracy of each test pixel is closer to the best result of the nearby pixels.Our method R-LA also obtains a better result than the feature fusion based classification method GGF.

B. Experiments on HSI and LiDAR
The second data set comes from the 2013 IEEE GRSS data fusion contest [18].We refer to it as GRSS2013.It was acquired over the University of Houston campus and the neighboring urban area in June 2012.It involves two types data sources: an HSI and a LiDAR derived Digital Surface Model (DSM).The HSI has 144 spectral bands and 349 × 1905 pixels.The ground truth provided for this dataset contains 15 classes.The morphological features are generated with the same method in [18].
The results are reported in Table II, Rt does not perform good enough, which proves our assumption again that it does not make sense to directly compare the perturbation thresholds of different classifiers and the combination methods might not always be the best option.However, the proposed R-LA yields again the best performance in terms of OA, AA and κ.Compared with the feature fusion based method GGF, our method offers the robustness to model specification and achieves a better performance at the same time.Moreover, our proposed model is parameter free, which makes it more practical for real applications.

V. CONCLUSION
The main contribution of this work is a novel, robust dynamic classifier selection method, that we refer to as R-DCS.The experimental results demonstrate that the proposed R-DCS model with the R-LA strategy not only outperforms each of the individual classifiers it is based on, but also achieves a better performance than the feature fusion based classification method GGF.Although the proposed model is computationally very simple, it naturally improves the robustness to model specification without sacrificing the classification accuracy.

Figure 2 .
It involves three blocks: (i) feature extraction from the original data sources; (ii) classifier construction based on the extracted features and (iii) dynamic classifier selection from the classifier pool.

Fig. 1 .
Fig. 1.An illustration of the R-LA strategy.Three green triangles and three purple squares are selected to compute the local accuracy in ψ 1 and ψ 2 .