A method for continuous-range sequence analysis with Jensen-Shannon divergence

Mutual Information (MI) is a useful Information Theory tool for the recognition of mutual dependence between data sets. Several methods have been developed fore estimation of MI when both data sets are of the discrete type or when both are of the continuous type. However, MI estimation between a discrete range data set and a continuous range data set has not received so much attention. We therefore present here a method for the estimation of MI for this case, based on the kernel density approximation. This calculation may be of interest in diverse contexts. Since MI is closely related to the Jensen Shannon divergence, the method developed here is of particular interest in the problems of sequence segmentation and set comparisons.


I Introduction
Mutual Information (MI) is a quantity whose theoretical base originates in Information Theory [1]. Since MI between two independent random variables (RV) is zero, a non-null value of MI between these variables gives a measure of mutual dependence. When analyzing two data sets X and Y (assumed to be the realization of two mutually dependent RVs) MI can give us a measure of the mutual dependence of these sets. Although MI may be straightforwardly calculated when the underlying probability distributions are known, this is not usually the case when only the data sets are available. Therefore, MI must be estimated from the data sets themselves. When X and Y are the discrete type, MI may be estimated by substituting the joint probability of these variables by the relative frequency of appearance of each pair (x, y) in the data sequence [2,3]. For real value data sets (or the discrete type with a wide range) estimation of MI by frequency of appearance is not applicable. The binning method [4] in turn requires large bins or large sequences in order to produce reasonable results. Alternative proposals have been made for cases when both data sets are the continuous type [5].
Estimation of MI between a discrete RV and a continuous one has not been so extensively considered, in spite of being a problem of interest in diverse situations. For instance, we could compare the day of the week (weekday-weekend, discrete) with traffic flow (continuous), quantifying this effect. In a different context we might wish to quan-tify the effect of a drug (administered or not, discrete) in medical treatment evaluation (electroencephalograms in epilepsy, continuous data). Ross [6] has proposed a scheme for estimating MI based on the nearest neighbour method [4]. Assuming a sequence of (x, y) pairs, with X being discrete and Y continuous, the nearest neighbour method requires the pairs to be ordered by the Y values. This requirement makes the proposal impractical, in sequence analysis for instance. The nearest neighbour method was also considered by Kraskov et al. [7]. In their paper they suggest two ways of evaluating MI with this method. An alternative definition for MI is presented by Gao et al. [8], also based on the distance between the elements of the sequence.
In this paper we propose a more direct method for estimating MI between a discrete and a continuous data set, based on the kernel density approximation (KDA) [4] for estimating the probability density function (PDF) of the continuous variable. For the discrete variable we make use of the usual frequency approximation [2,3]. Finally, MI is computed by the Monte Carlo integration.
As shown by Grosse et al. [2] MI can be identified with the Jensen Shannon Divergence (JSD), a measure of dissimilarity between two probability distributions. JSD is a non-negative functional that equals zero when the distributions being compared are the same. This property makes JSD a useful tool for sequence segmentation [2,3]. Furthermore, in diverse contexts it is of interest to evaluate whether a given sequence matches a particular probability distribution. The most usual case is that of a normal distribution. Nevertheless, this is a more general problem. For instance, in satellite synthetic aperture radar (SAR) images the backscatter presents a multiplicative noise assumed to have an exponential distribution [9]. Also, models for cloud droplet spectra assume a Weibull distribution [10,11]. Several indirect methods have been proposed for analysis of continuous range sequences. Pereyra et al. [12] outlined a method based on wavelet transform to analyze electroencephalograms. Recently, Mateos et al. [13] have proposed a mapping from continuous value sequences into discrete state sequences previous to JSD calculation. Several other mapping methods have been proposed in the literature to associate a discrete probability distribution with a real value series.
Here, by means of the KDA we avoid resorting to any indirect method, approximating the probability densities of continuous range variables by this non parametric method. In section II we present the calculation of MI and the arrangement for sequence segmentation with JSD. In section III we test the perfomance of this method through numerical experiments. Also considered is application of the method in edge detection in a satellite synthetic aperture radar (SAR) image. In section IV we consider the results obtained.

II Method
In this section we present our proposal for estimating MI between discrete and continuous RVs, based on the KDA estimator of a PDF. Let us consider a sequence of pairs (x, y) with x as a variable of discrete range and y of continuous range. To calculate MI we resort only to the sequence itself, making use of no extra information. We start from the sequence of data pairs (x, y), and assume that these data are sampled from a joint probability density µ (x, y), although unknown at first. From the joint PDF the marginal probabilities The MI between the RVs X and Y is expressed in terms of these PDFs as [1] .
Note that if the variables X and Y are statistically independent then µ (x, y) = p (x) φ (y), and in this case I (X, Y ) = 0. In this way a value of I (X, Y ) = 0 gives a measure of the mutual dependence of these variables. We may rewrite I (X, Y ) in terms of the conditional PDFs as i Kernel density approximation To carry out the calculation in (4), knowledge of the conditional PDFs is necessary. As mentioned, these densities are assumed to be unknown, and have to be estimated from the data themselves.
Here we make use of the KDA [4], as summarized in the following. The conditional PDFs in Eq. (3) are estimated considering separately each data set pair with a given value of x. We define the set and for each set we approximate the conditional densities using a KDA with a Gaussian kernel (6) Note that the sum is over the y j values in the set C κ , and n κ is the number of pairs in this set. The bandwidth, h κ , is chosen as the optimal value, as reported by [4] and followed by Steuer et al. [5] where s 2 κ is the variance of the sample. Sheather [14] considered alternative values to detect bimodality; however, as they mention, there is little visual difference.
The marginal probability of X is approximated by the frequency of ocurrence of each valuep and the marginal probability density of Y bŷ We illustrate the results obtained with the KDA by an example: let us consider the joint probability distribution µ (x, y) and the corresponding marginal PDF We sampled 1000 pairs from this distribution for two different values of y m , and from these pairs we made an estimation of the conditional PDFs using the KDA. In Fig. 1A and 1B we plot the probability functions in (10) and (11) for two values of y m and the corresponding approximations.
n2 values Figure 2: The segmentation problem. Consider a sequence S made up of two stationary subsequences S1 and S2, with n1 and n2 elements respectively. The problem consists in determining the value of n1; i.e., the point when the statistical properties change.
ii Monte Carlo integration After approximating the PDFs we have to compute the integrals in (4) to estimate MI. We recognize in these integrals the expectation value that can be estimated by Monte Carlo integration [15] ln Here the sum is again restricted to the y j values associated with a particular x value. Note that in this sum we make use of the kernel approximation of the conditional PDFs in (6). Substituting both approximations we finally get iii Sequence segmentation The JSD is a measure of dissimilarity between probability distributions. Originally proposed by Burbea and Rao [16] and Lin [17] as a symmetrized version of Kulback Leibler divergence [1,18], a generalized weighted JSD between two PDFs, f 1 , f 2 is defined as with π i arbitrary weights satisfying π 1 + π 2 = 1.
Here H is Gibbs Shannon entropy, defined for continuous range variables as As shown by Grosse et al. [2] JSD may be interpreted as MI between a discrete and a continuous variable by identifying the weights π i with the discrete variable probability in (1a): and the probability densities f i (y) with the conditional densities in (3) With these identifications, the functionals in (15) and (4) are the same. The JSD and several generalizations have been succesfully applied to the sequence segmentation problem, the partition of a non-stationary sequence into stationary subsequences, for discrete range sequences. We propose here the extension of this method to continuous range sequences without resorting to discrete mapping, wavelet decomposition or any other indirect method of estimation of the probability distribution.
The procedure for sequence segmentation may be stated in the following way: let us consider a sequence S with n elements made of two stationary subsequences S 1 and S 2 , with n 1 and n 2 values respectively (n 1 + n 2 = n), schematically illustrated Figure 3: The sliding window method. A sliding window is defined for sequence segmentation. The window is divided into two subwindows of equal size. The center of the window is considered as the window position. The window is displaced along the sequence and the JSD between the subwindows is calculated. The segmentation point is identified as the window position at which JSD has its maximun value. in Fig. 2. The aim is to determine the value of n 1 ; i.e., the position of the last element in S 1 . In the algorithm proposed here we define a sliding window of fixed width over the sequence. The window is divided into two segments, each including ν 1 elements (see Fig. 3). We define the window position as that of the last element in the left section of the window. This window is displaced over the sequence and the window position where JSD reaches its maximun value is taken as the segmentation point.

III Assessment results
In this section we present the results of our assessment of the proposed method by considering two applications: the detection of mutual dependence between two RV sequences and the segmentation of a sequence.
In the first case we generate sequences of two jointly distributed variables: one of discrete range and one of continuous range, and then we compute MI between these variables. In the second case we consider sequences made of two subsequences generated from diferent distributions. We detect the segmentation point following the procedure described in the previous section. We also apply the method to detect the edges between homogeneous regions in SAR images.

i Mutual information between a discrete and a continuous variable
We computed the MI between discrete and continuous variables. We generated 100 data sets, sampling 1000 (x, y) pairs from the distribution in (10) with different values of y m or σ g , and from the joint distribution with Θ (y) the step function Θ (y) = 0 for y < 0 1 for y > 0 with different values of y m or a. We estimated the MI, I (X, Y ), from each set by the method described in the previous section. Given that we are sampling the data pairs from known distributions, we are also able to calculate MI from the analytical expressions. In this way we may compare the results obtained from the approximation with the corresponding analytical results. In addition, we calculated the MI for samples of statistically independent variables to establish a significance value for the MI of the dependent variables. The analytical value in this case is zero, as already mentioned. The results of the calculation are shown in Fig. 4 for the distribution in (10) and in Fig. 5 for the distribution in (19), respectively. We include the average value of MI over the 100 data sets for the different values of the parameters, and the bars correspond to the standard deviation in each set. A small underestimation of the MI value can be seen in this last case. This may be attributed to a shortcoming of the KDA at the borders of the interval of the uniform distribution. Nevertheless, it is still possible to detect mutual dependence between the discrete and the continuous value sequences.
To consider the effect of sample size, we repeated the experiment with the distribution in (10) for different values of n, the number of data pairs in each set. We again generated 100 data sets of n data pairs each. The results are shown in Fig 6 for three sets of parameters. A slightly increasing overestimation of MI can be appreciated as n decreases. Finally, we considered an usual situation when there is only one sample of data pairs available. We sampled 1000 pairs from the distribution in (10), the distribution in (19) and from the distribution For each sample we estimated MI by the approximate method in (14). To set up a significance value Figure 6: Mutual information estimation for the distribution in (10). For the distribution in (10) the dots represent the average value for 100 data sets of different numbers of (x, y) pairs. Bars indicate the standard deviation, and dashed lines represent the analytical values of MI for the different sets of parameters. for each sample we generated 100 data sets of 1000 pairs of independent variables. The discrete values were sampled from the distribution where n x gives the number of times that the value x appears in the original sequence, and the continuous values were sampled from the Gaussian distribution independently of the value of x. Here m is the mean value in the original sequence and s 2 the sample variance. We calculated the MI for each data set and then the MI mean value and its variance. The results are included in Table 1. A clear difference can be seen between the MI of the dependent values and those of the independent sequences.
ii Sequence segmentation To test the sequence segmentation method, we generated sets of 500 sequences of 500 values each, di- vided into two subsequences with 250 values in each one. The sequences were generated from Rayleigh distributions with a different mean value for each subsequence. The mean value of the first subsequence is denoted by m l , and the mean value of the second segment by m r ; we define the ratio of the mean values as r m = m l /m r . Using the sliding window method, we analyzed a set with r m = 5 with several window widths. In Fig. 7 we present the average value across the 500 sequences of JSD at each window position for the different widths considered. The average JSD has a maximum value at position 250, the segmentation point, even for a narrow window with 20 elements (10 elements in each subwindow), although in this case statistical fluctuations are more noticeable. To test the sensitivity of the method we generated sets with r m = 1.2, 1.5, 2, 5, 10. The results of the algorithm, with a window of 50 elements, are included in Fig. 8. Even for the smallest ratio considered, the segmentation point can be detected.
Finally, we present an example of application of the segmentation algorithm to detect the edge between homogeneous regions in a SAR image. In SAR images the backscatter is affected by speckle noise (a multiplicative noise). This noise in the backscatter amplitude is modelled by a Rayleigh distribution in homogeneous regions. In Fig. 9 we include a section of the SAR image of an Antarctic region, and the boundary detected between water and ice. On the right a plot of the values of the backscatter amplitude of the highlighted lines in the image and the JSD is included. There is good coincidence of the detected boundary with the contour in the image.

IV Discussion and conclusions
In this paper we have presented a method for computing Mutual Information (MI) between discrete and continuous data sets, or alternatively, the JSD between continuous range data sets. The algorithm developed gives a measure of dissimilarity without resorting to an indirect method like those proposed in [12,13]. Neither is it necessary to have the continuous values ordered as in the nearest neighbour method [4,6]. In fact, the calculation in (14) is based only on the registered data as they were recorded.
The measure may be applied to two similar problems. On the one hand we can quantify the mutual dependence between discrete and continuous data sets, and on the other hand we can quantify the dissimilarity between two continuous data sets, as discussed in section II. In section III we applied the method to artificially-generated pairs of variables, finding good agreement with the corresponding analytical values as shown in Figs. 4 and 5, although systematic underestimation occurs mainly when the difference is given by the width in uniform distributions (fig 5-B). We attribute this dis-crepancy to the abrupt decay of the uniform distribution at the borders of the interval, while the KDA with a Gaussian kernel extends to infinity. The MI values in these cases of mutually dependent variables are clearly distinguishable from the MI values of independent variables. We also considered the dependence of the results of this method on the length of the sequence. IIn Fig 6 a slightly increasing overestimation of MI is seen with decreasing length. Nevertheless, there is good agreement for sequences of more than 400 pairs.
In real situations we frequently have only one sequence of (X, Y ) pairs. We have proposed a method for establishing a significance value by generating 100 sequences of independent variables with probability distributions given by the estimated marginal distribution for the discrete variable, and by a Gaussian distribution for the continuous variable with the same mean value and variance as the marginal distribution of the original sequence. We have considered sequences generated from three distributions. In all three cases MI establishes a clear difference between dependent and independent sets, as shown in Table 1.
It has been shown that the Jensen Shannon Divergence (JSD) is equivalent to MI [2]. Therefore, the calculation method developed here will also be suitable for computing JSD between two continuous range data sets, and in this format the JSD may be applied to the sequence segmentation problem as proposed in section II-iii. In this section we suggested a method based on a fixed-length sliding window. We considered the segmentation of artificially generated sequences in section III-ii. The JSD average at each position in the sequences exhibits a maximum at the segmentation point, as shown in Fig. 7. As we continue this work we will address the problem of comparing and analyzing electrophysiological signals. The segmentation method may also be of interest in detecting borders in images. Work along these lines will be published elsewhere.