Bayesian regression of piecewise homogeneous Poisson processes

In this paper, a Bayesian method for piecewise regression is adapted to handle counting processes data distributed as Poisson. A numerical code in Mathematica is developed and tested analyzing simulated data. The resulting method is valuable for detecting breaking points in the count rate of time series for Poisson processes.


I. Introduction
Bayesian statistics have revolutionized data analysis [1]. Techniques like the Generalized Lomb-Scargle Periodogram [2] allow us to obtain oscillation frequencies of time series with unprecedented accuracy. The Gregory and Loredo method [3] goes further allowing us to find and characterize periodic signals of any period and shape.
To detect non-periodical variations, the Exact Bayesian Regression of Piecewise Constant Functions by Marcus Hutter (hereafter Hutter's method) [4] is valuable. It permits to estimate the most probable partition of a data set in segments of constant signals, determining the number of segments and their borders, and in-segments means and variances. Hutter's method works with two continuous distributions: Normal, and Cauchy-Lorentz. The latter -the canonical example of a pathological distribution with undefined moments-, is also suitable to analyze data with other symmetric probability distributions, especially with heavy tails.
In the case of counting processes, especially for * E-mail: dsevilla@fceia.unr.edu.ar low rates, when data consist in non-negative small integers, methods specially designed to discrete probability distributions are necessary. Some regression methods, specially for non-homogeneous Poisson processes [5], were developed.
In this paper, Hutter's method is adapted for analyzing data distributed as Poisson. The results are summarized in a code in Mathematica [6]. It can be used to analyze data of several physical processes which follow the Poisson distribution (e.g., detection of photons in X-ray Astronomy, particles in nuclear disintegration, etc.), if sudden changes in detection rates are suspected.

II. Method
Hutter's method is summarized in Table 1 of Ref. [4] in a pseudo C code which is divided in two blocks. The first one calculates moments A k ij with k = 0, 1, 2 of the PDF of the statistical models for segments of data D ij := {n i+1 , . . . , n j }. The second one performs the regression from moments A k ij . The code developed in this work is divided in three blocks.
As the members of the Poisson distributions family are identified by one parameter -the mean rate r of the Poisson process-, the PDF of the models 070018-1 where P (r|I) is the prior of parameter r, P (D ij |r) is the likelihood of segment D ij for a given r, P (D ij ) is the global likelihood of the family, and I represents a prior information. Usually, the prior information consists of global quantities calculated from D := D 0N , i.e., from all the data set. For Poisson processes, only one quantity is necessary: the mean rater. Considering the conjugate prior of the Poisson distribution [7], the prior results For a Poisson process with rate r, the likelihood of a segment D ij is So, the moments of the posterior can be expressed in an analytical form As the second block of Hutter's code only needs the moments A k ij as inputs, it could work properly with no changes. It computes the evidence, the probability for k segments and its MAP estimation k, the probability of boundaries locations and the MAP locations of thek boundaries, the first and second in-segment moments, and an interesting regression curve that smooths the final result.
Nevertheless, for our specific problem, once the segments boundaries are obtained, we can estimate their means and variances straightforwardly, so we only use a part of Hutter's second block, which is shown in code block 2. The logical of the algorithm is explained in Ref. [4]. Finally, we calculate the in-segments means and estimate their statistical errors. If one segment of m elements has n counts, its mean rate is n/m, and its variance is approximately n/m 2 .
To determine the accuracy of the fit, it is also useful to estimate the uncertainties of the boundaries locations. A reasonable estimation of the uncertainties can be obtained from the second moments of the probability distributions of boundaries locations in the neighborhoods of the breaking points. If the probabilities of the boundaries locations are given by function B(k), the uncertainties approximately result where a is large enough to consider all the width of the peak corresponding to boundaryk, but small enough to avoid including peaks of other boundaries. In practice, if,k andl are the positions of consecutive boundaries, a good value for a is the minimum of (k −)/2 and (l −k)/2. Code block 3 uses these approaches to calculate the constant piecewise regression of data, and to estimate its statistical errors. The inputs are c[k], khat, B[i] and that. The outputs are list reg, which represents the best fit, and lists re1 and re2, which represent the minimum and maximum estimations considering the statistical errors of reg.    we can see that the probability of having 2 breaking points is very high. Figure 1 (top) also shows, in red line, the probability for the boundaries locations. Applying code block 3, we obtain the regression [ Fig. 1 (bottom), black dashed curve] and its error estimation [ Fig. 1 (bottom), gray zone]. The continuous blue line in Fig. 1 (bottom) indicates the rates used in simulation.

III. Applications and Discussion
The regression in the example above fits very well with the rate curve used in simulation. But sometimes regressions result qualitatively different to the rate curve, showing more or less breaking points, even for data simulated in the same conditions. This effect is due to chance. To show this issue, 2000 simulations with the same conditions were performed. In 992 of them, two breaking points were found. In the others, there were found zero (44), one (208), three (419), four (155), five (73), six (41), and seven or more (68) breaking points. For cases in which two breaking points were found, statistics of the most likely boundaries loca- tions and in-segment mean rates were calculated. Figure 2 shows histograms of those statistics. Figure 2 (top) shows histograms for boundaries locations. It is clear that the bigger the step, the smaller the uncertainty on its location. Figure 2 (bottom) shows histograms of in-segment rates. It is clear that the greater the rate, the smaller its relative statistical error. Figure 3 shows data and boundaries location probabilities for a simulation similar to the previous ones, but now with rates 3.0, 1.0 and 2.0. Comparing Fig. 1 (top) and Fig. 3 (top), we can see that in the latter one the boundaries locations are found more accurately.
Again, 2000 simulations with the same conditions were performed. In 1159 of them, two breaking points were found, while in the others, there were found zero (1), one (33), three (499), four (167), five (75), six (32), and seven or more (34) breaking points. Figure 4 shows histograms of the statistics of the most likely boundaries locations and in-segment rates for the simulations with two  It is important to note that the probability for the real curve to be completely inside the region defined by the error estimations of the regression is significantly less than one. It is easy to see why: if the errors were independent and equal to the standard error, the probability of satisfying n error conditions simultaneously would be 0.68 n . But even the actual probability could be lower, since it is clear that the errors must be dependent. Nevertheless, the error estimations presented here are useful to get an idea of the accuracy of the regression.
Finally, the capability to detect a breaking point with this code was tested for different count rates. To do this, simulated data sets of a single step in the count rate were used. Data sets consist in 100 Poisson distributed elements, the first 50 for a rate r 1 and the last 50 for a rate r 2 . 1000 simulations were   Table 1 shows that the smaller is the mean rate difference and the smaller are the mean rates, the more difficult is the detection of the step. This result is expected because in a Poisson distribution the variance is equal to the mean.

IV. Conclusions
In this work, a code for Bayesian regression of piecewise constant functions was adapted to handle data from Poisson processes. For this purpose, equations for calculating the moments of the posteriors of segments of data were found through Bayes theorem, considering the conjugate prior of the Poisson distribution as prior. These results, as well as part of Hutter's method, were used to calculate the most probable number of segments and their boundaries. Procedures for calculating in-segments mean rates and the uncertainties of mean rates and boundaries locations are also provided. The resulting method is summarized in a code in Mathematica.
The code was applied to simulated data. Firstly, two examples with tree segments were analyzed. The code performed well in both cases considering the dispersion of data, and the results improved in the case of higher mean rates and mean rates differences. This occurs because of the statistical dispersion of Poisson distributed data, which is greater than the mean rate if the mean rate is lower than one.
Finally, simulations of data of a single step were analyzed for different rates, and statistics of the regressions with only one breaking point are presented in a table. This table shows the effect of the rates and rate differences in the regression accuracy, and, together with the errors estimations provided by the code, can serve as an indicator of the reliability of the method.
Supplementary material including the source code for the algorithms can be found at the journal website [8].