Skip to content

Instantly share code, notes, and snippets.

@wiso
Last active August 29, 2015 14:18
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save wiso/c813db3fcbfc1ef9e6e9 to your computer and use it in GitHub Desktop.
Save wiso/c813db3fcbfc1ef9e6e9 to your computer and use it in GitHub Desktop.
Background
Many shape analyses have the problem of parametrizing the background. Many approach have been used in the past and new ideas should be investigated, trying to define common recipe(s) taking into account that different analyses can have different needs.
It would be good to extend the discussion to ATLAS, for example inside the statistical forum, but the concern is that such approach, as experimented in the past, may not converge. So the idea is to start the discussion with few people working in some similar analyses ($\gamma\gamma$, $jj$, $\gamma j$, diboson, ...), to list the various possibilities and to evaluate the pro and the cons.
These analyses share a similar background distribution of the invariant mass (smooth, decreasing) with a large number of events and the search for a resonant signal.
Some functional forms have theoretical motivation, but the detector effects are not negligible, so in general it is not mandatory to use functional form that comes from theoretical arguments.
The main topics of the discussion are:
1. How to check if a functional form is suitable, which means:
1. is it able to describe the background?
2. is it too elastic (does it have too many degree of freedom) with respect to the data that are fitted?
3. the background model can be compared to many samples: real data (eventually in control regions) / simulation / Asimov from a fit from data. What is the benchmark? <small>The problem using real data is that the statistics usually is not enough for precise studies and they can contain signal. The usage of simulations has the same statistical problem (in fact for example to generate O(100M) of events in the $H\to\gamma\gamma$ the background used for such studies is a "smeared MC" which is a non-full (sort of fast) simulation smeared to take into account detector effects). The problems with Asimov dataset generated from a function fitted on data are two: first, signal can be present in data, second a choice of functional form (usually with a number of degree of freedom much higher that the one of the function to be tested) must be made. </small>. Said in another way: if I generate an Asimov with function $f$ (which has many NDOF) and I use function $g$ which is completely different from $f$ and simpler, what is the impact on the fitted POI?
2. How to take into account an error on the final POI ($\mu$, $\sigma\times Br$, $N_{sig}$, ...) reflecting the fact that the choice of the functional form can be not perfect. <small>For example from the image below different functional forms give different values of $\mu_{H\to\gamma\gamma}$, even if some of them should be rejected since they do not fit the background decently.</small>
![Higgs peak fitted with many functional form](http://precision-turra.mi.infn.it/peak_fits.png)
The methods can be classified in two groups:
* using parametric function fitted to data to model the background
* using non-parametric functions (histograms) from the simulations
and consequently different tools are used (HistFactory / HistFitter or a manual implementation of the workspace).
Choice of the background function
-----------------------------------------
To evaluate how well a function fit the data is a quite easy task and can be evaluate for example with the probability of the reduced-$\chi^2$.
The diboson analysis is evaluating it fitting the real data with the proposed functional form. Then the error of the fitted parameter is used to build a band (probably using [RooAbsReal::PlotOn with RooFit::VisualizeError option](https://root.cern.ch/root/html/RooAbsReal.html#RooAbsReal:plotOn) <font color='pink'>(to be confirmed)</font>). Then if the data are covered by the band the fit is considered to be good. <small>In principle this approach is affected by the way the function is written, for example $[0] + [1] x$ vs $[0] + [1] (x - q)$.</small>
A common approach is to first define a set of functional form, e.g. polynomials, exponential of polynomials, Laurent series, ... to fit them on data. Of course function with higher nDOF will fit better, but this means to have additional free parameter in the final fit and so a bigger error; in addition with a large nDOF there is the risk to fit statistical fluctuation. A common procedure is the [F-test](https://en.wikipedia.org/wiki/F-test#Regression_problems). Since some analyses ($jj$) process the data continuously the test should be repeated with the increasing of statistics.
At this point for every set of functional form a candidate is selected (e.g. polynomial with 5 DOF, exponential with 3 DOF, ...). Now the functions should be compared in terms of fit goodness, taking into account the fact that they have different nDOF.
In addition some analyses uses a criteria based on the *spurious signal*. This is defined a the number of signal events fitted in a s+b fit on a dataset which do not contain signal (usually a simulation).
How to incorporate the systematics of the background
----------------------------------------------------------------
A non-perfect choice of the function modelling the background can lead to a systematics on the POI (see first figure above).
### Spurious signal
The spurious signal can be included in the likelihood to take into account the fact that the background function can partially fit the signal:
$$\text{number of signal events} = \mu N_{SM} + \sigma_{ss}\times \theta_{ss} $$
where $\sigma_{ss}$ is the spurious signal for the chosen functional form an $\theta_{ss}$ is a normalized nuisance parameter constrained.
### Discrete profiling
$H\to\gamma\gamma$ in CMS is using taking into account the uncertainty due to the unknowledge of the background functional form profiling the choice with an additional discrete NP ([main documentation](http://arxiv.org/abs/1408.6865), [example test](https://indico.cern.ch/event/374085/contribution/2/material/slides/0.pdf), [minimal code](http://nbviewer.ipython.org/gist/wiso/497b3ea55ccc639db829)). The method has been developed for measurement, but it can be transposed to discovery and exclusion.
The advantage of this method is to remove the need of a choice for the background functional form (even if the choice of the functional form to be used has to be made). One of the problem is that the resulting profile likelihood has some angular point that probably should be smoothed in some way. <small>For example in the image above `exp pol2` and `poly5` contributes to the $1\sigma$ error, but with a sligtly different curve the `poly5` could be ouside the $\Delta\chi^2=1$ variation.</small>
![enter image description here](http://precision-turra.mi.infn.it/multiple_fit.png)
Some care should be taken when comparing function with different nDOF.
When trasposing this method to discovery this means <font color="pink">(to be checked)</font> to repeat the procedure multiple times with different background functional form
### Evaluate the bias on POI in a pessimistic case
Another approach is to fit the data (real data or simulated data) with a function which is very elastic (it can be a combination of different functional form). Then some small random changes to the parameters are made to distort the spectrum or some non-linear transformation are applied. This function is considered to be the true-model of the background.
An Asimov dataset is generated from the true-model and a signal is injected on top of it.
The functional form to be tested (which is simpler than the truth model) is used to build a s+b model. This model is fitted on the Asimov dataset and the difference between the fitted-$s$ and the injected-$s$ is considered as the error on $s$ due to the choice of the background functional form.
This method is arbitrary and can lead to pessimistic results.
### Using theoretical uncertainty and simulations
With respect to the other method in this case the background estimation is *not real-data-driven*. The model of the background is from full simulations (corrected, rescaled, ...). Usually it is not parametrized but the histogram of the observable is directly consider as the pdf of the background.
The considered uncertainties are the one coming from the theory (PDF, scales, ...) and potentially taking the difference between different generators. <small>The prerequisite of this method is that the errors cover data as well as the approach with analytic functions</small>.
Practically this is implemented in `HistFactory` providing a nominal histogram and several distorted histograms. One important point is to define how the interpolation between the nominal and the distorted histograms should be done.
#### Application to functional form
Actually the method above can be used also with functional form, the problem is to find a way to morph the nominal function to the one derived on the non-nominal background.
A functional form is first tested on the nominal simulation. Then it is tested on distorted simulation (pdf, scales, detector effects, ...). The parameter of the nominal fit will be quite different from the one from the distorted sample. Trying to interpolate the parameters is quite difficult, since usually they are strongly correlated. So another way should be find to interpolate the functions between the various distorted sample. The variation can be also constrained.
By the way in this method is very similar to the one using histograms.
Other statistical topics
----------------------------
### Usage of `BumpHunter`
Some analyses ($jj$) are using `BumpHunter` as tool for the discovery. Even if this tool is fast and generic, it is not clear if this is sufficient or if a modelization is needed. In addition it cannot do exclusion, this means that the same analysis is using two very different method to set limits or to make discoveries.
### Fit at the limit of the spectrum
The most interestings regions usually are at the ones at the high limit of the POI. This means that only one sideband is available and so the extrapolation in the signal region is not reliable.
The easiest solution is to limit the fit range, this should be done asking for a limit on the expected error.
### Moving fitting range
In some analyses ($\gamma\gamma$) is quite difficult to find an unique analytic function fitting well the whole range. One solution is to define a smaller fit range around the tested value. Even if this make it easier to find a functional form it means that less information is used by the fit and that the error on the signal estimation because of the background can be increased since it is less constrained.
### Blindness
If analyses are blind (no access to data) it means that no direct check of the agreement between background model and data can be done. The usual approach is to consider validation regions, even this is not a common within the considered analyses and it is not clear how to do it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment