# Multidimensional Unfolding

## Intro

Measured distributions involve effects from higher-order QED radiative corrections and detector smearing. An unfolding method is applied in order to correct for kinematic event migration between bins.

### Detector Smearing

Due to multiple scattering of the particles emerging from the physics process, the kinematic variables of the event may be reconstructed in adjacent bins at higher or lower values with respect to the true values. Particle interactions in the detector put a physical limit to the detector resolution. The detector resolution is simulated by a detailed Monte Carlo model involving all relevant detector materials. The size of a bin should obviously not be smaller than the actual detector resolution, otherwise the large fraction of migrating events would cause artificially large inflated uncertainties.

The four QED processes that contribute to the radiative corrections are illustrated in this figure: Initial and final state radiation as well as vacuum polarization and vertex correction. The photon radiation of the incoming (initial state radiation) and outgoing (final state radiation) leptons results in a difference between the apparent kinematic variables reconstructed from the incoming and outgoing leptons and those at the virtual-photon lepton vertex. Due to the emission of photons the apparent kinematic variables differ from the values on Born level such that not only events from processes under study may pass all the analysis criteria but also events from other processes, e.g. radiative (quasi-)elastic processes in a DIS event sample.

## The unfolding formalism for unpolarised data sample

If the event sample is binned in the kinematic variable x (here x can be considered as a single variable or a short notation for more than one variable in a multidimensional analysis), let ${\displaystyle X=(X(1),...,X(n_{b}))}$ be the vector of measured yields in the ${\displaystyle n_{b}}$ x-bins, so that the total number of events is ${\displaystyle X_{tot}=\sum _{i}^{n_{b}}X(i)}$. Due to instrumental and radiative smearing, each event is characterized by two quantities: the measured kinematics x and the Born kinematics x'. As a consequence of the finite bin size, the observed bin ${\displaystyle i}$ corresponding to the measured x may differ from the original bin ${\displaystyle j}$ corresponding to the Born kinematics x'.

The migration of events between bins due to detector smearing as well as radiative corrections is simulated by a Monte Carlo in order to extract a Migration matrix ${\displaystyle n(i,j)}$, where ${\displaystyle n(i,j)}$ is the number of events generated in the Born-bin ${\displaystyle j}$ that are measured in the bin ${\displaystyle i}$ due to kinematical smearing. The migration matrix is a ${\displaystyle n_{b}\times (n_{b}+1)}$ matrix, where the additional column ${\displaystyle n(i,0)}$ contains the background events that fall in the measured bin ${\displaystyle i}$ from outside the acceptance.

In the unfolding procedure is also needed the definition of the Born vector ${\displaystyle n^{B}(j)=(n^{B}(1),...,n^{B}(n_{b}))}$ that contains all the generated events in $\displaystyle 4\pi$ and that does not include radiative contributions. This vector is not available from the same Monte Carlo simulation, as all the events whose smeared kinematics is outside acceptance are missing. Additionally, radiative effects do not conserve the total DIS cross section. Therefore it should be determined with the use of a statistically independent Monte Carlo simulation with both radiative and instrumental effects switched off. For clearness this Monte Carlo will be called now on the Born Monte Carlo.

With the Born distribution ${\displaystyle n^{B}(j)}$, it is possible to define the conditional probability that an event with born kinematics x' is observed with measured kinematics x:

${\displaystyle S(i,j)\equiv {\frac {n(i,j)}{n^{B}(j)}}}$

and ${\displaystyle S(i,j)}$ is called the Smearing matrix. The background column ${\displaystyle j=0}$ can be extracted from the matrix in order to have a square matrix ${\displaystyle S'(i,j)}$, called the truncated Smearing matrix, and a vector ${\displaystyle S(i,0)=n(i,0)/n^{B}(0)}$ describing the background contributions.

The truncated smearing matrix only depends on the detector and radiative simulations, which are well known, and, being a relative quantity, is independent on the Monte Carlo model (strictly true if the finite bin size is neglected). On the contrary, the background vector ${\displaystyle S(i,0)}$ strongly depend on the Monte Carlo model for the background.

Fundamental ingredients to apply an unfolding correction:

- ${\displaystyle X(i)=(X(1),...,X(n_{b}))\rightarrow }$ Measured yield vector;
- ${\displaystyle n(i,j)\rightarrow }$ migration matrix (Monte Carlo);
- $\displaystyle n_B(j)\rightarrow$
${\displaystyle 4\pi }$ yield vector (Born Monte Carlo)
- ${\displaystyle S(i,j)=n(i,j)/n_{B}(j)\rightarrow }$ ${\displaystyle (n_{b}\times (n_{b}+1))}$ Smearing matrix;
- ${\displaystyle S'(i,j)\rightarrow }$ $\displaystyle (n_b \times n_b)$
Truncated Smearing matrix;
- ${\displaystyle S(i,0)\rightarrow }$ Background yield vector.



#### Normalization

All the quantities listed above should be normalized.

In particular:

- to normalize MC: normalizing the two MC productions (reconstructed and born) to each other via IEVGEN can only be done if both productions were run with the same generator settings and no selector. The safest thing to do is to normalize them both to cross section level, see MC Analysis Techniques page for details.

- to normalize Data: see Cross section (data) page.

Here all vectors and matrices are treated as they were already normalized.

#### The procedure

If ${\displaystyle B(j)=(B(1),...,B(n_{b}))}$ is the correct, unknown born distribution of the measured yields, it is related to the measured distribution $\displaystyle X(i)$ by:

${\displaystyle X(i)=\sum _{j=1}^{n_{b}}S'(i,j)B(j)+S(i,0)n^{B}(0)=}$
${\displaystyle \sum _{j=1}^{n_{b}}S'(i,j)B(j)+n(i,0)}$

If the square matrix $\displaystyle S'(i,j)$ is not singular, it can be inverted in order to access the born distribution:

                   $\displaystyle B(j)=\sum_{i=1}^{n_b} S'^{-1}(j,i)[X(i)-n(i,0)]$
Unfolded yield distribution

Schematic view of unfolding radiative/acceptance procedure

#### Error propagation

The final covariance matrix on the unfolded yield, $\displaystyle U_B$ can be separated into the contribution from the measured yield and the contribution from the smearing matrix,

  $\displaystyle U_B = U_{B,X} + U_{B,S}$


##### Contribution from measured yield

The contribution to the covariance matrix for the Born distribution ${\displaystyle B(j)}$ propagated from the uncertainty on the measured yield is, in matrix notation,

${\displaystyle U_{B,X}=S'^{-1}V(S'^{-1})^{T}}$

where $\displaystyle V$ is the covariance matrix of the measured yields, and, as they are independent Poisson variables, $\displaystyle V$ is diagonal and ${\displaystyle V(k,k)=\delta ^{2}X(k)=X(k)}$, therefore:

                  $\displaystyle U_{B,X}(i,j)=\sum_k^{n_b}S'^{-1}(i,k)X(k)S'^{-1}(j,k)$
Contribution to covariance matrix due to uncertainty of measured yield


The corresponding uncertainties can be written in terms of the dilution matrix $\displaystyle D(i,j)\equiv \frac{\partial B(j)}{\partial X(i)}$ :

${\displaystyle \delta _{X}^{2}B(i)=U_{B,X}(i,i)=\sum _{k}^{n_{b}}D^{2}(i,k)\delta ^{2}X(k).}$

With the unfolding procedure the unknown systematic uncertainties due to kinematic smearing are removed, but the price is that the kinematic bins are statistically correlated, therefore the statistical errors are inflated, as shown in the last equation.

##### Contribution from smearing matrix

The contribution $\displaystyle U_{B,A}$ from the smearing matrix can be propagated analytically, as

$\displaystyle U_{B,S}(i,j) = \sum_{a,b} \frac{\partial B(i)}{ \partial S(a,b) } \delta^2 S(a,b) \frac{\partial B(j)}{ \partial S(a,b) } = \sum_{a,b} S^{\prime -1}(i,a) B(b)^2 \delta^2 S(a,b) S^{\prime -1}(j,a)$

In the case of histograms and using the Poisson uncertainty assumption, $\displaystyle \delta^2 S(a,b) = S(a,b)$ , and this reduces to

                  $\displaystyle U_{B,S}(i,j) = \sum_{a,b} S^{\prime -1}(i,a) B(b)^2 S(a,b) S^{\prime -1}(j,a)$
Contribution to covariance matrix due to uncertainty of smearing matrix


This method is described in detail in Rebecca Lamb's thesis, Section 4.4.1 (pages 55-56) and Section 4.4.3 (page 59).

Historically, a numerical approach has also been used: the results are unfolded ${\displaystyle N}$ times with ${\displaystyle N}$ smearing matrices constructed by randomly varying each matrix element according to a Poisson distribution with rms equal to the rms of the unfolded results. The result spectra in each bin should have a Poisson-like shape (more ${\displaystyle N}$ is big more the shapes would be Poisson-like), and their standard deviation could be added to the statistical uncertainties. The analytic and numeric methods yield equivalent results, provided ${\displaystyle N}$ is large enough, although the analytic method is much faster to compute is is not subject to properly choosing the parameter ${\displaystyle N}$.

#### Fitting

After the unfolding procedure the corrected yield ${\displaystyle B(j)}$ are statistically correlated, and the covariance matrix ${\displaystyle U_{B}(i,j)}$ is not diagonal. This can result in a complication of the standard procedures of extracting the results.

For instance if a fit of the unfolded results ${\displaystyle B(j)}$ is needed in the least squares method the ${\displaystyle \chi ^{2}}$ should be defined with the general formula (see Least Squares Fit page):

                        ${\displaystyle \chi ^{2}=(B-f(p))^{T}U_{B}^{-1}(B-f(p))}$


where ${\displaystyle f(p)}$ is the fit function and ${\displaystyle p}$ is the ${\displaystyle n_{p}}$-dimensional parameters vector.

If ${\displaystyle f(p)}$ is linear, i.e.

${\displaystyle f(p)={\mathbf {A}}p}$,

and

${\displaystyle \chi ^{2}=(B-{\mathbf {A}}p)^{T}U_{B}^{-1}(B-{\mathbf {A}}p)}$,

a linear regression can be performed (see The linear regression) to extract the vector of the parameters:

                         ${\displaystyle p=({\mathbf {A}}^{T}U_{B}^{-1}{\mathbf {A}})^{-1}{\mathbf {A}}^{T}U_{B}^{-1}B}$


and its covariance matrix:

                         ${\displaystyle C_{p}=({\mathbf {A}}^{T}U_{B}^{-1}{\mathbf {A}})^{-1}}$


## Alternative procedure

Since the smearing matrix could be hard to invert, especially in a multidimensional analysis with a high number of bins, if the data in the end need to be fit, a new procedure can be used that includes both the unfolding and the fitting in one step. Starting again from the relation between the measured ${\displaystyle X(i)}$ and the Born ${\displaystyle B(j)}$ distribution:

${\displaystyle X(i)=\sum _{j=1}^{n_{b}}S'(i,j)B(j)+n(i,0)}$,

we can define the measured yields corrected for the background as ${\displaystyle X'(i)\equiv X(i)-n(i,0)}$, so that in matrix form:

${\displaystyle X'\equiv S'B}$

Using the same formalism of the previous paragraph (Fitting), we can define the ${\displaystyle \chi ^{2}}$ as follows:

                            ${\displaystyle \chi ^{2}=(X'-S'{\mathbf {A}}p)^{T}V^{-1}(X'-S'{\mathbf {A}}p)}$


where ${\displaystyle V}$ is the measured yield covariance matrix (diagonal).

So here we compare the measured data (after background subtraction) to the "smeared" fit values to get a ${\displaystyle \chi ^{2}}$. Again, minimizing the ${\displaystyle \chi ^{2}}$ we find for the parameters:

                         ${\displaystyle p=({\mathbf {A}}^{T}S'^{T}V^{-1}S'{\mathbf {A}})^{-1}{\mathbf {A}}^{T}S'^{T}V^{-1}X'}$


and its covariance matrix:

                         ${\displaystyle C_{p}=({\mathbf {A}}^{T}S'^{T}V^{-1}S'{\mathbf {A}})^{-1}}$


• Only one step is needed instead of two: unfolding and fitting are performed at the same time;
• The covariance matrix to be inverted here is the one before the unfolding, so it is diagonal;
• There is only the non-trivial inversion of the matrix
${\displaystyle {\mathbf {A}}^{T}S'^{T}V^{-1}S'{\mathbf {A}}}$

whose dimension is related to the number of parameters (${\displaystyle n_{p}}$) instead of the number of bins (${\displaystyle n_{b}}$)

Therefore this procedure is more convenient than the previous one if ${\displaystyle n_{p} (For instance in a multidimensional analysis).