PIG 8 - Datasets¶
Author: Axel Donath, Christoph Deil, Regis Terrier & Atreyee Sinha
Created: Jan 4th, 2019
Withdrawn: Nov 30th, 2019
Status: withdrawn
Discussion: GH 1986
Abstract¶
An essential analysis feature of Gammapy will be the possibility to fit models to gamma-ray data. This includes the simple use-case of fitting spectral, spatial and combined models to binned datasets, but also includes more advanced scenarios such as joint-likelihood fitting of a model across multiple IACT observations, joint-likelihood fitting of IACT and Fermi-LAT data, combined analysis of gamma-ray data with flux points or a combined spectral and cube analysis. The joint-likelihood approach also allows for combining different event types in a single analysis.
For this reason we propose to introduce the abstraction layer of a Dataset
in Gammapy.
A dataset bundles the reduced data with a parametric source model, background model,
instrument response functions and the fit statistics function. It evaluates the model
and returns the log-likelihood of the data, given the current model parameters.
This approach allows to introduce a uniform fitting interface represented by a single
Fit
class, independent of the type of the analysis (spectral, spatial, cube). In
addition Datasets can be combined by adding their log-likelihood values, linking
and concatenating their model parameters. This enables all combinations
of a joint-likelihood analyses described above.
Proposal¶
In general the proposal includes three basic types of classes: Dataset
, Datasets
and Fit ``. The ``Dataset
bundles the model and data, the Datasets
class combines
multiple datasets for joint-likelhood fitting and the Fit
class represents the
interface to the optimization methods, which are implemented in gammapy.utils.fitting
.
Here is an illustration of the proposed API:
from gammapy.utils.fitting import Dataset, Datasets, Fit
dataset = Dataset()
dataset.stat()
fit = Fit(dataset)
fit.optimize()
datasets = Datasets([Dataset(), Dataset()])
fit = Fit(datasets)
fit.optimize()
# or for convenience
datasets = [Dataset(), Dataset()]
fit = Fit(datasets)
fit.optimize()
To enabled the different analysis use-cases we propose to introduce a MapDataset
,
SpectrumDataset
, SpectrumDatasetOnOff
and FluxPointsDataset
.
Dataset¶
The Dataset
class is the abstract base class for all datasets. It defines the minimal required
interface for a dataset to work with the Fit
class and basic user API. It serves
as a base class for all built-in datasets, as well as for user defined ones:
from gammapy.utils.fitting import Dataset
class UserDataset(Dataset):
def __init__(self, model, data, mask_fit=None, mask_safe=None):
self.model = model
self.data = data
self.parameters = model.parameters
self.mask_safe = None
self.mask_fit = None
def stat_per_bin():
return (self.model.evaluate() - self.data) ** 2
Any dataset derived from the base class has to define a .stat_per_bin()
method,
that returns the log-likelihood given the current model parameters and a .parameters
attribute, which defines the parameters passed on to the Fit
class.
The dataset also (optionally) defines two kinds of masks: the mask_safe
and mask_fit
.
The main purpose is that safe data range defined by mask_safe
is set in advance
e.g. during data reduction and not modified later, while the mask_fit
can be
modified by users to manually define the fit range or by algorithms e.g. flux point computation.
For likelihood evaluation both mask are combined by the Dataset
base class.
Datasets¶
To combine multiple datasets in a joint-likelihood analysis we propose to introduce
a Datasets
class. Its responsibility is to add up the log-likelihood values of
the individual datasets and join the parameter lists. The Datasets
class
also represents the main interface to the Fit
class.
The following example shows how a joint-likelihood analysis of multiple observations is supposed to work:
from gammapy.utils.fitting import Datasets
from gammapy.cube import SkyModel
model = SkyModel.read()
map_datasets = []
for obs_id in [123, 124, ...]:
bkg_model = BackgroundModel.read("obs-{}/background.fits".format(obs_id))
dataset = MapDataset(model=model, background=bkg_model)
map_datasets.append(dataset)
datasets = Datasets(map_datasets)
datasets.stat()
fit = Fit(datasets)
The linking of parameters of the spectral model is natively achieved, as the same instance of the spectral model is passed to two different datasets, while the background model is created for every dataset separately.
A joined spectral fit across multiple observations:
model = SpectralModel()
spectrum_dataset_1 = SpectrumDataset(model, ...)
spectrum_dataset_2 = SpectrumDataset(model, ...)
datasets = Datasets([spectrum_dataset_1, spectrum_dataset_2])
fit = Fit(datasets)
fit.optimize()
Combined spectral / flux points analysis:
model = SpectralModel()
spectrum_dataset = SpectrumDataset(model=model)
flux_points = FluxPoints.read()
flux_point_dataset = FluxPointsDataset(flux_points=flux_points, model=model, likelihood="chi2")
datasets = Datasets([flux_point_dataset, spectrum_dataset])
fit = Fit(datasets)
fit.optimize()
MapDataset¶
To enable the standard combined spectral and spatial analysis we propose to
introduce a MapDataset
class. A MapDataset
bundles the counts data, source model,
IRFs, background model corresponding to a given event selection.
It is supposed to work as following:
from gammapy.cube import MapDataset
model = SkyModel()
background = BackgroundModel()
counts = Map.read("counts.fits")
exposure = Map.read("exposure.fits")
edisp = EDispMap.read("edisp.fits")
psf = PSFMap.read("psf.fits")
dataset = MapDataset(
counts=counts,
exposure=exposure,
edisp=edisp,
psf=psf,
model=model,
background=background_model,
likelihood="cash"
mask_fit=None,
mask_safe=None,
stat="cash"
)
fit = Fit(dataset)
fit.optimize()
For the psf argument three values are possible: None
, PSFKernel()
or PSFMap
. In the first case the PSF convolution is skipped. In the
second case a single PSF for the whole map is assumed. In the last
case a spatially varying PSF is handled. The same holds for the edisp
argument.
In the proposed implementation the counts map and background model must be
defined on the same map geometry. A separate setup step in MapDataset.__init__
creates cutouts of the exposure map for the individual model components and assigns
the corresponding IRFs to the model component and bundles all into the already
existing MapEvaluator()
object. The list of model evaluators is cached on the
MapDataset
and used later to efficiently compute the predicted number of counts.
The stat
argument allows to choose between built-in fit statistics
such as cash
or cstat
for convenience.
MapDatasetOnOff¶
For on-off based analyses a MapDatasetOnOff
will be introduced. It inherits
most of its functionality from the MapDataset
, but implements in addition the
handling of counts_off
and acceptance
and acceptance_off
information:
from gammapy.cube import MapDatasetOnOff
model = SkyModel()
background_model = BackgroundModel()
counts = Map.read("counts.fits")
counts_off = Map.read("counts_off.fits")
acceptance = Map.read("acceptance.fits")
acceptance_off = Map.read("acceptance_off.fits")
exposure = Map.read("exposure.fits")
edisp = EdispMap.read("edisp.fits")
psf = PSFMap.read("psf.fits")
dataset_onoff = MapDatasetOnOff(
model=model,
background_model=background_model,
counts=counts,
counts_off=counts_off,
acceptance=acceptance,
aceptance_off=acceptance_off,
exposure=exposure,
edisp=edisp,
psf=psf,
mask_fit=mask_fit,
mask_safe=mask_safe,
stat="wstat"
)
fit = Fit(dataset)
fit.optimize()
The background_model
can be defined optionally for simulation or
fitting to the counts_off
data.
The MapDatasetOnOff
additionally implements the following methods
needed for the evaluation of the fit statistic:
dataset = MapDatasetOnOff()
dataset.alpha # defined by acceptance / acceptance_off
dataset.background # defined by alpha * counts_off
dataset.npred_sig() # defined by npred - background
SpectrumDataset¶
For spectral analysis we propose to introduce a SpectrumDataset
:
from gammapy.spectrum import SpectrumDataset
model = SpectralModel()
background_model = SpectralBackgroundModel.read()
edisp = EnergyDispersion.read()
counts = CountsSpectrum.read()
exposure = CountsSpectrum.read()
dataset = SpectrumDataset(
counts=counts,
exposure=exposure,
model=model,
background_model=background_model,
mask_fit=mask_fit,
mask_safe=mask_safe,
edisp=edisp,
stat="cash"
)
dataset.npred()
dataset.stat_per_bin()
dataset.stat()
SpectrumDatasetOnOff¶
For on-off based spectral analysis we propose to introduce a SpectrumDatasetOnOff
class which again inherits from SpectrumDataset
and handles the additional information
accordingly:
from gammapy.spectrum import SpectrumDatasetOnOff
model = SpectralModel()
edisp = EnergyDispersion.read()
counts = CountsSpectrum.read()
counts_off = CountsSpectrum.read()
aeff = EffectiveAreaTable.read()
dataset = SpectrumDatasetOnOff(
counts=counts,
counts_off=counts_off,
model=model,
exposure=exposure,
acceptance=acceptance,
acceptance_off=acceptance_off,
edisp=edisp,
stat="wstat",
)
dataset.npred()
dataset.stat_per_bin()
dataset.stat()
We propose to refactor the existing SpectrumObservation
object into the
SpectrumDatasetsOnOff
class.
FluxPointsDataset¶
For fitting of flux points we propose to introduce a FluxPointsDataset
:
from gammapy.spectrum import FluxPointsDataset
flux_points = FluxPoints.read("flux_points.ecsv")
model = PowerLaw()
dataset = FluxPointsDataset(
flux_points=flux_points,
model=model,
mask_safe=mask_safe
mask_fit=mask_fit,
stat="chi2"
)
fit = Fit(dataset)
fit.optimize()
The FluxPoint
class also supports the likelihood
format, which has to be implemented
in a special way in the FluxPointsDataset
. The likelihood
format stores the
likelihood of the flux point, depending on energy and amplitude. Given a predicted
flux by the spectral model the likelihood can be directly interpolated. This could
be supported by a specific option FluxPointsDataset(stat="likelihood")
or similar.
Dataset helper / convenience methods¶
We propose to implement a few convenience methods on all datasets. E.g.
a .residuals()
method, which computes the residual for the given data
and state of the model:
dataset.residuals(method="diff")
A selection of methods for the residual computation would ne available, such
as data - model
, (data - model) / model
and (data - model) / sqrt(model)
.
For counts based datasets an .excess()
method can be implemented.
All dataset objects should implement a __str__
method, so that the following
works:
print(dataset)
Specific to each dataset information on the model, model parameters, data and fit statistics is printed.
We propose to implement a .peek()
method on all datasets:
dataset.peek()
In an interactive environment this will show a default visualisation of the dataset including model, data, IRFs and residuals.
Simulation of MapDataset and SpectrumDataset¶
The MapDataset
and SpectrumDataset
classes will be the central classes
for map and spectrum based analyses. In many cases it’s useful to simulate
counts from a predicted number of counts model. The direct sampling of
binned data for the datasets could be supported as following (illustrated
for the MapDataset
):
# allow counts=None on init
dataset = MapDataset(counts=None)
# to sample a counts map from npred
dataset.counts = dataset.fake(random_seed=0)
fit = Fit(dataset)
fit.optimize()
Which calls np.random.poisson()
internally on the npred map and returns
the resulting map. This works equivalently for the SpectrumDataset
,
SpectrumDatasetOnOff
and MapDatasetOnOff
. The on-off dataset require
a background model for this to work.
Unbinned simulation (sampling of event lists) is addressed in a separate PIG.
Dataset serialization¶
For convenience all dataset classes should support serialization, implemented
via .read()
and .write()
methods. For now we only consider the serialization
of the data of the datasets and not the of the model, which might always stay
separate. As the dataset has to orchestrate the serialization of multiple objects,
such as different kind of maps, flux-points etc. one option is to introduce the
serialization with a YAML based index file:
dataset = MapDataset.read("dataset.yaml")
dataset.write("dataset.yaml")
Where the index file points to the various files needed for initialization of the dataset. Here is an example:
dataset:
type: map-dataset
counts: "obs-123/counts.fits"
exposure: "obs-123/exposure.fits"
edisp: "obs-123/edisp.fits"
psf: "obs-123/psf.fits"
background-model: "obs-123/background.fits"
model: "model.yaml" # optionally
Additionally we propose to introduce a single FITS file serializiaton for writing / reading datasets to and from disk:
dataset = MapDataset.read("dataset.fits")
dataset.write("dataset.fits")
The SpectrumDatasetOnOff
in addition implements serialisation to the standard
OGIP format described on the gamma-astro-data-formats PHA page.
The Datasets
object could be serialized equivalently as a list of datasets:
- dataset:
type: spectrum-dataset
maps:
counts: "obs-123/counts.fits"
exposure: "obs-123/exposure.fits"
edisp: "obs-123/edisp.fits"
model: "model.yaml"
background-model: "obs-123/background.fits"
likelihood: "wstat"
- dataset:
type: spectrum-dataset
maps:
counts: "obs-124/counts.fits"
exposure: "obs-124/exposure.fits"
edisp: "obs-124/edisp.fits"
model: "model.yaml"
background-model: "obs-124/background.fits"
likelihood: "wstat"
- dataset:
type: flux-point-dataset
flux-points : "fermipy-flux-points.fits"
model: "spectral-model.yaml"
likelihood: "template"
Task List¶
This is a proposal for a list of pull requests implementing the proposed changes, ordered by priority:
1. Refactor the current FluxPointFit
into a FluxPointsDataset
class
and make it work with the current Fit
class. Ensure that the Fit
class
supports the old inheritance scheme as well as the new dataset on init.
Update tests and tutorials (#2023).
2. Refactor the current MapFit
into the MapDataset
class. Only support a
fixed energy dispersion and psf first. Use the current MapEvaluator
for model
evaluation, but split out the background evaluation. Update test and tutorials
(#2026)
3. Implement the Datasets
class in gammapy.utils.fitting.datasets
, add tests
using multiple MapDataset
. Change the Fit
interface to take a Datasets
object or list of MapDataset
on input (#2030).
4. Enable joint-likelihood analyses by filtering the Datasets.parameters
for
unique parameters only. Add a tutorial for joint-likelihood fitting of multiple
observations (#2045).
5. Implement MapDataset.setup()
method, using a list of MapEvaluator
objects. Add an .evaluation_radius()
attribute to all the spatial models.
Only support a fixed PSF and edisp per model component (#2071).
6. Add support for psf maps to MapDataset.setup()
. Extend the .setup()
method
to look up the correct PSF for the given model component. A PSFMap
has to be
passed on MapDataset.__init__()
(v0.14).
7. Add support for energy dispersion maps to MapDataset
. Extend the .setup()
method to look up the correct EDisp for a given model component. An EdispMap
has to be passed on MapDataset.__init__()
(v0.14).
8. Add tutorial for joint-likelihood fitting of IACT and Fermi-LAT data, based on the joint-crab dataset (v0.14).
Add a
name
attribute to datasets and aDatasets.__getitem__
method (v0.14).Implement dataset serialization to yaml (v0.14).
Implement datasets serialization to a single FITS file (#2264).
12. Add the direct likelihood evaluation to the FluxPointsDataset
, by
interpolating flux point likelihood profiles (v0.14).
Implement
MapDataset.fake()
,SpectrumDataset.fake()
andSpectrumDatasetOnOff.fake()
methods (v0.14).
14. Implement convenience helper methods .residuals()
, .peek()
and .__str__
for
all datasets (v0.14).
Outlook¶
Parallel evaluation of datasets¶
For efficient joint-likelihood fits of multiple observations, parallel processing should be used. The obvious entry point is to evaluate one dataset per process and join the likelihoods at the end. The current handling of model references (the same model instance is passed to different datasets to achieve a generic parameter linking by updating the model parameters in place), sets a limitation on the parallel evaluation, because Python objects can’t easily be shared between multiple processes. We are aware of this issue, but propose to solve it later, because it does not affect the proposed API for datasets.
Lazy loading of Datasets¶
For the use-case of inspecting or stacking individual datasets (e.g. MapDataset per observation)
it could be advantegeous to implement a lazy-loading or generator interface for
Datasets.read()
. Such that the individual datasets are only read from disk on
requests and are not loaded in memory when calling .read()
. We leave this question
un-addressed in this PR but mention it for completeness.
Decision¶
The authors decided to withdraw the PIG. Most of the proposed changes are implemented.