PIG 8 - Datasets

  • Author: Axel Donath, Christoph Deil, Regis Terrier & Atreyee Sinha

  • Created: Jan 4th, 2019

  • Withdrawn: Nov 30th, 2019

  • Status: withdrawn

  • Discussion: GH 1986

Abstract

An essential analysis feature of Gammapy will be the possibility to fit models to gamma-ray data. This includes the simple use-case of fitting spectral, spatial and combined models to binned datasets, but also includes more advanced scenarios such as joint-likelihood fitting of a model across multiple IACT observations, joint-likelihood fitting of IACT and Fermi-LAT data, combined analysis of gamma-ray data with flux points or a combined spectral and cube analysis. The joint-likelihood approach also allows for combining different event types in a single analysis.

For this reason we propose to introduce the abstraction layer of a Dataset in Gammapy. A dataset bundles the reduced data with a parametric source model, background model, instrument response functions and the fit statistics function. It evaluates the model and returns the log-likelihood of the data, given the current model parameters. This approach allows to introduce a uniform fitting interface represented by a single Fit class, independent of the type of the analysis (spectral, spatial, cube). In addition Datasets can be combined by adding their log-likelihood values, linking and concatenating their model parameters. This enables all combinations of a joint-likelihood analyses described above.

Proposal

In general the proposal includes three basic types of classes: Dataset, Datasets and Fit ``. The ``Dataset bundles the model and data, the Datasets class combines multiple datasets for joint-likelhood fitting and the Fit class represents the interface to the optimization methods, which are implemented in gammapy.utils.fitting. Here is an illustration of the proposed API:

from gammapy.utils.fitting import Dataset, Datasets, Fit

dataset = Dataset()

dataset.stat()

fit = Fit(dataset)
fit.optimize()

datasets = Datasets([Dataset(), Dataset()])

fit = Fit(datasets)
fit.optimize()

# or for convenience
datasets = [Dataset(), Dataset()]

fit = Fit(datasets)
fit.optimize()

To enabled the different analysis use-cases we propose to introduce a MapDataset, SpectrumDataset, SpectrumDatasetOnOff and FluxPointsDataset.

Dataset

The Dataset class is the abstract base class for all datasets. It defines the minimal required interface for a dataset to work with the Fit class and basic user API. It serves as a base class for all built-in datasets, as well as for user defined ones:

from gammapy.utils.fitting import Dataset


class UserDataset(Dataset):
    def __init__(self, model, data, mask_fit=None, mask_safe=None):
        self.model = model
        self.data = data
        self.parameters = model.parameters
        self.mask_safe = None
        self.mask_fit = None

    def stat_per_bin():
        return (self.model.evaluate() - self.data) ** 2

Any dataset derived from the base class has to define a .stat_per_bin() method, that returns the log-likelihood given the current model parameters and a .parameters attribute, which defines the parameters passed on to the Fit class.

The dataset also (optionally) defines two kinds of masks: the mask_safe and mask_fit. The main purpose is that safe data range defined by mask_safe is set in advance e.g. during data reduction and not modified later, while the mask_fit can be modified by users to manually define the fit range or by algorithms e.g. flux point computation. For likelihood evaluation both mask are combined by the Dataset base class.

Datasets

To combine multiple datasets in a joint-likelihood analysis we propose to introduce a Datasets class. Its responsibility is to add up the log-likelihood values of the individual datasets and join the parameter lists. The Datasets class also represents the main interface to the Fit class.

The following example shows how a joint-likelihood analysis of multiple observations is supposed to work:

from gammapy.utils.fitting import Datasets
from gammapy.cube import SkyModel

model = SkyModel.read()

map_datasets = []

for obs_id in [123, 124, ...]:
    bkg_model = BackgroundModel.read("obs-{}/background.fits".format(obs_id))
    dataset = MapDataset(model=model, background=bkg_model)
    map_datasets.append(dataset)

datasets = Datasets(map_datasets)
datasets.stat()

fit = Fit(datasets)

The linking of parameters of the spectral model is natively achieved, as the same instance of the spectral model is passed to two different datasets, while the background model is created for every dataset separately.

A joined spectral fit across multiple observations:

model = SpectralModel()

spectrum_dataset_1 = SpectrumDataset(model, ...)
spectrum_dataset_2 = SpectrumDataset(model, ...)

datasets = Datasets([spectrum_dataset_1, spectrum_dataset_2])

fit = Fit(datasets)
fit.optimize()

Combined spectral / flux points analysis:

model = SpectralModel()

spectrum_dataset = SpectrumDataset(model=model)

flux_points = FluxPoints.read()
flux_point_dataset = FluxPointsDataset(flux_points=flux_points, model=model, likelihood="chi2")

datasets = Datasets([flux_point_dataset, spectrum_dataset])

fit = Fit(datasets)
fit.optimize()

MapDataset

To enable the standard combined spectral and spatial analysis we propose to introduce a MapDataset class. A MapDataset bundles the counts data, source model, IRFs, background model corresponding to a given event selection.

It is supposed to work as following:

from gammapy.cube import MapDataset

model = SkyModel()
background = BackgroundModel()

counts = Map.read("counts.fits")
exposure = Map.read("exposure.fits")
edisp = EDispMap.read("edisp.fits")
psf = PSFMap.read("psf.fits")

dataset = MapDataset(
    counts=counts,
    exposure=exposure,
    edisp=edisp,
    psf=psf,
    model=model,
    background=background_model,
    likelihood="cash"
    mask_fit=None,
    mask_safe=None,
    stat="cash"
    )

fit = Fit(dataset)
fit.optimize()

For the psf argument three values are possible: None, PSFKernel() or PSFMap. In the first case the PSF convolution is skipped. In the second case a single PSF for the whole map is assumed. In the last case a spatially varying PSF is handled. The same holds for the edisp argument.

In the proposed implementation the counts map and background model must be defined on the same map geometry. A separate setup step in MapDataset.__init__ creates cutouts of the exposure map for the individual model components and assigns the corresponding IRFs to the model component and bundles all into the already existing MapEvaluator() object. The list of model evaluators is cached on the MapDataset and used later to efficiently compute the predicted number of counts.

The stat argument allows to choose between built-in fit statistics such as cash or cstat for convenience.

MapDatasetOnOff

For on-off based analyses a MapDatasetOnOff will be introduced. It inherits most of its functionality from the MapDataset, but implements in addition the handling of counts_off and acceptance and acceptance_off information:

from gammapy.cube import MapDatasetOnOff

model = SkyModel()
background_model = BackgroundModel()

counts = Map.read("counts.fits")
counts_off = Map.read("counts_off.fits")

acceptance = Map.read("acceptance.fits")
acceptance_off = Map.read("acceptance_off.fits")

exposure = Map.read("exposure.fits")
edisp = EdispMap.read("edisp.fits")
psf = PSFMap.read("psf.fits")

dataset_onoff = MapDatasetOnOff(
    model=model,
    background_model=background_model,
    counts=counts,
    counts_off=counts_off,
    acceptance=acceptance,
    aceptance_off=acceptance_off,
    exposure=exposure,
    edisp=edisp,
    psf=psf,
    mask_fit=mask_fit,
    mask_safe=mask_safe,
    stat="wstat"
    )

fit = Fit(dataset)
fit.optimize()

The background_model can be defined optionally for simulation or fitting to the counts_off data.

The MapDatasetOnOff additionally implements the following methods needed for the evaluation of the fit statistic:

dataset = MapDatasetOnOff()
dataset.alpha  # defined by acceptance / acceptance_off
dataset.background  # defined by alpha * counts_off
dataset.npred_sig()  # defined by npred - background

SpectrumDataset

For spectral analysis we propose to introduce a SpectrumDataset:

from gammapy.spectrum import SpectrumDataset

model = SpectralModel()
background_model = SpectralBackgroundModel.read()
edisp = EnergyDispersion.read()

counts = CountsSpectrum.read()
exposure = CountsSpectrum.read()

dataset = SpectrumDataset(
    counts=counts,
    exposure=exposure,
    model=model,
    background_model=background_model,
    mask_fit=mask_fit,
    mask_safe=mask_safe,
    edisp=edisp,
    stat="cash"
    )

dataset.npred()
dataset.stat_per_bin()
dataset.stat()

SpectrumDatasetOnOff

For on-off based spectral analysis we propose to introduce a SpectrumDatasetOnOff class which again inherits from SpectrumDataset and handles the additional information accordingly:

from gammapy.spectrum import SpectrumDatasetOnOff

model = SpectralModel()

edisp = EnergyDispersion.read()

counts = CountsSpectrum.read()
counts_off = CountsSpectrum.read()

aeff = EffectiveAreaTable.read()

dataset = SpectrumDatasetOnOff(
    counts=counts,
    counts_off=counts_off,
    model=model,
    exposure=exposure,
    acceptance=acceptance,
    acceptance_off=acceptance_off,
    edisp=edisp,
    stat="wstat",
    )

dataset.npred()
dataset.stat_per_bin()
dataset.stat()

We propose to refactor the existing SpectrumObservation object into the SpectrumDatasetsOnOff class.

FluxPointsDataset

For fitting of flux points we propose to introduce a FluxPointsDataset:

from gammapy.spectrum import FluxPointsDataset

flux_points = FluxPoints.read("flux_points.ecsv")
model = PowerLaw()
dataset = FluxPointsDataset(
    flux_points=flux_points,
    model=model,
    mask_safe=mask_safe
    mask_fit=mask_fit,
    stat="chi2"
    )

fit = Fit(dataset)
fit.optimize()

The FluxPoint class also supports the likelihood format, which has to be implemented in a special way in the FluxPointsDataset. The likelihood format stores the likelihood of the flux point, depending on energy and amplitude. Given a predicted flux by the spectral model the likelihood can be directly interpolated. This could be supported by a specific option FluxPointsDataset(stat="likelihood") or similar.

Dataset helper / convenience methods

We propose to implement a few convenience methods on all datasets. E.g. a .residuals() method, which computes the residual for the given data and state of the model:

dataset.residuals(method="diff")

A selection of methods for the residual computation would ne available, such as data - model, (data - model) / model and (data - model) / sqrt(model).

For counts based datasets an .excess() method can be implemented.

All dataset objects should implement a __str__ method, so that the following works:

print(dataset)

Specific to each dataset information on the model, model parameters, data and fit statistics is printed.

We propose to implement a .peek() method on all datasets:

dataset.peek()

In an interactive environment this will show a default visualisation of the dataset including model, data, IRFs and residuals.

Simulation of MapDataset and SpectrumDataset

The MapDataset and SpectrumDataset classes will be the central classes for map and spectrum based analyses. In many cases it’s useful to simulate counts from a predicted number of counts model. The direct sampling of binned data for the datasets could be supported as following (illustrated for the MapDataset):

# allow counts=None on init
dataset = MapDataset(counts=None)

# to sample a counts map from npred
dataset.counts = dataset.fake(random_seed=0)

fit = Fit(dataset)
fit.optimize()

Which calls np.random.poisson() internally on the npred map and returns the resulting map. This works equivalently for the SpectrumDataset, SpectrumDatasetOnOff and MapDatasetOnOff. The on-off dataset require a background model for this to work.

Unbinned simulation (sampling of event lists) is addressed in a separate PIG.

Dataset serialization

For convenience all dataset classes should support serialization, implemented via .read() and .write() methods. For now we only consider the serialization of the data of the datasets and not the of the model, which might always stay separate. As the dataset has to orchestrate the serialization of multiple objects, such as different kind of maps, flux-points etc. one option is to introduce the serialization with a YAML based index file:

dataset = MapDataset.read("dataset.yaml")
dataset.write("dataset.yaml")

Where the index file points to the various files needed for initialization of the dataset. Here is an example:

dataset:
    type: map-dataset
    counts: "obs-123/counts.fits"
    exposure: "obs-123/exposure.fits"
    edisp: "obs-123/edisp.fits"
    psf: "obs-123/psf.fits"
    background-model: "obs-123/background.fits"
    model: "model.yaml" # optionally

Additionally we propose to introduce a single FITS file serializiaton for writing / reading datasets to and from disk:

dataset = MapDataset.read("dataset.fits")
dataset.write("dataset.fits")

The SpectrumDatasetOnOff in addition implements serialisation to the standard OGIP format described on the gamma-astro-data-formats PHA page.

The Datasets object could be serialized equivalently as a list of datasets:

- dataset:
    type: spectrum-dataset
    maps:
        counts: "obs-123/counts.fits"
        exposure: "obs-123/exposure.fits"
        edisp: "obs-123/edisp.fits"
    model: "model.yaml"
    background-model: "obs-123/background.fits"
    likelihood: "wstat"

- dataset:
    type: spectrum-dataset
    maps:
        counts: "obs-124/counts.fits"
        exposure: "obs-124/exposure.fits"
        edisp: "obs-124/edisp.fits"
    model: "model.yaml"
    background-model: "obs-124/background.fits"
    likelihood: "wstat"

- dataset:
    type: flux-point-dataset
    flux-points : "fermipy-flux-points.fits"
    model: "spectral-model.yaml"
    likelihood: "template"

Task List

This is a proposal for a list of pull requests implementing the proposed changes, ordered by priority:

1. Refactor the current FluxPointFit into a FluxPointsDataset class and make it work with the current Fit class. Ensure that the Fit class supports the old inheritance scheme as well as the new dataset on init. Update tests and tutorials (#2023).

2. Refactor the current MapFit into the MapDataset class. Only support a fixed energy dispersion and psf first. Use the current MapEvaluator for model evaluation, but split out the background evaluation. Update test and tutorials (#2026)

3. Implement the Datasets class in gammapy.utils.fitting.datasets, add tests using multiple MapDataset. Change the Fit interface to take a Datasets object or list of MapDataset on input (#2030).

4. Enable joint-likelihood analyses by filtering the Datasets.parameters for unique parameters only. Add a tutorial for joint-likelihood fitting of multiple observations (#2045).

5. Implement MapDataset.setup() method, using a list of MapEvaluator objects. Add an .evaluation_radius() attribute to all the spatial models. Only support a fixed PSF and edisp per model component (#2071).

6. Add support for psf maps to MapDataset.setup(). Extend the .setup() method to look up the correct PSF for the given model component. A PSFMap has to be passed on MapDataset.__init__() (v0.14).

7. Add support for energy dispersion maps to MapDataset. Extend the .setup() method to look up the correct EDisp for a given model component. An EdispMap has to be passed on MapDataset.__init__() (v0.14).

8. Add tutorial for joint-likelihood fitting of IACT and Fermi-LAT data, based on the joint-crab dataset (v0.14).

  1. Add a name attribute to datasets and a Datasets.__getitem__ method (v0.14).

  2. Implement dataset serialization to yaml (v0.14).

  3. Implement datasets serialization to a single FITS file (#2264).

12. Add the direct likelihood evaluation to the FluxPointsDataset, by interpolating flux point likelihood profiles (v0.14).

  1. Implement MapDataset.fake(), SpectrumDataset.fake() and SpectrumDatasetOnOff.fake() methods (v0.14).

14. Implement convenience helper methods .residuals(), .peek() and .__str__ for all datasets (v0.14).

Outlook

Parallel evaluation of datasets

For efficient joint-likelihood fits of multiple observations, parallel processing should be used. The obvious entry point is to evaluate one dataset per process and join the likelihoods at the end. The current handling of model references (the same model instance is passed to different datasets to achieve a generic parameter linking by updating the model parameters in place), sets a limitation on the parallel evaluation, because Python objects can’t easily be shared between multiple processes. We are aware of this issue, but propose to solve it later, because it does not affect the proposed API for datasets.

Lazy loading of Datasets

For the use-case of inspecting or stacking individual datasets (e.g. MapDataset per observation) it could be advantegeous to implement a lazy-loading or generator interface for Datasets.read(). Such that the individual datasets are only read from disk on requests and are not loaded in memory when calling .read(). We leave this question un-addressed in this PR but mention it for completeness.

Decision

The authors decided to withdraw the PIG. Most of the proposed changes are implemented.