.. include:: ../../references.txt .. _pig-008: **************** PIG 8 - Datasets **************** * Author: Axel Donath, Christoph Deil, Regis Terrier & Atreyee Sinha * Created: Jan 4th, 2019 * Withdrawn: Nov 30th, 2019 * Status: withdrawn * Discussion: `GH 1986`_ Abstract ======== An essential analysis feature of Gammapy will be the possibility to fit models to gamma-ray data. This includes the simple use-case of fitting spectral, spatial and combined models to binned datasets, but also includes more advanced scenarios such as joint-likelihood fitting of a model across multiple IACT observations, joint-likelihood fitting of IACT and Fermi-LAT data, combined analysis of gamma-ray data with flux points or a combined spectral and cube analysis. The joint-likelihood approach also allows for combining different event types in a single analysis. For this reason we propose to introduce the abstraction layer of a ``Dataset`` in Gammapy. A dataset bundles the reduced data with a parametric source model, background model, instrument response functions and the fit statistics function. It evaluates the model and returns the log-likelihood of the data, given the current model parameters. This approach allows to introduce a uniform fitting interface represented by a single ``Fit`` class, independent of the type of the analysis (spectral, spatial, cube). In addition Datasets can be combined by adding their log-likelihood values, linking and concatenating their model parameters. This enables all combinations of a joint-likelihood analyses described above. Proposal ======== In general the proposal includes three basic types of classes: ``Dataset``, ``Datasets`` and ``Fit ``. The ``Dataset`` bundles the model and data, the ``Datasets`` class combines multiple datasets for joint-likelhood fitting and the ``Fit`` class represents the interface to the optimization methods, which are implemented in ``gammapy.utils.fitting``. Here is an illustration of the proposed API: .. code:: from gammapy.utils.fitting import Dataset, Datasets, Fit dataset = Dataset() dataset.stat() fit = Fit(dataset) fit.optimize() datasets = Datasets([Dataset(), Dataset()]) fit = Fit(datasets) fit.optimize() # or for convenience datasets = [Dataset(), Dataset()] fit = Fit(datasets) fit.optimize() To enabled the different analysis use-cases we propose to introduce a ``MapDataset``, ``SpectrumDataset``, ``SpectrumDatasetOnOff`` and ``FluxPointsDataset``. Dataset ------- The ``Dataset`` class is the abstract base class for all datasets. It defines the minimal required interface for a dataset to work with the ``Fit`` class and basic user API. It serves as a base class for all built-in datasets, as well as for user defined ones: .. code:: from gammapy.utils.fitting import Dataset class UserDataset(Dataset): def __init__(self, model, data, mask_fit=None, mask_safe=None): self.model = model self.data = data self.parameters = model.parameters self.mask_safe = None self.mask_fit = None def stat_per_bin(): return (self.model.evaluate() - self.data) ** 2 Any dataset derived from the base class has to define a ``.stat_per_bin()`` method, that returns the log-likelihood given the current model parameters and a ``.parameters`` attribute, which defines the parameters passed on to the ``Fit`` class. The dataset also (optionally) defines two kinds of masks: the ``mask_safe`` and ``mask_fit``. The main purpose is that safe data range defined by ``mask_safe`` is set in advance e.g. during data reduction and not modified later, while the ``mask_fit`` can be modified by users to manually define the fit range or by algorithms e.g. flux point computation. For likelihood evaluation both mask are combined by the ``Dataset`` base class. Datasets -------- To combine multiple datasets in a joint-likelihood analysis we propose to introduce a ``Datasets`` class. Its responsibility is to add up the log-likelihood values of the individual datasets and join the parameter lists. The ``Datasets`` class also represents the main interface to the ``Fit`` class. The following example shows how a joint-likelihood analysis of multiple observations is supposed to work: .. code :: from gammapy.utils.fitting import Datasets from gammapy.cube import SkyModel model = SkyModel.read() map_datasets = [] for obs_id in [123, 124, ...]: bkg_model = BackgroundModel.read("obs-{}/background.fits".format(obs_id)) dataset = MapDataset(model=model, background=bkg_model) map_datasets.append(dataset) datasets = Datasets(map_datasets) datasets.stat() fit = Fit(datasets) The linking of parameters of the spectral model is natively achieved, as the same instance of the spectral model is passed to two different datasets, while the background model is created for every dataset separately. A joined spectral fit across multiple observations: .. code :: model = SpectralModel() spectrum_dataset_1 = SpectrumDataset(model, ...) spectrum_dataset_2 = SpectrumDataset(model, ...) datasets = Datasets([spectrum_dataset_1, spectrum_dataset_2]) fit = Fit(datasets) fit.optimize() Combined spectral / flux points analysis: .. code :: model = SpectralModel() spectrum_dataset = SpectrumDataset(model=model) flux_points = FluxPoints.read() flux_point_dataset = FluxPointsDataset(flux_points=flux_points, model=model, likelihood="chi2") datasets = Datasets([flux_point_dataset, spectrum_dataset]) fit = Fit(datasets) fit.optimize() MapDataset ---------- To enable the standard combined spectral and spatial analysis we propose to introduce a ``MapDataset`` class. A ``MapDataset`` bundles the counts data, source model, IRFs, background model corresponding to a given event selection. It is supposed to work as following: .. code :: from gammapy.cube import MapDataset model = SkyModel() background = BackgroundModel() counts = Map.read("counts.fits") exposure = Map.read("exposure.fits") edisp = EDispMap.read("edisp.fits") psf = PSFMap.read("psf.fits") dataset = MapDataset( counts=counts, exposure=exposure, edisp=edisp, psf=psf, model=model, background=background_model, likelihood="cash" mask_fit=None, mask_safe=None, stat="cash" ) fit = Fit(dataset) fit.optimize() For the psf argument three values are possible: ``None``, ``PSFKernel()`` or ``PSFMap``. In the first case the PSF convolution is skipped. In the second case a single PSF for the whole map is assumed. In the last case a spatially varying PSF is handled. The same holds for the ``edisp`` argument. In the proposed implementation the counts map and background model must be defined on the same map geometry. A separate setup step in ``MapDataset.__init__`` creates cutouts of the exposure map for the individual model components and assigns the corresponding IRFs to the model component and bundles all into the already existing ``MapEvaluator()`` object. The list of model evaluators is cached on the ``MapDataset`` and used later to efficiently compute the predicted number of counts. The ``stat`` argument allows to choose between built-in fit statistics such as ``cash`` or ``cstat`` for convenience. MapDatasetOnOff --------------- For on-off based analyses a ``MapDatasetOnOff`` will be introduced. It inherits most of its functionality from the ``MapDataset``, but implements in addition the handling of ``counts_off`` and ``acceptance`` and ``acceptance_off`` information: .. code :: from gammapy.cube import MapDatasetOnOff model = SkyModel() background_model = BackgroundModel() counts = Map.read("counts.fits") counts_off = Map.read("counts_off.fits") acceptance = Map.read("acceptance.fits") acceptance_off = Map.read("acceptance_off.fits") exposure = Map.read("exposure.fits") edisp = EdispMap.read("edisp.fits") psf = PSFMap.read("psf.fits") dataset_onoff = MapDatasetOnOff( model=model, background_model=background_model, counts=counts, counts_off=counts_off, acceptance=acceptance, aceptance_off=acceptance_off, exposure=exposure, edisp=edisp, psf=psf, mask_fit=mask_fit, mask_safe=mask_safe, stat="wstat" ) fit = Fit(dataset) fit.optimize() The ``background_model`` can be defined optionally for simulation or fitting to the ``counts_off`` data. The ``MapDatasetOnOff`` additionally implements the following methods needed for the evaluation of the fit statistic: .. code:: dataset = MapDatasetOnOff() dataset.alpha # defined by acceptance / acceptance_off dataset.background # defined by alpha * counts_off dataset.npred_sig() # defined by npred - background SpectrumDataset --------------- For spectral analysis we propose to introduce a ``SpectrumDataset``: .. code :: from gammapy.spectrum import SpectrumDataset model = SpectralModel() background_model = SpectralBackgroundModel.read() edisp = EnergyDispersion.read() counts = CountsSpectrum.read() exposure = CountsSpectrum.read() dataset = SpectrumDataset( counts=counts, exposure=exposure, model=model, background_model=background_model, mask_fit=mask_fit, mask_safe=mask_safe, edisp=edisp, stat="cash" ) dataset.npred() dataset.stat_per_bin() dataset.stat() SpectrumDatasetOnOff -------------------- For on-off based spectral analysis we propose to introduce a ``SpectrumDatasetOnOff`` class which again inherits from ``SpectrumDataset`` and handles the additional information accordingly: .. code :: from gammapy.spectrum import SpectrumDatasetOnOff model = SpectralModel() edisp = EnergyDispersion.read() counts = CountsSpectrum.read() counts_off = CountsSpectrum.read() aeff = EffectiveAreaTable.read() dataset = SpectrumDatasetOnOff( counts=counts, counts_off=counts_off, model=model, exposure=exposure, acceptance=acceptance, acceptance_off=acceptance_off, edisp=edisp, stat="wstat", ) dataset.npred() dataset.stat_per_bin() dataset.stat() We propose to refactor the existing ``SpectrumObservation`` object into the ``SpectrumDatasetsOnOff`` class. FluxPointsDataset ----------------- For fitting of flux points we propose to introduce a ``FluxPointsDataset``: .. code:: from gammapy.spectrum import FluxPointsDataset flux_points = FluxPoints.read("flux_points.ecsv") model = PowerLaw() dataset = FluxPointsDataset( flux_points=flux_points, model=model, mask_safe=mask_safe mask_fit=mask_fit, stat="chi2" ) fit = Fit(dataset) fit.optimize() The ``FluxPoint`` class also supports the ``likelihood`` format, which has to be implemented in a special way in the ``FluxPointsDataset``. The ``likelihood`` format stores the likelihood of the flux point, depending on energy and amplitude. Given a predicted flux by the spectral model the likelihood can be directly interpolated. This could be supported by a specific option ``FluxPointsDataset(stat="likelihood")`` or similar. Dataset helper / convenience methods ------------------------------------ We propose to implement a few convenience methods on all datasets. E.g. a ``.residuals()`` method, which computes the residual for the given data and state of the model: .. code:: dataset.residuals(method="diff") A selection of methods for the residual computation would ne available, such as ``data - model``, ``(data - model) / model`` and ``(data - model) / sqrt(model)``. For counts based datasets an ``.excess()`` method can be implmented. All dataset objects should implement a ``__str__`` method, so that the following works: .. code:: print(dataset) Specific to each dataset information on the model, model parameters, data and fit statistics is printed. We propose to implement a ``.peek()`` method on all datasets: .. code:: dataset.peek() In an interactive environment this will show a default visualisation of the dataset including model, data, IRFs and residuals. Simulation of MapDataset and SpectrumDataset -------------------------------------------- The ``MapDataset`` and ``SpectrumDataset`` classes will be the central classes for map and spectrum based analyses. In many cases it's useful to simulate counts from a predicted number of counts model. The direct sampling of binned data for the datasets could be supported as following (illustrated for the ``MapDataset``): .. code:: # allow counts=None on init dataset = MapDataset(counts=None) # to sample a counts map from npred dataset.counts = dataset.fake(random_seed=0) fit = Fit(dataset) fit.optimize() Which calls ``np.random.poisson()`` internally on the npred map and returns the resulting map. This works equivalently for the ``SpectrumDataset``, ``SpectrumDatasetOnOff`` and ``MapDatasetOnOff``. The on-off dataset require a background model for this to work. Unbinned simulation (sampling of event lists) is addressed in a separate PIG. Dataset serialization --------------------- For convenience all dataset classes should support serialization, implemented via ``.read()`` and ``.write()`` methods. For now we only consider the serialization of the data of the datasets and not the of the model, which might always stay separate. As the dataset has to orchestrate the serialization of multiple objects, such as different kind of maps, flux-points etc. one option is to introduce the serialization with a YAML based index file: .. code :: dataset = MapDataset.read("dataset.yaml") dataset.write("dataset.yaml") Where the index file points to the various files needed for initialization of the dataset. Here is an example: .. code:: dataset: type: map-dataset counts: "obs-123/counts.fits" exposure: "obs-123/exposure.fits" edisp: "obs-123/edisp.fits" psf: "obs-123/psf.fits" background-model: "obs-123/background.fits" model: "model.yaml" # optionally Addtionally we propose to introduce a single FITS file serializiaton for writing / reading datasets to and from disk: .. code:: dataset = MapDataset.read("dataset.fits") dataset.write("dataset.fits") The ``SpectrumDatasetOnOff`` in addition implements serialisation to the standard OGIP format described on the `gamma-astro-data-formats PHA`_ page. The ``Datasets`` object could be serialized equivalently as a list of datasets: .. code:: - dataset: type: spectrum-dataset maps: counts: "obs-123/counts.fits" exposure: "obs-123/exposure.fits" edisp: "obs-123/edisp.fits" model: "model.yaml" background-model: "obs-123/background.fits" likelihood: "wstat" - dataset: type: spectrum-dataset maps: counts: "obs-124/counts.fits" exposure: "obs-124/exposure.fits" edisp: "obs-124/edisp.fits" model: "model.yaml" background-model: "obs-124/background.fits" likelihood: "wstat" - dataset: type: flux-point-dataset flux-points : "fermipy-flux-points.fits" model: "spectral-model.yaml" likelihood: "template" Task List ========= This is a proposal for a list of pull requests implementing the proposed changes, ordered by priority: 1. Refactor the current ``FluxPointFit`` into a ``FluxPointsDataset`` class and make it work with the current ``Fit`` class. Ensure that the ``Fit`` class supports the old inheritance scheme as well as the new dataset on init. Update tests and tutorials (`#2023 `_). 2. Refactor the current ``MapFit`` into the ``MapDataset`` class. Only support a fixed energy dispersion and psf first. Use the current ``MapEvaluator`` for model evaluation, but split out the background evaluation. Update test and tutorials (`#2026 `_) 3. Implement the ``Datasets`` class in ``gammapy.utils.fitting.datasets``, add tests using multiple ``MapDataset``. Change the ``Fit`` interface to take a ``Datasets`` object or list of ``MapDataset`` on input (`#2030 `_). 4. Enable joint-likelihood analyses by filtering the ``Datasets.parameters`` for unique parameters only. Add a tutorial for joint-likelihood fitting of multiple observations (`#2045 `_). 5. Implement ``MapDataset.setup()`` method, using a list of ``MapEvaluator`` objects. Add an ``.evaluation_radius()`` attribute to all the spatial models. Only support a fixed PSF and edisp per model component (`#2071 `_). 6. Add support for psf maps to ``MapDataset.setup()``. Extend the ``.setup()`` method to look up the correct PSF for the given model component. A ``PSFMap`` has to be passed on ``MapDataset.__init__()`` (**v0.14**). 7. Add support for energy dispersion maps to ``MapDataset``. Extend the ``.setup()`` method to look up the correct EDisp for a given model component. An ``EdispMap`` has to be passed on ``MapDataset.__init__()`` (**v0.14**). 8. Add tutorial for joint-likelihood fitting of IACT and Fermi-LAT data, based on the joint-crab dataset (**v0.14**). 9. Add a ``name`` attribute to datasets and a ``Datasets.__getitem__`` method (**v0.14**). 10. Implement dataset serialization to yaml (**v0.14**). 11. Implement datasets serialization to a single FITS file (`#2264 `_). 12. Add the direct likelihood evaluation to the ``FluxPointsDataset``, by interpolating flux point likelihood profiles (**v0.14**). 13. Implement ``MapDataset.fake()``, ``SpectrumDataset.fake()`` and ``SpectrumDatasetOnOff.fake()`` methods (**v0.14**). 14. Implement convenience helper methods ``.residuals()``, ``.peek()`` and ``.__str__`` for all datasets (**v0.14**). Outlook ======= Parallel evaluation of datasets ------------------------------- For efficient joint-likelihood fits of multiple observations, parallel processing should be used. The obvious entry point is to evaluate one dataset per process and join the likelihoods at the end. The current handling of model references (the same model instance is passed to different datasets to achieve a generic parameter linking by updating the model parameters in place), sets a limitation on the parallel evaluation, because Python objects can't easily be shared between multiple processes. We are aware of this issue, but propose to solve it later, because it does not affect the proposed API for datasets. Lazy loading of Datasets ------------------------ For the use-case of inspecting or stacking individual datasets (e.g. MapDataset per observation) it could be advantegeous to implement a lazy-loading or generator interface for ``Datasets.read()``. Such that the individual datasets are only read from disk on requests and are not loaded in memory when calling ``.read()``. We leave this question un-addressed in this PR but mention it for completeness. Decision ======== The authors decided to withdraw the PIG. Most of the proposed changes are implemented. .. _gammapy: https://github.com/gammapy/gammapy .. _gammapy-web: https://github.com/gammapy/gammapy-webpage .. _`gamma-astro-data-formats PHA`: https://gamma-astro-data-formats.readthedocs.io/en/latest/spectra/ogip/index.html .. _`GH 1986`: https://github.com/gammapy/gammapy/pull/1986