PIG 12 - High level interface¶
Author: José Enrique Ruiz, Christoph Deil, Axel Donath, Regis Terrier, Lars Mohrmann
Created: Jun 6, 2019
Accepted: Aug 19, 2019
Status: accepted
Discussion: GH 2219
Abstract¶
The high level interface is one of the projects considered in the Gammapy roadmap for Gammapy v1.0 (see PIG 3 - Plan for dropping Python 2.7 support). It should be easy to use and allow users to do the most common analysis tasks and workflows quickly. It would be built on top of the existing Gammapy code-base, first on it’s own, but likely starting to develop it would inform improvements in code organisation throughout Gammapy.
Achieving a stable high level interface should allow us to continue improving the Gammapy code-base without breaking user-defined workflows or recipes made that would have been made with this high level interface.
We propose to develop a high level interface Python API, similar to Fermipy or
HAP in HESS, based on a single Analysis
class communicating with a set of
tool classes, and that supports config-file driven analysis of the main IACT
source analysis use cases.
What we have¶
We have been using Click to develop a very small set of tools for an
embryonic Gammapy command line interface. Among the existing tools (gammapy
image
, gammapy info
, gammapy download
, gammapy jupyter
), only
gammapy image can be considered as potentially needed in a data analysis
process. It actually creates a counts image from an event-list file and an image
that serves as a reference geometry. Hence, we have a code set-up in
gammapy.scripts
, that we will not use it for the moment to expose the
high level interface API, but to develop a very small set of specific command
line tools identified (i.e. perform long-time processing tasks, see Command
line tools section below)
We have a set of Jupyter notebooks as examples of tutorials and recipes demonstrating the use of Gammapy. These notebooks are continuously tested and are one of the pillars of the user documentation. We could check most of the use cases are covered by the high level interface with the help of these notebooks. We will have to translate most of them to use the high level interface, but we could also use them as a basis for experimental automated workflows driven by parametrized notebooks executed with papermill. (see Outlook section below) Moreover, some Python scripts have been added recently to perform benchmarks of Gammapy, surely we could rewrite some of all of these benchmarks to use the high level interface.
We also have some high level analysis classes in the API that concatenate several atomic actions and provide rough estimated results for more complex processes. (i.e. SpectrumAnalysisIACT, LightCurveEstimator) These classes would serve as a basis to design and prototype some of the tools of the high level interface.
Proposal¶
We will develop a high level interface Python API which uses a params-values configuration file defined by the user. This API would be used in Python scripts, notebooks or in IPython sessions to perform simple and most common IACT analysis.
We see then two main options on how to use the high level interface API:
Within an IPython session or notebook, mostly dealing with a manager object to perform specific tasks in an interactive analysis process
In a Python script or notebook, declaring the orchestration of the tasks with a manager object for an automated process
This high level interface API is similar to what it is done in Fermipy or HAP in HESS, including also the options to save and recover session states, as well as serialization of intermediate data products and logging. It is flexible enough to allow the user to work with the API at any stage of the analysis process, and not only from the start to the very end or in automated process.
Use cases
The use cases covered are in the scope of a single analysis and model, not parametrized variations in a multidimensional grid space of variables, and within a single region (e.g. 10 deg region with 5 sources).
The main use cases for analysis to be covered are:
3D map analysis
2D map analysis
1D spectrum analysis
Light curve estimation
Including the main methods for data reduction, modeling and fitting:
On vs on/off data reduction
Different background models
Joint vs stacked likelihood fitting
Diagnostics (residuals, significance, TS)
Spectral flux points
Making a SED may be the final part of the analysis, as many SED methods require the full model and all energy data.
Configuration file
The configuration file will be in YAML format exposing the parameters and values
needed for each one of the tasks involved in the analysis process. To generate
the config file, we could add a gammapy analysis config
command line tool
which dumps the config file with all lines commented out, and the users can then
uncomment and fill in the parameters and values they care about. As an
alternative users could copy & paste from a config file example in the docs. We
will develop a schema and validate / give good error messages on read.
We roughly sketch below an example of a prototype configuration file in YAML format, just to illustrate how a structured schema could expose most of the parameter/values needed in a data analysis process. The configuration file should be explicit enough for the users to understand which parameters to edit in order to define a specific configuration for an analysis session or workflow, and should use units for quantities where it makes sense, e.g. “angle: 3 deg” instead of “angle: 3”. The final schema for this configuration file will be achieved iteratively during the development of the high level interface and later on eventual improvements, also taking user feedback into account.
Prototype configuration file.
analysis:
process:
# add options to allow either in-memory or disk-based processing
out_folder: "." # default is current working directory
store_per_obs: {true, false}
reduce:
type: {"1D, "3D"}
stacked: {true, false}
background: {"irf", "reflected", "ring"}
roi: max_offset
exclusion: exclusion.fits
fit:
energy_min, energy_max
logging:
level: debug
grid:
spatial: center, width, binsz
energy: min, max, nbins
time: min, max
# PSF RAD and EDISP MIGRA not exposed for now
# Per-obs energy_safe and roi_max ?
observations:
data_store: $GAMMAPY_DATA/cta-1dc/index/gps/
ids: 110380, 111140, 111159
conesearch:
ra:
dec:
radius:
energy_min:
energy_max:
time_min:
time_max:
model:
# Model configuration will mostly be designed in different PIG
sources:
source_1:
spectrum: powerlaw
spatial: shell
diffuse: gal_diffuse.fits
background:
IRF
API design
The design of the high level interface API is driven by the use cases
considered, the different tools (tasks) identified and their responsibilities,
as well as the need of a main Analysis
session object that drives and
orchestrates internally the different tools involved, their inputs and products.
The Analysis
session object will be initialized with a configuration file
(see prototype configuration file above) and will be the responsible to
instantiate and run the different tools classes (see session workflow below).
The tools are middle management agents (e.g. MapMaker, ReflectedBgEstimator,
…) responsible to perform the different tasks identified in the use cases
covered.
The Analysis
session object will provide access to every object involved and
data structure produced during the session. Analysis
methods calls will
produce and modify datasets (i.e. models, maps,..), but in between method calls
advanced users can do a lot of custom processing by their own with scripting
using the Gammapy Python toolbox.
The code of the Analysis
class as well as any other eventual class needed
will be placed in gammapy.scripts
. This module will contain also the set of
different command line tools provided, where some small cleaning and refactoring
may be needed (i.e. remove gammapy image
command line tool)
Serialisation
There will be the possibility to save and recover session states with their associated data products. The user could also choose in the configuration file, with the help of a boolean parameter, to work with serialised intermediate products delivered by the tools instead of in memory. The state serialisation will be a mix of YAML (i.e. models, state) and FITS files (i.e. maps), where the delegated tools should know how to serialise and read themselves. The solution to address serialization of different datasets by the different tools is not in the scope of this PIG.
Session workflow
$ mkdir analysis
$ cd analysis
$ edit gammapy_analysis_config.yaml
Then the user would type ipython
, juypter notebook
or write a script
with the code below.
from gammapy.analysis import Analysis
analysis = Analysis(config)
analysis.select_observations()
# Select observations using the parameters defined in the configuration file.
analysis.reduce_data() # often slow, can be hours
# If the user wants they can save all results from data reduction and re-start later.
# This stores config, datasets, ... all the analysis class state.
# analysis.write()
# analysis = Analysis.read()
analysis.optimise() # often slow, can be hours
# Again, we could write and read, do the slow things only once.
# e.g. supervisor comes in and asks about significance of some model component or whatever.
# analysis.write()
# analysis = Analysis.read()
# Since anything is accessible from the Analysis object
# many advanced use cases can be done with the Analysis API.
analysis.model("source_42").spectrum.plot()
# Should we need energy_binning for the SED points in config or only here?
sed = analysis.spectral_points("source_42", energy_binning)
Command line tools
In addition to gammapy analysis config
, we will have a gammapy analysis
data_reduction
and a gammapy analysis optimise
which perform the long
processing tasks from the terminal outside an ipython session or jupyter
notebook, using all the information from the config file and/or saved state.
gammapy analysis config
: dumps a template configuration filegammapy analysis data_reduction
: performs a data reduction processgammapy analysis optimise
: performs a model fitting process
Outlook¶
Some of the use cases not covered by this high level interface API are the following:
Generation of simulated events and/or counts
Iterative source detection methods
Complex or memory eager processing on lightcurves
These use cases can be actually addressed using Gammapy as a Python toolbox, though in the near future some of them could be incorporated progressively to the high level interface. For example, making a lightcurve could be part of one Analysis (like it is in Fermi), or could be done at higher level, creating Analysis instances and running them for each time bin. This exercise of pro/con is left to the PIG 6 - CTA observation handling. We could expect that restructuring all of Gammapy to be tools based with good tool chains and config handling will be eventually achieved after a certain time if we define a solid high level interface API.
We could explore the use of papermill to run workflows defined in a notebook using the values of the parameter-value configuration file defined for the high level interface API. The notebooks could be provided as skeleton-templates for specific use cases or built by the user. This option would provide a rich-text formatted report of the analysis process execution.
One extra dimension that arises from the development of this high level interface API is the possibility to capture data provenance as structured logs of tasks executions in a session or in an automated workflow. Capturing provenance in this way is more useful if we also provide the means for it to be easily queried and inspected in multiple ways, allowing also forensic studies of the research analysis process as well as improving reproducibility and reuse by the community. This work will be the scope of another PIG.
Alternatives¶
A different approach to this high level interface API is that of command line tools executed from the terminal, what Fermitools and ctools do, where each tool is simple/atomic enough to allow users to inspect the output results before taking a decision on how to run and set the parameter values for the next tool. A similar approach could be done with the Gammapy high level interface API, but inside a notebook or IPython session.
Concerning the code implementation, ctapipe tools provides a solution based on Python traitlets, acting as an extensible framework to easily transform Python classes into command line tools. We will explore the adoption of this approach after Gammapy v1.0 since it requires a considerable effort on refactoring of the Gammapy code-base.
Another config-file based solution is what is implemented in Enrico. It
performs basic orchestrated analysis workflows using a set of input parameters
that the user provides via a configuration file. The user may be guided in the
declaration of values for the config file using an assistant command line tool
for config-file building, which asks for parameter values providing also
defaults. This is done in Enrico with enrico_config
and enrico_xml
,
where each workflow is set-up and then run with its own command line tool. In
our case, we define the workflow steps in a simple Python script and declare
parameter-value pairs in a configuration file. The Python script is then
executed to run the workflow.
Also Python scripts and/or notebook files could be generated with an assistant command line tool. Then, the user could edit and tweak the config files, scripts or notebooks. There isn’t much precedence for this workflow in science, but a lot of dev-ops and programming tools work like that, it is a standard technique. One random example of such a tool is the Angular CLI, or cookiecutter.
Task list¶
Required for Gammapy v1.0
Prototype for a manager class, agents, tools, etc.
Define a syntax for the declaration of parameter-value pairs needed for all tools in the analysis process.
Develop the session manager class responsible to drive the orchestration of tools in the analysis process.
Develop the tools classes responsible to perform each one of the tasks in the analysis process.
Design use cases and/or choose among the existing tutorials or benchmarks those that may be translated into high level interface notebooks.
Provide notebooks using the high level interface API for each of the chosen tutorials, benchmarks and/or use cases identified.
Add documentation for the high level interface API and clean the list of documentation tutorials, making a distinct separation among those using Gammapy as a high level interface API and those using Gammapy as a Python toolbox.
Extra features (command line tools)
Develop the small set of helpers command line tools described above.
Develop an assistant command line tool that produces Python scripts and/or notebooks using the high level interface API.
Cleaning and refactoring of
gammapy.scripts
module to remove old and unused command line tools.Cleaning present documentation on
gammapy.scripts
to transform into documentation of helper command line tools.