PIG 4 - Setup for tutorial notebooks and data#
- Author: José Enrique Ruiz, Christoph Deil 
- Created: May 16, 2018 
- Accepted: Oct 4, 2018 
- Status: accepted 
- Discussion: GH 1419 
Abstract#
For the past years, we have had tutorial notebooks and example datasets in a
second gammapy-extra repository, as well as others example datasets placed
in differente repositories like gamma-cat and gammapy-fermi-lat-data.
The motivation was to keep the main gammapy code repository small. But we
always had problems with code, tutorials and data changing and versions not
being linked.
We propose to move the notebooks to the gammapy repository so that code and
tutorials can be version-coupled, and to only use stable datasets in tutorials
to mostly save the versioning issues. The datasets will remain in
gammapy-extra repository.
To ship tutorials and datasets to users, we propose to add a gammapy
download command. The gammapy-extra repository will remain as a repository
for developers and as one place where datasets can be put, but it will not be
mentioned to users.
What we have#
We have the gammapy repository for code, and the gammapy-extra repository for tutorial notebooks, example datasets and a few other things.
The gammapy repository currently is 12 MB and the gammapy-extra
repository is 161 MB. In gammapy-extra/notebooks, we have ~ 30 tutorial
notebooks, each 20 kB to 1 MB in size, i.e. a few MB in total. Most of the size
comes from PNG output images in the notebooks, and they usually change on
re-run, i.e. even though git compresses a bit, the repo grows by up to 1 MB
every time a notebook is edited. The datasets we access from the tutorials are
maybe 20 or 30 MB, a lot of the datasets we have there are old and should be
cleaned up / removed. The reason the notebooks and datasets were split out from
the code was to keep the code repository small and avoid it growing to 100 MB or
even 1 GB  over time.
This separation of code vs. notebooks and datasets has been a problem in Gammapy for years.
Given that Gammapy code changes (and probably always will, even if the pace will become slower and slower over the years), the tutorials have to be version-coupled to the code. A related question is how tutorial notebooks, datasets used in tutorials and other datasets not used in the tutorials are shipped to users. Some related discussions may be found in the following references, see e.g. GH 1237, GH 1369, GH 700, GH 431, GH 405, GH 228, GH 1131 (probably missed a few).
Proposal#
This proposal is limited. It outlines a few key changes in the setup for notebooks and example data that mostly solve the versioning and shipping issue of tutorials and datasets. Other related issues that may appear will be faced iteratively with or without an extra PIG.
To solve the versioning issue for notebooks, we propose to move the notebooks
from gammapy-extra/notebooks to gammapy/notebooks. We propose to store
the notebooks in the repository without output cells filled. Removing output
cells before committing has the advantage that the files are small, and that the
diff in the pull request is also small and review becomes possible. On the other
hand, the output is not directly visible on GitHub. Note that in any case, a
rendered version of the notebooks will be available via the docs, that is
already in place.  We count on developer documentation and code review to
guarantee empty-output notebooks stored in gammapy/notebooks, though we can
also explore nbstripout for a potential implementation of an automated
mechanism to remove outputs from notebooks in the GitHub repository.
In the process of documentation building the notebooks will be texted and executed automatically, so the static HTML-formatted notebooks will contain output cells rendered in the the documentation. On the contrary, links to Binder notebooks and download links to .ipynb files will point to empty-output notebooks.
To solve the versioning issue for datasets, we propose to only use stable
example datasets. Examples are gammapy-fermi-lat-data or
gammapy-extra/datasets/cta-1dc or the upcoming HESS DL3 DR1 or
joint-crab datasets. Datasets can be in gammapy-extra or at any other
URL, but even if they are in gammapy-extra, they should not be ‘’live’’
datasets. If an issue is found or something is improved, a separate new dataset
should be added, instead of changing the existing one. So versioning of example
datasets is not needed.
To ship notebooks and example data to users, we propose to introduce a gammapy
download command. This work and discussion how it should work in detail has
started in GH 1369. Roughly, the idea is that users will use gammapy
download to download a version of the notebooks matching the version of the
Gammapy code, by fetching the files from GitHub. A gammapy download
tutorials command will download all notebooks and the input datasets related.
Not output datasets from the notebooks will be downloaded. All files will be
copied into a $CWD/gammapy-tutorials folder, the datasets placed in a
datasets subfolder and the notebooks into a notebooks-x.x subfolder
accounting for the version downloaded. The management of  updating the
gammapy-tutorials folder after a local update of gammapy is left up to
the user.
The URLs of the input files used by the notebooks should be noted in the
tutorials/notebooks.yaml file in the Gammapy repository, also accounting for
the list of notebooks to download as tutorials. For the different stable
releases, the list of tutorials to download, their locations and datasets used
are declared in YAML files placed in the download/tutorials folder of the
gammapy-webpage GitHub repository. The same happens for conda working
environments of stable releases declared in files placed in the
download/install folder of that repository. The datasets are not versioned
and are similarly declared in the download/data folder.
As far as we can see, for testing and online Binder these changes don’t
introduce significant improvements or new problems, though a Dockerfile in the
gammapy repository will be needed to have these notebooks running in Binder.
This is a change that will just affect developers and users on their local
machines.
Alternatives#
One alternative would be to keep the notebooks in gammapy-extra, and to
couple the version with gammapy somehow, e.g. via a git submodule pointer,
or via a config file in one of the repos or on gammapy.org with the version to
be used. The mono repo approach seems simpler and better.
For shipping the notebooks, one option is to include them in the source and
binary distribution as package data, instead of downloading the from the web.
For datasets this is not a good option, it would limit us to 10 - 30 MB max,
i.e. we would get a split between some datasets distributed this way, and larger
ones still via gammapy download. Overall it doesn’t seem useful; note that
we also don’t ship HTML documentation with the code, but separately.
Decision#
This PIG was extensively discussed on GitHub, as well as online and in-person meetings. It was then implemented in summer 2018, and we shipped the new setup with Gammapy v0.8 and tried the development workflow for a few weeks. The solution works well so far and does solve the notebook and dataset issues that motivated the work. It was finally approved during the Gammapy coding sprint on Oct 4, 2018.
