PIG 4 - Setup for tutorial notebooks and data¶
- Author: José Enrique Ruiz, Christoph Deil
- Created: May 16, 2018
- Accepted: Oct 4, 2018
- Status: accepted
- Discussion: GH 1419
Abstract¶
For the past years, we have had tutorial notebooks and example datasets in a
second gammapy-extra
repository, as well as others example datasets placed in
differente repositories like gamma-cat
and gammapy-fermi-lat-data
.
The motivation was to keep the main gammapy
code repository small.
But we always had problems with code, tutorials and data changing and versions
not being linked.
We propose to move the notebooks to the gammapy
repository so that code and
tutorials can be version-coupled, and to only use stable datasets in tutorials to
mostly save the versioning issues. The datasets will remain in gammapy-extra
repository.
To ship tutorials and datasets to users, we propose to add a gammapy download
command. The gammapy-extra
repository will remain as a repository for
developers and as one place where datasets can be put, but it will not be mentioned
to users.
What we have¶
We have the gammapy repository for code, and the gammapy-extra repository for tutorial notebooks, example datasets and a few other things.
The gammapy
repository currently is 12 MB and the gammapy-extra
repository
is 161 MB. In gammapy-extra/notebooks
, we have ~ 30 tutorial notebooks, each
20 kB to 1 MB in size, i.e. a few MB in total. Most of the size comes from PNG
output images in the notebooks, and they usually change on re-run, i.e. even
though git compresses a bit, the repo grows by up to 1 MB every time a notebook
is edited. The datasets we access from the tutorials are maybe 20 or 30 MB, a
lot of the datasets we have there are old and should be cleaned up / removed.
The reason the notebooks and datasets were split out from the code was to keep the
code repository small and avoid it growing to 100 MB or even 1 GB over time.
This separation of code vs. notebooks and datasets has been a problem in Gammapy for years.
Given that Gammapy code changes (and probably always will, even if the pace will become slower and slower over the years), the tutorials have to be version-coupled to the code. A related question is how tutorial notebooks, datasets used in tutorials and other datasets not used in the tutorials are shipped to users. Some related discussions may be found in the following references, see e.g. GH 1237, GH 1369, GH 700, GH 431, GH 405, GH 228, GH 1131 (probably missed a few).
Proposal¶
This proposal is limited. It outlines a few key changes in the setup for notebooks and example data that mostly solve the versioning and shipping issue of tutorials and datasets. Other related issues that may appear will be faced iteratively with or without an extra PIG.
To solve the versioning issue for notebooks, we propose to move the notebooks
from gammapy-extra/notebooks
to gammapy/notebooks
. We propose to store
the notebooks in the repository without output cells filled. Removing output
cells before committing has the advantage that the files are small, and that
the diff in the pull request is also small and review becomes possible.
On the other hand, the output is not directly visible on Github. Note that in
any case, a rendered version of the notebooks will be available via the docs,
that is already in place. We count on developer documentation and code review
to guarantee empty-output notebooks stored in gammapy/notebooks
, though we
can also explore nbstripout for a potential implementation of an automated
mechamism to remove outputs from notebooks in the Github repository.
In the process of documentation building the notebooks will be texted and executed automatically, so the static HTML-formatted notebooks will contain output cells rendered in the the documentation. On the contrary, links to Binder notebooks and download links to .ipynb files will point to empty-output notebooks.
To solve the versioning issue for datasets, we propose to only use stable
example datasets. Examples are gammapy-fermi-lat-data or
gammapy-extra/datasets/cta-1dc or the upcoming HESS DL3 DR1 or
joint-crab
datasets. Datasets can be in gammapy-extra
or at any other
URL, but even if they are in gammapy-extra
, they should not be ‘’live’’
datasets. If an issue is found or something is improved, a separate new dataset
should be added, instead of changing the existing one. So versioning of example
datasets is not needed.
To ship notebooks and example data to users, we propose to introduce a gammapy
download
command. This work and discussion how it should work in detail has
started in GH 1369. Roughly, the idea is that users will use gammapy
download
to download a version of the notebooks matching the version of the
Gammapy code, by fetching the files from Github. A gammapy download tutorials
command will download all notebooks and the input datasets related. Not output
datasets from the notebooks will be downloaded. All files will be copied into a
$CWD/gammapy-tutorials
folder, the datasets placed in a datasets
subfolder
and the notebooks into a notebooks-x.x
subfolder accounting for the version
downloaded. The management of updating the gammapy-tutorials
folder after a
local update of gammapy
is left up to the user.
The URLs of the input files used by the notebooks should be noted in the
tutorials/notebooks.yaml
file in the Gammapy repository, also accounting
for the list of notebooks to download as tutorials. For the different stable releases,
the list of tutorials to download, their locations and datasets used are declared in
YAML files placed in the download/tutorials
folder of the gammapy-webpage Github repository.
The same happens for conda working environments of stable releases declared in files placed
in the download/install
folder of that repository. The datasets are not versioned and
are similarly declared in the download/data
folder.
As far as we can see, for testing and online Binder these changes don’t introduce
significant improvements or new problems, though a Dockerfile in the gammapy
repository
will be needed to have these notebooks running in Binder. This is a change that will just
affect developers and users on their local machines.
Alternatives¶
One alternative would be to keep the notebooks in gammapy-extra
, and to
couple the version with gammapy
somehow, e.g. via a git submodule pointer,
or via a config file in one of the repos or on gammapy.org with the version to
be used. The mono repo approach seems simpler and better.
For shipping the notebooks, one option is to include them in the source and
binary distribution as package data, instead of downloading the from the web.
For datasets this is not a good option, it would limit us to 10 - 30 MB max,
i.e. we would get a split between some datasets distributed this way, and larger
ones still via gammapy download
. Overall it doesn’t seem useful; note that
we also don’t ship HTML documentation with the code, but separately.
Decision¶
This PIG was extensively discussed on Github, as well as online and in-person meetings. It was then implemented in summer 2018, and we shipped the new setup with Gammapy v0.8 and tried the development workflow for a few weeks. The solution works well so far and does solve the notebook and dataset issues that motivated the work. It was finally approved during the Gammapy coding sprint on Oct 4, 2018.