.. include:: ../../references.txt .. _pig-004: ********************************************* PIG 4 - Setup for tutorial notebooks and data ********************************************* * Author: José Enrique Ruiz, Christoph Deil * Created: May 16, 2018 * Accepted: Oct 4, 2018 * Status: accepted * Discussion: `GH 1419`_ Abstract ======== For the past years, we have had tutorial notebooks and example datasets in a second ``gammapy-extra`` repository, as well as others example datasets placed in differente repositories like ``gamma-cat`` and ``gammapy-fermi-lat-data``. The motivation was to keep the main ``gammapy`` code repository small. But we always had problems with code, tutorials and data changing and versions not being linked. We propose to move the notebooks to the ``gammapy`` repository so that code and tutorials can be version-coupled, and to only use stable datasets in tutorials to mostly save the versioning issues. The datasets will remain in ``gammapy-extra`` repository. To ship tutorials and datasets to users, we propose to add a ``gammapy download`` command. The ``gammapy-extra`` repository will remain as a repository for developers and as one place where datasets can be put, but it will not be mentioned to users. What we have ============ We have the `gammapy`_ repository for code, and the `gammapy-extra`_ repository for tutorial notebooks, example datasets and a few other things. The ``gammapy`` repository currently is 12 MB and the ``gammapy-extra`` repository is 161 MB. In ``gammapy-extra/notebooks``, we have ~ 30 tutorial notebooks, each 20 kB to 1 MB in size, i.e. a few MB in total. Most of the size comes from PNG output images in the notebooks, and they usually change on re-run, i.e. even though git compresses a bit, the repo grows by up to 1 MB every time a notebook is edited. The datasets we access from the tutorials are maybe 20 or 30 MB, a lot of the datasets we have there are old and should be cleaned up / removed. The reason the notebooks and datasets were split out from the code was to keep the code repository small and avoid it growing to 100 MB or even 1 GB over time. This separation of code vs. notebooks and datasets has been a problem in Gammapy for years. Given that Gammapy code changes (and probably always will, even if the pace will become slower and slower over the years), the tutorials have to be version-coupled to the code. A related question is how tutorial notebooks, datasets used in tutorials and other datasets not used in the tutorials are shipped to users. Some related discussions may be found in the following references, see e.g. `GH 1237`_, `GH 1369`_, `GH 700`_, `GH 431`_, `GH 405`_, `GH 228`_, `GH 1131`_ (probably missed a few). Proposal ======== This proposal is limited. It outlines a few key changes in the setup for notebooks and example data that mostly solve the versioning and shipping issue of tutorials and datasets. Other related issues that may appear will be faced iteratively with or without an extra PIG. To solve the versioning issue for notebooks, we propose to move the notebooks from ``gammapy-extra/notebooks`` to ``gammapy/notebooks``. We propose to store the notebooks in the repository without output cells filled. Removing output cells before committing has the advantage that the files are small, and that the diff in the pull request is also small and review becomes possible. On the other hand, the output is not directly visible on Github. Note that in any case, a rendered version of the notebooks will be available via the docs, that is already in place. We count on developer documentation and code review to guarantee empty-output notebooks stored in ``gammapy/notebooks``, though we can also explore `nbstripout`_ for a potential implementation of an automated mechamism to remove outputs from notebooks in the Github repository. In the process of documentation building the notebooks will be texted and executed automatically, so the static HTML-formatted notebooks will contain output cells rendered in the the documentation. On the contrary, links to Binder notebooks and download links to .ipynb files will point to empty-output notebooks. To solve the versioning issue for datasets, we propose to only use stable example datasets. Examples are `gammapy-fermi-lat-data`_ or ``gammapy-extra/datasets/cta-1dc`` or the upcoming `HESS DL3 DR1`_ or ``joint-crab`` datasets. Datasets can be in ``gammapy-extra`` or at any other URL, but even if they are in ``gammapy-extra``, they should not be ''live'' datasets. If an issue is found or something is improved, a separate new dataset should be added, instead of changing the existing one. So versioning of example datasets is not needed. To ship notebooks and example data to users, we propose to introduce a ``gammapy download`` command. This work and discussion how it should work in detail has started in `GH 1369`_. Roughly, the idea is that users will use ``gammapy download`` to download a version of the notebooks matching the version of the Gammapy code, by fetching the files from Github. A ``gammapy download tutorials`` command will download all notebooks and the input datasets related. Not output datasets from the notebooks will be downloaded. All files will be copied into a ``$CWD/gammapy-tutorials`` folder, the datasets placed in a ``datasets`` subfolder and the notebooks into a ``notebooks-x.x`` subfolder accounting for the version downloaded. The management of updating the ``gammapy-tutorials`` folder after a local update of ``gammapy`` is left up to the user. The URLs of the input files used by the notebooks should be noted in the ``tutorials/notebooks.yaml`` file in the Gammapy repository, also accounting for the list of notebooks to download as tutorials. For the different stable releases, the list of tutorials to download, their locations and datasets used are declared in YAML files placed in the ``download/tutorials`` folder of the `gammapy-webpage`_ Github repository. The same happens for conda working environments of stable releases declared in files placed in the ``download/install`` folder of that repository. The datasets are not versioned and are similarly declared in the ``download/data`` folder. As far as we can see, for testing and online Binder these changes don't introduce significant improvements or new problems, though a Dockerfile in the ``gammapy`` repository will be needed to have these notebooks running in Binder. This is a change that will just affect developers and users on their local machines. Alternatives ============ One alternative would be to keep the notebooks in ``gammapy-extra``, and to couple the version with ``gammapy`` somehow, e.g. via a git submodule pointer, or via a config file in one of the repos or on gammapy.org with the version to be used. The mono repo approach seems simpler and better. For shipping the notebooks, one option is to include them in the source and binary distribution as package data, instead of downloading the from the web. For datasets this is not a good option, it would limit us to 10 - 30 MB max, i.e. we would get a split between some datasets distributed this way, and larger ones still via ``gammapy download``. Overall it doesn't seem useful; note that we also don't ship HTML documentation with the code, but separately. Decision ======== This PIG was extensively discussed on Github, as well as online and in-person meetings. It was then implemented in summer 2018, and we shipped the new setup with Gammapy v0.8 and tried the development workflow for a few weeks. The solution works well so far and does solve the notebook and dataset issues that motivated the work. It was finally approved during the Gammapy coding sprint on Oct 4, 2018. .. _GH 1419: https://github.com/gammapy/gammapy/pull/1419 .. _GH 1369: https://github.com/gammapy/gammapy/pull/1369 .. _GH 1237: https://github.com/gammapy/gammapy/issues/1237 .. _GH 1131: https://github.com/gammapy/gammapy/issues/1131 .. _GH 700: https://github.com/gammapy/gammapy/pull/700 .. _GH 431: https://github.com/gammapy/gammapy/pull/431 .. _GH 405: https://github.com/gammapy/gammapy/issues/405 .. _GH 228: https://github.com/gammapy/gammapy/issues/288 .. _gammapy: https://github.com/gammapy/gammapy .. _gammapy-extra: https://github.com/gammapy/gammapy-extra .. _gammapy-fermi-lat-data: https://github.com/gammapy/gammapy-fermi-lat-data .. _HESS DL3 DR1: https://www.mpi-hd.mpg.de/hfm/HESS/pages/dl3-dr1/ .. _nbstripout: https://github.com/kynan/nbstripout