diff --git a/01-conda.md b/01-conda.md new file mode 100644 index 0000000..cbd32b7 --- /dev/null +++ b/01-conda.md @@ -0,0 +1,416 @@ +--- +title: Package management +teaching: 15 +exercises: 15 +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- Identify the main Python libraries used in atmosphere and ocean science and the relationships between them. +- Explain the advantages of Anaconda over other Python distributions. +- Extend the number of packages available via conda using conda-forge. +- Create a conda environment with the libraries needed for these lessons. +- Open a Jupyter Notebook ready for use in these lessons + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- What are the main Python libraries used in atmosphere and ocean science? +- How do I install and manage all the Python libraries that I want to use? +- How do I interact with Python? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## The PyAOS stack + +Before we jump in and start analysing our netCDF precipitation data files, +we need to consider what Python libraries are best suited to the task. + +For reading, writing and analysing data stored in the netCDF file format, +atmosphere and ocean scientists will typically do most of their work with either the +[xarray](https://xarray.pydata.org/en/stable/) or [iris](https://scitools.org.uk/iris/) libraries. +These libraries are built on top of more generic data science libraries like numpy and matplotlib, +to make the types of analysis we do faster and more efficient. +To learn more about the PyAOS "stack" shown in the diagram below +(i.e. the collection of libraries that are typically used for +data analysis and visualisation in the atmosphere and ocean sciences), +check out the [overview of the PyAOS stack](https://pyaos.github.io/stack/) +at the PyAOS community site. + +![](fig/01-pyaos-stack.svg){alt='PyAOS stack'} + +## Python distributions for data science + +Now that we've identified the Python libraries we might want to use, +how do we go about installing them? + +Our first impulse might be to use the Python package installer (pip), +but it really only works for libraries written in pure Python. +This is a major limitation for the data science community, +because many scientific Python libraries have C and/or Fortran dependencies. +To spare people the pain of installing these dependencies, +a number of scientific Python "distributions" have been released over the years. +These come with the most popular data science libraries and their dependencies pre-installed, +and some also come with a package manager to assist with installing +additional libraries that weren't pre-installed. +Today the most popular distribution for data science is [Anaconda](https://www.anaconda.com/distribution/), +which comes with a package (and environment) manager called [conda](https://conda.io/docs/). + +## Introducing conda + +According to the [latest documentation](https://docs.anaconda.com/anaconda/#anaconda-navigator-or-conda), +Anaconda comes with over 250 of the most widely used data science libraries (and their dependencies) pre-installed. +In addition, there are several thousand more libraries available via the `conda install` command, +which can be executed using the Bash Shell or Anaconda Prompt (Windows only). +It is also possible to install packages using the Anaconda Navigator graphical user interface. + +::::::::::::::::::::::::::::::::::::::::: callout + +## conda in the shell on windows + +If you're on a Windows machine and the `conda` command isn't available at the Bash Shell, +you'll need to open the Anaconda Prompt program (via the Windows start menu) +and run the command `conda init bash` (this only needs to be done once). +After that, your Bash Shell will be configured to use `conda` going forward. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +For instance, the popular `xarray` library could be installed using the following command, + +```bash +$ conda install xarray +``` + +(Use `conda search -f {package_name}` to find out if a package you want is available.) + +OR using Anaconda Navigator: + +![](fig/01-navigator-xarray.png){alt='Anaconda Navigator xarray search'} + +::::::::::::::::::::::::::::::::::::::::: callout + +## Miniconda + +If you don't want to install the entire Anaconda distribution, +you can install [Miniconda](https://conda.pydata.org/miniconda.html) instead. +It essentially comes with conda and nothing else. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Advanced conda + +For a relatively small/niche field of research like atmosphere and ocean science, +one of the most important features that Anaconda provides is the +[Anaconda Cloud](https://anaconda.org) website, +where the community can contribute conda installation packages. +This is critical because many of our libraries have a small user base, +which means they'll never make it into the top few thousand data science libraries +supported by Anaconda. + +You can search Anaconda Cloud to find the command needed to install the package. +For instance, here is the search result for the `iris` package: + +![](fig/01-iris-search.png){alt='Iris search on Anaconda Cloud'} + +As you can see, there are often multiple versions of the same package up on Anaconda Cloud. +To try and address this duplication problem, +[conda-forge](https://conda-forge.github.io/) has been launched, +which aims to be a central repository that contains just a single (working) version +of each package on Anaconda Cloud. +You can therefore expand the selection of packages available via `conda install` +beyond the chosen few thousand by adding the conda-forge channel: + +```bash +$ conda config --add channels conda-forge +``` + +OR + +![](fig/01-navigator-conda-forge.png){alt='Anaconda Navigator conda-forge'} + +We recommend not adding any other third-party channels unless absolutely necessary, +because mixing packages from multiple channels can cause headaches like binary incompatibilities. + +## Software installation for these lessons + +For these particular lessons we will use `xarray`, +but all the same tasks could be performed with `iris`. +We'll also install +[`dask`](https://dask.org/) (`xarray` uses this for parallel processing), +[`netCDF4`](https://unidata.github.io/netcdf4-python/) (`xarray` requires this to read netCDF files), +[`cartopy`](https://scitools.org.uk/cartopy/) (to help with geographic plot projections), +[`cmocean`](https://matplotlib.org/cmocean/) (for nice color palettes) and +[`cmdline_provenance`](https://cmdline-provenance.readthedocs.io/en/latest/) +(to keep track of our data processing steps). +We don't need to worry about installing [jupyter](https://jupyter.org/) +(we will be using the jupyter notebook) because it already comes +pre-installed with Anaconda. + +We could install these libraries from Anaconda Navigator (not shown) +or using the Bash Shell or Anaconda Prompt (Windows): + +```bash +$ conda install xarray dask netCDF4 cartopy cmocean cmdline_provenance +``` + +If we then list all the libraries that we've got installed, +we can see that `jupyter`, `dask`, `xarray`, `netCDF4`, `cartopy`, `cmocean`, `cmdline_provenance` +and their dependencies are now there: + +```bash +$ conda list +``` + +(This list can also be viewed in the environments tab of the Navigator.) + +::::::::::::::::::::::::::::::::::::::::: callout + +## Creating separate environments + +If you've got multiple data science projects on the go, +installing all your packages in the same conda environment can get a little messy. +(By default they are installed in the root/base environment.) +It's therefore common practice to +[create separate conda environments](https://conda.io/docs/user-guide/tasks/manage-environments.html) +for the various projects you're working on. + +For instance, we could create an environment called `pyaos-lesson` for this lesson. +The process of creating a new environment can be managed in the environments tab +of the Anaconda Navigator or via the following Bash Shell / Anaconda Prompt commands: + +```bash +$ conda create -n pyaos-lesson jupyter xarray dask netCDF4 cartopy cmocean cmdline_provenance +$ conda activate pyaos-lesson +``` + +(it's `conda deactivate` to exit) + +Notice that in this case we had to include jupyter in the list of packages to install. +When you create a brand new conda environment, +it doesn't automatically come with the pre-installed packages +that are in the base environment. + +You can have lots of different environments, + +```bash +$ conda info --envs +``` + +```output +# conda environments: +# +base * /anaconda3 +pyaos-lesson /anaconda3/envs/pyaos-lesson +test /anaconda3/envs/test +``` + +the details of which can be exported to a YAML configuration file: + +```bash +$ conda env export -n pyaos-lesson -f pyaos-lesson.yml +$ cat pyaos-lesson.yml +``` + +```output +name: pyaos-lesson +channels: + - conda-forge + - defaults +dependencies: + - cartopy=0.16.0=py36h81b52dc_1 + - certifi=2018.4.16=py36_0 + - cftime=1.0.1=py36h7eb728f_0 + - ... +``` + +Other people (or you on a different computer) can then re-create that exact environment +using the YAML file: + +```bash +$ conda env create -f pyaos-lesson.yml +``` + +For ease of sharing the YAML file, +it can be uploaded to your account at the Anaconda Cloud website, + +```bash +$ conda env upload -f pyaos-lesson.yml +``` + +so that others can re-create the environment by simply refering to your Anaconda username: + +```bash +$ conda env create damienirving/pyaos-lesson +$ conda activate pyaos-lesson +``` + +The ease with which others can recreate your environment (on any operating system) +is a huge breakthough for reproducible research. + +To delete the environment: + +```bash +$ conda env remove -n pyaos-lesson +``` + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Interacting with Python + +Now that we know which Python libraries we want to use and how to install them, +we need to decide how we want to interact with Python. + +The most simple way to use Python is to type code directly into the interpreter. +This can be accessed from the bash shell: + +``` +$ python +Python 3.7.1 (default, Dec 14 2018, 13:28:58) +[Clang 4.0.1 (tags/RELEASE_401/final)] :: Anaconda, Inc. on darwin +Type "help", "copyright", "credits" or "license" for more information. +>>> print("hello world") +hello world +>>> exit() +$ +``` + +The `>>>` prompt indicates that you are now talking to the Python interpreter. + +A more powerful alternative to the default Python interpreter is IPython (Interactive Python). +The [online documentation](https://ipython.readthedocs.io/en/stable/) +outlines all the special features that come with IPython, +but as an example, it lets you execute bash shell commands +without having to exit the IPython interpreter: + +``` +$ ipython +Python 3.7.1 (default, Dec 14 2018, 13:28:58) +Type 'copyright', 'credits' or 'license' for more information +IPython 7.2.0 -- An enhanced Interactive Python. Type '?' for help. + +In [1]: print("hello world") +hello world + +In [2]: ls +data/ script_template.py +plot_precipitation_climatology.py + +In [3]: exit +$ +``` + +(The IPython interpreter can also be accessed via the Anaconda Navigator +by running the QtConsole.) + +While entering commands to the Python or IPython interpreter line-by-line +is great for quickly testing something, +it's clearly impractical for developing longer bodies of code +and/or interactively exploring data. +As such, Python users tend to do most of their code development and data exploration +using either an Integrated Development Environment (IDE) or Jupyter Notebook: + +- Two of the most common IDEs are [Spyder](https://www.spyder-ide.org/) + and [PyCharm](https://www.jetbrains.com/pycharm/) + (the former comes with Anaconda) + and will look very familiar to anyone + who has used MATLAB or R-Studio. +- [Jupyter Notebooks](https://jupyter.org/) run in your web browser + and allow users to create and share documents that contain live code, + equations, visualizations and narrative text. + +We are going to use the Jupyter Notebook to explore our precipitation data +(and the plotting functionality of xarray) in the next few lessons. +A notebook can be launched from our `data-carpentry` directory +using the Bash Shell: + +```bash +$ cd ~/Desktop/data-carpentry +$ jupyter notebook & +``` + +(The `&` allows you to come back and use the bash shell without closing +your notebook first.) + +Alternatively, you can launch Jupyter Notebook from the Anaconda Navigator +and navigate to the `data-carpentry` directory before creating a new +Python 3 notebook: + +![](fig/01-launch-notebook.png){alt='Launch Jupyter notebook'} + +::::::::::::::::::::::::::::::::::::::::: callout + +## JupyterLab + +The Jupyter team have recently launched +[JupyterLab](https://blog.jupyter.org/jupyterlab-is-ready-for-users-5a6f039b8906) +which combines the Jupyter Notebook with many of the features common to an IDE. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Install the Python libraries required for this lesson + +If you haven't already done so, +go ahead and install the `xarray`, `dask`, `netCDF4`, `cartopy`, `cmocean` and `cmdline_provenance` +packages using either the Anaconda Navigator or Bash Shell. + +Remember that you'll need to add the conda-forge channel first. + +(You may like to create a separate `pyaos-lesson` conda environment, +but this is not necessary to complete the lessons. +If you create a new environment rather than using the base environment, +you'll need to install jupyter too.) + +::::::::::::::: solution + +The [setup menu](https://carpentries-lab.github.io/python-aos-lesson/setup.html) +at the top of the page +contains drop-down boxes explaining how to install the Python libraries +using the Bash Shell or Anaconda Navigator. + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Launch a Jupyer Notebook + +In preparation for the next lesson, +open a new Jupyter Notebook **in your `data-carpentry` directory** +by entering `jupyter notebook &` at the Bash Shell +or by clicking the Jupyter Notebook launch button in the Anaconda Navigator. + +If you use the Navigator, +the Jupyter interface will open in a new tab of your default web browser. +Use that interface to navigate to the `data-carpentry` directory +that you created specifically for these lessons +before clicking to create a new Python 3 notebook: + +![](fig/01-launch-notebook.png){alt='Launch Jupyter notebook'} + +Once your notebook is open, +import `xarray`, `catropy`, `matplotlib` and `numpy` +using the following Python commands: + +```python +import xarray as xr +import cartopy.crs as ccrs +import matplotlib.pyplot as plt +import numpy as np +``` + +(Hint: Hold down the shift and return keys to execute a code cell in a Jupyter Notebook.) + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- xarray and iris are the core Python libraries used in the atmosphere and ocean sciences. +- Use conda to install and manage your Python environments. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + + diff --git a/02-visualisation.md b/02-visualisation.md new file mode 100644 index 0000000..24baa81 --- /dev/null +++ b/02-visualisation.md @@ -0,0 +1,384 @@ +--- +title: Data processing and visualisation +teaching: 20 +exercises: 40 +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- Import the xarray library and use the functions it contains. +- Convert precipitation units to mm/day. +- Calculate and plot the precipitation climatology. +- Use the cmocean library to find colormaps designed for ocean science. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- How can I create a quick plot of my CMIP data? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +As a first step towards making a visual comparison of the +ACCESS-CM2 and ACCESS-ESM1-5 historical precipitation climatology, +we are going to create a quick plot of the ACCESS-CM2 data. + +```python +accesscm2_pr_file = "data/pr_Amon_ACCESS-CM2_historical_r1i1p1f1_gn_201001-201412.nc" +``` + +We will need a number of the libraries introduced in the previous lesson. + +```python +import xarray as xr +import cartopy.crs as ccrs +import matplotlib.pyplot as plt +import numpy as np +``` + +Since geographic data files can often be very large, +when we first open our data file in xarray it simply loads the metadata +associated with the file (this is known as "lazy loading"). +We can then view summary information about the contents of the file +before deciding whether we'd like to load some or all of the data into memory. + +```python +ds = xr.open_dataset(accesscm2_pr_file) +print(ds) +``` + +```output + +Dimensions: (bnds: 2, lat: 144, lon: 192, time: 60) +Coordinates: + * time (time) datetime64[ns] 2010-01-16T12:00:00 ... 2014-12-16T12:00:00 + * lon (lon) float64 0.9375 2.812 4.688 6.562 ... 355.3 357.2 359.1 + * lat (lat) float64 -89.38 -88.12 -86.88 -85.62 ... 86.88 88.12 89.38 +Dimensions without coordinates: bnds +Data variables: + time_bnds (time, bnds) datetime64[ns] ... + lon_bnds (lon, bnds) float64 ... + lat_bnds (lat, bnds) float64 ... + pr (time, lat, lon) float32 ... +Attributes: + CDI: Climate Data Interface version 1.9.8 (https://mpi... + source: ACCESS-CM2 (2019): \naerosol: UKCA-GLOMAP-mode\na... + institution: CSIRO (Commonwealth Scientific and Industrial Res... + Conventions: CF-1.7 CMIP-6.2 + activity_id: CMIP + branch_method: standard + branch_time_in_child: 0.0 + branch_time_in_parent: 0.0 + creation_date: 2019-11-08T08:26:37Z + data_specs_version: 01.00.30 + experiment: all-forcing simulation of the recent past + experiment_id: historical + external_variables: areacella + forcing_index: 1 + frequency: mon + further_info_url: https://furtherinfo.es-doc.org/CMIP6.CSIRO-ARCCSS... + grid: native atmosphere N96 grid (144x192 latxlon) + grid_label: gn + initialization_index: 1 + institution_id: CSIRO-ARCCSS + mip_era: CMIP6 + nominal_resolution: 250 km + notes: Exp: CM2-historical; Local ID: bj594; Variable: p... + parent_activity_id: CMIP + parent_experiment_id: piControl + parent_mip_era: CMIP6 + parent_source_id: ACCESS-CM2 + parent_time_units: days since 0950-01-01 + parent_variant_label: r1i1p1f1 + physics_index: 1 + product: model-output + realization_index: 1 + realm: atmos + run_variant: forcing: GHG, Oz, SA, Sl, Vl, BC, OC, (GHG = CO2,... + source_id: ACCESS-CM2 + source_type: AOGCM + sub_experiment: none + sub_experiment_id: none + table_id: Amon + table_info: Creation Date:(30 April 2019) MD5:e14f55f257cceaf... + title: ACCESS-CM2 output prepared for CMIP6 + variable_id: pr + variant_label: r1i1p1f1 + version: v20191108 + cmor_version: 3.4.0 + tracking_id: hdl:21.14100/b4dd0f13-6073-4d10-b4e6-7d7a4401e37d + license: CMIP6 model data produced by CSIRO is licensed un... + CDO: Climate Data Operators version 1.9.8 (https://mpi... + history: Tue Jan 12 14:50:25 2021: ncatted -O -a history,p... + NCO: netCDF Operators version 4.9.2 (Homepage = http:/... +``` + +We can see that our `ds` object is an `xarray.Dataset`, +which when printed shows all the metadata associated with our netCDF data file. + +In this case, +we are interested in the precipitation variable contained within that xarray Dataset: + +```python +print(ds["pr"]) +``` + +```output + +[1658880 values with dtype=float32] +Coordinates: + * time (time) datetime64[ns] 2010-01-16T12:00:00 ... 2014-12-16T12:00:00 + * lon (lon) float64 0.9375 2.812 4.688 6.562 ... 353.4 355.3 357.2 359.1 + * lat (lat) float64 -89.38 -88.12 -86.88 -85.62 ... 86.88 88.12 89.38 +Attributes: + standard_name: precipitation_flux + long_name: Precipitation + units: kg m-2 s-1 + comment: includes both liquid and solid phases + cell_methods: area: time: mean + cell_measures: area: areacella +``` + +We can actually use either the `ds["pr"]` or `ds.pr` syntax to access the precipitation +`xarray.DataArray`. + +To calculate the precipitation climatology, +we can make use of the fact that xarray DataArrays have built in functionality +for averaging over their dimensions. + +```python +clim = ds["pr"].mean("time", keep_attrs=True) +print(clim) +``` + +```output + +array([[1.8461452e-06, 1.9054805e-06, 1.9228980e-06, ..., 1.9869783e-06, + 2.0026005e-06, 1.9683730e-06], + [1.9064508e-06, 1.9021350e-06, 1.8931637e-06, ..., 1.9433096e-06, + 1.9182237e-06, 1.9072245e-06], + [2.1003202e-06, 2.0477617e-06, 2.0348527e-06, ..., 2.2391034e-06, + 2.1970161e-06, 2.1641599e-06], + ..., + [7.5109556e-06, 7.4777777e-06, 7.4689174e-06, ..., 7.3359679e-06, + 7.3987890e-06, 7.3978440e-06], + [7.1837171e-06, 7.1722038e-06, 7.1926393e-06, ..., 7.1552149e-06, + 7.1576678e-06, 7.1592167e-06], + [7.0353467e-06, 7.0403985e-06, 7.0326828e-06, ..., 7.0392648e-06, + 7.0387587e-06, 7.0304386e-06]], dtype=float32) +Coordinates: + * lon (lon) float64 0.9375 2.812 4.688 6.562 ... 353.4 355.3 357.2 359.1 + * lat (lat) float64 -89.38 -88.12 -86.88 -85.62 ... 86.88 88.12 89.38 +Attributes: + standard_name: precipitation_flux + long_name: Precipitation + units: kg m-2 s-1 + comment: includes both liquid and solid phases + cell_methods: area: time: mean + cell_measures: area: areacella +``` + +Now that we've calculated the climatology, +we want to convert the units from kg m-2 s-1 +to something that we are a little more familiar with like mm day-1. + +To do this, consider that +1 kg of rain water spread over 1 m2 of surface is 1 mm in thickness +and that there are 86400 seconds in one day. +Therefore, 1 kg m-2 s-1 = 86400 mm day-1. + +The data associated with our xarray DataArray is simply a numpy array, + +```python +type(clim.data) +``` + +```output +numpy.ndarray +``` + +so we can go ahead and multiply that array by 86400 and update the units attribute accordingly: + +```python +clim.data = clim.data * 86400 +clim.attrs["units"] = "mm/day" + +print(clim) +``` + +```output + +array([[0.15950695, 0.16463352, 0.16613839, ..., 0.17167493, 0.17302468, + 0.17006743], + [0.16471735, 0.16434446, 0.16356934, ..., 0.16790195, 0.16573453, + 0.1647842 ], + [0.18146767, 0.17692661, 0.17581128, ..., 0.19345854, 0.18982219, + 0.18698342], + ..., + [0.64894656, 0.64607999, 0.64531446, ..., 0.63382763, 0.63925537, + 0.63917372], + [0.62067316, 0.61967841, 0.62144403, ..., 0.61821057, 0.6184225 , + 0.61855632], + [0.60785395, 0.60829043, 0.60762379, ..., 0.60819248, 0.60814875, + 0.6074299 ]]) +Coordinates: + * lon (lon) float64 0.9375 2.812 4.688 6.562 ... 353.4 355.3 357.2 359.1 + * lat (lat) float64 -89.38 -88.12 -86.88 -85.62 ... 86.88 88.12 89.38 +Attributes: + standard_name: precipitation_flux + long_name: Precipitation + units: mm/day + comment: includes both liquid and solid phases + cell_methods: area: time: mean + cell_measures: area: areacella +``` + +We could now go ahead and plot our climatology using matplotlib, +but it would take many lines of code to extract all the latitude and longitude information +and to setup all the plot characteristics. +Recognising this burden, +the xarray developers have built on top of `matplotlib.pyplot` to make the visualisation +of xarray DataArrays much easier. + +```python +fig = plt.figure(figsize=[12,5]) + +ax = fig.add_subplot(111, projection=ccrs.PlateCarree(central_longitude=180)) + +clim.plot.contourf( + ax=ax, + levels=np.arange(0, 13.5, 1.5), + extend="max", + transform=ccrs.PlateCarree(), + cbar_kwargs={"label": clim.units} +) +ax.coastlines() + +plt.show() +``` + +![](fig/02-visualisation-viridis.png){alt='Precipitation climatology'} + +The default colorbar used by matplotlib is `viridis`. +It used to be `jet`, +but that was changed a couple of years ago in response to the +[\#endtherainbow](https://www.climate-lab-book.ac.uk/2014/end-of-the-rainbow/) campaign. + +Putting all the code together +(and reversing viridis so that wet is purple and dry is yellow)... + +```python +import xarray as xr +import cartopy.crs as ccrs +import matplotlib.pyplot as plt +import numpy as np + + +accesscm2_pr_file = "data/pr_Amon_ACCESS-CM2_historical_r1i1p1f1_gn_201001-201412.nc" + +ds = xr.open_dataset(accesscm2_pr_file) + +clim = ds["pr"].mean("time", keep_attrs=True) + +clim.data = clim.data * 86400 +clim.attrs["units"] = "mm/day" + +fig = plt.figure(figsize=[12,5]) +ax = fig.add_subplot(111, projection=ccrs.PlateCarree(central_longitude=180)) +clim.plot.contourf( + ax=ax, + levels=np.arange(0, 13.5, 1.5), + extend="max", + transform=ccrs.PlateCarree(), + cbar_kwargs={"label": clim.units}, + cmap="viridis_r", +) +ax.coastlines() +plt.show() +``` + +![](fig/02-visualisation-viridis_r.png){alt='Precipitation climatology'} + + +::::::::::::::::::::::::::::::::::::::: challenge + +## Color palette + +Copy and paste the final slab of code above into your own Jupyter notebook. + +The viridis color palette doesn't seem quite right for rainfall. +Change it to the "haline" [cmocean](https://matplotlib.org/cmocean/) palette +used for ocean salinity data. + +::::::::::::::: solution + +```python +import cmocean + +... +clim.plot.contourf( + ax=ax, + ... + cmap=cmocean.cm.haline_r, +) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Season selection + +Rather than plot the annual climatology, +edit the code so that it plots the June-August (JJA) season. + +(Hint: the [groupby]() functionality can be used to +group all the data into seasons prior to averaging over the time axis) + +::::::::::::::: solution + +```python +clim = ds["pr"].groupby("time.season").mean("time", keep_attrs=True) +clim.data = clim.data * 86400 +clim.attrs['units'] = "mm/day" + +clim.sel(season="JJA").plot.contourf( + ax=ax, + ... +) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Add a title + +Add a title to the plot which gives the name of the model +(taken from the `ds` attributes) +followed by the words "precipitation climatology (JJA)" + +::::::::::::::: solution + +```python +model = ds.attrs["source_id"] +title = f"{model} precipitation climatology (JJA)" +plt.title(title) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Libraries such as xarray can make loading, processing and visualising netCDF data much easier. +- The cmocean library contains colormaps custom made for the ocean sciences. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + + diff --git a/03-functions.md b/03-functions.md new file mode 100644 index 0000000..f812014 --- /dev/null +++ b/03-functions.md @@ -0,0 +1,315 @@ +--- +title: Functions +teaching: 15 +exercises: 25 +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- Define a function that takes parameters. +- Use docstrings to document functions. +- Break our existing plotting code into a series of small, single-purpose functions. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- How can I define my own functions? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +In the previous lesson +we created a plot of the ACCESS-CM2 historical precipitation climatology +using the following commands: + +```python +import xarray as xr +import cartopy.crs as ccrs +import matplotlib.pyplot as plt +import numpy as np +import cmocean + + +accesscm2_pr_file = "data/pr_Amon_ACCESS-CM2_historical_r1i1p1f1_gn_201001-201412.nc" + +ds = xr.open_dataset(accesscm2_pr_file) + +clim = ds["pr"].groupby("time.season").mean("time", keep_attrs=True) + +clim.data = clim.data * 86400 +clim.attrs['units'] = "mm/day" + +fig = plt.figure(figsize=[12,5]) +ax = fig.add_subplot(111, projection=ccrs.PlateCarree(central_longitude=180)) +clim.sel(season="JJA").plot.contourf( + ax=ax, + levels=np.arange(0, 13.5, 1.5), + extend="max", + transform=ccrs.PlateCarree(), + cbar_kwargs={"label": clim.units}, + cmap=cmocean.cm.haline_r, +) +ax.coastlines() + +model = ds.attrs["source_id"] +title = f"{model} precipitation climatology (JJA)" +plt.title(title) + +plt.show() +``` + +![](fig/03-functions-accesscm2-jja.png){alt='Precipitation climatology'} + +If we wanted to create a similar plot for a different model and/or different month, +we could cut and paste the code and edit accordingly. +The problem with that (common) approach is that it increases the chances of a making a mistake. +If we manually updated the season to 'DJF' for the `clim.sel(season=` command +but forgot to update it when calling `plt.title`, for instance, +we'd have a mismatch between the data and title. + +The cut and paste approach is also much more time consuming. +If we think of a better way to create this plot in future +(e.g. we might want to add gridlines using `plt.gca().gridlines()`), +then we have to find and update every copy and pasted instance of the code. + +A better approach is to put the code in a function. +The code itself then remains untouched, +and we simply call the function with different input arguments. + +```python +def plot_pr_climatology(pr_file, season, gridlines=False): + """Plot the precipitation climatology. + + Args: + pr_file (str): Precipitation data file + season (str): Season (3 letter abbreviation, e.g. JJA) + gridlines (bool): Select whether to plot gridlines + + """ + + ds = xr.open_dataset(pr_file) + + clim = ds["pr"].groupby("time.season").mean("time", keep_attrs=True) + + clim.data = clim.data * 86400 + clim.attrs["units"] = "mm/day" + + fig = plt.figure(figsize=[12,5]) + ax = fig.add_subplot(111, projection=ccrs.PlateCarree(central_longitude=180)) + clim.sel(season=season).plot.contourf( + ax=ax, + levels=np.arange(0, 13.5, 1.5), + extend="max", + transform=ccrs.PlateCarree(), + cbar_kwargs={"label": clim.units}, + cmap=cmocean.cm.haline_r, + ) + ax.coastlines() + if gridlines: + plt.gca().gridlines() + + model = ds.attrs["source_id"] + title = f"{model} precipitation climatology ({season})" + plt.title(title) +``` + +The docstring allows us to have good documentation for our function: + +```python +help(plot_pr_climatology) +``` + +```output +Help on function plot_pr_climatology in module __main__: + +plot_pr_climatology(pr_file, season, gridlines=False) + Plot the precipitation climatology. + + Args: + pr_file (str): Precipitation data file + season (str): Season (3 letter abbreviation, e.g. JJA) + gridlines (bool): Select whether to plot gridlines +``` + +We can now use this function to create exactly the same plot as before: + +```python +plot_pr_climatology("data/pr_Amon_ACCESS-CM2_historical_r1i1p1f1_gn_201001-201412.nc", "JJA") +plt.show() +``` + +![](fig/03-functions-accesscm2-jja.png){alt='Precipitation climatology'} + +Plot a different model and season: + +```python +plot_pr_climatology("data/pr_Amon_ACCESS-ESM1-5_historical_r1i1p1f1_gn_201001-201412.nc", "DJF") +plt.show() +``` + +![](fig/03-functions-accessesm-djf.png){alt='Precipitation climatology'} + +Or use the optional `gridlines` input argument +to change the default behaviour of the function +(keyword arguments are usually used for options +that the user will only want to change occasionally): + +```python +plot_pr_climatology( + "data/pr_Amon_ACCESS-ESM1-5_historical_r1i1p1f1_gn_201001-201412.nc", + "DJF", + gridlines=True, +) +plt.show() +``` + +![](fig/03-functions-accessesm-djf-gridlines.png){alt='Precipitation climatology'} + +::::::::::::::::::::::::::::::::::::::: challenge + +## Short functions + +Our `plot_pr_climatology` function works, but at 16 lines of code it's starting to get a little long. +In general, people can only fit around 7-12 pieces of information in their short term memory. +The readability of your code can therefore be greatly enhanced +by keeping your functions short and sharp. +The speed at which people can analyse their data is usually limited +by the time it takes to read/understand/edit their code +(as opposed to the time it takes the code to actually run), +so the frequent use of short, +well documented functions can dramatically speed up your data science. + +1. Cut and paste the `plot_pr_climatology` function (defined in the notes above) into your own notebook and try running it with different input arguments. +2. Break the contents of `plot_pr_climatology` down into a series of smaller functions, such that it reads as follows: + +```python +def plot_pr_climatology(pr_file, season, gridlines=False): + """Plot the precipitation climatology. + + Args: + pr_file (str): Precipitation data file + season (str): Season (3 letter abbreviation, e.g. JJA) + gridlines (bool): Select whether to plot gridlines + + """ + + ds = xr.open_dataset(pr_file) + clim = ds["pr"].groupby("time.season").mean("time", keep_attrs=True) + clim = convert_pr_units(clim) + create_plot(clim, ds.attrs["source_id"], season, gridlines=gridlines) + plt.show() +``` + +In other words, you'll need to define new +`convert_pr_units` and `create_plot` +functions using code from the existing `plot_pr_climatology` function. + +::::::::::::::: solution + +```python +def convert_pr_units(da): + """Convert kg m-2 s-1 to mm day-1. + + Args: + da (xarray.DataArray): Precipitation data + + """ + + da.data = da.data * 86400 + da.attrs['units'] = "mm/day" + + return da + + +def create_plot(clim, model, season, gridlines=False): + """Plot the precipitation climatology. + + Args: + clim (xarray.DataArray): Precipitation climatology data + model (str) : Name of the climate model + season (str): Season + + Kwargs: + gridlines (bool): Select whether to plot gridlines + + """ + + fig = plt.figure(figsize=[12,5]) + ax = fig.add_subplot(111, projection=ccrs.PlateCarree(central_longitude=180)) + clim.sel(season=season).plot.contourf( + ax=ax, + levels=np.arange(0, 13.5, 1.5), + extend="max", + transform=ccrs.PlateCarree(), + cbar_kwargs={"label": clim.units}, + cmap=cmocean.cm.haline_r, + ) + ax.coastlines() + if gridlines: + plt.gca().gridlines() + + title = f"{model} precipitation climatology ({season})" + plt.title(title) + + +def plot_pr_climatology(pr_file, season, gridlines=False): + """Plot the precipitation climatology. + + Args: + pr_file (str): Precipitation data file + season (str): Season (3 letter abbreviation, e.g. JJA) + gridlines (bool): Select whether to plot gridlines + + """ + + ds = xr.open_dataset(pr_file) + clim = ds["pr"].groupby("time.season").mean("time", keep_attrs=True) + clim = convert_pr_units(clim) + create_plot(clim, ds.attrs["source_id"], season, gridlines=gridlines) + plt.show() +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::::: callout + +## Writing your own modules + +We've used functions to avoid code duplication in this particular notebook/script, +but what if we wanted to convert precipitation units from kg m-2 s-1 to mm/day +in a different notebook/script? + +To avoid cutting and pasting from this notebook/script to another, +the solution would be to place the `convert_pr_units` function +in a separate script full of similar functions. + +For instance, +we could put all our unit conversion functions in a script called `unit_conversion.py`. +When we want to convert precipitation units (in any script or notebook we're working on), +we can simply import that "module" and use the `convert_pr_units` function: + +```python +import unit_conversion +clim.data = unit_conversion.convert_pr_units(clim.data) +``` + +No copy and paste required! + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Define a function using `def name(...params...)`. +- The body of a function must be indented. +- Call a function using `name(...values...)`. +- Use `help(thing)` to view help for something. +- Put docstrings in functions to provide help for that function. +- Specify default values for parameters when defining a function using `name=value` in the parameter list. +- The readability of your code can be greatly enhanced by using numerous short functions. +- Write (and import) modules to avoid code duplication. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + + diff --git a/04-cmdline.md b/04-cmdline.md new file mode 100644 index 0000000..8c9931c --- /dev/null +++ b/04-cmdline.md @@ -0,0 +1,518 @@ +--- +title: Command line programs +teaching: 20 +exercises: 30 +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- Use the `argparse` library to manage command-line arguments in a program. +- Structure Python scripts according to a simple template. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- How can I write my own command line programs? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +We've arrived at the point where we have successfully defined the functions +required to plot the precipitation data. + +We could continue to execute these functions from the Jupyter notebook, +but in most cases notebooks are simply used to try things out +and/or take notes on a new data analysis task. +Once you've scoped out the task (as we have for plotting the precipitation climatology), +that code can be transferred to a Python script so that it can be executed at the command line. +It's likely that your data processing workflows will include command line utilities +from the CDO and NCO projects in addition to Python code, +so the command line is the natural place to manage your workflows +(e.g. using shell scripts or make files). + +In general, the first thing that gets added to any Python script is the following: + +```python +if __name__ == "__main__": + main() +``` + +The reason we need these two lines of code +is that running a Python script in bash is very similar to importing that file in Python. +The biggest difference is that we don't expect anything to happen when we import a file, whereas when running a script we expect to see some output +(e.g. an output file, figure and/or some text printed to the screen). + +The `__name__` variable exists to handle these two situations. +When you import a Python file `__name__` is set to the name of that file +(e.g. when importing script.py, `__name__` is `script`), +but when running a script in bash `__name__` is always set to `__main__`. +The convention is to call the function that produces the output `main()`, +but you can call it whatever you like. + +The next thing you'll need is a library to parse the command line for input arguments. +The most widely used option is [argparse](https://docs.python.org/3/library/argparse.html). + +Putting those together, +here's a template for what most python command line programs look like: + +```bash +$ cat script_template.py +``` + +```python +import argparse + +# +# All your functions (that will be called by main()) go here. +# + +def main(inargs): + """Run the program.""" + + print("Input file: ", inargs.infile) + print("Output file: ", inargs.outfile) + + +if __name__ == "__main__": + + description = "Print the input arguments to the screen." + parser = argparse.ArgumentParser(description=description) + + parser.add_argument("infile", type=str, help="Input file name") + parser.add_argument("outfile", type=str, help="Output file name") + + args = parser.parse_args() + main(args) +``` + +By running `script_template.py` at the command line +we'll see that `argparse` handles all the input arguments: + +```bash +$ python script_template.py in.nc out.nc +``` + +```output +Input file: in.nc +Output file: out.nc +``` + +It also generates help information for the user: + +```bash +$ python script_template.py -h +``` + +```output +usage: script_template.py [-h] infile outfile + +Print the input arguments to the screen. + +positional arguments: + infile Input file name + outfile Output file name + +optional arguments: + -h, --help show this help message and exit +``` + +and issues errors when users give the program invalid arguments: + +```bash +$ python script_template.py in.nc +``` + +```output +usage: script_template.py [-h] infile outfile +script_template.py: error: the following arguments are required: outfile +``` + +Using this template as a starting point, +we can add the functions we developed previously to a script called +`plot_precipitation_climatology.py`. + +```bash +$ cat plot_precipitation_climatology.py +``` + +```python +import xarray as xr +import cartopy.crs as ccrs +import matplotlib.pyplot as plt +import numpy as np +import cmocean +import argparse + + +def convert_pr_units(da): + """Convert kg m-2 s-1 to mm day-1. + + Args: + da (xarray.DataArray): Precipitation data + + """ + + da.data = darray.data * 86400 + da.attrs["units"] = "mm/day" + + return da + + +def create_plot(clim, model, season, gridlines=False): + """Plot the precipitation climatology. + + Args: + clim (xarray.DataArray): Precipitation climatology data + model (str): Name of the climate model + season (str): Season + + Kwargs: + gridlines (bool): Select whether to plot gridlines + + """ + + fig = plt.figure(figsize=[12,5]) + ax = fig.add_subplot(111, projection=ccrs.PlateCarree(central_longitude=180)) + clim.sel(season=season).plot.contourf( + ax=ax, + levels=np.arange(0, 13.5, 1.5), + extend="max", + transform=ccrs.PlateCarree(), + cbar_kwargs={"label": clim.units}, + cmap=cmocean.cm.haline_r, + ) + ax.coastlines() + if gridlines: + plt.gca().gridlines() + + title = f"{model} precipitation climatology ({season})" + plt.title(title) + + +def main(inargs): + """Run the program.""" + + ds = xr.open_dataset(inargs.pr_file) + + clim = ds["pr"].groupby("time.season").mean("time", keep_attrs=True) + clim = convert_pr_units(clim) + + create_plot(clim, ds.attrs["source_id"], inargs.season) + plt.savefig( + inargs.output_file, + dpi=200, + bbox_inches="tight", + facecolor="white", + ) + + +if __name__ == "__main__": + description='Plot the precipitation climatology.' + parser = argparse.ArgumentParser(description=description) + + parser.add_argument("pr_file", type=str, help="Precipitation data file") + parser.add_argument("season", type=str, help="Season to plot") + parser.add_argument("output_file", type=str, help="Output file name") + + args = parser.parse_args() + + main(args) +``` + +... and then run it at the command line: + +```bash +$ python plot_precipitation_climatology.py data/pr_Amon_ACCESS-CM2_historical_r1i1p1f1_gn_201001-201412.nc MAM pr_Amon_ACCESS-CM2_historical_r1i1p1f1_gn_201001-201412-MAM-clim.png +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Choices + +For this series of challenges, +you are required to make improvements to the `plot_precipitation_climatology.py` script +that you downloaded earlier from the setup tab at the top of the page. + +For the first improvement, +edit the line of code that defines the season command line argument +(`parser.add_argument("season", type=str, help="Season to plot")`) +so that it only allows the user to input a valid three letter abbreviation +(i.e. `["DJF", "MAM", "JJA", "SON"]`). + +(Hint: Read about the `choices` keyword argument +at the [argparse tutorial](https://docs.python.org/3/howto/argparse.html).) + +::::::::::::::: solution + +```python +parser.add_argument( + "season", + type=str, + choices=["DJF", "MAM", "JJA", "SON"], + help="Season to plot", +) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Gridlines + +Add an optional command line argument that allows the user to add gridlines to the plot. + +(Hint: Read about the `action="store_true"` keyword argument +at the [argparse tutorial](https://docs.python.org/3/howto/argparse.html).) + +::::::::::::::: solution + +Make the following additions to `plot_precipitation_climatology.py` +(code omitted from this abbreviated version of the script is denoted `...`): + +```python +... + +def main(inargs): + + ... + + create_plot( + clim, + ds.attrs["source_id"], + inargs.season, + gridlines=inargs.gridlines, + ) + +... + +if __name__ == "__main__": + + ... + + parser.add_argument( + "--gridlines", + action="store_true", + default=False, + help="Include gridlines on the plot", + ) + +... + +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Colorbar levels + +Add an optional command line argument that allows the user to specify the tick levels used in the colourbar + +(Hint: You'll need to use the `nargs='*'` keyword argument.) + +::::::::::::::: solution + +## Solution + +Make the following additions to `plot_precipitation_climatology.py` +(code omitted from this abbreviated version of the script is denoted `...`): + +```python +... + +def create_plot(clim, model_name, season, gridlines=False, levels=None): + """Plot the precipitation climatology. + ... + Kwargs: + gridlines (bool): Select whether to plot gridlines + levels (list): Tick marks on the colorbar + + """ + + if not levels: + levels = np.arange(0, 13.5, 1.5) + + ... + + clim.sel(season=season).plot.contourf( + ax=ax, + ..., + ) + +... + +def main(inargs): + + ... + + create_plot( + clim, + ds.attrs["source_id"], + inargs.season, + gridlines=inargs.gridlines, + levels=inargs.cbar_levels, + ) + +... + +if __name__ == "__main__": + + ... + + parser.add_argument( + "--cbar_levels", + type=float, + nargs="*", + default=None, + help="list of levels / tick marks to appear on the colorbar", + ) + +... + +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Free time + +Add any other options you'd like for customising the plot (e.g. title, axis labels, figure size). + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## plot\_precipitation\_climatology.py + +At the conclusion of this lesson your `plot_precipitation_climatology.py` script +should look something like the following: + +::::::::::::::: solution + +```python +import xarray as xr +import cartopy.crs as ccrs +import matplotlib.pyplot as plt +import numpy as np +import cmocean +import argparse + + +def convert_pr_units(da): + """Convert kg m-2 s-1 to mm day-1. + + Args: + da (xarray.DataArray): Precipitation data + + """ + + da.data = darray.data * 86400 + da.attrs["units"] = "mm/day" + + return da + + +def create_plot(clim, model, season, gridlines=False, levels=None): + """Plot the precipitation climatology. + + Args: + clim (xarray.DataArray): Precipitation climatology data + model (str): Name of the climate model + season (str): Season + + Kwargs: + gridlines (bool): Select whether to plot gridlines + levels (list): Tick marks on the colorbar + + """ + + if not levels: + levels = np.arange(0, 13.5, 1.5) + + fig = plt.figure(figsize=[12,5]) + ax = fig.add_subplot(111, projection=ccrs.PlateCarree(central_longitude=180)) + clim.sel(season=season).plot.contourf( + ax=ax, + levels=levels, + extend="max", + transform=ccrs.PlateCarree(), + cbar_kwargs={"label": clim.units}, + cmap=cmocean.cm.haline_r, + ) + ax.coastlines() + if gridlines: + plt.gca().gridlines() + + title = f"{model} precipitation climatology ({season})" + plt.title(title) + + +def main(inargs): + """Run the program.""" + + ds = xr.open_dataset(inargs.pr_file) + + clim = ds["pr"].groupby("time.season").mean("time", keep_attrs=True) + clim = convert_pr_units(clim) + + create_plot( + clim, + ds.attrs["source_id"], + inargs.season, + gridlines=inargs.gridlines, + levels=inargs.cbar_levels, + ) + plt.savefig( + inargs.output_file, + dpi=200, + bbox_inches="tight", + facecolor="white", + ) + + +if __name__ == "__main__": + description = "Plot the precipitation climatology." + parser = argparse.ArgumentParser(description=description) + + parser.add_argument("pr_file", type=str, help="Precipitation data file") + parser.add_argument("season", type=str, help="Season to plot") + parser.add_argument("output_file", type=str, help="Output file name") + + parser.add_argument( + "--gridlines", + action="store_true", + default=False, + help="Include gridlines on the plot", + ) + parser.add_argument( + "--cbar_levels", + type=float, + nargs="*", + default=None, + help="list of levels / tick marks to appear on the colorbar", + ) + + args = parser.parse_args() + main(args) + +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Libraries such as `argparse` can be used the efficiently handle command line arguments. +- Most Python scripts have a similar structure that can be used as a template. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + + diff --git a/05-git.md b/05-git.md new file mode 100644 index 0000000..ef3ac96 --- /dev/null +++ b/05-git.md @@ -0,0 +1,794 @@ +--- +title: Version control +teaching: 35 +exercises: 0 +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- Configure `git` the first time it is used on a computer. +- Create a local Git repository. +- Go through the modify-add-commit cycle for one or more files. +- Explain what the HEAD of a repository is and how to use it. +- Identify and use Git commit numbers. +- Compare various versions of tracked files. +- Restore old versions of files. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- How can I record the revision history of my code? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Follow along + +For this lesson participants follow along command by command, +rather than observing and then completing challenges afterwards. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +A version control system stores an authoritative copy of your code in a repository, +which you can't edit directly. + +Instead, you checkout a working copy of the code, +edit that code, +then commit changes back to the repository. + +In this way, +the system records a complete revision history (i.e. of every commit), +so that you can retrieve and compare previous versions at any time. + +This is useful from an individual viewpoint, +because you don't need to store multiple (but slightly different) copies of the same script. + +![](fig/05-git-file-mess.gif){alt='File mess'} + +It's also useful from a collaboration viewpoint +(including collaborating with yourself across different computers) +because the system keeps a record of who made what changes and when. + +## Setup + +When we use Git on a new computer for the first time, +we need to configure a few things. + +```bash +$ git config --global user.name "Your Name" +$ git config --global user.email "you@email.com" +``` + +This user name and email will be associated with your subsequent Git activity, +which means that any changes pushed to +[GitHub](https://github.com/), +[BitBucket](https://bitbucket.org/), +[GitLab](https://gitlab.com/) or +another Git host server later on in this lesson will include this information. + +You only need to run these configuration commands once - +git will remember then for next time. + +We then need to navigate to our `data-carpentry` directory +and tell Git to initialise that directory as a Git repository. + +```bash +$ cd ~/Desktop/data-carpentry +$ git init +``` + +If we use `ls` to show the directory's contents, +it appears that nothing has changed: + +```bash +$ ls -F +``` + +```output +data/ script_template.py +plot_precipitation_climatology.py +``` + +But if we add the `-a` flag to show everything, +we can see that Git has created a hidden directory within `data-carpentry` called `.git`: + +```bash +$ ls -F -a +``` + +```output +./ data/ +../ plot_precipitation_climatology.py +.git/ script_template.py +``` + +Git stores information about the project in this special sub-directory. +If we ever delete it, +we will lose the project's history. + +We can check that everything is set up correctly +by asking Git to tell us the status of our project: + +```bash +$ git status +``` + +```output +$ git status +On branch main + +Initial commit + +Untracked files: + (use "git add ..." to include in what will be committed) + + data/ + plot_precipitation_climatology.py + script_template.py + +nothing added to commit but untracked files present (use "git add" to track) +``` + +::::::::::::::::::::::::::::::::::::::::: callout + +## Branch naming + +If you're running an older version of Git, +you may see `On branch master` instead of `On branch main` +at the top of the `git status` output. +Since 2021, Git has followed a move in the developer community +to change the default branch name from "master" to "main" +for cultural sensitivity reasons, avoiding "master/slave" terminology. + + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Tracking changes + +The "untracked files" message means that there's a file/s in the directory +that Git isn't keeping track of. +We can tell Git to track a file using `git add`: + +```bash +$ git add plot_precipitation_climatology.py +``` + +and then check that the right thing happened: + +```bash +$ git status +``` + +```output +On branch main + +Initial commit + +Changes to be committed: + (use "git rm --cached ..." to unstage) + + new file: plot_precipitation_climatology.py + +Untracked files: + (use "git add ..." to include in what will be committed) + + data/ + script_template.py +``` + +Git now knows that it's supposed to keep track of `plot_precipitation_climatology.py`, +but it hasn't recorded these changes as a commit yet. +To get it to do that, +we need to run one more command: + +```bash +$ git commit -m "Initial commit of precip climatology script" +``` + +```output +[main (root-commit) 8e69d70] Initial commit of precip climatology script + 1 file changed, 75 insertions(+) + create mode 100644 plot_precipitation_climatology.py +``` + +When we run `git commit`, +Git takes everything we have told it to save by using `git add` +and stores a copy permanently inside the special `.git` directory. +This permanent copy is called a commit (or revision) +and its short identifier is `8e69d70` +(Your commit may have another identifier.) + +We use the `-m` flag (for "message") +to record a short, descriptive, and specific comment that will help us remember later on what we did and why. +If we just run `git commit` without the `-m` option, +Git will launch `nano` (or whatever other editor we configured as `core.editor`) +so that we can write a longer message. + +If we run `git status` now: + +```bash +$ git status +``` + +```output +On branch main +Untracked files: + (use "git add ..." to include in what will be committed) + + data/ + script_template.py + +nothing added to commit but untracked files present (use "git add" to track) +``` + +it tells us everything is up to date. +If we want to know what we've done recently, +we can ask Git to show us the project's history using `git log`: + +```bash +$ git log +``` + +```output +commit 8e69d7086cb7c44a48a096122e5324ad91b8a439 +Author: Damien Irving +Date: Wed Mar 3 15:46:48 2021 +1100 + + Initial commit of precip climatology script +``` + +`git log` lists all commits made to a repository in reverse chronological order. +The listing for each commit includes +the commit's full identifier +(which starts with the same characters as +the short identifier printed by the `git commit` command earlier), +the commit's author, +when it was created, +and the log message Git was given when the commit was created. + +Let's go ahead and open our favourite text editor and +make a small change to `plot_precipitation_climatology.py` +by editing the `description` variable +(which is used by argparse in the help information it displays at the command line). + +```python +description = "Plot the precipitation climatology for a given season." +``` + +When we run `git status` now, +it tells us that a file it already knows about has been modified: + +```bash +$ git status +``` + +```output +On branch main +Changes not staged for commit: + (use "git add ..." to update what will be committed) + (use "git restore ..." to discard changes in working directory) + + modified: plot_precipitation_climatology.py + +Untracked files: + (use "git add ..." to include in what will be committed) + + data/ + script_template.py + +no changes added to commit (use "git add" and/or "git commit -a") +``` + +The last line is the key phrase: +"no changes added to commit". +We have changed this file, +but we haven't told Git we will want to save those changes +(which we do with `git add`) +nor have we saved them (which we do with `git commit`). +So let's do that now. It is good practice to always review +our changes before saving them. We do this using `git diff`. +This shows us the differences between the current state +of the file and the most recently saved version: + +```bash +$ git diff +``` + +``` +$ git diff +diff --git a/plot_precipitation_climatology.py b/plot_precipitation_climatology.py +index 58903f5..6c12b29 100644 +--- a/plot_precipitation_climatology.py ++++ b/plot_precipitation_climatology.py +@@ -62,7 +62,7 @@ def main(inargs): + + + if __name__ == '__main__': +- description = "Plot the precipitation climatology." ++ description = "Plot the precipitation climatology for a given season." + parser = argparse.ArgumentParser(description=description) + + parser.add_argument("pr_file", type=str, help="Precipitation data file") +``` + +The output is a little cryptic because +it's actually a series of commands for tools like editors and `patch` +telling them how to reconstruct one file given the other, +but the + and - markers clearly show what has been changed. + +After reviewing our change, it's time to commit it: + +```bash +$ git commit -m "Small improvement to help information" +``` + +```output +On branch main +Changes not staged for commit: + modified: plot_precipitation_climatology.py + +Untracked files: + data/ + script_template.py + +no changes added to commit +``` + +Whoops: +Git won't commit because we didn't use `git add` first. +Let's fix that: + +```bash +$ git add plot_precipitation_climatology.py +$ git commit -m "Small improvement to help information" +``` + +```output +[main 35f22b7] Small improvement to help information + 1 file changed, 1 insertion(+), 1 deletion(-) +``` + +Git insists that we add files to the set we want to commit +before actually committing anything. This allows us to commit our +changes in stages and capture changes in logical portions rather than +only large batches. +For example, +suppose we're writing our thesis using LaTeX +(the plain text `.tex` files can be tracked using Git) +and we add a few citations +to the introduction chapter. +We might want to commit those additions to our `introduction.tex` file +but *not* commit the work we're doing on the `conclusion.tex` file +(which we haven't finished yet). + +To allow for this, +Git has a special *staging area* +where it keeps track of things that have been added to +the current changeset but not yet committed. + +![](fig/05-git-staging-area.svg){alt='The Git Staging Area'} + +Let's do the whole edit-add-commit process one more time to +watch as our changes to a file move from our editor +to the staging area +and into long-term storage. +First, +we'll tweak the section of the script that imports all the libraries we need, +by putting them in the order suggested by the +[PEP 8 - Style Guide for Python Code](https://www.python.org/dev/peps/pep-0008/#imports). +The convention is to import packages from the [Python Standard Library](https://docs.python.org/3/library/) first, +then other external packages, +then your own modules (with a blank line between each grouping). + +```python +import argparse + +import xarray as xr +import cartopy.crs as ccrs +import matplotlib.pyplot as plt +import numpy as np +import cmocean +``` + +```bash +$ git diff +``` + +```output +diff --git a/plot_precipitation_climatology.py b/plot_precipitation_climatology.py +index 6c12b29..c6beb12 100644 +--- a/plot_precipitation_climatology.py ++++ b/plot_precipitation_climatology.py +@@ -1,9 +1,10 @@ ++import argparse ++ + import xarray as xr + import cartopy.crs as ccrs + import matplotlib.pyplot as plt + import numpy as np + import cmocean +-import argparse + + + def convert_pr_units(da): +``` + +Let's save our changes: + +```bash +$ git add plot_precipitation_climatology.py +$ git commit -m "Ordered imports according to PEP 8" +``` + +```output +[main a6cea2c] Ordered imports according to PEP 8 + 1 file changed, 2 insertions(+), 1 deletion(-) +``` + +check our status: + +```bash +$ git status +``` + +```output +On branch main +Untracked files: + (use "git add ..." to include in what will be committed) + + data/ + script_template.py + +nothing added to commit but untracked files present (use "git add" to track) +``` + +and look at the history of what we've done so far: + +```bash +$ git log +``` + +```output +commit a6cea2cd4facde6adfdde3a08ff9413b45479623 (HEAD -> main) +Author: Damien Irving +Date: Wed Mar 3 16:01:45 2021 +1100 + + Ordered imports according to PEP 8 + +commit 35f22b74b11ed7993b23f9b4554b03ffc295e823 +Author: Damien Irving +Date: Wed Mar 3 15:55:18 2021 +1100 + + Small improvement to help information + +commit 8e69d7086cb7c44a48a096122e5324ad91b8a439 +Author: Damien Irving +Date: Wed Mar 3 15:46:48 2021 +1100 + + Initial commit of precip climatology script +``` + +## Exploring history + +As we saw earlier, we can refer to commits by their identifiers. +You can refer to the *most recent commit* of the working +directory by using the identifier `HEAD`. + +To demonstrate how to use `HEAD`, +let's make a trival change to `plot_precipitation_climatology.py` +by inserting a comment. + +```python +# A random comment +``` + +Now, let's see what we get. + +```bash +$ git diff HEAD plot_precipitation_climatology.py +``` + +```output +diff --git a/plot_precipitation_climatology.py b/plot_precipitation_climatology.py +index c6beb12..c11707c 100644 +--- a/plot_precipitation_climatology.py ++++ b/plot_precipitation_climatology.py +@@ -6,6 +6,7 @@ import matplotlib.pyplot as plt + import numpy as np + import cmocean + ++# A random comment + + def convert_pr_units(darray): + """Convert kg m-2 s-1 to mm day-1. +``` + +which is the same as what you would get if you leave out `HEAD` (try it). +The real benefit of using the HEAD notation is the ease with which you can refer to previous commits. +We do that by adding `~1` to refer to the commit one before `HEAD`. + +```bash +$ git diff HEAD~1 plot_precipitation_climatology.py +``` + +```output +diff --git a/plot_precipitation_climatology.py b/plot_precipitation_climatology.py +index 6c12b29..c11707c 100644 +--- a/plot_precipitation_climatology.py ++++ b/plot_precipitation_climatology.py +@@ -1,10 +1,12 @@ ++import argparse ++ + import xarray as xr + import cartopy.crs as ccrs + import matplotlib.pyplot as plt + import numpy as np + import cmocean +-import argparse + ++# A random comment + + def convert_pr_units(da): + """Convert kg m-2 s-1 to mm day-1. +``` + +If we want to see the differences between older commits we can use `git diff` +again, but with the notation `HEAD~2`, `HEAD~3`, and so on, to refer to them. + +We can also refer to commits using +those long strings of digits and letters +that `git log` displays. +These are unique IDs for the changes, +and "unique" really does mean unique: +every change to any set of files on any computer +has a unique 40-character identifier. +Our first commit (HEAD~2) was given the ID +`8e69d7086cb7c44a48a096122e5324ad91b8a439`, +but you only have to use the first seven characters +for git to know what you mean: + +```bash +$ git diff 8e69d70 plot_precipitation_climatology.py +``` + +```output +diff --git a/plot_precipitation_climatology.py b/plot_precipitation_climatology.py +index 58903f5..c11707c 100644 +--- a/plot_precipitation_climatology.py ++++ b/plot_precipitation_climatology.py +@@ -1,10 +1,12 @@ ++import argparse ++ + import xarray as xr + import cartopy.crs as ccrs + import matplotlib.pyplot as plt + import numpy as np + import cmocean +-import argparse + ++# A random comment + + def convert_pr_units(da): + """Convert kg m-2 s-1 to mm day-1. +@@ -62,7 +64,7 @@ def main(inargs): + + + if __name__ == '__main__': +- description = "Plot the precipitation climatology." ++ description = "Plot the precipitation climatology for a given season." + parser = argparse.ArgumentParser(description=description) + + parser.add_argument("pr_file", type=str, help="Precipitation data file") +``` + +Now that we can save changes to files and see what we've changed —- how +can we restore older versions of things? +Let's suppose we accidentally overwrite our file: + +```bash +$ echo "whoops" > plot_precipitation_climatology.py +$ cat plot_precipitation_climatology.py +``` + +```output +whoops +``` + +`git status` now tells us that the file has been changed, +but those changes haven't been staged: + +```bash +$ git status +``` + +```output +On branch main +Changes not staged for commit: + (use "git add ..." to update what will be committed) + (use "git restore ..." to discard changes in working directory) + + modified: plot_precipitation_climatology.py + +Untracked files: + (use "git add ..." to include in what will be committed) + + data/ + script_template.py + +no changes added to commit (use "git add" and/or "git commit -a") +``` + +We can put things back the way they were at the time of our last commit +by using `git restore`: + +```bash +$ git restore plot_precipitation_climatology.py +$ cat plot_precipitation_climatology +``` + +```output +import argparse + +import xarray as xr +import cartopy.crs as ccrs +... +``` + +The random comment that we inserted has been lost +(that change hadn't been committed) +but everything else that was in our last commit is there. + +::::::::::::::::::::::::::::::::::::::::: callout + +## Checking out with Git + +If you're running a different version of Git, +you may see a suggestion for `git checkout` instead of `git restore`. +As of Git version 2.29, +`git restore` is still an experimental command +and operates as a specialized form of `git checkout`. +`git checkout HEAD plot_precipitation_climatology` +is the equivalent command. + + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::::: challenge + +## plot\_precipitation\_climatology.py + +At the conclusion of this lesson your `plot_precipitation_climatology.py` script +should look something like the following: + +::::::::::::::: solution + +```python +import argparse + +import xarray as xr +import cartopy.crs as ccrs +import matplotlib.pyplot as plt +import numpy as np +import cmocean + + +def convert_pr_units(da): + """Convert kg m-2 s-1 to mm day-1. + + Args: + da (xarray.DataArray): Precipitation data + + """ + + da.data = da.data * 86400 + da.attrs["units"] = "mm/day" + + return darray + + +def create_plot(clim, model, season, gridlines=False, levels=None): + """Plot the precipitation climatology. + + Args: + clim (xarray.DataArray): Precipitation climatology data + model (str): Name of the climate model + season (str): Season + + Kwargs: + gridlines (bool): Select whether to plot gridlines + levels (list): Tick marks on the colorbar + + """ + + if not levels: + levels = np.arange(0, 13.5, 1.5) + + fig = plt.figure(figsize=[12,5]) + ax = fig.add_subplot(111, projection=ccrs.PlateCarree(central_longitude=180)) + clim.sel(season=season).plot.contourf( + ax=ax, + levels=levels, + extend="max", + transform=ccrs.PlateCarree(), + cbar_kwargs={"label": clim.units}, + cmap=cmocean.cm.haline_r + ) + ax.coastlines() + if gridlines: + plt.gca().gridlines() + + title = f"{model} precipitation climatology ({season})" + plt.title(title) + + +def main(inargs): + """Run the program.""" + + ds = xr.open_dataset(inargs.pr_file) + + clim = ds["pr"].groupby("time.season").mean("time") + clim = convert_pr_units(clim) + + create_plot( + clim, + ds.attrs["source_id"], + inargs.season, + gridlines=inargs.gridlines, + levels=inargs.cbar_levels + ) + plt.savefig( + inargs.output_file, + dpi=200, + bbox_inches="tight", + facecolor="white", + ) + + +if __name__ == "__main__": + description = "Plot the precipitation climatology for a given season." + parser = argparse.ArgumentParser(description=description) + + parser.add_argument("pr_file", type=str, help="Precipitation data file") + parser.add_argument("season", type=str, help="Season to plot") + parser.add_argument("output_file", type=str, help="Output file name") + + parser.add_argument( + "--gridlines", + action="store_true", + default=False, + help="Include gridlines on the plot", + ) + parser.add_argument( + "--cbar_levels", + type=float, + nargs="*", + default=None, + help="list of levels / tick marks to appear on the colorbar", + ) + + args = parser.parse_args() + main(args) + +``` + +::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Use `git config` to configure a user name, email address, editor, and other preferences once per machine. +- `git init` initializes a repository. +- `git status` shows the status of a repository. +- Files can be stored in a project’s working directory (which users see), the staging area (where the next commit is being built up) and the local repository (where commits are permanently recorded). +- `git add` puts files in the staging area. +- `git commit` saves the staged content as a new commit in the local repository. +- Always write a log message when committing changes. +- `git diff` displays differences between commits. +- `git restore` recovers old versions of files. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + + diff --git a/06-github.md b/06-github.md new file mode 100644 index 0000000..8353550 --- /dev/null +++ b/06-github.md @@ -0,0 +1,375 @@ +--- +title: GitHub +teaching: 25 +exercises: 0 +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- Explain what remote repositories are and why they are useful. +- Push to or pull from a remote repository. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- How can I make my code available on GitHub? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Follow along + +For this lesson participants follow along command by command, +rather than observing and then completing challenges afterwards. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Creating a remote repository + +Version control really comes into its own when we begin to collaborate with +other people (including ourselves for those who work on multiple computers). +We already have most of the machinery we need to do this; the +only thing missing is to copy changes from one repository to another. + +Systems like Git allow us to move work between any two repositories. In +practice, though, it's easiest to use one copy as a central hub, and to keep it +on the web rather than on someone's laptop. Most programmers use hosting +services like [GitHub](https://github.com), [BitBucket](https://bitbucket.org) or +[GitLab](https://gitlab.com/) to hold those master copies. + +Let's start by sharing the changes we've made to our current project with the +world. Log in to GitHub, then click on the icon in the top right corner to +create a new repository called `data-carpentry`: + +![](fig/06-github-create-repo-01.png){alt='Creating a Repository on GitHub (Step 1)'} + +Name your repository "data-carpentry" and then click "Create Repository": + +![](fig/06-github-create-repo-02.png){alt='Creating a Repository on GitHub (Step 2)'} + +As soon as the repository is created, GitHub displays a page with a URL and some +information on how to configure your local repository: + +![](fig/06-github-create-repo-03.png){alt='Creating a Repository on GitHub (Step 3)'} + +This effectively does the following on GitHub's servers: + +```bash +$ mkdir data-carpentry +$ cd data-carpentry +$ git init +``` + +Our local repository still contains our earlier work on `plot_precipitation_climatology.py`, +but the remote repository on GitHub doesn't contain any files yet. + +The next step is to connect the two repositories. We do this by making the +GitHub repository a "remote" for the local repository. +The home page of the repository on GitHub includes the string we need to +identify it: + +![](fig/06-github-find-repo-string.png){alt='Where to Find Repository URL on GitHub'} + +Click on the 'SSH' link to change the protocol from HTTPS to SSH if +SSH isn't already selected. + +::::::::::::::::::::::::::::::::::::::::: callout + +## HTTPS vs. SSH + +We use SSH here because, while it requires some additional configuration, +it is a security protocol widely used by many applications. +If you want to use HTTPS instead, +you'll need to create a [Personal Access Token](https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/creating-a-personal-access-token). + + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +Copy that location from the browser, go into the local `data-carpentry` repository, +and run this command: + +```bash +$ git remote add origin git@github.com:DamienIrving/data-carpentry.git +``` + +Make sure to use the location for your repository rather than Damien's: the only +difference should be your username instead of `DamienIrving`. + +We can check that the command has worked by running `git remote -v`: + +```bash +$ git remote -v +``` + +```output +origin git@github.com/DamienIrving/data-carpentry.git (push) +origin git@github.com/DamienIrving/data-carpentry.git (fetch) +``` + +The name `origin` is a local nickname for your remote repository. We could use +something else if we wanted to, but `origin` is by far the most common choice. + +## Connecting to GitHub with SSH + +Before we can connect to our remote repository, +we need to set up a way for our computer to authenticate with GitHub +so it knows it's us trying to connect to our remote repository. + +We are going to set up the method that is commonly used by many different services +to authenticate access on the command line. +This method is called Secure Shell Protocol (SSH). +SSH is a cryptographic network protocol that allows secure communication between computers +using an otherwise insecure network. + +SSH uses what is called a key pair. +This is two keys that work together to validate access. +One key is publicly known and called the public key, +and the other key called the private key is kept private. +Very descriptive names. + +You can think of the public key as a padlock, +and only you have the key (the private key) to open it. +You use the public key where you want a secure method of communication, +such as your GitHub account. +You give this padlock, or public key, to GitHub and say +"lock the communications to my account with this so that only computers +that have my private key can unlock communications and send git commands as my GitHub account." + +What we will do now is the minimum required to set up the SSH keys +and add the public key to a GitHub account. + +::::::::::::::::::::::::::::::::::::::::: callout + +## Keeping your keys secure + +You shouldn't really forget about your SSH keys, since they keep your account secure. +It's good practice to audit your secure shell keys every so often. +Especially if you are using multiple computers to access your account. + + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +The first thing we are going to do is check if this has already been done on the computer we're on. +Because generally speaking, this setup only needs to happen once and then you can forget about it. +We can run the list command to check what key pairs already exist on our computer. + +```bash +ls -al ~/.ssh +``` + +(Your output is going to look a little different depending on whether or not SSH +has ever been set up on the computer you are using.) + +I have not set up SSH on my computer, so my output is + +```output +ls: cannot access '/home/damien/.ssh': No such file or directory +``` + +If SSH has been set up on the computer you're using, +the public and private key pairs will be listed. +The file names are either `id_ed25519`/`id_ed25519.pub` or `id_rsa`/`id_rsa.pub` +depending on how the key pairs were set up. + +If they don't exist on your computer, you can use this command to create them. + +```bash +$ ssh-keygen -t ed25519 -C "you@email.com" +``` + +The `-t` option specifies which type of algorithm to use +and `-C` attaches a comment to the key (here, your email). + +If you are using a legacy system that doesn't support the Ed25519 algorithm, use: +`$ ssh-keygen -t rsa -b 4096 -C "you@email.com"` + +```output +Generating public/private ed25519 key pair. +Enter file in which to save the key (/home/damien/.ssh/id_ed25519): +``` + +We want to use the default file, so just press Enter. + +```output +Created directory '/home/damien/.ssh'. +Enter passphrase (empty for no passphrase): +``` + +Now, it is prompting us for a passphrase. +If you're using a laptop that other people sometimes have access to, +you might want to create a passphrase. +Be sure to use something memorable or save your passphrase somewhere, +as there is no "reset my password" option. + +```output +Enter same passphrase again: +``` + +After entering the same passphrase a second time, we receive the confirmation. + +```output +Your identification has been saved in /home/damien/.ssh/id_ed25519 +Your public key has been saved in /home/damien/.ssh/id_ed25519.pub +The key fingerprint is: +SHA256:SMSPIStNyA00KPxuYu94KpZgRAYjgt9g4BA4kFy3g1o you@email.com +The key's randomart image is: ++--[ED25519 256]--+ +|^B== o. | +|%*=.*.+ | +|+=.E =.+ | +| .=.+.o.. | +|.... . S | +|.+ o | +|+ = | +|.o.o | +|oo+. | ++----[SHA256]-----+ +``` + +The "identification" is actually the private key. +You should never share it. +The public key is appropriately named. +The "key fingerprint" is a shorter version of a public key. + +Now that we have generated the SSH keys, +we will find the SSH files when we check. + +```bash +ls -al ~/.ssh +``` + +```output +drwxr-xr-x 1 Damien staff 197121 0 Jul 16 14:48 ./ +drwxr-xr-x 1 Damien staff 197121 0 Jul 16 14:48 ../ +-rw-r--r-- 1 Damien staff 197121 419 Jul 16 14:48 id_ed25519 +-rw-r--r-- 1 Damien staff 197121 106 Jul 16 14:48 id_ed25519.pub +``` + +Now we have a SSH key pair and we can run this command +to check if GitHub can read our authentication. + +```bash +ssh -T git@github.com +``` + +```output +The authenticity of host 'github.com (192.30.255.112)' can't be established. +RSA key fingerprint is SHA256:nThbg6kXUpJWGl7E1IGOCspRomTxdCARLviKw6E5SY8. +This key is not known by any other names +Are you sure you want to continue connecting (yes/no/[fingerprint])? y +Please type 'yes', 'no' or the fingerprint: yes +Warning: Permanently added 'github.com' (RSA) to the list of known hosts. +git@github.com: Permission denied (publickey). +``` + +Right, we forgot that we need to give GitHub our public key! + +First, we need to copy the public key. +Be sure to include the `.pub` at the end, +otherwise you're looking at the private key. + +```bash +cat ~/.ssh/id_ed25519.pub +``` + +```output +ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIDmRA3d51X0uu9wXek559gfn6UFNF69yZjChyBIU2qKI you@email.com +``` + +Now, going to GitHub.com, +click on your profile icon in the top right corner to get the drop-down menu. +Click "Settings," then on the settings page, click "SSH and GPG keys," +on the left side "Account settings" menu. +Click the "New SSH key" button on the right side. +Now, you can add a title +(e.g. "work laptop" so you can remember where the original key pair files are located), +paste your SSH key into the field, and click the "Add SSH key" to complete the setup. + +![](fig/06-github-add-ssh-keys.png){alt='Adding SSH heys on GitHub'} + +Now that we've set that up, let's check our authentication again from the command line. + +```bash +$ ssh -T git@github.com +``` + +```output +Hi Damien! You've successfully authenticated, but GitHub does not provide shell access. +``` + +Good! This output confirms that the SSH key works as intended. +We are now ready to push our work to the remote repository. + +```bash +$ git push origin main +``` + +```output +Counting objects: 9, done. +Delta compression using up to 4 threads. +Compressing objects: 100% (6/6), done. +Writing objects: 100% (9/9), 821 bytes, done. +Total 9 (delta 2), reused 0 (delta 0) +To github.com:DamienIrving/data-carpentry.git + * [new branch] master -> master +Branch master set up to track remote branch master from origin. +``` + +We can pull changes from the remote repository to the local one as well: + +```bash +$ git pull origin main +``` + +```output +From github.com:DamienIrving/data-carpentry.git + * branch master -> FETCH_HEAD +Already up-to-date. +``` + +Pulling has no effect in this case because the two repositories are already synchronised. +If someone else had pushed some changes to the repository on GitHub, though, +this command would download them to our local repository. + +## Sharing code with yourself or others + +If we logged onto a different computer +(e.g. a supercomputing facility or our desktop computer at home) +we could access a copy of our code by "cloning" it. + +```bash +$ git clone git@github.com:DamienIrving/data-carpentry.git +``` + +Since our repository is public, +anyone (e.g. research collaborators) could clone the repository +by getting the location from the corresponding page on GitHub: + +![](fig/06-github-clone.png){alt='Cloning a repository on GitHub'} + +::::::::::::::::::::::::::::::::::::::::: callout + +## Working with others + +Someone who clones your repository can't push changes directly to it +(unless you add them as a collaborator). +They could, however, "fork" your repository and submit suggested changes via a "pull request". +Collaborators and pull requests are beyond the scope of this lesson, +but you may come across them as you get more experienced with +using git and GitHub. + + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- A local Git repository can be connected to one or more remote repositories. +- You can use the SSH protocol to connect to remote repositories. +- `git push` copies changes from a local repository to a remote repository. +- `git pull` copies changes from a remote repository to a local repository. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + + diff --git a/07-vectorisation.md b/07-vectorisation.md new file mode 100644 index 0000000..00df303 --- /dev/null +++ b/07-vectorisation.md @@ -0,0 +1,401 @@ +--- +title: Vectorisation +teaching: 15 +exercises: 15 +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- Use surface land fraction data to mask the land or ocean. +- Use the vectorised operations available in the `numpy` library to avoid looping over array elements. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- How can I avoid looping over each element of large data arrays? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +A useful addition to our `plot_precipitation_climatology.py` script +would be the option to apply a land or ocean mask. +To do this, we need to use the land area fraction file. + +```python +import numpy as np +import xarray as xr + + +accesscm2_sftlf_file = "data/sftlf_fx_ACCESS-CM2_historical_r1i1p1f1_gn.nc" + +ds = xr.open_dataset(accesscm2_sftlf_file) +sftlf = ds["sftlf"] +print(sftlf) +``` + +```output + +array([[100., 100., 100., ..., 100., 100., 100.], + [100., 100., 100., ..., 100., 100., 100.], + [100., 100., 100., ..., 100., 100., 100.], + ..., + [ 0., 0., 0., ..., 0., 0., 0.], + [ 0., 0., 0., ..., 0., 0., 0.], + [ 0., 0., 0., ..., 0., 0., 0.]], dtype=float32) +Coordinates: + * lat (lat) float64 -89.38 -88.12 -86.88 -85.62 ... 86.88 88.12 89.38 + * lon (lon) float64 0.9375 2.812 4.688 6.562 ... 353.4 355.3 357.2 359.1 +Attributes: + standard_name: land_area_fraction + long_name: Percentage of the grid cell occupied by land (including... + comment: Percentage of horizontal area occupied by land. + units: % + original_units: 1 + history: 2019-11-09T02:47:20Z altered by CMOR: Converted units fr... + cell_methods: area: mean + cell_measures: area: areacella +``` + +The data in a sftlf file assigns each grid cell a percentage value +between 0% (no land) to 100% (all land). + +```python +print(sftlf.data.max()) +print(sftlf.data.min()) +``` + +```output +100.0 +0.0 +``` + +To apply a mask to our plot, +the value of all the data points we'd like to mask needs to be set to `np.nan`. +The most obvious solution to applying a land mask, for example, +might therefore be to loop over each cell in our data array and decide whether +it is a land point (and thus needs to be set to `np.nan`). + +(For this example, we are going to define land as any grid point that is more than 50% land.) + +```python +ds = xr.open_dataset("data/pr_Amon_ACCESS-CM2_historical_r1i1p1f1_gn_201001-201412.nc") +clim = ds['pr'].mean("time", keep_attrs=True) + +nlats, nlons = clim.data.shape +for y in range(nlats): + for x in range(nlons): + if sftlf.data[y, x] > 50: + clim.data[y, x] = np.nan +``` + +While this approach technically works, +the problem is that (a) the code is hard to read, +and (b) in contrast to low level languages like Fortran and C, +high level languages like Python and Matlab are built for usability +(i.e. they make it easy to write concise, readable code) as opposed to speed. +This particular array is so small that the looping isn't noticably slow, +but in general looping over every data point in an array should be avoided. + +Fortunately, there are lots of numpy functions +(which are written in C under the hood) +that allow you to get around this problem +by applying a particular operation to an entire array at once +(which is known as a vectorised operation). +The `np.where` function, for instance, +allows you to make a true/false decision at each data point in the array +and then perform a different action depending on the answer. + +The developers of xarray have built-in the `np.where` functionality, +so creating a new DataArray with the land masked becomes a one-line command: + +```python +clim_ocean = clim.where(sftlf.data < 50) +print(clim_ocean) +``` + +```output + +array([[ nan, nan, nan, ..., nan, + nan, nan], + [ nan, nan, nan, ..., nan, + nan, nan], + [ nan, nan, nan, ..., nan, + nan, nan], + ..., + [7.5109556e-06, 7.4777777e-06, 7.4689174e-06, ..., 7.3359679e-06, + 7.3987890e-06, 7.3978440e-06], + [7.1837171e-06, 7.1722038e-06, 7.1926393e-06, ..., 7.1552149e-06, + 7.1576678e-06, 7.1592167e-06], + [7.0353467e-06, 7.0403985e-06, 7.0326828e-06, ..., 7.0392648e-06, + 7.0387587e-06, 7.0304386e-06]], dtype=float32) +Coordinates: + * lon (lon) float64 0.9375 2.812 4.688 6.562 ... 353.4 355.3 357.2 359.1 + * lat (lat) float64 -89.38 -88.12 -86.88 -85.62 ... 86.88 88.12 89.38 +Attributes: + standard_name: precipitation_flux + long_name: Precipitation + units: kg m-2 s-1 + comment: includes both liquid and solid phases + cell_methods: area: time: mean + cell_measures: area: areacella +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Mask option + +Modify `plot_precipitation_climatology.py` so that the user can choose to apply a mask +via the following `argparse` option: + +```python +parser.add_argument( + "--mask", + type=str, + nargs=2, + metavar=("SFTLF_FILE", "REALM"), + default=None, + help="""Provide sftlf file and realm to mask ('land' or 'ocean')""", +) +``` + +This should involve defining a new function called `apply_mask()`, +in order to keep `main()` short and readable. + +Test to see if your mask worked by plotting the ACCESS-CM2 climatology for JJA: + +```bash +$ python plot_precipitation_climatology.py data/pr_Amon_ACCESS-CM2_historical_r1i1p1f1_gn_201001-201412.nc JJA pr_Amon_ACCESS-CM2_historical_r1i1p1f1_gn_201001-201412-JJA-clim_land-mask.png --mask data/sftlf_fx_ACCESS-CM2_historical_r1i1p1f1_gn.nc ocean +``` + +![](fig/07-vectorisation-ocean-mask.png){alt='Ocean masked rainfall plot'} + +Commit the changes to git and then push to GitHub. + +::::::::::::::: solution + +Make the following additions to `plot_precipitation_climatology.py` +(code omitted from this abbreviated version of the script is denoted `...`): + +```python +def apply_mask(da, sftlf_file, realm): + """Mask ocean or land using a sftlf (land surface fraction) file. + + Args: + da (xarray.DataArray): Data to mask + sftlf_file (str): Land surface fraction file + realm (str): Realm to mask + + """ + + ds = xr.open_dataset(sftlf_file) + + if realm == 'land': + masked_da = da.where(ds['sftlf'].data < 50) + else: + masked_da = da.where(ds['sftlf'].data > 50) + + return masked_da + +... + +def main(inargs): + """Run the program.""" + + ds = xr.open_dataset(inargs.pr_file) + + clim = ds["pr"].groupby("time.season").mean("time", keep_attrs=True) + clim = convert_pr_units(clim) + + if inargs.mask: + sftlf_file, realm = inargs.mask + clim = apply_mask(clim, sftlf_file, realm) + + ... + +if __name__ == "__main__": + + description = "Plot the precipitation climatology for a given season." + parser = argparse.ArgumentParser(description=description) + + ... + + parser.add_argument( + "--mask", + type=str, + nargs=2, + metavar=("SFTLF_FILE", "REALM"), + default=None, + help="""Provide sftlf file and realm to mask ('land' or 'ocean')""", + ) + + args = parser.parse_args() + main(args) + +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## plot\_precipitation\_climatology.py + +At the conclusion of this lesson your `plot_precipitation_climatology.py` script +should look something like the following: + +::::::::::::::: solution + +```python +import argparse + +import xarray as xr +import cartopy.crs as ccrs +import matplotlib.pyplot as plt +import numpy as np +import cmocean + + +def convert_pr_units(da): + """Convert kg m-2 s-1 to mm day-1. + + Args: + da (xarray.DataArray): Precipitation data + + """ + + da.data = da.data * 86400 + da.attrs["units"] = "mm/day" + + return da + + +def apply_mask(da, sftlf_file, realm): + """Mask ocean or land using a sftlf (land surface fraction) file. + + Args: + da (xarray.DataArray): Data to mask + sftlf_file (str): Land surface fraction file + realm (str): Realm to mask + + """ + + ds = xr.open_dataset(sftlf_file) + + if realm == 'land': + masked_da = da.where(ds['sftlf'].data < 50) + else: + masked_da = da.where(ds['sftlf'].data > 50) + + return masked_da + + +def create_plot(clim, model, season, gridlines=False, levels=None): + """Plot the precipitation climatology. + + Args: + clim (xarray.DataArray): Precipitation climatology data + model (str): Name of the climate model + season (str): Season + + Kwargs: + gridlines (bool): Select whether to plot gridlines + levels (list): Tick marks on the colorbar + + """ + + if not levels: + levels = np.arange(0, 13.5, 1.5) + + fig = plt.figure(figsize=[12,5]) + ax = fig.add_subplot(111, projection=ccrs.PlateCarree(central_longitude=180)) + clim.sel(season=season).plot.contourf( + ax=ax, + levels=levels, + extend="max", + transform=ccrs.PlateCarree(), + cbar_kwargs={"label": clim.units}, + cmap=cmocean.cm.haline_r + ) + ax.coastlines() + if gridlines: + plt.gca().gridlines() + + title = f"{model} precipitation climatology ({season})" + plt.title(title) + + +def main(inargs): + """Run the program.""" + + ds = xr.open_dataset(inargs.pr_file) + + clim = ds["pr"].groupby("time.season").mean("time", keep_attrs=True) + clim = convert_pr_units(clim) + + if inargs.mask: + sftlf_file, realm = inargs.mask + clim = apply_mask(clim, sftlf_file, realm) + + create_plot( + clim, + ds.attrs["source_id"], + inargs.season, + gridlines=inargs.gridlines, + levels=inargs.cbar_levels + ) + plt.savefig( + inargs.output_file, + dpi=200, + bbox_inches="tight", + facecolor="white", + ) + + +if __name__ == "__main__": + description = "Plot the precipitation climatology for a given season." + parser = argparse.ArgumentParser(description=description) + + parser.add_argument("pr_file", type=str, help="Precipitation data file") + parser.add_argument("season", type=str, help="Season to plot") + parser.add_argument("output_file", type=str, help="Output file name") + + parser.add_argument( + "--gridlines", + action="store_true", + default=False, + help="Include gridlines on the plot", + ) + parser.add_argument( + "--cbar_levels", + type=float, + nargs="*", + default=None, + help="list of levels / tick marks to appear on the colorbar", + ) + parser.add_argument( + "--mask", + type=str, + nargs=2, + metavar=("SFTLF_FILE", "REALM"), + default=None, + help="""Provide sftlf file and realm to mask ('land' or 'ocean')""", + ) + + args = parser.parse_args() + main(args) + +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- For large arrays, looping over each element can be slow in high-level languages like Python. +- Vectorised operations can be used to avoid looping over array elements. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + + diff --git a/08-defensive.md b/08-defensive.md new file mode 100644 index 0000000..48ee505 --- /dev/null +++ b/08-defensive.md @@ -0,0 +1,801 @@ +--- +title: Defensive programming +teaching: 15 +exercises: 15 +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- Signal errors by raising exceptions. +- Use try-except blocks to catch and handle exceptions. +- Explain what an assertion is. +- Add assertions that check the program's state is correct. +- Use a logging framework to report on program activity. +- Identify sources of more advanced lessons on code testing. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- How can I make my programs more reliable? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::::: callout + +## Scientist's nightmare + +If you needed any motivation to learn and employ the principles of defensive programming, +look no further than [this article](https://science.sciencemag.org/content/314/5807/1856). +It documents the experience of a researcher who had to retract five published papers - +three of which were in *Science* - because his code had inadvertently switched +the rows and columns of a data table. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +Now that we've written `plot_precipitation_climatology.py`, +how can we be sure that it's producing reliable results? + +The first step toward getting the right answers from our programs +is to assume that mistakes *will* happen +and to guard against them. +This is called defensive programming, +and there are a number of tools and approaches at our disposal for doing this. +Broadly speaking, we can raise and handle errors to check and respond to program inputs, +use assertions to make sure nothing crazy or unexpected has happened, +write unit tests to make sure each component of our program produces expected outputs, +and use a logging framework to report on program activity. +In this lesson, we'll look at how error handling, assertions and logging +can make the unit conversion in our program more reliable, +and we'll provide links to further information on unit testing. + +## Types of errors + +There are essentially two kinds of errors that can arise in Python: +*syntax errors* and *exceptions*. +You're probably familiar with the former: + +```python +rainfall = 5 +if rainfall > 10 + print("heavy rainfall") +``` + +```error + if rainfall > 10 + ^ +SyntaxError: expected ':' +``` + +Once a statement or expression is syntactically correct, +it may cause an error when an attempt is made to execute it. +Errors detected during execution are called exceptions +(i.e. an exception from normal behaviour) +and there are lots of different types of exceptions: + +```python +10 * (1/0) +``` + +```error +ZeroDivisionError: division by zero +``` + +```python +4 + spam*3 +``` + +```error +NameError: name 'spam' is not defined +``` + +```python +"2" + 2 +``` + +```error +TypeError: can only concatenate str (not "int") to str +``` + +## Raising errors + +With respect to defensive programming, +it can sometimes be useful to raise your own exceptions (using the `raise` keyword). + +```python +infile = "temperature_data.txt" +file_format = infile.split(".")[-1] +if file_format != "nc": + raise ValueError(f"{infile} does not have the netCDF file extension .nc") +``` + +```error +ValueError Traceback (most recent call last) +/var/folders/6v/vrpsky6j509dff7250jyg8240000gp/T/ipykernel_12425/3612736507.py in + 2 file_format = infile.split(".")[-1] + 3 if file_format != "nc": +----> 4 raise ValueError(f"{infile} does not have the netCDF file extension .nc") + +ValueError: temperature_data.txt does not have the netCDF file extension .nc +``` + +In this case we've chosen to raise a `ValueError`, +but we could pick any of the [builtin exception types](https://docs.python.org/3/library/exceptions.html#exception-hierarchy) +(including just a generic `Exception`) +or define our own [custom exception class](https://towardsdatascience.com/how-to-define-custom-exception-classes-in-python-bfa346629bca). + +In the context of our `plot_precipitation_climatology.py` script, +we currently multiply our data by 86400 regardless of what the input units are. +It would be better if we modified the `main` function so that the program +multiplied by 86400 if the input units are kg m-2 s-1, +performed no unit conversion if the input units are mm/day, +or halted with an informative error message if the input data have some other units. + +```python +input_units = clim.attrs["units"] +if input_units == "kg m-2 s-1": + clim = convert_pr_units(clim) +elif input_units == "mm/day": + pass +else: + raise ValueError("""Input units are not 'kg m-2 s-1' or 'mm/day'""") +``` + +## Handling errors + +As we've seen in the examples above, +if exceptions aren't dealt with the program crashes. +The error message upon crashing is sometimes be easy to understand +(particularly if you wrote the `raise` statement yourself) +but can often be cryptic. + +If we'd rather the program didn't crash when a particular exception occurs, +we can use a try-except block to catch and handle the exception. +The syntax of the try-except block is: + +```python +try: + +except Exception: + +``` + +The code in the except block is only executed if an exception occurred in the try block. +The except block is required with a try block, even if it contains only the `pass` statement +(i.e. ignore the exception and carry on). +For example, +let's say there's a calculation in our program where we need to divide +by the number of available weather stations. +If there were no weather stations available, +by default the program would crash. + +```python +quantity = 500 +n_stations = 0 + +scaled_quantity = quantity / n_stations +print(scaled_quantity) +``` + +```error +ZeroDivisionError Traceback (most recent call last) +/var/folders/6v/vrpsky6j509dff7250jyg8240000gp/T/ipykernel_12425/3927438267.py in + 2 n_stations = 0 + 3 +----> 4 scaled_quantity = quantity / n_stations + 5 print(scaled_quantity) + +ZeroDivisionError: division by zero +``` + +If we'd prefer the program simply continue with a `scaled_quantity` value of NaN, +we could catch and handle the `ZeroDivisionError`. + +```python +import numpy as np + + +quantity = 500 +n_stations = 0 + +try: + scaled_quantity = quantity / n_stations +except ZeroDivisionError: + scaled_quantity = np.nan + +print(scaled_quantity) +``` + +```output +nan +``` + +In the context of our `plot_precipitation_climatology.py` script, +the variable attributes from the input netCDF file are stored in a dictionary +that we access via `clim.attrs`. +The dictionary keys are the names of the variable attributes +(e.g. `standard_name`, `long_name`, `units`) +and the dictionary values are the values corresponding to those keys +(e.g. `precipitation_flux`, `Precipitation`, `kg m-2 s-1`). +If the input data file didn't have a units attribute associated with the precipitation variable, +our program would currently fail with a `KeyError` +(which Python raises when you ask for a key that isn't in a dictionary). + +```python +example_dict = { + "standard_name": "precipitation_flux", + "long_name": "Precipitation", +} +units = example_dict["units"] +``` + +```error +KeyError Traceback (most recent call last) +/var/folders/6v/vrpsky6j509dff7250jyg8240000gp/T/ipykernel_12425/2679443625.py in + 1 example_dict = { + 2 "standard_name": "precipitation_flux", + 3 "long_name": "Precipitation", + 4 } +----> 5 units = example_dict["units"] + +KeyError: 'units' +``` + +It's fine that our program would crash in this situation +(we can't continue if we don't know the units of the input data), +but the error message doesn't explicitly tell the user that the +input file requires a units attribute. +To make this crystal clear, +we could use a try-except block in the `main` function to catch the `KeyError` +and re-define a better error message. + +```python +try: + input_units = clim.attrs["units"] +except KeyError: + raise KeyError(f"Precipitation variable in {inargs.pr_file} does not have a units attribute") +``` + +## Assertions + +Unexpected behaviour in a program can sometimes propagate a long way +before triggering an exception or producing a perplexing result. +For instance, +if a calculation produces a non-physical value for precipitation (e.g. a negative value) +that value could be used in various downstream calculations of drought and fire risk +(combined with values for temperature, wind, humidity, etc). +The final plot of the forest fire danger index might look wrong +(or not, which would be even worse) +to the scientist who wrote and executed the code, +but it wouldn't be immediately obvious that the precipitation calculation +was the source of the problem. + +In order to avoid propagation, +it's best to nip unexpected behaviour in the bud right when it occurs. +One way to do this is to add assertions to your code. +An assertion is simply a statement that something must be true at a certain point in a program. +When Python sees one, +it evaluates the assertion's condition. +If it's true, +Python does nothing, +but if it's false, +Python halts the program immediately +and raises an `AssertionError` with a custom error message. + +To demonstrate an assertion in action, +consider this piece of code that halts if any precipitation data is negative: + +```python +import numpy as np + +pr_data = np.array([1.5, 2.3, 0.7, -0.2, 4.4]) +assert pr_data.min() >= 0.0, "There is at least one negative precipitation value" +``` + +```error +--------------------------------------------------------------------------- +AssertionError Traceback (most recent call last) + in () +----> 1 assert pr_data.min() >= 0.0, "There is at least one negative precipitation value" + +AssertionError: There is at least one negative precipitation value +``` + +With respect to our command line program, +one thing to check would be that the climatological precipitation values lie within +a sensible range after the unit conversion. +The [world record highest daily rainfall total](https://wmo.asu.edu/content/world-greatest-twenty-four-hour-1-day-rainfall) +is 1825mm (at Reunion Island in 1966), +so climatological values across the globe should be more than 0 but less than 2000 mm/day. +We could add the following assertions to our `convert_pr_units` function +to catch unexpected precipitation values. + +```python +assert darray.data.min() >= 0.0, "There is at least one negative precipitation value" +assert darray.data.max() < 2000, "There is a precipitation value/s > 2000 mm/day" +``` + +Assertions are also used in unit testing +(each test culminates in an assertion), +but that topic is beyond the scope of this lesson. + +::::::::::::::::::::::::::::::::::::::::: callout + +## Testing and continuous integration + +An assertion checks that something is true at a particular point in the program. +For programs that are more complex (or research critical) than `plot_precipitation_climatology.py`, +it's a good idea to take the next step and check the overall behavior of entire pieces (or units) of code. +Related concepts like unit testing and continuous integration are beyond the scope of this lesson, +but *Research Software Engineering With Python* has a [chapter on testing](https://merely-useful.github.io/py-rse/testing.html) +that is well worth a read. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Logging + +So far we've considered how to make our programs halt or handle the situation when things go wrong. +Another option in our defensive programming toolkit is to have our programs report their own activity. + +For example, +let's say we're working with relative humidity data. +We wouldn't typically expect to encounter any values over 100%, but it is physically possible. +Rather than halt the program if a value over 100% occurs, +we might therefore want our program to simply report the maximum relative humidity. +We can then decide whether to trust the output or not +(e.g. a value of 100.1% might be ok but not 150%). + +To do this reporting, +our first instinct might be to add a `print` statement to the program. + +```python +rh_data = np.array([1.5, 20.4, 100.1, 76.3, 54.4]) +rh_max = rh_data.max() +print(f"The maximum relative humidity was {rh_max}%") +``` + +```output +The maximum relative humidity was 100.1% +``` + +The problem with this approach is that information printed to the screen is lost +once we close our command line session. +Constantly adding, removing or commenting out `print` statements is also tedious and error-prone. + +A better approach is to use a logging framework, +such as Python's `logging` library. +This lets us leave debugging statements in our code and turn them on or off at will. +Let's start by replacing our `print` statement with a `logging` command. + +```python +import logging + +rh_data = np.array([1.5, 20.4, 100.1, 76.3, 54.4]) +rh_max = rh_data.max() +logging.debug(f"The maximum relative humidity was {rh_max}%") +``` + +Whoops! +There's no output because by default the logging library only reports information +at the "warning" level and above. +In order of increasing severity, the available levels are: + +- `debug`: very detailed information used for localizing errors. +- `info`: confirmation that things are working as expected. +- `warning`: something unexpected happened, but the program will keep going. +- `error`: something has gone badly wrong, but the program hasn't hurt anything. +- `critical`: potential loss of data, security breach, etc. + +If we want to see the output from less severe levels (i.e. turn our debugging statements on), +we'd need to change the minimum level in the logging configuration. +We can also provide the name of a file to write the logging information to, +so that it isn't lost when we finish our command line session. + +```python +# for loop only required in notebooks +for handler in logging.root.handlers[:]: + logging.root.removeHandler(handler) + +logging.basicConfig(level=logging.DEBUG, filename="log.txt")) + +rh_data = np.array([1.5, 20.4, 100.1, 76.3, 54.4]) +rh_max = rh_data.max() +logging.debug(f"The maximum relative humidity was {rh_max}%") +``` + +(The for loop is needed to turn off the background logging the notebook does itself. +It's not needed in a Python script.) + +Notice that we've used capital `logging.DEBUG` (which is an integer value) to set the logging level, +as opposed to the `logging.debug` function that is used for logging a message. + +By setting the logging level to "DEBUG", +our output "log.txt" file will now capture all logging information +with a flag of 'debug' or higher - that is, +all logging outputs will be written to our log file. + +```bash +$ cat log.txt +``` + +```output +DEBUG:root:The maximum relative humidity was 100.1% +``` + +In the context of the `plot_precipitation_climatology.py` script, +it would be nice to know whether or not unit conversion was performed. +To do this we just need a few small changes to the script. +We need to import the logging library at the top of the script, + +```python +import logging +``` + +and set the logging configuration and add a `logging.info` command in the `main` function. + +```python +def main(inargs): + """Run the program.""" + + logging.basicConfig(level=logging.DEBUG, filename="log.txt") + + ... + + if input_units == 'kg m-2 s-1': + clim = convert_pr_units(clim) + logging.info("Units converted from kg m-2 s-1 to mm/day") + elif input_units == "mm/day": + pass + else: + raise ValueError("""Input units are not 'kg m-2 s-1' or 'mm/day'""") + + ... +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Update your script + +Update your working copy of `plot_precipitation_climatology.py` +with the changes from this lesson. + +This will mean your `main` function will now read as follows, + +```python +def main(inargs): + """Run the program.""" + + logging.basicConfig(level=logging.DEBUG, filename="log.txt") + + ds = xr.open_dataset(inargs.pr_file) + + clim = ds["pr"].groupby("time.season").mean("time", keep_attrs=True) + + try: + input_units = clim.attrs["units"] + except KeyError: + raise KeyError(f"Precipitation variable in {inargs.pr_file} must have a units attribute") + + if input_units == "kg m-2 s-1": + clim = convert_pr_units(clim) + logging.info("Units converted from kg m-2 s-1 to mm/day") + elif input_units == "mm/day": + pass + else: + raise ValueError("""Input units are not 'kg m-2 s-1' or 'mm/day'""") + + if inargs.mask: + sftlf_file, realm = inargs.mask + clim = apply_mask(clim, sftlf_file, realm) + + create_plot( + clim, + dset.attrs["source_id"], + inargs.season, + gridlines=inargs.gridlines, + levels=inargs.cbar_levels + ) + plt.savefig( + inargs.output_file, + dpi=200, + bbox_inches="tight", + facecolor="white", + ) +``` + +and your `convert_pr_units` as: + +```python +def convert_pr_units(da): + """Convert kg m-2 s-1 to mm day-1. + + Args: + da (xarray.DataArray): Precipitation data + + """ + + da.data = da.data * 86400 + da.attrs["units"] = "mm/day" + + assert da.data.min() >= 0.0, "There is at least one negative precipitation value" + assert da.data.max() < 2000, "There is a precipitation value/s > 2000 mm/day" + + return da +``` + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Verbose reporting + +Add two new command line options to `plot_precipitation_climatology.py`. + +The first should change the logging level so reporting from the program is more verbose. + +```python +parser.add_argument( + "-v", + "--verbose", + action="store_true", + default=False, + help="Change the minimum logging reporting level from WARNING (default) to INFO", +) +``` + +The second should allow the user to specify the name of the log file. +(If they don't specify a name then the logging information is printed to the screen.) + +```python +parser.add_argument( + "--logfile", + type=str, + default=None, + help="Name of log file (by default logging information is printed to the screen)", +) +``` + +::::::::::::::: solution + +The basic configuration command at the top of the `main` function +(`logging.basicConfig(level=logging.DEBUG, filename="log.txt")`) +needs to be replaced with the following: + +```python +log_lev = logging.INFO if inargs.verbose else logging.WARNING +logging.basicConfig(level=log_lev, filename=inargs.logfile) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Error handling for land/ocean masks + +In the previous lesson we added an `apply_mask` function to our script. +Update the following if statement in that function +so that it raises a `ValueError` if the realm is not `ocean` or `land`. + +```python +if realm == "land": + masked_da = da.where(ds["sftlf"].data < 50) +else: + masked_da = da.where(ds["sftlf"].data > 50) +``` + +::::::::::::::: solution + +```python +if realm.lower() == 'land': + masked_da = da.where(ds["sftlf"].data < 50) +elif realm.lower() == 'ocean': + masked_da = da.where(ds["sftlf"].data > 50) +else: + raise ValueError("""Mask realm is not 'ocean' or 'land'""") +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## plot\_precipitation\_climatology.py + +At the conclusion of this lesson your `plot_precipitation_climatology.py` script +should look something like the following: + +::::::::::::::: solution + +```python +import logging +import argparse + +import xarray as xr +import cartopy.crs as ccrs +import matplotlib.pyplot as plt +import numpy as np +import cmocean + + +def convert_pr_units(da): + """Convert kg m-2 s-1 to mm day-1. + + Args: + da (xarray.DataArray): Precipitation data + + """ + + da.data = da.data * 86400 + da.attrs["units"] = "mm/day" + + assert da.data.min() >= 0.0, "There is at least one negative precipitation value" + assert da.data.max() < 2000, "There is a precipitation value/s > 2000 mm/day" + + return da + + +def apply_mask(da, sftlf_file, realm): + """Mask ocean or land using a sftlf (land surface fraction) file. + + Args: + da (xarray.DataArray): Data to mask + sftlf_file (str): Land surface fraction file + realm (str): Realm to mask + + """ + + ds = xr.open_dataset(sftlf_file) + if realm.lower() == "land": + masked_da = da.where(ds["sftlf"].data < 50) + elif realm.lower() == 'ocean': + masked_da = da.where(ds["sftlf"].data > 50) + else: + raise ValueError("""Mask realm is not 'ocean' or 'land'""") + + return masked_da + + +def create_plot(clim, model, season, gridlines=False, levels=None): + """Plot the precipitation climatology. + + Args: + clim (xarray.DataArray): Precipitation climatology data + model (str): Name of the climate model + season (str): Season + + Kwargs: + gridlines (bool): Select whether to plot gridlines + levels (list): Tick marks on the colorbar + + """ + + if not levels: + levels = np.arange(0, 13.5, 1.5) + + fig = plt.figure(figsize=[12,5]) + ax = fig.add_subplot(111, projection=ccrs.PlateCarree(central_longitude=180)) + clim.sel(season=season).plot.contourf( + ax=ax, + levels=levels, + extend='max', + transform=ccrs.PlateCarree(), + cbar_kwargs={"label": clim.units}, + cmap=cmocean.cm.haline_r, + ) + ax.coastlines() + if gridlines: + plt.gca().gridlines() + + title = f"{model} precipitation climatology ({season})" + plt.title(title) + + +def main(inargs): + """Run the program.""" + + log_lev = logging.DEBUG if inargs.verbose else logging.WARNING + logging.basicConfig(level=log_lev, filename=inargs.logfile) + + ds = xr.open_dataset(inargs.pr_file) + + clim = ds['pr'].groupby("time.season").mean("time", keep_attrs=True) + + try: + input_units = clim.attrs["units"] + except KeyError: + raise KeyError(f"Precipitation variable in {inargs.pr_file} does not have a units attribute") + + if input_units == "kg m-2 s-1": + clim = convert_pr_units(clim) + logging.info("Units converted from kg m-2 s-1 to mm/day") + elif input_units == "mm/day": + pass + else: + raise ValueError("""Input units are not 'kg m-2 s-1' or 'mm/day'""") + + if inargs.mask: + sftlf_file, realm = inargs.mask + clim = apply_mask(clim, sftlf_file, realm) + + create_plot( + clim, + ds.attrs["source_id"], + inargs.season, + gridlines=inargs.gridlines, + levels=inargs.cbar_levels + ) + plt.savefig( + inargs.output_file, + dpi=200, + bbox_inches="tight", + facecolor="white", + ) + + +if __name__ == '__main__': + description='Plot the precipitation climatology for a given season.' + parser = argparse.ArgumentParser(description=description) + + parser.add_argument("pr_file", type=str, help="Precipitation data file") + parser.add_argument("season", type=str, help="Season to plot") + parser.add_argument("output_file", type=str, help="Output file name") + + parser.add_argument( + "--gridlines", + action="store_true", + default=False, + help="Include gridlines on the plot", + ) + parser.add_argument( + "--cbar_levels", + type=float, + nargs="*", + default=None, + help="list of levels / tick marks to appear on the colorbar", + ) + parser.add_argument( + "--mask", + type=str, + nargs=2, + metavar=("SFTLF_FILE", "REALM"), + default=None, + help="""Provide sftlf file and realm to mask ('land' or 'ocean')""", + ) + parser.add_argument( + "-v", + "--verbose", + action="store_true", + default=False, + help="Change the minimum logging reporting level from WARNING (default) to DEBUG", + ) + parser.add_argument( + "--logfile", + type=str, + default=None, + help="Name of log file (by default logging information is printed to the screen)", + ) + + args = parser.parse_args() + main(args) + +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Program defensively, i.e., assume that errors are going to arise, and write code to detect them when they do. +- You can raise exceptions in your own code. +- Put try-except blocks in programs to catch and handle exceptions. +- Put assertions in programs to check their state as they run. +- Use a logging framework instead of `print` statements to report program activity. +- The are more advanced lessons you can read to learn about code testing. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + + diff --git a/09-provenance.md b/09-provenance.md new file mode 100644 index 0000000..304fbbb --- /dev/null +++ b/09-provenance.md @@ -0,0 +1,609 @@ +--- +title: Data provenance +teaching: 10 +exercises: 20 +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- Automate the process of recording the history of what was entered at the command line to produce a given data file or image. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- How can keep track of my data processing steps? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +We've now successfully created a command line program - `plot_precipitation_climatology.py` - +that calculates and plots the precipitation climatology for a given season. +The last step is to capture the provenance of that plot. +In other words, we need a log of all the data processing steps +that were taken from the intial download of the data file to the end result +(i.e. the PNG image). + +The simplest way to do this is to follow the lead of the +[NCO](https://nco.sourceforge.net/) +and [CDO](https://code.mpimet.mpg.de/projects/cdo) command line tools, +which insert a record of what was executed at the command line +into the history attribute of the output netCDF file. +If fact, +they were both used in the pre-processing of the data files +used in these lessons: + +```python +import xarray as xr + + +accessesm15_pr_file = "data/pr_Amon_ACCESS-ESM1-5_historical_r1i1p1f1_gn_201001-201412.nc" +ds = xr.open_dataset(accessesm15_pr_file) + +print(ds.attrs['history']) +``` + +```output +Tue Jan 12 14:50:35 2021: ncatted -O -a history,pr,d,, pr_Amon_ACCESS-ESM1-5_historical_r1i1p1f1_gn_201001-201412.nc +Tue Jan 12 14:48:10 2021: cdo seldate,2010-01-01,2014-12-31 /g/data/fs38/publications/CMIP6/CMIP/CSIRO/ACCESS-ESM1-5/historical/r1i1p1f1/Amon/pr/gn/latest/pr_Amon_ACCESS-ESM1-5_historical_r1i1p1f1_gn_185001-201412.nc pr_Amon_ACCESS-ESM1-5_historical_r1i1p1f1_gn_201001-201412.nc +2019-11-15T04:32:57Z ; CMOR rewrote data to be consistent with CMIP6, CF-1.7 CMIP-6.2 and CF standards. +``` + +Fortunately, there is a Python package called [cmdline-provenance](https://cmdline-provenance.readthedocs.io/en/latest/) +that creates NCO/CDO-style records of what was executed at the command line. +We can use it to add to the command log, +before inserting the updated log into the metadata associated with our output image file. + +To start, let's import the `cmdline_provenance` library with the other imports at the top of our +`plot_precipitation_climatology.py` script, + +```python +import cmdline_provenance as cmdprov +``` + +and then make the following updates to the original line of code +responsible for saving the image to file: + +```python +new_log = cmdprov.new_log(infile_logs={inargs.pr_file: dset.attrs["history"]}) +plt.savefig( + inargs.output_file, + metadata={"History": new_log}, + dpi=200, + bbox_inches="tight", + facecolor="white", +) +``` + +When we execute `plot_precipitation_climatology.py`, +`cmdprov.new_log` will create a record of what was entered at the command line. +The name of the input precipitation file and its history attribute +have been provided using the `infile_logs` argument, +so that `cmdprov.new_log` can append the history of that file to the new command log. +The `metadata` argument for `plt.savefig` +has then been used to save the new command log to the image metadata. + +To see this in action, +we can run the program, + +```bash +$ python plot_precipitation_climatology.py data/pr_Amon_ACCESS-ESM1-5_historical_r1i1p1f1_gn_201001-201412.nc SON pr_Amon_ACCESS-ESM1-5_historical_r1i1p1f1_gn_201001-201412-SON-clim.png +``` + +and then view the metadata of the new PNG image file. +There are a number of different programs available to do this, +but they can often be tricky to install. +Fortunately, +`conda` is used to install programs written in many different languages, +not just Python. +There are [installation recipes](https://anaconda.org/conda-forge/exiftool) +for a command line program called [`exiftool`](https://exiftool.org/) on anaconda.org, +so let's go ahead and install that: + +```bash +$ conda install exiftool +``` + +Once installed, +we can use it to view the metadata associated with our image file: + +```bash +$ exiftool pr_Amon_ACCESS-ESM1-5_historical_r1i1p1f1_gn_201001-201412-SON-clim.png +``` + +```output +ExifTool Version Number : 12.17 +File Name : pr_Amon_ACCESS-ESM1-5_historical_r1i1p1f1_gn_201001-201412-SON-clim.png +Directory : . +File Size : 315 KiB +File Modification Date/Time : 2021:02:08 10:16:56+11:00 +File Access Date/Time : 2021:02:08 10:16:58+11:00 +File Inode Change Date/Time : 2021:02:08 10:16:56+11:00 +File Permissions : rw-r--r-- +File Type : PNG +File Type Extension : png +MIME Type : image/png +Image Width : 2400 +Image Height : 1000 +Bit Depth : 8 +Color Type : RGB with Alpha +Compression : Deflate/Inflate +Filter : Adaptive +Interlace : Noninterlaced +Software : Matplotlib version3.3.3, https://matplotlib.org/ +History : Mon Feb 08 09:45:17 2021: /Users/damien/opt/anaconda3/envs/pyaos-lesson/bin/python code/plot_precipitation_climatology.py data/pr_Amon_ACCESS-ESM1-5_historical_r1i1p1f1_gn_201001-201412.nc SON pr_Amon_ACCESS-ESM1-5_historical_r1i1p1f1_gn_201001-201412-SON-clim.png.Tue Jan 12 14:50:35 2021: ncatted -O -a history,pr,d,, pr_Amon_ACCESS-ESM1-5_historical_r1i1p1f1_gn_201001-201412.nc.Tue Jan 12 14:48:10 2021: cdo seldate,2010-01-01,2014-12-31 /g/data/fs38/publications/CMIP6/CMIP/CSIRO/ACCESS-ESM1-5/historical/r1i1p1f1/Amon/pr/gn/latest/pr_Amon_ACCESS-ESM1-5_historical_r1i1p1f1_gn_185001-201412.nc pr_Amon_ACCESS-ESM1-5_historical_r1i1p1f1_gn_201001-201412.nc.2019-11-15T04:32:57Z ; CMOR rewrote data to be consistent with CMIP6, CF-1.7 CMIP-6.2 and CF standards. . +Pixels Per Unit X : 7874 +Pixels Per Unit Y : 7874 +Pixel Units : meters +Image Size : 2400x1000 +Megapixels : 2.4 +``` + +Now that we've successfully added a command log to our PNG image, +we might want to think about how our script should handle other image formats (e.g. PDF, EPS)? +For PNG files you can pick whatever metadata keys you like (hence we picked "History"), +but other formats only allow specific keys. +For now we can add raise an error so that the program halts +if someone tries to generate a format that isn't PNG, +and in the exercises we'll add more valid formats to the script. + +```python +image_format = inargs.output_file.split(".")[-1]) +if image_format != "png": + raise ValueError("Output file must have the PNG image file extension .png") +``` + +Putting this altogether, +here's what the `main` function in `plot_precipitation_climatology.py` looks like: + +```python +def main(inargs): + """Run the program.""" + + log_lev = logging.DEBUG if inargs.verbose else logging.WARNING + logging.basicConfig(level=log_lev, filename=inargs.logfile) + + ds = xr.open_dataset(inargs.pr_file) + + clim = ds["pr"].groupby("time.season").mean("time", keep_attrs=True) + + try: + input_units = clim.attrs["units"] + except KeyError: + raise KeyError(f"Precipitation variable in {inargs.pr_file} does not have a units attribute") + + if input_units == "kg m-2 s-1": + clim = convert_pr_units(clim) + logging.info("Units converted from kg m-2 s-1 to mm/day") + elif input_units == "mm/day": + pass + else: + raise ValueError("""Input units are not 'kg m-2 s-1' or 'mm/day'""") + + if inargs.mask: + sftlf_file, realm = inargs.mask + clim = apply_mask(clim, sftlf_file, realm) + + create_plot( + clim, + ds.attrs["source_id"], + inargs.season, + gridlines=inargs.gridlines, + levels=inargs.cbar_levels, + ) + + image_format = inargs.output_file.split(".")[-1] + if image_format != "png": + raise ValueError("Output file must have the PNG image file extension .png") + new_log = cmdprov.new_log(infile_logs={inargs.pr_file: ds.attrs["history"]}) + plt.savefig( + inargs.output_file, + metadata={"History": new_log}, + dpi=200, + bbox_inches="tight", + facecolor="white", + ) +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Handling different image formats + +The [`plt.savefig` documentation](https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.savefig.html) +provides information on the metadata keys accepted by +PNG, PDF, EPS and PS image formats. + +Using that information as a guide, +add a new function called `get_log_and_key` +to the `plot_precipitation_climatology.py` script. +It should take the precipitation file name, history attribute, +and plot type as arguments and return the updated command log and +the appropriate metadata key for PNG, PDF, EPS or PS image formats. + +When you're done, +the final lines of the `main` function should read as follows: + +```python +log_key, new_log = get_log_and_key( + inargs.pr_file, + ds.attrs["history"], + inargs.output_file.split(".")[-1], +) +plt.savefig( + inargs.output_file, + metadata={log_key: new_log}, + dpi=200, + bbox_inches="tight", + facecolor="white", +) +``` + +::::::::::::::: solution + +The new function could read as follows: + +```python +def get_log_and_key(pr_file, history_attr, plot_type): + """Get key and command line log for image metadata. + + Different image formats allow different metadata keys. + + Args: + pr_file (str): Input precipitation file + history_attr (str): History attribute from pr_file + plot_type (str): File format for output image + + """ + + valid_keys = {"png": "History", + "pdf": "Title", + "eps": "Creator", + "ps" : "Creator", + } + if plot_type in valid_keys.keys(): + log_key = valid_keys[plot_type] + else: + raise ValueError(f"Image format not one of: {*[*valid_keys],}") + new_log = cmdprov.new_log(infile_logs={pr_file: history_attr}) + + return log_key, new_log +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Writing the command log to netCDF files + +Instead of calculating the seasonal climatology and creating a plot all in the one script, +we could have decided to make it a two step process. +The first script in the process could take the original precipitation data file and +output a new netCDF file containing the seasonal climatology, +and the second script could take the seasonal climatology file +and create a plot from that. + +If the first script reads as follows, +how would you edit it so that an updated command log +is included in the history attribute of the output netCDF file? + +```python +import argparse + +import numpy as np +import xarray as xr + + +def convert_pr_units(darray): + """Convert kg m-2 s-1 to mm day-1. + + Args: + da (xarray.DataArray): Precipitation data + """ + + da.data = da.data * 86400 + da.attrs["units"] = "mm/day" + + assert da.data.min() >= 0.0, "There is at least one negative precipitation value" + assert da.data.max() < 2000, "There is a precipitation value/s > 2000 mm/day" + + return da + + +def main(inargs): + """Run the program.""" + + log_lev = logging.DEBUG if inargs.verbose else logging.WARNING + logging.basicConfig(level=log_lev, filename=inargs.logfile) + + in_dset = xr.open_dataset(inargs.pr_file) + clim = in_ds["pr"].groupby("time.season").mean("time", keep_attrs=True) + + try: + input_units = clim.attrs["units"] + except KeyError: + raise KeyError(f"Precipitation variable in {inargs.pr_file} does not have a units attribute") + + if input_units == "kg m-2 s-1": + clim = convert_pr_units(clim) + logging.info("Units converted from kg m-2 s-1 to mm/day") + elif input_units == "mm/day": + pass + else: + raise ValueError("""Input units are not 'kg m-2 s-1' or 'mm/day'""") + + out_ds = clim.to_dataset() + out_ds.attrs = in_ds.attrs + out_ds.to_netcdf(inargs.output_file) + + +if __name__ == "__main__": + description = "Calculate the seasonal precipitation climatology." + parser = argparse.ArgumentParser(description=description) + parser.add_argument("pr_file", type=str, help="Precipitation data file") + parser.add_argument("output_file", type=str, help="Output file name") + parser.add_argument( + "-v", + "--verbose", + action="store_true", + default=False, + help="Change the minimum logging reporting level from WARNING (default) to DEBUG", + ) + parser.add_argument( + "--logfile", + type=str, + default=None, + help="Name of log file (by default logging information is printed to the screen)", + ) + args = parser.parse_args() + main(args) +``` + +::::::::::::::: solution + +The beginning of the script would need to be updated to import +the `cmdline_provenance` library: + +```python +import cmdline_provenance as cmdprov +``` + +The body of the `main` function would then need to be updated +to write the new command log to the history attribute of the output dataset: + +```python +out_ds.attrs = in_ds.attrs +new_log = cmdprov.new_log(infile_logs={inargs.pr_file: in_ds.attrs["history"]}) +out_ds.attrs["history"] = new_log +out_ds.to_netcdf(inargs.output_file) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## plot\_precipitation\_climatology.py + +At the conclusion of this lesson your `plot_precipitation_climatology.py` script +should look something like the following: + +::::::::::::::: solution + +```python +import logging +import argparse + +import numpy as np +import matplotlib.pyplot as plt +import xarray as xr +import cartopy.crs as ccrs +import cmocean +import cmdline_provenance as cmdprov + + +def convert_pr_units(da): + """Convert kg m-2 s-1 to mm day-1. + + Args: + da (xarray.DataArray): Precipitation data + + """ + + da.data = da.data * 86400 + da.attrs["units"] = "mm/day" + + assert da.data.min() >= 0.0, "There is at least one negative precipitation value" + assert da.data.max() < 2000, "There is a precipitation value/s > 2000 mm/day" + + return da + + +def apply_mask(darray, sftlf_file, realm): + """Mask ocean or land using a sftlf (land surface fraction) file. + + Args: + da (xarray.DataArray): Data to mask + sftlf_file (str): Land surface fraction file + realm (str): Realm to mask + + """ + + ds = xr.open_dataset(sftlf_file) + if realm.lower() == 'land': + masked_da = da.where(ds["sftlf"].data < 50) + elif realm.lower() == 'ocean': + masked_da = da.where(ds["sftlf"].data > 50) + else: + raise ValueError("""Mask realm is not 'ocean' or 'land'""") + + return masked_da + + +def create_plot(clim, model, season, gridlines=False, levels=None): + """Plot the precipitation climatology. + + Args: + clim (xarray.DataArray): Precipitation climatology data + model (str): Name of the climate model + season (str): Season + + Kwargs: + gridlines (bool): Select whether to plot gridlines + levels (list): Tick marks on the colorbar + + """ + + if not levels: + levels = np.arange(0, 13.5, 1.5) + + fig = plt.figure(figsize=[12,5]) + ax = fig.add_subplot(111, projection=ccrs.PlateCarree(central_longitude=180)) + clim.sel(season=season).plot.contourf( + ax=ax, + levels=levels, + extend="max", + transform=ccrs.PlateCarree(), + cbar_kwargs={"label": clim.units}, + cmap=cmocean.cm.haline_r + ) + ax.coastlines() + if gridlines: + plt.gca().gridlines() + + title = f"{model} precipitation climatology ({season})" + plt.title(title) + + +def get_log_and_key(pr_file, history_attr, plot_type): + """Get key and command line log for image metadata. + + Different image formats allow different metadata keys. + + Args: + pr_file (str): Input precipitation file + history_attr (str): History attribute from pr_file + plot_type (str): File format for output image + + """ + + valid_keys = {"png": "History", + "pdf": "Title", + "svg": "Title", + "eps": "Creator", + "ps" : "Creator", + } + + assert plot_type in valid_keys.keys(), f"Image format not one of: {*[*valid_keys],}" + log_key = valid_keys[plot_type] + new_log = cmdprov.new_log(infile_logs={pr_file: history_attr}) + + return log_key, new_log + + +def main(inargs): + """Run the program.""" + + log_lev = logging.INFO if inargs.verbose else logging.WARNING + logging.basicConfig(level=log_lev, filename=inargs.logfile) + + ds = xr.open_dataset(inargs.pr_file) + + clim = ds["pr"].groupby("time.season").mean("time", keep_attrs=True) + + try: + input_units = clim.attrs["units"] + except KeyError: + raise KeyError("Precipitation variable in {inargs.pr_file} must have a units attribute") + + if input_units == "kg m-2 s-1": + clim = convert_pr_units(clim) + logging.info("Units converted from kg m-2 s-1 to mm/day") + elif input_units == "mm/day": + pass + else: + raise ValueError("""Input units are not 'kg m-2 s-1' or 'mm/day'""") + + if inargs.mask: + sftlf_file, realm = inargs.mask + clim = apply_mask(clim, sftlf_file, realm) + + create_plot( + clim, + ds.attrs["source_id"], + inargs.season, + gridlines=inargs.gridlines, + levels=inargs.cbar_levels, + ) + + log_key, new_log = get_log_and_key( + inargs.pr_file, + ds.attrs["history"], + inargs.output_file.split('.')[-1], + ) + plt.savefig( + inargs.output_file, + metadata={log_key: new_log}, + dpi=200, + bbox_inches="tight", + facecolor="white", + ) + + +if __name__ == "__main__": + description = "Plot the precipitation climatology for a given season." + parser = argparse.ArgumentParser(description=description) + + parser.add_argument("pr_file", type=str, help="Precipitation data file") + parser.add_argument("season", type=str, help="Season to plot") + parser.add_argument("output_file", type=str, help="Output file name") + + parser.add_argument( + "--gridlines", + action="store_true", + default=False, + help="Include gridlines on the plot", + ) + parser.add_argument( + "--cbar_levels", + type=float, + nargs="*", + default=None, + help="list of levels / tick marks to appear on the colorbar", + ) + parser.add_argument( + "--mask", + type=str, + nargs=2, + metavar=("SFTLF_FILE", "REALM"), + default=None, + help="""Provide sftlf file and realm to mask ('land' or 'ocean')""", + ) + parser.add_argument( + "-v", + "--verbose", + action="store_true", + default=False, + help="Change the minimum logging reporting level from WARNING (default) to DEBUG", + ) + parser.add_argument( + "--logfile", + type=str, + default=None, + help="Name of log file (by default logging information is printed to the screen)", + ) + + args = parser.parse_args() + main(args) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- It is possible (in only a few lines of code) to record the provenance of a data file or image. + +:::::::::::::::::::::::::::::::::::::::::::::::::: diff --git a/10-large-data.md b/10-large-data.md new file mode 100644 index 0000000..43263cf --- /dev/null +++ b/10-large-data.md @@ -0,0 +1,440 @@ +--- +title: Large data +teaching: 30 +exercises: 30 +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- Inspect netCDF chunking. +- Import the dask library and start a client with parallel workers. +- Calculate and plot the maximum daily precipitation for a high resolution model. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- How do I work with multiple CMIP files that won't fit in memory? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +So far we've been working with small, +individual data files that can be comfortably read into memory on a modern laptop. +What if we wanted to process a larger dataset +that consists of many files and/or much larger file sizes? +For instance, +let's say the next step in our global precipitation analysis +is to plot the daily maximum precipitation over the 1850-2014 period +for the high resolution CNRM-CM6-1-HR model. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Data download + +Instructors teaching this lesson can download the CNRM-CM6-1-HR +daily precipitation data from the Earth System Grid Federation (ESGF). +See the [instructor notes](https://carpentries-lab.github.io/python-aos-lesson/guide/index.html) for details. +Since it is a very large download (45 GB), +learners are not expected to download the data. +(None of the exercises at the end of the lesson require downloading the data.) + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +At the Unix shell, +we can inspect the dataset and see that the daily maximum precipitation data +for the 1850-2014 period has been broken up into 25-year chunks and +spread across seven netCDF files, +most of which are 6.7 GB in size. + +```bash +$ ls -lh data/pr_day*.nc +``` + +```output +-rw-r----- 1 irving staff 6.7G 26 Jan 12:09 pr_day_CNRM-CM6-1-HR_historical_r1i1p1f2_gr_18500101-18741231.nc +-rw-r----- 1 irving staff 6.7G 26 Jan 13:14 pr_day_CNRM-CM6-1-HR_historical_r1i1p1f2_gr_18750101-18991231.nc +-rw-r----- 1 irving staff 6.7G 26 Jan 14:17 pr_day_CNRM-CM6-1-HR_historical_r1i1p1f2_gr_19000101-19241231.nc +-rw-r----- 1 irving staff 6.7G 26 Jan 15:20 pr_day_CNRM-CM6-1-HR_historical_r1i1p1f2_gr_19250101-19491231.nc +-rw-r----- 1 irving staff 6.7G 26 Jan 16:24 pr_day_CNRM-CM6-1-HR_historical_r1i1p1f2_gr_19500101-19741231.nc +-rw-r----- 1 irving staff 6.7G 26 Jan 17:27 pr_day_CNRM-CM6-1-HR_historical_r1i1p1f2_gr_19750101-19991231.nc +-rw-r----- 1 irving staff 4.0G 26 Jan 18:05 pr_day_CNRM-CM6-1-HR_historical_r1i1p1f2_gr_20000101-20141231.nc +``` + +In order to work with these files, +we can use `xarray` to open a "multifile" dataset as though it were a single file. +Our first step is to open a new Jupyter notebook and +import a library with a rather unfortunate name. + +```python +import glob +``` + +The `glob` library contains a single function, also called `glob`, +that finds files whose names match a pattern. +We provide those patterns as strings: +the character `*` matches zero or more characters, +while `?` matches any one character, +just like at the Unix shell. + +```python +pr_files = glob.glob("data/pr_day*.nc") +pr_files.sort() +print(pr_files) +``` + +```output +['/Users/irving/Desktop/data-carpentry/data/pr_day_CNRM-CM6-1-HR_historical_r1i1p1f2_gr_18500101-18741231.nc', + '/Users/irving/Desktop/data-carpentry/data/pr_day_CNRM-CM6-1-HR_historical_r1i1p1f2_gr_18750101-18991231.nc', + '/Users/irving/Desktop/data-carpentry/data/pr_day_CNRM-CM6-1-HR_historical_r1i1p1f2_gr_19000101-19241231.nc', + '/Users/irving/Desktop/data-carpentry/data/pr_day_CNRM-CM6-1-HR_historical_r1i1p1f2_gr_19250101-19491231.nc', + '/Users/irving/Desktop/data-carpentry/data/pr_day_CNRM-CM6-1-HR_historical_r1i1p1f2_gr_19500101-19741231.nc', + '/Users/irving/Desktop/data-carpentry/data/pr_day_CNRM-CM6-1-HR_historical_r1i1p1f2_gr_19750101-19991231.nc', + '/Users/irving/Desktop/data-carpentry/data/pr_day_CNRM-CM6-1-HR_historical_r1i1p1f2_gr_20000101-20141231.nc'] +``` + +Recall that when we first open data in `xarray` +it simply ("lazily") loads the metadata associated with the data +and shows summary information about the contents of the dataset +(i.e. it doesn't read the actual data into memory). +This may take a little time for a large multifile dataset. + +```python +import xarray as xr + +ds = xr.open_mfdataset(pr_files, chunks={"time": "500MB"}) + +print(ds) +``` + +```output + +Dimensions: (axis_nbounds: 2, lat: 360, lon: 720, time: 60265) +Coordinates: + * lat (lat) float64 -89.62 -89.12 -88.62 -88.13 ... 88.62 89.12 89.62 + * lon (lon) float64 0.0 0.5 1.0 1.5 2.0 ... 358.0 358.5 359.0 359.5 + * time (time) datetime64[ns] 1850-01-01T12:00:00 ... 2014-12-31T12:... +Dimensions without coordinates: axis_nbounds +Data variables: + time_bounds (time, axis_nbounds) datetime64[ns] dask.array + pr (time, lat, lon) float32 dask.array +Attributes: + Conventions: CF-1.7 CMIP-6.2 + creation_date: 2019-05-23T12:33:53Z + description: CMIP6 historical + title: CNRM-CM6-1-HR model output prepared for CMIP6 and... + activity_id: CMIP + contact: contact.cmip@meteo.fr + data_specs_version: 01.00.21 + dr2xml_version: 1.16 + experiment_id: historical + experiment: all-forcing simulation of the recent past + external_variables: areacella + forcing_index: 2 + frequency: day + further_info_url: https://furtherinfo.es-doc.org/CMIP6.CNRM-CERFACS... + grid: data regridded to a 359 gaussian grid (360x720 la... + grid_label: gr + nominal_resolution: 50 km + initialization_index: 1 + institution_id: CNRM-CERFACS + institution: CNRM (Centre National de Recherches Meteorologiqu... + license: CMIP6 model data produced by CNRM-CERFACS is lice... + mip_era: CMIP6 + parent_experiment_id: p i C o n t r o l + parent_mip_era: CMIP6 + parent_activity_id: C M I P + parent_source_id: CNRM-CM6-1-HR + parent_time_units: days since 1850-01-01 00:00:00 + parent_variant_label: r1i1p1f2 + branch_method: standard + branch_time_in_parent: 0.0 + branch_time_in_child: 0.0 + physics_index: 1 + product: model-output + realization_index: 1 + realm: atmos + references: http://www.umr-cnrm.fr/cmip6/references + source: CNRM-CM6-1-HR (2017): aerosol: prescribed monthl... + source_id: CNRM-CM6-1-HR + source_type: AOGCM + sub_experiment_id: none + sub_experiment: none + table_id: day + variable_id: pr + variant_label: r1i1p1f2 + EXPID: CNRM-CM6-1-HR_historical_r1i1p1f2 + CMIP6_CV_version: cv=6.2.3.0-7-g2019642 + dr2xml_md5sum: 45d4369d889ddfb8149d771d8625e9ec + xios_commit: 1442-shuffle + nemo_gelato_commit: 84a9e3f161dade7_8250e198106a168 + arpege_minor_version: 6.3.3 + history: none + tracking_id: hdl:21.14100/223fa794-73fe-4bb5-9209-8ff910f7dc40 +``` + +We can see that our `ds` object is an `xarray.Dataset`, +but notice now that each variable has type `dask.array` +with a `chunksize` attribute. +Dask will access the data chunk-by-chunk (rather than all at once), +which is fortunate because at 45GB the full size of our dataset +is much larger than the available RAM on our laptop (17GB in this example). +Dask can also distribute chunks across multiple cores if we ask it to +(i.e. parallel processing). + +So how big should our chunks be? +As a general rule they need to be small enough to fit comfortably in memory +(because multiple chunks can be in memory at once), +but large enough to avoid the time cost associated with asking Dask to +manage/schedule lots of little chunks. +The [Dask documentation](https://docs.dask.org/en/latest/array-chunks.html) +suggests that chunk sizes between 10MB-1GB are common, +so we've set the chunk size to 500MB in this example. +Since our netCDF files are chunked by time, +we've specified that the 500MB Dask chunks should also be along that axis. +Performance would suffer dramatically if our Dask chunks +weren't aligned with our netCDF chunks. + +We can have the Jupyter notebook display the size and shape of our chunks, +just to make sure they are indeed 500MB. + +```python +ds['pr'].data +``` + +```output + Array Chunk +Bytes 62.48 GB 499.74 MB +Shape (60265, 360, 720) (482, 360, 720) +Count 307 Tasks 150 Chunks +Type float32 numpy.ndarray +``` + +Now that we understand the chunking information contained in the metadata, +let's go ahead and calculate the daily maximum precipitation. + +```python +pr_max = ds["pr"].max("time", keep_attrs=True) +print(pr_max) +``` + +```output + +dask.array +Coordinates: + * lat (lat) float64 -89.62 -89.12 -88.62 -88.13 ... 88.62 89.12 89.62 + * lon (lon) float64 0.0 0.5 1.0 1.5 2.0 ... 357.5 358.0 358.5 359.0 359.5 +Attributes: + long_name: Precipitation + units: kg m-2 s-1 + online_operation: average + cell_methods: area: time: mean + interval_operation: 900 s + interval_write: 1 d + standard_name: precipitation_flux + description: at surface; includes both liquid and solid phases fr... + history: none + cell_measures: area: areacella +``` + +It seems like the calculation happened instataneously, +but it's actually just another "lazy" feature of `xarray`. +It's showing us what the output of the calculation would look like (i.e. a 360 by 720 array), +but `xarray` won't actually do the computation +until the data is needed (e.g. to create a plot or write to a netCDF file). + +To force `xarray` to do the computation +we can use `.compute()` with the `%%time` Jupyter notebook command +to record how long it takes: + +```python +%%time +pr_max_done = pr_max.compute() +``` + +```output +CPU times: user 4min 1s, sys: 54.2 s, total: 4min 55s +Wall time: 3min 44s +``` + +By processing the data chunk-by-chunk, +we've avoided the memory error we would have generated +had we tried to handle the whole dataset at once. +A completion time of 3 minutes and 44 seconds isn't too bad, +but that was only using one core. +We can try and speed things up by using a `dask` "client" +to run the calculation in parallel across multiple cores: + +```python +from dask.distributed import Client + +client = Client() + +client +``` + +```output +Client Cluster +Scheduler: tcp://127.0.0.1:59152 Workers: 4 +Dashboard: http://127.0.0.1:8787/status Cores: 4 + Memory: 17.18 GB +``` + +(Click on the dashboard link to watch what's happening on each core.) + +```python +%%time +pr_max_done = pr_max.compute() +``` + +```output +CPU times: user 10.2 s, sys: 1.12 s, total: 11.3 s +Wall time: 2min 33s +``` + +By distributing the calculation across all four cores +the processing time has dropped to 2 minutes and 33 seconds. +It's faster than, but not a quarter of, the original 3 minutes and 44 seconds +because there's a time cost associated with +setting up and coordinating jobs across all the cores. + +::::::::::::::::::::::::::::::::::::::::: callout + +## Parallel isn't always best + +For smaller or very complex data processing tasks, +the time associated with setting up and coordinating jobs across multiple cores +can mean it takes longer to run in parallel than on just one core. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +Now that we've computed the daily maximum precipitation, +we can go ahead and plot it: + +```python +import matplotlib.pyplot as plt +import cartopy.crs as ccrs +import numpy as np +import cmocean + + +pr_max_done.data = pr_max_done.data * 86400 +pr_max_done.attrs["units"] = "mm/day" + +fig = plt.figure(figsize=[12,5]) +ax = fig.add_subplot(111, projection=ccrs.PlateCarree(central_longitude=180)) +pr_max_done.plot.contourf( + ax=ax, + levels=np.arange(0, 450, 50), + extend="max", + transform=ccrs.PlateCarree(), + cbar_kwargs={"label": pr_max.units}, + cmap=cmocean.cm.haline_r, +) +ax.coastlines() + +model = ds.attrs["source_id"] +title = f"Daily maximum precipitation, 1850-2014 ({model})" +plt.title(title) + +plt.show() +``` + +![](fig/10-pr-max.png){alt='Daily maximum precipitation'} + +::::::::::::::::::::::::::::::::::::::::: callout + +## Writing your own Dask-aware functions + +So far leveraging Dask for our large data computations has been relatively simple, +because almost all of `xarray`'s built-in operations (like `max`) work on Dask arrays. + +If we wanted to chunk and parallelise code involving operations +that aren't built into `xarray` +(e.g. an interpolation routine from the SciPy library) +we'd first need to use the `apply_ufunc` or `map_blocks` function +to make those operations "Dask-aware". +The [xarray tutorial](https://xarray-contrib.github.io/xarray-tutorial/scipy-tutorial/06_xarray_and_dask.html#Automatic-parallelization-with-apply_ufunc-and-map_blocks) +from SciPy 2020 explains how to do this. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Alternatives to Dask + +While `dask` can be a good option for working with large data, +it's not always the best choice. +For instance, +in this lesson we've talked about the time cost associated with +setting up and coordinating jobs across multiple cores, +and the fact that it can take a bit of time and effort +to make a function Dask-aware. + +What other options/tactics do you have when working with a large, multi-file dataset? +What are the pros and cons of these options? + +::::::::::::::: solution + +## Solution + +Other options include writing a loop to process each netCDF file one at a time +or a series of sub-regions one at a time. + +By breaking the data processing down like this you might avoid the +need to make your function/s Dask-aware. +If the loop only involves tens (as opposed to hundreds) files/regions, +the loop might also not be too slow. +However, using Dask is neater and more scalable, +because the code you write looks essentially the same regardless +of whether you're running the code on a powerful supercomputer or a personal laptop. + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Find out what you're working with + +Setup a Dask client in your Jupyter notebook. +How many cores do you have available and how much memory? + +::::::::::::::: solution + +## Solution + +Run the following two commands in your notebook to setup the client +and then type `client` to view the display of information: + +```python +from dask.distributed import Client +client = Client() +client +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Applying this to your own work + +In your own research, +are there any data processing tasks that could benefit +from chunking and/or parallelisation? +If so, +how would you implement it? +What size would your chunks be and along what axis? +Are all the operations involved already Dask-aware? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Libraries such as dask and xarray can make loading, processing and visualising netCDF data much easier. +- Dask can speed up processing through parallelism but care may be needed particularly with data chunking. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + + diff --git a/CODE_OF_CONDUCT.md b/CODE_OF_CONDUCT.md new file mode 100644 index 0000000..f19b804 --- /dev/null +++ b/CODE_OF_CONDUCT.md @@ -0,0 +1,13 @@ +--- +title: "Contributor Code of Conduct" +--- + +As contributors and maintainers of this project, +we pledge to follow the [The Carpentries Code of Conduct][coc]. + +Instances of abusive, harassing, or otherwise unacceptable behavior +may be reported by following our [reporting guidelines][coc-reporting]. + + +[coc-reporting]: https://docs.carpentries.org/topic_folders/policies/incident-reporting.html +[coc]: https://docs.carpentries.org/topic_folders/policies/code-of-conduct.html diff --git a/LICENSE.md b/LICENSE.md new file mode 100644 index 0000000..7632871 --- /dev/null +++ b/LICENSE.md @@ -0,0 +1,79 @@ +--- +title: "Licenses" +--- + +## Instructional Material + +All Carpentries (Software Carpentry, Data Carpentry, and Library Carpentry) +instructional material is made available under the [Creative Commons +Attribution license][cc-by-human]. The following is a human-readable summary of +(and not a substitute for) the [full legal text of the CC BY 4.0 +license][cc-by-legal]. + +You are free: + +- to **Share**---copy and redistribute the material in any medium or format +- to **Adapt**---remix, transform, and build upon the material + +for any purpose, even commercially. + +The licensor cannot revoke these freedoms as long as you follow the license +terms. + +Under the following terms: + +- **Attribution**---You must give appropriate credit (mentioning that your work + is derived from work that is Copyright (c) The Carpentries and, where + practical, linking to ), provide a [link to the + license][cc-by-human], and indicate if changes were made. You may do so in + any reasonable manner, but not in any way that suggests the licensor endorses + you or your use. + +- **No additional restrictions**---You may not apply legal terms or + technological measures that legally restrict others from doing anything the + license permits. With the understanding that: + +Notices: + +* You do not have to comply with the license for elements of the material in + the public domain or where your use is permitted by an applicable exception + or limitation. +* No warranties are given. The license may not give you all of the permissions + necessary for your intended use. For example, other rights such as publicity, + privacy, or moral rights may limit how you use the material. + +## Software + +Except where otherwise noted, the example programs and other software provided +by The Carpentries are made available under the [OSI][osi]-approved [MIT +license][mit-license]. + +Permission is hereby granted, free of charge, to any person obtaining a copy of +this software and associated documentation files (the "Software"), to deal in +the Software without restriction, including without limitation the rights to +use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies +of the Software, and to permit persons to whom the Software is furnished to do +so, subject to the following conditions: + +The above copyright notice and this permission notice shall be included in all +copies or substantial portions of the Software. + +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +SOFTWARE. + +## Trademark + +"The Carpentries", "Software Carpentry", "Data Carpentry", and "Library +Carpentry" and their respective logos are registered trademarks of [Community +Initiatives][ci]. + +[cc-by-human]: https://creativecommons.org/licenses/by/4.0/ +[cc-by-legal]: https://creativecommons.org/licenses/by/4.0/legalcode +[mit-license]: https://opensource.org/licenses/mit-license.html +[ci]: https://communityin.org/ +[osi]: https://opensource.org diff --git a/config.yaml b/config.yaml new file mode 100644 index 0000000..9d59026 --- /dev/null +++ b/config.yaml @@ -0,0 +1,86 @@ +#------------------------------------------------------------ +# Values for this lesson. +#------------------------------------------------------------ + +# Which carpentry is this (swc, dc, lc, or cp)? +# swc: Software Carpentry +# dc: Data Carpentry +# lc: Library Carpentry +# cp: Carpentries (to use for instructor training for instance) +# incubator: The Carpentries Incubator +carpentry: 'cp' + +# Overall title for pages. +title: 'Python for Atmosphere and Ocean Scientists' + +# Date the lesson was created (YYYY-MM-DD, this is empty by default) +created: 2018-01-15 + +# Comma-separated list of keywords for the lesson +keywords: 'software, data, lesson, The Carpentries' + +# Life cycle stage of the lesson +# possible values: pre-alpha, alpha, beta, stable +life_cycle: 'stable' + +# License of the lesson materials (recommended CC-BY 4.0) +license: 'CC-BY 4.0' + +# Link to the source repository for this lesson +source: 'https://github.com/fishtree-attempt/python-aos-lesson/' + +# Default branch of your lesson +branch: 'main' + +# Who to contact if there are any issues +contact: 'irving.damien@gmail.com' + +# Navigation ------------------------------------------------ +# +# Use the following menu items to specify the order of +# individual pages in each dropdown section. Leave blank to +# include all pages in the folder. +# +# Example ------------- +# +# episodes: +# - introduction.md +# - first-steps.md +# +# learners: +# - setup.md +# +# instructors: +# - instructor-notes.md +# +# profiles: +# - one-learner.md +# - another-learner.md + +# Order of episodes in your lesson +episodes: +- 01-conda.md +- 02-visualisation.md +- 03-functions.md +- 04-cmdline.md +- 05-git.md +- 06-github.md +- 07-vectorisation.md +- 08-defensive.md +- 09-provenance.md +- 10-large-data.md + +# Information for Learners +learners: + +# Information for Instructors +instructors: + +# Learner Profiles +profiles: + +# Customisation --------------------------------------------- +# +# This space below is where custom yaml items (e.g. pinning +# sandpaper and varnish versions) should live + diff --git a/data/change_time_units.py b/data/change_time_units.py new file mode 100644 index 0000000..98f34ca --- /dev/null +++ b/data/change_time_units.py @@ -0,0 +1,40 @@ +import argparse +import pdb + +import iris +import cf_units +import cmdline_provenance as cmdprov + + +def main(inargs): + """Run the program.""" + + cube = iris.load_cube(inargs.infile, inargs.variable) + + time_axis = cube.coord('time') + new_units = cf_units.Unit(inargs.new_time_units, calendar=time_axis.units.calendar) + time_axis.convert_units(new_units) + + new_log = cmdprov.new_log(infile_history={inargs.infile: cube.attributes['history']}) + cube.attributes['history'] = new_log + + if inargs.infile == inargs.outfile: + cube.data # to realise lazy data to allow file overwrite + iris.save(cube, inargs.outfile) + + +if __name__ == '__main__': + + extra_info ="""example: $ python change_time_units.py pr_Amon_ACCESS1-3_historical_r1i1p1_200101-200512.nc + precipitation_flux "days since 2001-01-01 00:00:00" test.nc""" + + description='Change the units of the time axis' + parser = argparse.ArgumentParser(description=description, epilog=extra_info) + + parser.add_argument("infile", type=str, help="Input file name") + parser.add_argument("variable", type=str, help="Variable to select from file") + parser.add_argument("new_time_units", type=str, help="New time axis units") + parser.add_argument("outfile", type=str, help="Output file name") + + args = parser.parse_args() + main(args) \ No newline at end of file diff --git a/data/pr_Amon_ACCESS-CM2_historical_r1i1p1f1_gn_201001-201412.nc b/data/pr_Amon_ACCESS-CM2_historical_r1i1p1f1_gn_201001-201412.nc new file mode 100644 index 0000000..afb8144 Binary files /dev/null and b/data/pr_Amon_ACCESS-CM2_historical_r1i1p1f1_gn_201001-201412.nc differ diff --git a/data/pr_Amon_ACCESS-ESM1-5_historical_r1i1p1f1_gn_201001-201412.nc b/data/pr_Amon_ACCESS-ESM1-5_historical_r1i1p1f1_gn_201001-201412.nc new file mode 100644 index 0000000..611897f Binary files /dev/null and b/data/pr_Amon_ACCESS-ESM1-5_historical_r1i1p1f1_gn_201001-201412.nc differ diff --git a/data/sftlf_fx_ACCESS-CM2_historical_r1i1p1f1_gn.nc b/data/sftlf_fx_ACCESS-CM2_historical_r1i1p1f1_gn.nc new file mode 100644 index 0000000..fdc5bb8 Binary files /dev/null and b/data/sftlf_fx_ACCESS-CM2_historical_r1i1p1f1_gn.nc differ diff --git a/data/sftlf_fx_ACCESS-ESM1-5_historical_r1i1p1f1_gn.nc b/data/sftlf_fx_ACCESS-ESM1-5_historical_r1i1p1f1_gn.nc new file mode 100644 index 0000000..d9642dd Binary files /dev/null and b/data/sftlf_fx_ACCESS-ESM1-5_historical_r1i1p1f1_gn.nc differ diff --git a/fig/01-iris-search.png b/fig/01-iris-search.png new file mode 100644 index 0000000..889c234 Binary files /dev/null and b/fig/01-iris-search.png differ diff --git a/fig/01-launch-notebook.png b/fig/01-launch-notebook.png new file mode 100644 index 0000000..09acb47 Binary files /dev/null and b/fig/01-launch-notebook.png differ diff --git a/fig/01-navigator-conda-forge.png b/fig/01-navigator-conda-forge.png new file mode 100644 index 0000000..771aaba Binary files /dev/null and b/fig/01-navigator-conda-forge.png differ diff --git a/fig/01-navigator-xarray.png b/fig/01-navigator-xarray.png new file mode 100644 index 0000000..1c1ca93 Binary files /dev/null and b/fig/01-navigator-xarray.png differ diff --git a/fig/01-pyaos-stack.svg b/fig/01-pyaos-stack.svg new file mode 100644 index 0000000..f7431d5 --- /dev/null +++ b/fig/01-pyaos-stack.svg @@ -0,0 +1,4 @@ + + + +
numpy
(numerical arrays)
numpy...
scipy
(general science libraries)
scipy...
matplotlib
(static plots)
matplotlib...
cartopy
(geographic map projections)
cartopy...
xarray
(generic multi-dimensional labelled arrays)
 
OR 

iris
(climate-specific data analysis)
xarray...
dask
(parallel processing)
dask...
Viewer does not support full SVG 1.1
\ No newline at end of file diff --git a/fig/01-pyaos-stack.xml b/fig/01-pyaos-stack.xml new file mode 100644 index 0000000..1092987 --- /dev/null +++ b/fig/01-pyaos-stack.xml @@ -0,0 +1 @@ +5VjLctowFP0aZtJFMn4AZVsoTaal02mZPrLqCPtiK5ElV5Z59Ot7ZUuAYxOcBlgkbJDOlSVxzvXVER1/lKyuJUnjzyIE1vGccNXx33c8z+16ffzSyLpE+oNeCUSShmbQFpjSv2BAx6A5DSGrDFRCMEXTKhgIziFQFYxIKZbVYXPBqqumJIIaMA0Iq6M/aajiEh30nC1+AzSK7cquYyIzEtxHUuTcrNfx/HnxKcMJsXOZ8VlMQrHcgfxxxx9JIVTZSlYjYJpbS1v53Ic90c2+JXDV5gHfL59YEJaD3XKxMbW2ZCxjqmCakkD3lyh4xx/GKmHYc7FJGI04tmXJx3BOGRsJJiRiXHB8aJgpKe7BgsiJU3wwEpIshtDMVN+8+T0LkApWO5D5MdcgElByjUNMtGt4NXnXtf3lVsXNmHhHQc8KSEzmRJupt+xhwxDYTGa3gcs+05zMsBHpBs+TdG1RnG8T2IyUFrnAsSApJmWZ1GSdvamPqqmFRKmqQFX2jSQNKlkhA6QfEB9q2vX670wgoWGolxk2ZcTxxes7LcXrHkG73l7tQrrQcxsOEP2T67dzaGja9GtKWiAL6OOaI1YsUkWfuu5FBBxkkS24InAUx3MYnUkiKTSnTvPyLzahev0zJlT/cDHA0yBlQqFGLStCpgjyp1nBx15ZNegOzije28PiBUQq0bqW4yYibZUK9RKkCMmR4g69CxVcK/lydWt7BB9DN7fpCH5AbUmktXXeA5ZjkupxyaqQ62rOxDKIUeorwrnQr5/gv3ddD4N5YXoYTX9g1zHtm+NZGm/gHib0VJZm41O3/EGIVtl0Tc4VhrewcU6VTlhR9UvDV3jAlt1bG+K4ld2Y7tvgHSi1NtcCkiuBkJAqFpHghE2ESA/Rm4lcBlB5mRWREagKpH/KoxJIYCj5onoreBafVv72PqPIsMMuY1VYxNbliGtrqWtRjreqy5AmwDOq2dWmgcxw16hnzXc+zans3TlJdK3hsyz9bxfUPPeXb6ecvcb5o57qmTav5RaopFlL0QNG0XHAZZZCQOeF+iFRRK+Ouq8zul/no59Opmye+2zy3d7BWup2eycqprPo7zgdDycfJ+n3W3f69ZOYLC8bb+BHunng7fr+DBcPlIRgvWClqQkgyyiPXueFo3pU97zTeR/sbv8mKmI7/8X5438= \ No newline at end of file diff --git a/fig/02-visualisation-viridis.png b/fig/02-visualisation-viridis.png new file mode 100644 index 0000000..9e5b7a8 Binary files /dev/null and b/fig/02-visualisation-viridis.png differ diff --git a/fig/02-visualisation-viridis_r.png b/fig/02-visualisation-viridis_r.png new file mode 100644 index 0000000..2fa20cb Binary files /dev/null and b/fig/02-visualisation-viridis_r.png differ diff --git a/fig/03-functions-accesscm2-jja.png b/fig/03-functions-accesscm2-jja.png new file mode 100644 index 0000000..007b2e3 Binary files /dev/null and b/fig/03-functions-accesscm2-jja.png differ diff --git a/fig/03-functions-accessesm-djf-gridlines.png b/fig/03-functions-accessesm-djf-gridlines.png new file mode 100644 index 0000000..90370b0 Binary files /dev/null and b/fig/03-functions-accessesm-djf-gridlines.png differ diff --git a/fig/03-functions-accessesm-djf.png b/fig/03-functions-accessesm-djf.png new file mode 100644 index 0000000..dc15526 Binary files /dev/null and b/fig/03-functions-accessesm-djf.png differ diff --git a/fig/05-git-file-mess.gif b/fig/05-git-file-mess.gif new file mode 100644 index 0000000..067b183 Binary files /dev/null and b/fig/05-git-file-mess.gif differ diff --git a/fig/05-git-staging-area.svg b/fig/05-git-staging-area.svg new file mode 100644 index 0000000..c74c298 --- /dev/null +++ b/fig/05-git-staging-area.svg @@ -0,0 +1,93 @@ + + + + + + + + .git + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + git add + + + + + + git commit + + staging area + + repository + + diff --git a/fig/06-github-add-ssh-keys.png b/fig/06-github-add-ssh-keys.png new file mode 100644 index 0000000..7987cc2 Binary files /dev/null and b/fig/06-github-add-ssh-keys.png differ diff --git a/fig/06-github-clone.png b/fig/06-github-clone.png new file mode 100644 index 0000000..c1491d9 Binary files /dev/null and b/fig/06-github-clone.png differ diff --git a/fig/06-github-create-repo-01.png b/fig/06-github-create-repo-01.png new file mode 100644 index 0000000..2f76d1f Binary files /dev/null and b/fig/06-github-create-repo-01.png differ diff --git a/fig/06-github-create-repo-02.png b/fig/06-github-create-repo-02.png new file mode 100644 index 0000000..0ad9782 Binary files /dev/null and b/fig/06-github-create-repo-02.png differ diff --git a/fig/06-github-create-repo-03.png b/fig/06-github-create-repo-03.png new file mode 100644 index 0000000..b47f7df Binary files /dev/null and b/fig/06-github-create-repo-03.png differ diff --git a/fig/06-github-find-repo-string.png b/fig/06-github-find-repo-string.png new file mode 100644 index 0000000..dcc8ca9 Binary files /dev/null and b/fig/06-github-find-repo-string.png differ diff --git a/fig/07-vectorisation-ocean-mask.png b/fig/07-vectorisation-ocean-mask.png new file mode 100644 index 0000000..f51afa8 Binary files /dev/null and b/fig/07-vectorisation-ocean-mask.png differ diff --git a/fig/10-pr-max.png b/fig/10-pr-max.png new file mode 100644 index 0000000..5e4201d Binary files /dev/null and b/fig/10-pr-max.png differ diff --git a/fig/README.md b/fig/README.md new file mode 100644 index 0000000..c8ef643 --- /dev/null +++ b/fig/README.md @@ -0,0 +1,2 @@ +Open `01-pyaos-stack.xml` using the online [draw.io](https://www.draw.io/) editor +to make changes to the `01-pyaos-stack.png` figure. \ No newline at end of file diff --git a/helper_lesson_check.md b/helper_lesson_check.md new file mode 100644 index 0000000..9c0fdcf --- /dev/null +++ b/helper_lesson_check.md @@ -0,0 +1,33 @@ +## Helper lesson check + +Once you've installed the required software for the workshop (following the instructions given to participants)... + +1. Clone this repository: + +``` +$ git clone https://github.com/carpentries-lab/python-aos-lesson.git +$ cd python-aos-lesson +``` + +2. Install the `jupyter`, `xarray` and `cmocean` libraries in a new conda environment called `pyaos-lesson`: + +``` +$ conda config --add channels conda-forge +$ conda create -n pyaos-lesson jupyter xarray cmocean +``` + +3. Activate the new environment by typing `source activate pyaos-lesson`. If that doesn't work, try `conda activate pyaos-lesson`. + +4. Install the `cmdline-provenance` library: + +``` +(pyaos-lesson) $ pip install cmdline-provenance +``` + +5. Run the final version of the script that participants will develop throughout the workshop: + +``` +(pyaos-lesson) $ python code/plot_precipitation_climatology_final.py data/pr_Amon_ACCESS-ESM1-5_historical_r1i1p1f1_gn_201001-201412.nc DJF test.png +``` + +It should produce an image file (`test.png`) as well as a log of command line entries (`test.txt`). diff --git a/index.md b/index.md new file mode 100644 index 0000000..00ec53e --- /dev/null +++ b/index.md @@ -0,0 +1,64 @@ +--- +site: sandpaper::sandpaper_site +--- + +Python is rapidly emerging as the programming language of choice for data analysis +in the atmosphere and ocean sciences. +By consulting online tutorials and help pages, +most researchers in this community are able to pick up the basic syntax and programming constructs +(e.g. loops, lists and conditionals). +This self-taught knowledge is sufficient to get work done, +but it often involves spending hours to do things that should take minutes, +reinventing a lot of wheels, +and a nagging uncertainty at the end of it all +regarding the reliability and reproducibility of the results. +To help address these issues, +these Data Carpentry lessons cover a suite of programming and data management best practices +that aren't so easy to glean from a quick Google search. + +The skills covered in the lessons are taught in the context of a typical data analysis task: +creating a command line program that plots the precipitation climatology for any given month, +so that two different CMIP6 models (ACCESS-CM2 and ACCESS-ESM1-5) can be compared visually. + +::::::::::::::::::::::::::::::::::::::::: callout + +## raster vs vector data + +These lessons work with raster or "gridded" data that are stored as a uniform grid of values using the netCDF file format. +This is the most common data format and file type in the atmosphere and ocean sciences; +essentially all output from weather, climate and ocean models is gridded data stored as a series of netCDF files. + +The other data type that atmosphere and ocean scientists tend to work with is geospatial vector data. +In contrast to gridded raster data, +these vector data are composed of discrete geometric locations (i.e. x, y values) +that define the shape of a spatial point, line or polygon. +They are not stored using the netCDF file format and are not covered in these lessons. +[Data Carpentry](https://www.datacarpentry.org/lessons/) have separate lessons on working with geospatial vector data. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::: prereq + +## Prerequisites + +Participants must have some familiarity with Python and the Unix shell. +They don't need to be highly proficient, +but a basic understanding of Python syntax, +beginner-level programming constructs (e.g. loops and conditionals) +and filesystem navigation using the shell +(e.g. the `ls` and `cd` commands) is required. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::: prereq + +## Citation + +To cite these lessons, please refer to the following paper: + +Irving D (2019). [Python for atmosphere and ocean scientists](https://jose.theoj.org/papers/10.21105/jose.00037). +*Journal of Open Source Education*. 2(11), 37. doi:10.21105/jose.00037 + +:::::::::::::::::::::::::::::::::::::::::::::::::: + + diff --git a/instructor-notes.md b/instructor-notes.md new file mode 100644 index 0000000..a7ee435 --- /dev/null +++ b/instructor-notes.md @@ -0,0 +1,54 @@ +--- +title: Instructor Notes +--- + +## Teaching notes + +These materials are designed with the core [Carpentries teaching practices](https://carpentries.org/workshops/) in mind. +All of the episodes involve the instructor live coding +(in a Jupyter notebook, bash shell or text editor, depending on the lesson), +using these online notes as a guide +(it can be a good idea to print a copy of the notes to have next to you while live coding). +For most of the lessons, participants simply watch the live coding +and then get an opportunity to try the new concepts that are introduced by completing the challenges. +The only exception to this format are lessons on version control and GitHub, +where the participants are required to follow along with the instructor, +typing and executing every command as the instructor enters them on the screen. +This is done to reinforce the repetitive "add, commit, push" workflow in git. + +## Downloads + +At the beginning of the workshop, +participants are required to download a number of data files +(instructions at the [setup page](https://carpentries-lab.github.io/python-aos-lesson/setup.html)). +In the [first lesson](https://carpentries-lab.github.io/python-aos-lesson/01-conda/index.html), +they are then required to install some python libraries (`jupyter`, `xarray`, `cmocean`, etc). +Both these tasks can be problematic at venues with slow wifi, +so it is often a good idea to ask participants to download the data +and install the libraries prior to the workshop. + +The large data lesson involves the analysis of a 45GB dataset. +Instructors can download the relevant data files from the +Earth System Grid Federation (ESGF) following the +[CMIP6 Guidance for Data Users](https://pcmdi.llnl.gov/CMIP6/Guide/dataUsers.html#3-accessing-model-output). +(Workshop participants do not need to download this data.) +Use the following search terms to locate the data at your nearest ESGF node: + +- Source ID: CNRM-CM6-1-HR +- Experiment ID: historical +- Variant Label: r1i1p1f2 +- Grid Label: gr +- Table ID: day +- Variable: pr + +## Software and code + +The [setup page](https://carpentries-lab.github.io/python-aos-lesson/setup.html) +gives details of the software installation instructions that can provided to participants. + +You can also send the +[helper lesson check](https://github.com/carpentries-lab/python-aos-lesson/blob/gh-pages/helper_lesson_check.md) +to helpers prior to the workshop, +so that they can test that all the software and code is working correctly. + + diff --git a/learner-profiles.md b/learner-profiles.md new file mode 100644 index 0000000..434e335 --- /dev/null +++ b/learner-profiles.md @@ -0,0 +1,5 @@ +--- +title: FIXME +--- + +This is a placeholder file. Please add content here. diff --git a/md5sum.txt b/md5sum.txt new file mode 100644 index 0000000..ee682d5 --- /dev/null +++ b/md5sum.txt @@ -0,0 +1,21 @@ +"file" "checksum" "built" "date" +"CODE_OF_CONDUCT.md" "c93c83c630db2fe2462240bf72552548" "site/built/CODE_OF_CONDUCT.md" "2024-07-25" +"LICENSE.md" "b24ebbb41b14ca25cf6b8216dda83e5f" "site/built/LICENSE.md" "2024-07-25" +"config.yaml" "0295d6b581fa4c3251f193b086810abc" "site/built/config.yaml" "2024-07-25" +"helper_lesson_check.md" "1d1b53176140a057b588969f9f60fa61" "site/built/helper_lesson_check.md" "2024-07-25" +"index.md" "0e62af27fdd16f8c6f2c9e1fef67e750" "site/built/index.md" "2024-07-25" +"paper.md" "32ace9442c642b5e25ce659e7ae9df72" "site/built/paper.md" "2024-07-25" +"episodes/01-conda.md" "a6686033d7b73c398bf6f559e29b0065" "site/built/01-conda.md" "2024-07-25" +"episodes/02-visualisation.md" "1e0da181aa29d53a614a7358d5379434" "site/built/02-visualisation.md" "2024-07-25" +"episodes/03-functions.md" "5f76d28ee961529d9fdf812a01cc5e81" "site/built/03-functions.md" "2024-07-25" +"episodes/04-cmdline.md" "ba7a163f8c93824a88a42045723108c6" "site/built/04-cmdline.md" "2024-07-25" +"episodes/05-git.md" "6b10244dccd7f145ba1ad6c915ac06b8" "site/built/05-git.md" "2024-07-25" +"episodes/06-github.md" "3ae16cc2084d7dca01805a73eabc556a" "site/built/06-github.md" "2024-07-25" +"episodes/07-vectorisation.md" "04d52cf7cd3a849c406c5c96cca4671c" "site/built/07-vectorisation.md" "2024-07-25" +"episodes/08-defensive.md" "4e709227290c2bffb21d737b55cb67f9" "site/built/08-defensive.md" "2024-07-25" +"episodes/09-provenance.md" "91a9a5b5e34b594ca49d77c319264eb8" "site/built/09-provenance.md" "2024-07-25" +"episodes/10-large-data.md" "909dba426cb9409b6124181662c31af4" "site/built/10-large-data.md" "2024-07-25" +"instructors/instructor-notes.md" "858a8aea24f020d8732ac71b2601f2e6" "site/built/instructor-notes.md" "2024-07-25" +"learners/reference.md" "4e0dcbc7892af6f9610d44d356e66617" "site/built/reference.md" "2024-07-25" +"learners/setup.md" "d86e41df032ef642f234e5014b28178a" "site/built/setup.md" "2024-07-25" +"profiles/learner-profiles.md" "60b93493cf1da06dfd63255d73854461" "site/built/learner-profiles.md" "2024-07-25" diff --git a/paper.md b/paper.md new file mode 100644 index 0000000..2ab06e3 --- /dev/null +++ b/paper.md @@ -0,0 +1,34 @@ +--- +title: 'Python for Atmosphere and Ocean Scientists' +tags: +- python +- meteorology +- oceanography +authors: +- name: Damien B Irving + orcid: 0000-0003-1258-5002 + affiliation: 1 +affiliations: +- name: Commonwealth Scientific and Industrial Research Organisation (CSIRO) + index: 1 +date: 23 October 2018 +bibliography: paper.bib +--- + +# Summary + +Python is rapidly emerging as the programming language of choice for data analysis in the atmosphere and ocean sciences. By consulting online tutorials and help pages, most researchers in this community are able to pick up the basic syntax and programming constructs (e.g., loops, lists and conditionals). This self-taught knowledge is sufficient to get work done, but it often involves spending hours to do things that should take minutes, reinventing a lot of wheels, and a nagging uncertainty at the end of it all regarding the reliability and reproducibility of the results. To help address these issues, the [Python for Atmosphere and Ocean Scientists](https://carpentries-lab.github.io/python-aos-lesson/) educational materials cover a suite of programming and data management best practices that are not so easy to glean from a quick Google search. + +The materials contain everything required to run a one-day workshop. This includes data files, software installation instructions and lesson plans (complete with exercises and solutions), which double as teaching notes for instructors and a reference for learners to refer back to. For those unable to attend a workshop, it is also possible to work through the lessons independently. The skills covered in the lessons are presented in the context of a typical data analysis task: creating a command line program that plots the average rainfall for any given month, so that the output from two different global climate models can be compared visually. After giving an overview of the PyAOS stack (i.e., the ecosystem of libraries used in the atmosphere and ocean sciences) and the management of software environments using conda, the lessons introduce the basic Python commands required to create the plot. Those commands are then refactored to be more modular/reusable (using functions) before being transferred to a stand-alone script that can be executed from the command line. Changes to that script are then tracked using version control as further edits are made to implement common defensive programming strategies and to record the provenance of the input data files and output figures. Along the way, the basics of the Network Common Data Form (netCDF) file format and associated “climate and forecasting” metadata convention are introduced. The raster (or “gridded”) output from weather, climate and/or ocean models is almost universally archived using this format. + +## Statement of need + +The inspiration for the materials was the world-renowned Software Carpentry initiative [@Wilson2014]. In 2013, the author was involved in organising the first ever Software Carpentry workshops outside of Europe and North America, one of which was held alongside the annual conference of the Australian Meteorological and Oceanographic Society (AMOS) in Melbourne, Australia. He then trained up as a Software Carpentry instructor and taught workshops alongside the AMOS conference from 2014-2017, as well as other ad hoc workshops in various meteorology and oceanography departments. While these workshops were very popular and well received, there was clearly demand for a workshop designed specifically for atmosphere and ocean scientists. Instead of teaching generic skills in the hope that people would figure out how to apply them in their own context (i.e., in the context of netCDF files, the PyAOS stack, etc.), such a workshop would teach programming [@Wilson2014a], data management and reproducible research [@Irving2016] skills in the atmosphere and ocean science context. This idea of discipline (or data-type) specific workshops was the driving force behind the establishment of the Data Carpentry initiative, so it was with their assistance that these materials were developed. The first pilot workshops were held in 2018 at the AMOS Conference in Sydney and at Woods Hole Oceanographic Institution in Massachusetts. The materials have since been formally endorsed and released for use by the wider Data Carpentry instructor community. + +Carpentries workshops (i.e., Data Carpentry and Software Carpentry) are taught by volunteer instructors, trained in pedagogy [@Wilson2018], who focus on creating a motivating and engaging environment for learners. The one-day Python for Atmosphere and Ocean Scientists workshop is no different, and incorporates a number of teaching practices that make it effective and enjoyable. These include a code of conduct to ensure the workshop is welcoming, live coding by the instructor, peers teaching peers, continuous opportunity for feedback, collaborative note taking, a lesson design that avoids cognitive overload, regular challenges/quizzes and learners work in a familiar environment by using their own laptop [@Wilson2014]. Of course, the materials can also be adapted and used in other forums. The first few modules, for instance, are being used in a third year undergraduate subject at the University of Wollongong. + +## Future directions + +Going forward, it is hoped that these materials can be collaboratively updated and improved (at the associated [GitHub repository](https://github.com/carpentries-lab/python-aos-lesson)) by the atmosphere and ocean science community [@Devenyi2018]. Additional topics that could be covered include strategies and best practices for remapping data from one grid to another, processing very large data arrays (i.e., to reduce processing time and to avoid memory errors) and workflow management. These topics are especially relevant for the oceanography community, as ocean models tend to output large data arrays on complex grids. + +# References diff --git a/reference.md b/reference.md new file mode 100644 index 0000000..456f940 --- /dev/null +++ b/reference.md @@ -0,0 +1,9 @@ +--- +title: 'Glossary' +--- + +## Glossary + +FIXME + + diff --git a/setup.md b/setup.md new file mode 100644 index 0000000..2e8283b --- /dev/null +++ b/setup.md @@ -0,0 +1,138 @@ +--- +title: Setup +--- + +## Data + +In preparation for these lessons, +you will need to download the following two Python scripts and four netCDF files +and place them in a new folder/directory: + +1. Make a new folder in your Desktop called `data-carpentry`. +2. Download [script\_template.py][template-script] and [plot\_precipitation\_climatology.py][precip-script] and move them into that folder. +3. Make a new folder in your `data_carpentry` folder called `data`. + Download the following files and place them in that folder: + - [pr\_Amon\_ACCESS-CM2\_historical\_r1i1p1f1\_gn\_201001-201412.nc][pr-accesscm-file] + - [pr\_Amon\_ACCESS-ESM1-5\_historical\_r1i1p1f1\_gn\_201001-201412.nc][pr-accessesm-file] + - [sftlf\_fx\_ACCESS-CM2\_historical\_r1i1p1f1\_gn.nc][sftlf-accesscm-file] + - [sftlf\_fx\_ACCESS-ESM1-5\_historical\_r1i1p1f1\_gn.nc][sftlf-accessesm-file] + +## Software installation + +In order to complete the lessons, +you will need access to the following: + +- The bash shell (the Z-shell is also fine, which is default on new Macs) +- A text editor +- Git +- Anaconda (which is a Python distribution) + +If you don't already have these installed, +please follow The Carpentries [software installation instructions](https://carpentries.github.io/workshop-template/#setup). +(You do not need to install R, which is also listed at that site.) + +::::::::::::::::::::::::::::::::::::::::: callout + +## Troubeshooting + +If you have any trouble with software installation, +The Carpentries maintain a list of common issues on their +[Configuration Problems and Solutions wiki page](https://github.com/carpentries/workshop-template/wiki/Configuration-Problems-and-Solutions). + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +Your workshop instructor may also ask that you install the python packages introduced in the +[first lesson](https://carpentries-lab.github.io/python-aos-lesson/01-conda/index.html) +ahead of time. +You can do this via the command line or by using the Anaconda Navigator: + +### Installation of python packages: via the command line + +(Windows users may need to open the Anaconda Prompt program +and run `conda init bash` to make conda available at the Bash Shell.) + +**Step 1** + +Add the conda-forge channel: + +```bash +$ conda config --add channels conda-forge +``` + +**Option 1 for Step 2 (recommended if you're new to using Anaconda)** + +Install the packages in the base conda environment: + +```bash +$ conda install xarray dask netCDF4 cartopy cmocean cmdline_provenance +``` + +**Option 2 for Step 2** + +Create a new environment called `pyaos-lesson` and install the packages there: + +```bash +$ conda create -n pyaos-lesson jupyter xarray dask netCDF4 cartopy cmocean cmdline_provenance +``` + +You can activate this new environment as follows: + +```bash +$ conda activate pyaos-lesson +``` + +(Or `source activate pyaos-lesson` if that doesn't work.) + +Type `conda deactivate` to exit that environment. + + +### Installation of python packages: via the Anaconda Navigator + +Once you've opened the Anaconda Navigator program +(which can be found at the start menu on Windows), +head to the "Environments" tab. + +You can install the packages into the "base" environment +(recommended if you're new to Anaconda) +OR create a new environment called `pyaos-lesson` +by clicking the "create" button at the bottom of the environment +list before doing the following: + +**Step 1** + +Add the `conda-forge` channel. +![](fig/01-navigator-conda-forge.png){alt='Anaconda Navigator add conda-forge'} + +**Step 2** + +Install the `xarray`, `dask`, `netCDF4`, `cartopy`, `cmocean` and `cmdline_provenance` +packages one-by-one (click "apply" to install once selected). +If you've created a new environment, +you'll need to install `jupyter` too. +![](fig/01-navigator-xarray.png){alt='Anaconda Navigator install xarray'} + + +## Software check + +To check that everything is installed correctly, follow the instructions below. + +**Bash Shell** + +- *Linux*: Open the Terminal program via the applications menu. The default shell is usually Bash. If you aren't sure what yours is, type `echo $SHELL`. If the shell listed is not bash, type `bash` and press Enter to access Bash. +- *Mac*: Open the Applications Folder, and in Utilities select Terminal. +- *Windows*: Open the Git Bash program via the Windows start menu. + +**Git** + +- At the Bash Shell, type `git --version`. You should see the version of your Git program listed. + +**Anaconda** + +- At the Bash Shell, type `python --version`. You should see the version of your Python program listed, with a reference to Anaconda (i.e. the default Python program on your laptop needs to be the Anaconda installation of Python). + +[pr-accesscm-file]: https://carpentries-lab.github.io/python-aos-lesson/data/pr_Amon_ACCESS-CM2_historical_r1i1p1f1_gn_201001-201412.nc +[pr-accessesm-file]: https://carpentries-lab.github.io/python-aos-lesson/data/pr_Amon_ACCESS-ESM1-5_historical_r1i1p1f1_gn_201001-201412.nc +[sftlf-accesscm-file]: https://carpentries-lab.github.io/python-aos-lesson/data/sftlf_fx_ACCESS-CM2_historical_r1i1p1f1_gn.nc +[sftlf-accessesm-file]: https://carpentries-lab.github.io/python-aos-lesson/data/sftlf_fx_ACCESS-ESM1-5_historical_r1i1p1f1_gn.nc +[template-script]: https://carpentries-lab.github.io/python-aos-lesson/code/script_template.py +[precip-script]: https://carpentries-lab.github.io/python-aos-lesson/code/plot_precipitation_climatology.py