authors : Julien Le Sommer, Aurélie Albert and Redouane Lguensat (MEOM group, IGE)
This page provides a curated list of software used for data analysis in the MEOM group, online resources on how to use it and general advice on how to proceed with ocean data analysis. Please, keep in mind that (i) this list is not exhaustive and that (ii) it may evolve with time.
A nice mental picture for understanding most of our data analysis tasks is the notion of data analysis pipeline. Our data analyses generally combine several steps, all corresponding to individual pieces of software. Our data flows through the pipeline and gets transformed at each step by a particular piece of software. Ideally these pipelines shoud be as automated as possible so that our work is easily reproducible.
A key principle for building data analysis pipelines is to try to rely as much as possible on pre-existing software. In practice, most of the steps in an analysis pipeline are very generic (as eg. reading/writing/plotting data) so that we can just use preexisting code. So a large fraction of our work just involves glueing together existing pieces of code. This is why modern software is now made as modular as possible.
If you need to write new code, it should focus in priority on what is specific to your analysis. Some building-blocks of your data analysis pipeline are indeed more specific to your needs than others. For these key specific steps, you might have to write a dedicated module, but you should always first wonder if someone has already implemented something close to what you need.
Knowing what is currently feasible with your sofware environment is therefore key for designing your own analyses. This requires your keeping a routine curiosity for software. In practice, this requires your spending time going through online videos and tutorials (because you learn better through examples). This also requires your being aware of minimal good practices in scientific computing.
In this page, you will find a curated list of software, tutorials and examples that we hope will improve your awareness of your technological environment. Do not hesitate to suggest new links and update to this page.
To work with the MEOM group, you will need this minimal software configuration and set-up:
- install anaconda (python distribution and package manager) and learn how to use it
- learn how to manage conda environments
- install git (software version control manager), create a github account and learn how to use it :
- [cloning] (https://help.github.com/articles/importing-a-git-repository-using-the-command-line/) a repository
- using gist
- using pull requests (advanced)
A small (and old) presentation about basic git : https://drive.google.com/file/d/13CBG1wGUQJpawkAM0zHzLJQivtwx_t3w/view?usp=sharing
Caution : you may need to configure git to work through our network proxy.
Resources on how to better interact with your computer (basic).
- unix operating system: software carpentry tutorial
- automation and make : software carpentry tutorial, SCons
- version control with git: software carpentry tutorial, becoming a git guru
Resources on how to build and distribute software (advanced):
- packaging and distrubuting python projects : user guide, setuptools
- testing and continuous integration : software carpentry tutorial, travis-ci
- documenting your projects : A guide to python documentation with numpydoc, readthedocs, write the docs community
- a blog post of templates for python command line scripts
Jupyter notebooks are great for sharing your work because they allow to mix code, text and visualization in the same document. Because Jupyter notebooks can understand code in different languages, they are also great for building an complex data analysis pipelines. We therefore strongly recommend using Jupyter Notebooks.
- official documentation
- using the ipython kernel
- sharing notebooks with nbviewer
- recommanded best practices with Jupyter notebooks
- Jupyter notebooks tips, tricks and shortcuts
- examples from the Python Data Science Handbook, and from the group
- building reproducible analysis pipelines with nbflow (advanced)
- diffing and merging notebooks with nbdime
For people who have never used python before :
- Introduction to python through examples :
- numpy is the fundamental package for scientific computing with Python : tutorial. Here is a collection of resources on numpy. From python to numpy includes a lot of examples codes and links to tutorials. A useful tutorial from Euroscipy 2019.
- scipy is a Python library used for scientific computing and statistical analysis. Here is a collection of resources on scipy.
- more python examples : from the Python Data Science HandBook, from earthpy website, see also the scipy lecture notes
We strongly recommand using the following packages.
-
There are several interfaces for handling netcdf files in python. Here is a tutorial to NetCDF4 interface.
-
pandas is a great package for handing time series and labelled data in python, here is a 10min tour to pandas. See also this example . Here is a collection of resources on pandas. See also this brief introduction. Example of timeseries analysis with pandas
-
xarray implements a N-dimensional variants of the core pandas data structures. In practice, xarray provides an in-memory representation of the content of a collection of netCDF files. * official documentation (with interesting links to videos and tutorials) : * xarray tutorials by S. Hoyer and by N. Fauchereau
-
Dask is a flexible parallel computing library for analytic computing in python.
- official documentation
- a good introduction
- a video on parallel and distributed computing with dask
- some examples
- slides on visualizing parrallel computation
- combining xarray and dask
- QuickStart with Dask.distributed
-
oocgcm is a project that provides tools for processing and analysing output of general circulation models and gridded satellite data in the field of Earth system science.
The Jupyter book "An Introduction to Earth and Environmental Data Science" by Ryan Abernathy is a good introduction to modern computing software, programming tools and best practices that are broadly applicable to the analysis and visualization of Earth and Environmental data. https://earth-env-data-science.github.io/intro
Recently, machine learning (ML) methods are leading to impressive performances in signal/image/vision fields, in particular, neural networks made a strong come-back thanks to a technique called deep learning (DL) (Nature paper). Maybe you heard about AlphaGo the algorithm that beat the world champion in the game Go (story), or you heard about the Microsoft team that developed an algorithm able to surpass Human-Level performance on classification of ImageNet a popular image dataset (paper). These are just a glimpse of what DL is doing right now, for more information, you can take a look at these excellent articles 1,2. In our group we are starting to investigate the possibility of transferring the knowledge gained in the machine learning community to tackle ocean related inverse problems.
For newcomers into ML we strongly recommend Andrew Ng's ML course (It's an old course so it still uses Matlab). Then the DL specialization suggested by Andrew Ng (again) is a great introduction to DL (course) (2017, so it's Python time!).
The following packages are mainly used by our team for ML:
- scikit-learn: Built on top of NumPy, SciPy, and matplotlib, scikit-learn is the standard package used by industry and education for machine learning with Python. Tutorials (FYI: It's first released version was done by INRIA researchers)
- Tensorflow: Developed by Google, TF is currently the most used Python library for DL according to Github pull requests history and Google trends.
- Keras: Keras is a high-level neural networks API, that can run on top of TensorFlow. Meaning that it is more "user friendly" than Tensorflow (allows easier prototyping, basically classical layers are ready-to-use with minimum code). In our group, we rely on Keras for direct applications that require optimizing popular existing neural networks techniques. Some tutorials could be found here.
- PyTorch (developed by Facebook) which is gaining ground and can be considered as a strong competitor to TF. It is used by our group for rather complex applications where new neural network design choices are needed. It has the advantage of relying on dynamic graphs (Define-and-Run) which is a more nature way of programming, more details can be found here (Note that TF has recently added a new functionality to allow for the use of dynamical graphs).
- Other libraries for DL in Python exist such as Theano (historical library developed by MILA Montréal), Caffe, CNTK (developed by Microsoft), MXNET (Amazon), fast.ai, etc.
Some other useful links and material for further reading:
- Course materials for the Data Science at Scale Specialization at Coursera https://github.com/AlessandroMozzato/datasci_course_materials
- (French) Stéphane Mallat - Apprentissage par invariants en grande dimension https://www.youtube.com/watch?v=kgicutzlq50
- Machine learning with sklearn, lecture https://www.youtube.com/watch?v=Cte8FYCpylk
- Machine learning for time series data in Python https://www.youtube.com/watch?v=ZgHGCfwExw0
- Machine Learning for Analyzing Complex Time Series https://www.youtube.com/watch?v=8lv3rf1zWkQ
- Jupyter notebooks for the code samples of the book "Deep Learning with Python" https://github.com/fchollet/deep-learning-with-python-notebooks
- Python for probability, statistics and machine learning http://nbviewer.jupyter.org/github/unpingco/Python-for-Probability-Statistics-and-Machine-Learning/tree/master/
- Machine learning with TensorFlow https://github.com/BinRoot/TensorFlow-Book
There are currently too many libraries for visualizing data with python (see this python data vizualization tour), this may seem exciting or overwhelming depending on the point of view... In practice, you should distinguish libraries that focus on interactive data visualization (great for investigating your datasets in Jupyter notebooks) and libraries that focus on static data visualization (needed for writing papers and reports). Several of the more recent visualization libraries in python implement concepts from the Grammar of graphics.
-
General purpose visualization:
-
static visualization :
-
matplotlib is the standard 2D and 3D plotting library for python. See this tutorial and some explanations on how plots work, the euroscipy 2019 tutorial
-
seaborn is a great to matplotlib for statistical data vizualisation with python (default plots look good with seaborn) : tutorial, examples, here is a nice collection of ressources. (note that seaborn is well integrated in holoviews)
-
ggplot is a plotting system for Python based on R's ggplot2 and the Grammar of Graphics. see also this video
-
altair is a new declarative statistical visualization library : documentation, tutorial notebook
-
interactive visualization (within Jupyter notebooks)
- HoloViews lets you store your data in an annotated format that is instantly visualizable, with immediate access to both the numeric data and its visualization : video demo, video demo, example notebooks, list holoviews elements, holoview for visualizing distribution data.
- Bokeh is an interactive data visualization library that implements the grammar of graphics : example gallery, notebook gallery, video demo
- bqplot is a Grammar of Graphics-based interactive plotting framework for the Jupyter notebook.
- plotly is another interactive visualization library, probably more oriented to making charts and dashboard for companies: documentation, examples
-
-
Visualizing geographical data :
- static visualization :
- ( Basemap is an extension to matplotlib that allows to plot geographical data: documentation ) => not maintained anymore
- Cartopy provides cartographic tools for python (developped by the MetOffice) : documentation
-
interactive vizualization (within Jupyter notebooks) :
-
How to choose a colormap:
-
see this resourtce on perceptually uniform colormaps,
-
see this discussion for a better default colormap for matplotlib
-
see also this discussion colorbar manipulation for bathymetry
-
seaborn color palette tutorial
Python is great for fast prototyping of production code. It also has the reputation of being rather slow as compared to other languages (as for instance Fortran, C or C++). There are therefore a lot of options for accelerating python code, with classical solutions generally involving interfacing python code with faster languages. This is the general idea behind f2py, Weave, Cython. Although some of theses solutions can be very helpful for interfacing pre-existing legacy code, we here promote the use of numba. Numba indeed gives you the power to speed up your applications with a few annotations without having to switch languages or Python interpreters.
- see this discussion on how to optimize python code with numba, cython and fortran in jupyter notebooks with magics
- see this blog post on the optimization of non-uniform fourier transforms
- see this blog post on accelerating python code with numba
- and this series of videos on numba : video1, video2, video3
- Statistics in python with statsmodels : documentation, examples
- Image processing with python
- Filtering and time series analysis with scipy.signal
- Python for signal processing
- PyMC3 Probabilistic modelling in python
- managing dates and time intervals arrow.
(section under construction)
- CDFTOOLS : https://github.com/meom-group/CDFTOOLS
- DCM : https://servforge.legi.grenoble-inp.fr/projects/DCM
- mkmov