Skip to content

Latest commit

 

History

History
181 lines (128 loc) · 21.9 KB

software.md

File metadata and controls

181 lines (128 loc) · 21.9 KB

Data analysis software used in MEOM group and how to learn it.

authors : Julien Le Sommer, Aurélie Albert and Redouane Lguensat (MEOM group, IGE)

This page provides a curated list of software used for data analysis in the MEOM group, online resources on how to use it and general advice on how to proceed with ocean data analysis. Please, keep in mind that (i) this list is not exhaustive and that (ii) it may evolve with time.

General advice

A nice mental picture for understanding most of our data analysis tasks is the notion of data analysis pipeline. Our data analyses generally combine several steps, all corresponding to individual pieces of software. Our data flows through the pipeline and gets transformed at each step by a particular piece of software. Ideally these pipelines shoud be as automated as possible so that our work is easily reproducible.

A key principle for building data analysis pipelines is to try to rely as much as possible on pre-existing software. In practice, most of the steps in an analysis pipeline are very generic (as eg. reading/writing/plotting data) so that we can just use preexisting code. So a large fraction of our work just involves glueing together existing pieces of code. This is why modern software is now made as modular as possible.

If you need to write new code, it should focus in priority on what is specific to your analysis. Some building-blocks of your data analysis pipeline are indeed more specific to your needs than others. For these key specific steps, you might have to write a dedicated module, but you should always first wonder if someone has already implemented something close to what you need.

Knowing what is currently feasible with your sofware environment is therefore key for designing your own analyses. This requires your keeping a routine curiosity for software. In practice, this requires your spending time going through online videos and tutorials (because you learn better through examples). This also requires your being aware of minimal good practices in scientific computing.

In this page, you will find a curated list of software, tutorials and examples that we hope will improve your awareness of your technological environment. Do not hesitate to suggest new links and update to this page.

Recommanded base configuration

To work with the MEOM group, you will need this minimal software configuration and set-up:

A small (and old) presentation about basic git : https://drive.google.com/file/d/13CBG1wGUQJpawkAM0zHzLJQivtwx_t3w/view?usp=sharing

Caution : you may need to configure git to work through our network proxy.

Software engineering

Resources on how to better interact with your computer (basic).

Resources on how to build and distribute software (advanced):

Jupyter notebooks

Jupyter notebooks are great for sharing your work because they allow to mix code, text and visualization in the same document. Because Jupyter notebooks can understand code in different languages, they are also great for building an complex data analysis pipelines. We therefore strongly recommend using Jupyter Notebooks.

Python language

Basic scientific python

For people who have never used python before :

Geoscientific data analysis in python :

We strongly recommand using the following packages.

The Jupyter book "An Introduction to Earth and Environmental Data Science" by Ryan Abernathy is a good introduction to modern computing software, programming tools and best practices that are broadly applicable to the analysis and visualization of Earth and Environmental data. https://earth-env-data-science.github.io/intro

Machine Learning with python :

Recently, machine learning (ML) methods are leading to impressive performances in signal/image/vision fields, in particular, neural networks made a strong come-back thanks to a technique called deep learning (DL) (Nature paper). Maybe you heard about AlphaGo the algorithm that beat the world champion in the game Go (story), or you heard about the Microsoft team that developed an algorithm able to surpass Human-Level performance on classification of ImageNet a popular image dataset (paper). These are just a glimpse of what DL is doing right now, for more information, you can take a look at these excellent articles 1,2. In our group we are starting to investigate the possibility of transferring the knowledge gained in the machine learning community to tackle ocean related inverse problems.

For newcomers into ML we strongly recommend Andrew Ng's ML course (It's an old course so it still uses Matlab). Then the DL specialization suggested by Andrew Ng (again) is a great introduction to DL (course) (2017, so it's Python time!).

The following packages are mainly used by our team for ML:

  • scikit-learn: Built on top of NumPy, SciPy, and matplotlib, scikit-learn is the standard package used by industry and education for machine learning with Python. Tutorials (FYI: It's first released version was done by INRIA researchers)
  • Tensorflow: Developed by Google, TF is currently the most used Python library for DL according to Github pull requests history and Google trends.
  • Keras: Keras is a high-level neural networks API, that can run on top of TensorFlow. Meaning that it is more "user friendly" than Tensorflow (allows easier prototyping, basically classical layers are ready-to-use with minimum code). In our group, we rely on Keras for direct applications that require optimizing popular existing neural networks techniques. Some tutorials could be found here.
  • PyTorch (developed by Facebook) which is gaining ground and can be considered as a strong competitor to TF. It is used by our group for rather complex applications where new neural network design choices are needed. It has the advantage of relying on dynamic graphs (Define-and-Run) which is a more nature way of programming, more details can be found here (Note that TF has recently added a new functionality to allow for the use of dynamical graphs).
  • Other libraries for DL in Python exist such as Theano (historical library developed by MILA Montréal), Caffe, CNTK (developed by Microsoft), MXNET (Amazon), fast.ai, etc.

Some other useful links and material for further reading:

Data visualization with python :

There are currently too many libraries for visualizing data with python (see this python data vizualization tour), this may seem exciting or overwhelming depending on the point of view... In practice, you should distinguish libraries that focus on interactive data visualization (great for investigating your datasets in Jupyter notebooks) and libraries that focus on static data visualization (needed for writing papers and reports). Several of the more recent visualization libraries in python implement concepts from the Grammar of graphics.

Optimizing python codes

Python is great for fast prototyping of production code. It also has the reputation of being rather slow as compared to other languages (as for instance Fortran, C or C++). There are therefore a lot of options for accelerating python code, with classical solutions generally involving interfacing python code with faster languages. This is the general idea behind f2py, Weave, Cython. Although some of theses solutions can be very helpful for interfacing pre-existing legacy code, we here promote the use of numba. Numba indeed gives you the power to speed up your applications with a few annotations without having to switch languages or Python interpreters.

Other useful python packages and jupyter notebooks

Other software used in MEOM group

(section under construction)