Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

User interface and better solvers for GAP fitting #305

Open
wants to merge 91 commits into
base: master
Choose a base branch
from

Conversation

max-veit
Copy link
Contributor

@max-veit max-veit commented Dec 2, 2020

This PR aims to improve the user interface of the potential-fitting step, starting with fitting of SOAP-GAP potentials. In particular, it provides an interface to automate many of the common tasks involved in GAP fitting, allowing non-interactive (batch-style) fitting based on a flexible Python interface and easing the transition from QUIP.

Edit 24.06.2021: Incorporate the RKHS solver from #354 - description below:

Implement an RKHS solver for the sparse GPR problem.

Solving the GPR problem using the "normal" equation is very ill-conditioned.
This implements an alternative solver that is much better behaved.
Might be good to also include different options in terms of using solve or lstsq,
let's see where this goes - however this already works.

Main changes:

  • Simplify the train_gap_model function (now gaptools.fit_gap_simple()):
    • Improve modularity, remove dependence on StructureManagers
    • Move kernel computation out of main function
    • Make regularizers more predictable by removing implicit variance normalization (the user should almost always specify the variance / function scale explicitly)
  • Add a script that reads a parameter file to control fit options
    • Example parameter file included
  • Move the solver of the sparse GPR "normal" equations out of the fitting functions
    • Add RKHS and QR solvers as options of the solver class

In progress or not yet implemented (wishlist):

  • Feature sparsification
  • Training with virials
  • Automatic variance scaling if requested

(Probably for future PRs:)

  • Multi-kernel (multi-SOAP) fitting
  • Automate uncertainty estimation (fitting with different data subsets)
  • Interface to automate regularizer optimization (including saving kernels)

Tasks before review:

  • Add tests, examples, and complete documentation of any added functionality
    • Make sure the autogenerated documentation appears in Sphinx and is
      formatted correctly (ask @max-veit if you need help with this task).
    • Write additional documentation in the Sphinx (not docstrings!) to
      explain the feature and its usage in plain English
    • Make sure the examples run (!) and produce the expected output
  • Run make lint on the project, ensure it passes
    • Run make pretty-cpp and make pretty-python, check that the
      auto-formatting is sensible
    • Review variable names, make sure they're descriptive (ask a friend)
    • Review all TODOs, convert to issues wherever possible
    • Make sure draft, in-progress, and commented-out code is moved to its
      own branch
  • Merge with master and resolve any conflicts

max-veit and others added 22 commits May 6, 2020 15:43
(Gaussian Approximation Potentials)
and add a sensible default to baseline values in KRR class (in case we
encounter species that weren't in the training set; this is better than
failing with a KeyError)
(Update to version of CURFilter and FPSFilter from PR #265)
Get up to speed with the latest PRs (and eventually feat/gap_pred as well)
Account for name change and add capability to do FPS as well as CUR
(mainly useful for test runs on smaller machines)
I was initially intending this to work just by changing the working
directory, since there are a few other files written that would be a
pain to give the option to specify names for.  But I think it's worth
at least leaving the potential filename flexible.
And turn off computing gradients while sparsifying (they should be
turned on again when computing the gradient kernel)
@felixmusil
Copy link
Contributor

A few comments/questions to this draft:

  • I feel it would be beneficial to separate file saving and computation functionality, e.g. calculate_and_sparsify.
  • Does the WORKDIR parameter work as intended atm ?
  • Do you plan to integrate training with virials into gaptools ?
  • Do you plan on handling partial reference, e.g. some structures with energy but no forces and vice versa ?
  • Do you plan to handle computation of the kernel on multiple processor ?
  • Is this the right place to get an ase.Atoms sanitizer ? (The full periodic and no pbc cases are already around so only partial periodicity really needs to be implemented)
  • Putting parameters in a json (or yml) file is the right way I think.
  • Do you plan to integrate a cross validation in the executable ?
  • Does the dedicated ipi interface belong to this PR ?

I guess many of these suggestions can be offloaded to the wishlist of the next PR too.

@max-veit
Copy link
Contributor Author

max-veit commented Dec 7, 2020

One at a time:

I feel it would be beneficial to separate file saving and computation functionality, e.g. `calculate_and_sparsify`.

Sure, although some of that was intended as diagnostic information or just to have a few "save points" in case the fit fails e.g. due to lack of time or memory. If you're referring specifically to this line:
https://github.com/cosmo-epfl/librascal/blob/9ae737649005fb1773a431b73db452060f8169cf/bindings/rascal/models/gaptools.py#L98
then yes, I suppose as long as that's included and easily accessible in the final model, then writing the sparse points is no longer necessary. But some of these files (e.g. writing out kernel files) will be useful for planned functionality, like regularizer optimization. Let's discuss this further once the PR is closer to being ready.

Does the WORKDIR parameter work as intended atm ?

It's read by the fit_gap.py script from the json parameter file, not from environment variables (although I guess that could also be made a possibility).

Do you plan to integrate training with virials into gaptools ?

Yes, I'll just go ahead and add that to the wish list.

Do you plan on handling partial reference, e.g. some structures with energy but no forces and vice versa ?

I suppose so, though I don't quite see the use case - usually you'll want to fit on a dataset where everything is computed at the same level of theory, and usually that also means having all structures with the same type of data (e.g. all with forces or all without). In any case, I don't think it would be too hard to implement.

Do you plan to handle computation of the kernel on multiple processor ?

That may indeed be useful, but I think it'll have to wait for another PR.

Is this the right place to get an ase.Atoms sanitizer ? (The full periodic and no pbc cases are already around so only partial periodicity really needs to be implemented)

Not really, but I don't see an easy way to automatically turn on atoms wrapping (which I think should be the default option tbh), so it was easier at the time to put in 5 lines of Python and just do it myself. For the non-periodic case we can use the rascal.neighbourlist.structure_manager.sanitize_non_periodic_structure function.

Putting parameters in a json (or yml) file is the right way I think.

Yep, this was deliberately done for compatibility with the rest of librascal, and especially the representation parameters are very clear when stored in this format.

Do you plan to integrate a cross validation in the executable ?

This would be part of the regularizer optimization, which might be better off in its own PR.

Does the dedicated ipi interface belong to this PR ?

That was a mistake, development has been moved to its own branch.

@max-veit max-veit mentioned this pull request Mar 25, 2021
11 tasks
Especially to get the updated sparsification utils and zundel notebook
parameters for smoother automatic testing
Also swap labels to resolve #380
Stresses (i.e. cell gradients) are going to be tricky...
Also add a method to store features into lists of ASE Atoms, so that
these can be used instead of rascal's internal AtomsLists

Finally, selection methods now work with lists of ASE Atoms
max-veit and others added 12 commits June 3, 2022 17:57
(basically, this just removes kvec_generator.cc from compilation,
because it was throwing mysterious bounds-checking errors for Eigen arrays.
Will need to resolve properly once feat/reciprocal_space_soap is
merged in)
- temporarily exclude kvec_generator from tests (kvec_generator.cc is
  currently incompatible with gcc11)
- unpin jinja2 version, since older versions fail with latest markupsafe
  (pallets/markupsafe#304) and it is now
  compatible with latest nbsphinx
  (spatialaudio/nbsphinx#563)
Users can still specify it if they need it, but this is broken in
multiple ways right now.
Apparently this was updated in skcosmo recently...
This was previously implemented but execution never reached there.  This
should therefore be considered an experimental feature until tests can
be made.
@max-veit max-veit marked this pull request as ready for review July 11, 2022 09:55
@max-veit
Copy link
Contributor Author

Ok, this is almost ready for review -- it has all the functionality I was planning to add here, now just needs some testing.

The main major change since the last time this was updated was to add a Kernel class that works with feature matrices explicitly, rather than rascal StructureManagers. This is of course less efficient than using the native rascal Kernel class, but it also means we can work with representations that aren't currently computed in librascal (like LODE).

@max-veit max-veit requested a review from PicoCentauri August 26, 2022 15:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants