Find a problem that you are excited to spend time implementing and releasing, develop high-quality code, and release it as a well-structued and documented GitHub repository.
There are a few heuristics that you might use to identify a good problem:
- Start with a problem we saw in the course (e.g. an implementation of a single algorithm like a spectral PDE solver) and then branch out into related extensions (e.g. generalize the solver to different coordinate systems or geometries).
- Look for creative but underappreciated older methods papers in your field, especially where the author’s code is unavailable, where the implementation is in a very different language like FORTRAN or Perl, or where the provided code is poorly documented. A clean, easy-to-use implementation of an uncommon method often proves very valuable.
- Find a very common task or method in your field (e.g., stacking microscopy images, or synchronizing experimental recordings), and use tricks from vectorization, dynamic programming or linear algebra to speed up the task.
- It's okay if you pick a hard problem and ultimately can't solve it. The goal is to write good code with good documentation and structure.
- Don't pick a problem that is so hard or open-ended that it delays getting started writing code. Your goal should be to get started writing code as soon as possible. You can always get something simple working and then add more features later.
- Please don't copy or very lightly modify code that is primarily taken from a blog post, a GitHub repo you found, or purely generated by an LLM without any checking for correctness. Remember that the project will be public on GitHub and thus searchable on Google.
- Don't do an ML project unless someone in your group is either experienced with it, or you are willing to read ahead and make sure you can use best statistical practices. If you want to use a deep learning model, make sure it's appropriate for your problem scale. Make sure that simpler methods, like boosted forests or ridge regression, aren't more appropriate first. Be extremely careful about validation, hyperparameter tuning, and train-test splits.
Beyond providing a setting to try out some of the ideas we are learning in this class, I am hoping that this project will have residual value to you after the course is over. Having prior experience with open-source development and visible existing code examples may prove useful to you in your graduate research, and potentially on the academic and industry job market. By posting the code publicly on GitHub, your code will help others in the future who are trying to solve similar problems.
- I would prefer groups of 4 people, for a total of 8 projects per course
- You’re allowed (and encouraged) to work on something relevant to your research group's work, but please make the GitHub repo self-contained. You should plan to write substantial new code for this project (although re-factoring a “rough” implementation is okay), so please use your best judgment to ensure that you get the most value out of this project.
- If there is a method you’d like to add to a large, existing package that is widely used in your field (e.g. Biopython, Astropy, scikit-learn, sktime) check with me about submitting a well-structured PR to the main repo instead of implementing a standalone package. Please check with the repo maintainers that your feature would be welcome, and about the format and testing in order to get accepted. Generally, the larger the repo’s userbase, the smaller the addition should be, and the more testing it will need to pass. However, the potential impact could be huge.
Problem scope: 20%
- Contains an interesting and challenging problem, and makes a good-faith effort to approach it.
- Creativity: An unexpected application or novel algorithm or interpretation of an algorithm is exciting and appreciated.
- Thoroughness: Makes a thorough attempt to solve the problem, even if ultimately unsuccessful
Code quality: 40%
- Logical structure, minimal redundancy or repeated code
- Variables and objects have appropriate scope
- Use of appropriate abstractions
- Code legibility and style
- Unit tests or other tests to ensure correctness
Documentation: 20%
- README contains Installation instructions
- README contains example usage and minimum working example
- I’m not requiring a written report this term, and so if you have benchmarks or results, please put them in a section of your repo’s README.md file. Please use best practices for publication-quality writing and figure-making.
- Major functions and classes have documentation
Talk: 20%
- Only one group member needs to present, though you are welcome to structure this however you’d like.
- These will be ~10 minutes + 2 min questions during the last few sessions of the course (5 talks per class).
- Please be ready to present the class session before you are scheduled, just in case someone can’t come on their scheduled presentation day.
- You can organize these however you want, but if you would prefer a template: 5-8 minutes background, 3 minutes on problem formulation, 5 minutes on your solution and any pitfalls or dead ends, and remaining time on future directions, applications, connections to interesting other ideas.
You can pick anything that interests you and which involves writing code, here are just some ideas
- Implement the orthogonality-constrained optimizer of Edelman et al.
- Implement a minimal finite-element solver for PDEs in 2D, and compare and contrast with the finite-difference method
- Recreate the key results of Kauffman’s random Boolean circuits paper
- Recreate the key results of Lenski et al.'s digital organisms paper
- Recreate (computationally) the chimera states of coupled oscillators described in Abrams and Strogatz
- Implement the Vicsek model of flocking, including the extension to N-D, and probe the phase transition. Consider how you migh extend the model to larger swarms, such as by using the kernelized approach described in this paper
- Implement a minimal version of Havok or Dynamic Mode Decomposition, two data-driven methods for discovering the underlying dynamics of a system based on time series data.
- Using the logistic map or another minimal system, implement the Ott, Grebogi, and Yorke method for controlling chaos.
- Our LA, Anish Pandya, has put together a list of additional project ideas here
- Pineau Lab machine learning reproducibility checklist
- It’s not necessary to include everything in this guide, but it’s a great guide to what the research community thinks good, reusable code should look like.
- Quantum Reinforcement Learning with the Grover method
- Modelling the contractile dynamics of muscle
- Tight binding and Anderson localization on complex graphs
- Neural System Identification by Training Recurrent Neural Networks
- Assimilating a realistic neuron model onto a reduced-order model
- Testing particle phenomenology beyond the Standard Model with Bayesian classification
- Monte Carlo sampling for many-body systems
- William’s dynamical systems repo from NeurIPS 2021
- tsfresh time series featurization library
- darts forecasting library
- JAX
- An example pull request to the widely-used sklearn machine learning package, which implements varimax PCA:
Example class projects from other courses: