Skip to content

Latest commit

 

History

History
238 lines (121 loc) · 17.1 KB

README.md

File metadata and controls

238 lines (121 loc) · 17.1 KB

OTS Data Science Co-Learning Resources

Collection space for resources pertaining to the OpenTechSchool Data Science Co-Learning Meetup.. All resources are free unless otherwise noted. Please let us know if there are any broken or outdated links.

Feedback, further resources and PRs welcome!



Contents

  1. Python
  2. R
  3. Databases
  4. Machine Learning
  5. Math and Statistics
  6. Social Media and News
  7. Competitions
  8. Datasets
  • Introduction to Programming with Python from OTS: Perfect for anyone with limited exposure to programming. Covers everything you need to get started with Python. Available in English, German, Spanish, Russian, Korean, and Romanian.

  • Automate the Boring Stuff with Python: Teaches Python as a means to get things done. Free if you use the website, also available as a book and a Udemy course.

  • Python Course: Probably one of the most comprehensive free tutorials available, it covers everything from "Hello World" to advanced OOP. Klein also offers quite a bit beyond the core material, e.g. Advanced Topics, Numerical Python, Machine Learning, and Tkinter Tutorial. All the material is offered in English and German.

  • DataCamp Community Tutorials: Lots of tutorials, many user-created. Lots of Python, but also R, SQL, Git, stats, etc.

  • Introduction to Data Processing with Python from OTS: A work in progress, our own tutorial builds on the foundations of Introduction to Programming with Python. You'll learn how to install and use Jupyter notebooks, load data, analyze a survey, and visualize your data. We've got quite a bit in the pipeline, so be sure to check back soon.

  • Pandas Tutorial: This tutorial consists of a series of Jupyter Notebooks introducing the fundamentals of the Pandas module. The notebooks can be freely downloaded. The Pandas tutorial, as well as tutorials on a number of other data science related topics, are also available as email courses.

  • StatsModels Tutorial: Repo of Jupyter Notebooks dealing with the StatsModels module. The material is a little old and not very well organized, but there are still a few gems there for anyone doing statistics in Python.

  • SciPy Lecture Notes: A brief (1-2 hours per module) introduction to the tools and techniques of Python's SciPy module.

  • CS109 Data Science: A very comprehensive, very challenging course from Harvard's School of Engineering and Applied Sciences. Uses Python.



R

Note: R is, unlike Python, definitely not a general-purpose language – although it can perform some general-purpose tasks. It is specifically designed for statistical computing and graphics, so many courses teach R in conjunction with stats, data science, etc.

  • Swirl: Swirl is an R package that allows you to learn R interactively in the R console.

  • Introduction to R from DataCamp: A solid, 6-part intro to the basics of the R language with 4-5 hours of material. If you're interested in continuing with DataCamp you can purchase a subscription for 22€ per month, which gives you access to 137 courses (R and Python) and a number of career and skill tracks.

  • DataCamp Community Tutorials: Lots of tutorials, many user-created. Lots of R, but also Python, SQL, Git, stats, etc.

  • R for Data Science from Garrett Grolemund and Hadley Wickham: An excellent introduction to data science via R by two heavyweights of the R community, it is broken down into 5 parts, corresponding to steps in the data science process: Explore, Wrangle, Program, Model, and Communicate. You'll learn the "tidy" approach to data, and immediately use libraries such as dplyr, tidyr, and ggplot2. Some basic knowledge of R can be helpful, but isn't absolutely necessary (DataCamp's intro is more than enough). It's also available as a book.

  • Advanced R from Hadley Wickham: A companion website to the book of the same name, it introduces more advanced features (and quirks) of the R language, e.g. style, exception handling, functional programming, R's C interface, etc.

  • Data Science Specialization: A comprehensive, challenging, 10-part course from Johns Hopkins University & Coursera. Covers R, the data science workflow, stats, and some machine learning.

  • R Tutorial: A basic intro to R and stats from the University of Georgia, Department of Mathematics.

  • Sharp Sight Labs: A useful blog with short tutorials on the nuts and bolts of data analysis in R, with a focus on tidyverse tools and on developing fluency.

  • The Analytics Edge: Semester-long course from MIT & edX. Covers stats/DS using real-world examples.

  • Quick-R: A website dedicated to helping individuals with some background in statistics transition to R.

MOOCs and Tutorials

  • Intro to Machine Learning (Py): An excellent introduction to applied ML from Udacity. The course focuses on the ML library scikit-learn. Part of Udacity's Data Analyst Nanodegree, it takes an estimated 10 weeks to complete.

  • Machine Learning (Octave/Matlab): A popular introduction to the theory behind common ML algorithms, from Coursera founder and Stanford professor Andrew Ng. It takes an estimated 11 weeks to complete. A certificate is available for Coursera subscribers, but the material is free for everyone. Use of Octave/Matlab in only required when pursuing a certificate.

  • Chris Albon's personal website - Lots of short tutorials. Mostly ML, but also web scraping, regular expressions, visualization, etc. Chris has also written a book.

  • Deep Learning: An online version of the popular deep learning textbook.

  • Natural Language Processing with Python: Free online version of the popular NLP book. Uses NLTK. Updated for Python 3.

  • Kaggle Titanic Tutorial (R): A tutorial aimed at Kaggle's Titanic: Machine Learning from Disaster. Begins with some basics, then moves on to decision trees, feature engineering, and random forests.

  • Kaggle Titanic Tutorial (Py): Machine learning with scikit-learn and tensorflow

  • Machine Learning Mastery from Jason Brownlee (R/Python): Includes lots of self-study tutorials covering beginner to advanced topics in machine learning and statistics. Brownlee also offers some ebooks for $37-47, in case you're looking for more depth and/or structure.

  • fast.ai: A website dedicated to making the power of deep learning accessible to all.

Toolkits

  • Scikit-Learn (Py): Simple and efficient tools for data mining and data analysis. Accessible to everybody, and reusable in various contexts. Built on NumPy, SciPy, and matplotlib. Open source, commercially usable - BSD license.

  • Keras (Py): A Python deep learning library. Keras is a high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano. It was developed with a focus on enabling fast experimentation.

  • TensorFlow (Py): An open source machine learning framework.

  • PyTorch (Py): A deep learning framework for fast, flexible experimentation.

  • Natural Language Toolkit (Py): NLTK is a leading platform for building Python programs to work with human language data.

  • caret (R): The caret package (short for _C_lassification _A_nd _RE_gression _T_raining) is a set of functions that attempt to streamline the process for creating predictive models.

  • class (R): Various functions for classification, including k-nearest neighbour, Learning Vector Quantization and Self-Organizing Maps.

  • stats (R): Offers a number of functions for supervised and unsupervised learning.

  • Statistics with R Specialization: A popular, semester-long statistics course from from Duke University & Coursera. Focus is on stats with programming assignments in R. That said, it is possible to make it through the course without knowing much about R. Topics: inference, correlation, regression, Bayesian statistics.

  • Computational Linear Algebra for Coders: From the good people at fast.ai.

  • Linear Algebra - Foundations to Frontiers: A popular, semester-long course from the University of Texas at Austin and edX. Challenging, but doesn't assume too much math experience. Programming exercises require Matlab, but it's possible to finish the course with R, Python, etc.

People

  • Dr. Rachael Tatman (R/Py): Data Scientist @kaggle & Linguistics PhD. Data science, stats, R, Python, NLP and linguistics. Live coding data science on twitch and other interactive projects.

  • Mara Averick (R): tidyverse dev advocate @rstudio #rstats, #datanerd, #civictech 💖er, 🏀 stats junkie, using #data4good (&or 🥇 fantasy sports), lesser ½ of @batpigandme 🦇🐽

  • Julia Silge (R): Data science and visualization at @StackOverflow, #rstats, author of Text Mining with R, parenthood.

  • Maëlle Salmon (R): PhD in statistics. 💙#rstats. Living the FOSS dream working for @rOpenSci & @LockeData. Onboarding co-editor at @rOpenSci. #rladies. Member of @rweekly_org team.

  • Hadley Wickham (R): R, data, visualisation. Creator of, and contributor to, numerous R libraries.

  • David Robinson (R): Chief Data Scientist at @DataCamp, #rstats fan/evangelist.

  • Jake VanderPlas (Py): Data scientist in academia & exploring what that means with a great team at @UWeScience. Visiting researcher at @Google; dad to two girls; author of @pydatasci.

  • Wes McKinney (Py): Data science toolmaker at https://ursalabs.org/ . Creator of pandas, @IbisData. @ApacheArrow @ApacheParquet PMC. Wrote Python for Data Analysis.

  • Allen Downey (Py): Professor at Olin College, author of Think Python, blauthor of Probably Overthinking It, and stark raving Bayesian.

  • Hugo-Bowne (Py): Data scientist, writer, mathematician, educator. Does all of these @DataCamp.

  • Julia Evans: Julia writes about lots of stuff. Check out her blog.

DS Websites, News, Etc.

  • Analytics Vidhya: A comprehensive data science website providing resources for pretty much everything related to data science/analysis. Their ambitious blurb: Learn everything about analytics.

  • KDnuggets: A website covering everything DS and ML related.

  • Data Machina: A weekly digest of data science curiosities, machine intelligence, data geekery, and other amenities.

  • Revolution Analytics (R): Daily news about using open source R for big data analysis, predictive modeling, data science, and visualization since 2008.

  • Becoming a Data Scientist: A website covering DS-related topics beyond just methods. Also offers useful information on career development.

  • Data Science Berlin: A collection of information related to DS in Berlin and Germany.

  • DrivenData: Data science competitions geared towards social causes, including health, education, and development.

  • Kaggle: A popular data science competition platform.

  • UCI Machine Learning Repository: More than 400 ML datasets.

  • Kaggle: More than 8,000 datasets of varying quality covering numerous topics.

  • Rdatasets: A collection of 1161 datasets that were originally distributed alongside the statistical software environment R and some of its add-on packages. Curated by Vincent Arel-Bundock Github.

  • Awesome Public Datasets: A VERY large list of tidied, public data sets. Most of the datasets are free, some are not.

  • IMDb Datasets: Lots and lots of movie data from IMDb.

  • Gun Violence Database: A crowdsourced database of gun violence incidents in the US.

  • UN Data: Data from diverse UN sub-organizations.

  • Eurostat: EU data and statistics.

  • Gapminder: An independent Swedish foundation dedicated to fighting misconceptions about global development, Gapminder offers datasets related to various development indicators.

  • Open Source Psychometrics Project: A website providing a collection of interactive personality tests with detailed results that can be taken for personal entertainment or to learn more about personality assessment. The tests range from very serious to not so much. Special focus is given to the strengths, weaknesses and validity of the various systems.

  • OpenStreetMap: Collaborative project to create a free editable map of the world. Geographic data can be downloaded as XML files.

  • The Standford Open Policing Project: Standardized data on interactions between police and public, e.g. vehicle and pedestrian stops, from law enforcement departments across the USA.

  • Our World in Data: Online publication giving overview of global living conditions. Topics covered: health, food provision, the growth and distribution of incomes, violence, rights, wars, culture, energy use, education, and environmental changes. Charts generally include option to download data.

  • Open Food Facts: Collaborative database of food products from around the world.

  • Climate Data Online: Global climate data from the National Climatic Data Center, U.S. Department of Commerce.

  • Data Sets: From the book Applied Regression Analysis and Generalized Linear Models.

  • United States Census Bureau: Demographic data from the USA.