Skip to content

beingmojo/datasciencemasters

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

38 Commits
 
 
 
 

Repository files navigation

The Open-Source Data Science Masters

This is a fork of this, experimenting with different curriculum topics and themes.

License here.

The Open Source Data Science Curriculum

History

Fundamentals

Intro to Data Science [UW / Coursera](https://www.coursera.org/course/dat * Topics: Python NLP on Twitter API, Distributed Computing Paradigm, MapReduce/Hadoop & Pig Script, SQL/NoSQL, Relational Algebra, Experiment design, Statistics, Graphs, Amazon EC2, Visualization.asci) Algebra-Steven-Levandosky/dp/0536667470/ref=sr_1_1?ie=UTF8&qid=1376546498&sr=8-1&keywords=linear+algebra+levandosky#)

Skills

Matrices and Linear Algebra fundamentals
	Linear Algebra / Levandosky [Stanford / Book](http://www.amazon.com/Linear-
	Coding the Matrix: Linear Algebra through Computer Science Applications [Brown / Coursera](https://www.coursera.org/course/matrix)
Hash Functions, Binary Tree, O(n)
Relational Algebra
DB Basics
Inner, Outer, Cross, Theta join
CAP Theorem
abular data
Entropy
Data Frames and Series
Sharding
OLAP
Multidimensional Data Model
	ETL
Reporting vs. BI vs. Analytics
JSON & XML
NoSQL
Regex
Vendor Landscape
Env setup

Maths and Stats

Skills

Descriptive statistics
Exploratory Data Analysis
Histograms
Percentiles and outliers
Probability theory
Bayes Theorem
Random Variables
Cumulative Distribution Function (CDF)
Continous Distributions (Normal, Poisson, Gaussian)
Skewness
ANOVA
Probability Density Functions

Central Limit Theorem
Monte Carlo Method
Hypothesis testing
p-value
Chi squared test
Estimation
Confidence intevals (CI)
MLE
Kernel Density Estimate
Regression
Covariance
Correlation
Pearson Coefficient
Causation
Least squares fit
Euclidean Distance

Computing

Toolbox / Programming Languages / Software stacks

Skills

Unix cli install programs and packages
Bash basics
	cat, grep, wget etc
	piping
	understand stdio
Python
Regex
MS Excel w/ Analysis ToolPak
Java
R, R-studio, Rattle
IBM SPSS
Weka, Knime, RapidMiner
Hadoop ditribution of choice
Spark, Storm
Flume, Scibe, Chukwa
Nutch, Talend, Scraperwiki
Webscraper, Flume, Sqoop
tm, RWeka, NLTK
RHIPE
D3.js, ggplot2, Shiny
IBM Languageware
Cassandra, MongoDB

Algorithms, data structures and databases

Programming

Skills

Variables
Vectors
Matrices
Arrays
Factors
Lists
Data Frames
Reading CSV data
Reading Raw data
Manipulate Data Frames
Functions
Factor Analysis

Applied methods

Data Munging and integration

The art of converting or mapping data from one "raw" form into another format that allows for more convenient consumption of the data with the help of semi-automated tools. Expect to spend 80% of your workday doing some sort of data wrangling.

Skills

Dimensionality & Numerosity Reduction
Normalization
Data Scrubbing
Handling missing values
Unbiased estimators
Binning sparse values
Feature Extraction
Denoising
Sampling
Stratified Sampling
Principal Component Analysis
Summary of Data Formats
Data Discovery
Data Sources & Acquisition
Data Integration
Data Fusion
Transformation and enrichment
Data survey
Google OpenRefine
How Much Daya
Using ETL

Visualization

Skills

Data Exploration in R (Hist, boxplot etc)
Uni, Bi and multivariate Viz
ggplot2
Histogram & Pie (Uni)
Tree and Tree Map
Scatter Plot
Line Charts
Survey Plot
Timeline
Decision Tree
D3.js
InfoVis
IBM ManyEyes
Tableau

Data mining and analysis

Machine Learning

Skills

Numerical Var
Categorical Var
Supervised Learning
Unsupervised Learning
Concepts, Inputs and Attributes
Training and Test Data
Classifier
Prediction
Lift
Overfitting
Bias and variance
Classification
	Trees and classification
	Classification rate
	Decision trees
	Boosting
	Naive Bayes Classifiers
	K-Nearest neighbour
Regression
	Logistic regression
	Ranking
	Linear regression
	Perceptron
Clustering
	Hierarchical clustering
	K-means clustering
Neural Networks
Sentiment analysis
Collaborative Filtering
Tagging

Text Mining / NLP

Skills

Corpus
Named Entity Recognition
Text Analysis
UIMA
Term Document Matrix
Term Frequency and weight
Support Vector Machines
Association rules
Market Based Analysis
Feature Extraction
Use Mahout
Use Weka
Use NLTK
Classify Text
Vocabulaty Mapping

Big Data

Map reduce fundamentals
Hadoop
HDFS
Data Replication Principles
Setup Hadoop (IBM / Cloudera / HortonWorks)
Name & Data nodes
Job and task tracker
M/R Programming
Sqoop: Loading Data in HDFS
Flube, Scribe: For Unstructured Data
SQL with Pig
DWH with Hive
Scribe, Chukwa For Weblog
Using Mahout
Zookeeper Avro
Storm: Hadoop Realtime
Rhadoop, RHIPE
rmr
Cassandra
MongoDB, Neo4j

General Resources:

Contribute

Please Share and Contribute Your Ideas -- it's Open Source!

A note on direction

This is an introduction geared toward those with at least a minimum understanding of programming, and (perhaps obviously) an interest in the components of Data Science (like statistics and distributed computing). Out of personal preference and need for focus, the curriculum assumes and mainly uses Python tools and resources, except where marked as R, Java etc.

About

The Open Source Data Science Masters

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published