Skip to content

Latest commit

 

History

History
10 lines (7 loc) · 586 Bytes

README.textile

File metadata and controls

10 lines (7 loc) · 586 Bytes

CAR Data Science Toolkit

The CAR datas science toolkit is a collection of common data science tools and algorithms, implemented and documented as simply as possible for data journalists to learn from and understand.

Tools currently implemented include:

  • Clustering algorithms: DBSCAN; k-means clustering
  • Classification: Naive Bayes classifier; k-nearest neighbors
  • Similarity metrics: Euclidean distance; Jaccard similarity; cosine similarity; Pearson similarity; Hamming distance
  • MapReduce workflow that calculates pairwise document similarity based on TF-IDF weights.