Skip to content

Here lies the resources and topics necessary for the role of Data Scientist and Machine Learning

License

Notifications You must be signed in to change notification settings

xandie985/data-scientist-roadmap2024

Repository files navigation

Data Scientist Roadmap2024

Description

Mastering the tools in this guide — including programming languages, machine learning libraries, and cloud platforms — is crucial for data science success.

I've categorized them based on difficulty:

  • Green text: Mandatory and easiest
  • Yellow text: Mediocre tough
  • Red text: Toughest and for pros (color codes are present here)

Structure:

List of tools, libraries and concepts


Programming Languages:

  • Python
    • GRIND 75 - Questions and multiple solutions.
  • R

Frameworks & Libraries:


Cloud Platforms & Services:

  • Docker (Containerization platform)
  • Learn any one of the following:
    • GCP (Google Cloud Platform)
      • Cloud Storage
      • Compute Engine
      • Cloud SQL
      • Cloud Functions
      • BigQuery
      • AI Platform (includes Vertex AI)
    • Azure (Microsoft Azure)
      • Blob Storage
      • Virtual Machines
      • SQL Database / Azure Database for PostgreSQL/MySQL
      • Azure Functions
      • Azure Synapse Analytics
      • Azure Machine Learning
    • AWS (Amazon Web Services)
      • AWS S3
      • AWS EC2
      • AWS RDS
      • AWS Lambda
      • AWS Redshift
      • AWS SageMaker
  • Kubeflow (Cloud-native machine learning platform)
  • Kubernetes (Container orchestration platform)

Data Tools & Libraries:

  • SQL (including OLAP & OLTP variations)
    • SQLBOLT, a simple & interactive. [2H]
  • Pandas
  • Elasticsearch
  • Dask (Parallel computing library for big data)
  • Spark (Large-scale data processing framework)
  • Airbyte (Open-source data integration platform)

Web Development Frameworks:

  • FastAPI
  • Uvicorn (likely mentioned in conjunction with FastAPI)
  • Streamlit (Machine learning app development framework)

Machine Learning Concepts:

  • Supervised Learning
    • Regression
    • Classification
  • Unsupervised Learning
    • Clustering
    • Dimensionality Reduction
  • Recommendation Systems
  • Time Series Forecasting
  • Natural Language Processing (NLP)
    • Text Mining
    • Natural Language Understanding (NLU)
      • Sentiment Analysis
      • Named Entity Recognition (NER)
      • Question Answering (QA)
    • Natural Language Generation (NLG)
  • Deep Learning Techniques
    • Convolutional Neural Networks (CNNs)
    • Long Short-Term Memory networks (LSTMs)
    • Generative AI
  • Reinforcement Learning
  • Bayesian Optimization
  • Statistics

DevOps & MLOps Tools:

  • Airflow (Workflow orchestration tool)
  • MLFlow (Machine learning lifecycle management)
  • Prometheus (Monitoring and alerting system)
  • Grafana (Data visualization and analytics tool)
  • Git version control (e.g., GitLab, GitHub)

Data Visualization Tools:

  • Tableau
  • Matplotlib (Python plotting library)
  • Seaborn (Statistical data visualization library built on top of Matplotlib)
  • Power BI (Microsoft business intelligence platform)

Other:

  • ETL (Extract, Transform, Load) processes

  • Optimisation algorithms (can be broader than just machine learning)

  • Distributed training

  • Curse of dimensionality

  • Financial modeling

    • MIT Course: Mathematics With Applications In Finance
      • The purpose of the class is to expose undergraduate and graduate students to the mathematical concepts and techniques used in the financial industry. Mathematics lectures are mixed with lectures illustrating the corresponding application in the financial industry.
  • LLMs

    • Lang-chain Agents
    • Prompt engineering
    • RAG
    • Fine-tuning

Interviews

Notes and Study Material

  • Neural Networks
    • Part 1: Basics, Gradient Descent, Backpropagation, Learning Rate, Activation Functions.
    • Part 2: Premitive systems, RNN, GRU and LSTM, Transformers, BERT
    • A quick recap, designed for last review before any interview.

Work in progress:

  1. Updating the pytorch material with notebooks containing code & concepts.(3/20 done)
  2. Updating notes for Neural Networks consisting on basics, RNN, GRU, LSTM, Tranformers, etc.