Skip to content

A compilation of my projects in Data Science, Artificial Intelligence, Machine Learning, Natural Language Processing, and Data Analytics

Notifications You must be signed in to change notification settings

thitirat-mnc/DataSci-ML-Portfolio

Repository files navigation

Portfolio

Table of Contents

Data Science & Machine Learning Projects

  • Credit Card Market Segmentation and Cluster Prediction - Explore Project

    • Python, Exploratory Data Analysis (EDA), Pandas, Clustering, PCA, Logistic Regression

    CreditCard

  • Restuarants Rating Classification - Explore Project

    • Python, BERT, Logistic Regression, Pandas

    Restaurant Review



Artificial Intelligence (AI) Projects

  • Facial Expression Recognition System - Explore Project

    • Python, Exploratory Data Analysis (EDA), Pandas, OpenCV, Image augmentation, Data normalization, Tensorflow, CNNs, RESNET

    Overview
    • A system that automatically monitors people's emotions and expressions based on facial images
      • The dataset comprises 2000 images with facial key-point annotations and 20,000 facial images, each labeled with facial expression categories.
      • The tasks include detecting facial key points and categorizing each face into one of five emotion categories.
    • Tasks:
      • Perform image visualizations to understand the dataset.
      • Perform image augmentation to increase dataset diversity.
      • Conduct data normalization and prepare training data for model training.
      • Build deep Convolutional Neural Networks (CNNs) and residual neural network (RESNET) models for facial key points detection.
      • Save the trained model for deployment.
  • Brain Tumor Detection and Localization - Explore Project

    • Python, Exploratory Data Analysis (EDA), Pandas, scikit-learn, OpenCV, Image Segmentation, Tensorflow, ResNet50, ResUNet
    Overview
    • Improve the speed and accuracy of brain tumors detection and localization based in MRI scans
    • Tasks:
      • Perform data visualizations to understand the dataset.
      • Training classifier model to detect tumor
      • Train a segmentation ResUNet model to localize tumor if exist


Machine Learning for NLP Projects

  • Business Idea Generator App (BizGen) using Langchain and Large Language Model (LLM) - Explore Project

    • Python, Langchain, OpenAI, LLM

    BizGen Screenshot 2567-04-11 at 02 36 33

  • Aspect Category and Polarity Classification - Explore Project

    • Python, Pandas, nltk toolkit, Spacy, Logistic Regression, DAN, CNNs

    Overview
      • The dataset contains 3156 rows. The text is drawn from restaurant reviews, tokenized using nltk.word_tokenize and non-English alphabet symbols were cleaned out using regular expression.
      • The tasks include categorizing each text review into one of five aspect categories and into one of four sentiments.
    • Tasks :
      • Bag-of-word logistic regression model as a baseline for both sentiment and aspect classification. The features are created from the cleaned text.
      • Perform oversampling by multiplying the number of conflict label data in the training set to increase dataset diversity.
      • Trained both multi-class and multi-label logistic regression models for aspect classification.
      • For multi-label, used a binary logistic regression model to train each aspect model separately, and combine the end result prediction.
      • For Deep Learning Models, tried both pre-trained GloVe 300-dimensional word embeddings from stanford.edu and Word2Vec.
      • Build Deep Averaging Network (DAN) and Convolutional Neural Network (CNN).
      • Tuned Hyperparameters using grid search.
  • Next word Prediction - Explore Project

    • Python, Tensorflow, Pandas

      Overview
      • Predicting next word based on the first letter
        • The training set is drawn from https://huggingface.co/datasets/gigaword
        • The development set provided for evaluation contains 94,825 rows with 3 columns:
        • 'context' column, 'first letter' column, and 'prediction' column.
        • The 'first letter' column provides the initial letter of the word to be predicted for each context, while the 'prediction' column contains the actual word that is to be generated.
      • Tasks :
        • For trigram model, a counter dictionary is used to count the number of occurrences of each trigram in the training data.
        • The model is trained in small batches, with a batch size of 2048, to accommodate the large size of the training set.
        • Once the model has been trained, the probability of each word is computed based on its frequency in the training data.
        • The trigram model is then used to predict the next word in a given context by selecting the word with the highest probability.
        • For kenlm (pre-trained 5-gram model), the next word in a given context is generated by looping over each word in the model's vocabulary and selecting the word with the highest probability.

        Screenshot 2567-04-11 at 00 25 23
  • Name Entitiy Recognition (NER) for Thai Language - Explore Project

    • Python, Scikit-learn, Pandas, pythainlp

      Overview
      • Name Entity Recognition (NER) for Thai Language
        • The training and development data for this project were in Thai, and were first tokenized and separated by '|' using the pythainlp library (newmm dictionary).
        • The resulting text was then tagged with entity types, including 'ORG', 'PER', 'MEA', 'LOC', 'TTL', 'DTM', 'NUM', 'DES', 'MISC', 'TRM', and 'BRN', using 'B_' before each tag.
        • Each word and tag were separated by '\t', while sentences were separated by '\n'.
        • To preprocess the training data for the models, each word and tag in the dataset were split and stored in two separate lists: one for token sequences and one for label sequences.
      • Tasks :
        • Implemented Conditional Random Fields (CRF) with only the word, the previous word, and the next word as features as baseline.
        • Added conjunctive features to the model, which took the form of {word i-1 – word i – word i+1}, resembling bigram and trigram features to capture more contextual information about the words.
        • Explored the use of conjunctive part- of-speech (POS) tags as a feature to recognize named entities based on grammatical context, using the pythainlp pos_tag (orchid_ud).

        Screenshot 2567-04-11 at 00 21 21


Research Paper

  • Enhancing GPT-3.5 for Thai Intent Classification via Cross-Lingual Prompts, Chain-of-Thought, and Self-Consistency - Read Paper

    cross_lingual_prompt
    • proposes a method using Large Language Models (LLMs) and cross-lingual techniques.
    • enhance classification performance by prompting GPT 3.5 in English rather than Thai.
    • Our Cross- Lingual Chain-of-Thought Prompt template (XCoT) improves LLM performance, integrating role-assigning and cross-lingual steps, surpassing standard prompts.


Projects Repositories

CreditCard Restaurant Review Netflix-top10 BizGen


Data Engineer Workshop

Data-EngineerR2DE

Releases

No releases published

Packages

No packages published