- Data Science & Machine Learning
- Artificial Intelligence (AI)
- Machine Learning for NLP
- Research Paper
- Data Analysis and visualization
- Netflix Top 10 and Financial Data Analysis
- My Tableau Dashboard
- Google Spreadsheet / Excel
- Project Repositories
- Data Engineer Workshop
-
Credit Card Market Segmentation and Cluster Prediction - Explore Project
- Python, Exploratory Data Analysis (EDA), Pandas, Clustering, PCA, Logistic Regression
- Python, Exploratory Data Analysis (EDA), Pandas, Clustering, PCA, Logistic Regression
-
Restuarants Rating Classification - Explore Project
- Python, BERT, Logistic Regression, Pandas
- Python, BERT, Logistic Regression, Pandas
-
Facial Expression Recognition System - Explore Project
- Python, Exploratory Data Analysis (EDA), Pandas, OpenCV, Image augmentation, Data normalization, Tensorflow, CNNs, RESNET
Overview
- A system that automatically monitors people's emotions and expressions based on facial images
- The dataset comprises 2000 images with facial key-point annotations and 20,000 facial images, each labeled with facial expression categories.
- The tasks include detecting facial key points and categorizing each face into one of five emotion categories.
- Tasks:
- Perform image visualizations to understand the dataset.
- Perform image augmentation to increase dataset diversity.
- Conduct data normalization and prepare training data for model training.
- Build deep Convolutional Neural Networks (CNNs) and residual neural network (RESNET) models for facial key points detection.
- Save the trained model for deployment.
- Python, Exploratory Data Analysis (EDA), Pandas, OpenCV, Image augmentation, Data normalization, Tensorflow, CNNs, RESNET
-
Brain Tumor Detection and Localization - Explore Project
- Python, Exploratory Data Analysis (EDA), Pandas, scikit-learn, OpenCV, Image Segmentation, Tensorflow, ResNet50, ResUNet
Overview
- Improve the speed and accuracy of brain tumors detection and localization based in MRI scans
- The data comprises 3929 Brain MRI scans with brain tumor location from https://www.kaggle.com/mateuszbuda/lgg-mri-segmentation
- The tasks include classification to detect if tumor exists or not and localizing the tumor if exists/li>
- Tasks:
- Perform data visualizations to understand the dataset.
- Training classifier model to detect tumor
- Train a segmentation ResUNet model to localize tumor if exist
- Python, Exploratory Data Analysis (EDA), Pandas, scikit-learn, OpenCV, Image Segmentation, Tensorflow, ResNet50, ResUNet
-
Business Idea Generator App (BizGen) using Langchain and Large Language Model (LLM) - Explore Project
- Python, Langchain, OpenAI, LLM
- Python, Langchain, OpenAI, LLM
-
Aspect Category and Polarity Classification - Explore Project
- Python, Pandas, nltk toolkit, Spacy, Logistic Regression, DAN, CNNs
Overview
- The dataset contains 3156 rows. The text is drawn from restaurant reviews, tokenized using nltk.word_tokenize and non-English alphabet symbols were cleaned out using regular expression.
- The tasks include categorizing each text review into one of five aspect categories and into one of four sentiments.
- Tasks :
- Bag-of-word logistic regression model as a baseline for both sentiment and aspect classification. The features are created from the cleaned text.
- Perform oversampling by multiplying the number of conflict label data in the training set to increase dataset diversity.
- Trained both multi-class and multi-label logistic regression models for aspect classification.
- For multi-label, used a binary logistic regression model to train each aspect model separately, and combine the end result prediction.
- For Deep Learning Models, tried both pre-trained GloVe 300-dimensional word embeddings from stanford.edu and Word2Vec.
- Build Deep Averaging Network (DAN) and Convolutional Neural Network (CNN).
- Tuned Hyperparameters using grid search.
- Python, Pandas, nltk toolkit, Spacy, Logistic Regression, DAN, CNNs
-
Next word Prediction - Explore Project
-
Python, Tensorflow, Pandas
Overview
- Predicting next word based on the first letter
- The training set is drawn from https://huggingface.co/datasets/gigaword
- The development set provided for evaluation contains 94,825 rows with 3 columns:
- 'context' column, 'first letter' column, and 'prediction' column.
- The 'first letter' column provides the initial letter of the word to be predicted for each context, while the 'prediction' column contains the actual word that is to be generated.
- Tasks :
- For trigram model, a counter dictionary is used to count the number of occurrences of each trigram in the training data.
- The model is trained in small batches, with a batch size of 2048, to accommodate the large size of the training set.
- Once the model has been trained, the probability of each word is computed based on its frequency in the training data.
- The trigram model is then used to predict the next word in a given context by selecting the word with the highest probability.
- For kenlm (pre-trained 5-gram model), the next word in a given context is generated by looping over each word in the model's vocabulary and selecting the word with the highest probability.
- Predicting next word based on the first letter
-
-
Name Entitiy Recognition (NER) for Thai Language - Explore Project
-
Python, Scikit-learn, Pandas, pythainlp
Overview
- Name Entity Recognition (NER) for Thai Language
- The training and development data for this project were in Thai, and were first tokenized and separated by '|' using the pythainlp library (newmm dictionary).
- The resulting text was then tagged with entity types, including 'ORG', 'PER', 'MEA', 'LOC', 'TTL', 'DTM', 'NUM', 'DES', 'MISC', 'TRM', and 'BRN', using 'B_' before each tag.
- Each word and tag were separated by '\t', while sentences were separated by '\n'.
- To preprocess the training data for the models, each word and tag in the dataset were split and stored in two separate lists: one for token sequences and one for label sequences.
- Tasks :
- Implemented Conditional Random Fields (CRF) with only the word, the previous word, and the next word as features as baseline.
- Added conjunctive features to the model, which took the form of {word i-1 – word i – word i+1}, resembling bigram and trigram features to capture more contextual information about the words.
- Explored the use of conjunctive part- of-speech (POS) tags as a feature to recognize named entities based on grammatical context, using the pythainlp pos_tag (orchid_ud).
- Name Entity Recognition (NER) for Thai Language
-
-
Enhancing GPT-3.5 for Thai Intent Classification via Cross-Lingual Prompts, Chain-of-Thought, and Self-Consistency - Read Paper
- proposes a method using Large Language Models (LLMs) and cross-lingual techniques.
- enhance classification performance by prompting GPT 3.5 in English rather than Thai.
- Our Cross- Lingual Chain-of-Thought Prompt template (XCoT) improves LLM performance, integrating role-assigning and cross-lingual steps, surpassing standard prompts.