Skip to content

Commit

Permalink
[PUBLISHER] Merge #32
Browse files Browse the repository at this point in the history
* PUSH NOTE : portfolio cleanup.md

* PUSH NOTE : nlp overview.md

* PUSH NOTE : data exploration.md

* PUSH NOTE : ml overview.md

* PUSH NOTE : copart internship.md
  • Loading branch information
zaiquiriw authored Nov 15, 2023
1 parent 2e6e7a1 commit 09854e8
Show file tree
Hide file tree
Showing 5 changed files with 401 additions and 0 deletions.
253 changes: 253 additions & 0 deletions docs/ML Work/data exploration.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,253 @@
---
share: true
category: ML Work
---
# Data Exploration
## The Premise
>"In class, we covered how to do data exploration with statistical functions in R. In this assignment, you recreate that functionality in C++ code. This will prepare us to write algorithms in C++ in future assignments"
For me this is both a review of C++, but also a review of what correlation is.

### Notes
- I deliberate whether returning range as the min and max, or as the difference between the two. I eventually chose just returning a min and max.
- I can't get relative links to work at the moment, hope it's fine that it is linking to the file hosted on the main site
-
## Conclusion
### The Code
```c
#include <iostream>
#include <string>
#include <vector>
#include <fstream>
#include <algorithm>
#include <iterator>
#include <cmath>

using namespace std;

// TODO: Convert double vectors to taking in (explicitly) any numeric value
// Reference: - Iterators: https://www.geeksforgeeks.org/iterators-c-stl/
// - What get's passed into sort: https://cplusplus.com/reference/iterator/RandomAccessIterator/

class Explore
{
public:
// Calculate the sum of the vector
double sum_vector(vector<double> vect)
{
double sum = 0;
for (int i = 0; i < vect.size(); i++)
{
sum += vect[i];
}
return sum;
}

// Calculate the mean of a vector
double mean_vector(vector<double> vect)
{
double mean = sum_vector(vect) / vect.size();
return mean;
}

// Calculate the median of a vector
double median_vector(vector<double> vect)
{
double median;
// Use an iterator because it is probably better -internet
vector<double>::iterator it;
// Find the center if it is even or odd
sort(vect.begin(), vect.end());
if (vect.size() % 2 == 0) // If there is an even number of elements
{
it = vect.begin() + vect.size() / 2 - 1;
median = (*it + *(it + 1)) / 2;
}
else // if there is an odd number of elements
{
it = vect.begin() + vect.size() / 2;
median = *it;
}
return median;
}

// Calculate the range of a vector
vector<double> range_vector(vector<double> vect)
{
vector<double> range = {max_vector(vect), min_vector(vect)};
return range;
}

// Calculate the max of a vector (Just for range)
double max_vector(vector<double> vect)
{
double max;
vector<double>::iterator it;
sort(vect.begin(), vect.end());
it = vect.end() - 1;
max = *it;
return max;
}

// Calculate the min of a vector (Just for range) just with a loop
double min_vector(vector<double> vect)
{
double min;
vector<double>::iterator it;
sort(vect.begin(), vect.end());
it = vect.begin();
min = *it;
return min;
}

// Calculate the covariance of two vectors
// Cov(x,y) = E((x-x_mean)(y-y_mean)))/n-1
double covar_vector(vector<double> x, vector<double> y)
{
double sum = 0;
double mean_x = mean_vector(x);
double mean_y = mean_vector(y);
for (int i = 0; i < x.size(); i++)
{
float x_i_diff = x[i] - mean_x;
float y_i_diff = y[i] - mean_y;
float y_times_x_diff = x_i_diff * y_i_diff;
// cout << x_i_diff << " * " << y_i_diff << " = " << y_times_x_diff << endl;
sum = sum + y_times_x_diff;
}
return sum / (x.size() - 1);
}

// Calculate the correlation of two vectors
// Cor(x,y) = Cov(x,y)/(standard_devation(x)*standard_devation(y))
// Using the hint from the assignment:
// "sigma of a vector can be calculated as the square root of variance(v,v)"
double cor_vector(vector<double> x, vector<double> y)
{
double covar = covar_vector(x, y);
double sigma_x = sqrt(covar_vector(x, x));
double sigma_y = sqrt(covar_vector(y, y));
return covar / (sigma_x * sigma_y);
}

// Run suite of statistcal functions on a vector
void print_stats(vector<double> vect)
{
cout << "Sum: " << sum_vector(vect) << endl;
cout << "Mean: " << mean_vector(vect) << endl;
cout << "Median: " << median_vector(vect) << endl;
vector<double> range = range_vector(vect);
cout << "Range: " << range[1] << ", " << range[0] << endl;
}
};

int main(int argc, char **argv)
{
ifstream inFS;
string line;
string rm_in, medv_in;
const int MAX_LEN = 1000;
vector<double> rm(MAX_LEN), medv(MAX_LEN);

cout << "Opening file Boston.csv." << endl;

inFS.open("Boston.csv");
if (!inFS.is_open())
{
cout << "Error opening file Boston.csv." << endl;
return 1;
}

cout << "Reading line 1 of Boston.csv." << endl;
getline(inFS, line);

// echo heading
cout << "Headings: " << line << endl;

// read data
int numObservations = 0;
while (inFS.good())
{
getline(inFS, rm_in, ',');
getline(inFS, medv_in, '\n');
rm.at(numObservations) = stof(rm_in);
medv.at(numObservations) = stof(medv_in);

numObservations++;
}

rm.resize(numObservations);
medv.resize(numObservations);

cout << "New Length: " << rm.size() << endl;

cout << "Closing file Boston.csv." << endl;
inFS.close(); // Done

cout << "Number of records: " << numObservations << endl;

// Create an Explore object to use stats functions
Explore explore;

cout << "\nStats for rm" << endl;
explore.print_stats(rm);

cout << "\nStats for medv" << endl;
explore.print_stats(medv);

cout << "\n Covariance = " << explore.covar_vector(rm, medv) << endl;

cout << "\n Correlation = " << explore.cor_vector(rm, medv) << endl;

cout << "\nProgram terminated." << endl;
}

```
### Returns
```bash
Opening file Boston.csv.
Reading line 1 of Boston.csv.
Headings: rm,medv
New Length: 506
Closing file Boston.csv.
Number of records: 506
Stats for rm
Sum: 3180.03
Mean: 6.28463
Median: 6.2085
Range: 3.561, 8.78
Stats for medv
Sum: 11401.6
Mean: 22.5328
Median: 21.2
Range: 5, 50
Covariance = 4.49345
Correlation = 0.69536
Program terminated.
```

### Built into R or C++
I believe that it was clearly easier to use R functions vs going through and making these functions in C++. This indicates the value of use using R into the future for machine learning. A more straight forward way to analyze data will allow us to understand our data models.

I will note that someone fluent in C++ would do better than I did, considering I was just stumbling around for a bit trying to remember how import a library for a second there.

### Statistical Value
What statistical measures did I evaluate:
- ***Mean***: A mean is an average of the data set, and represents the typical or most likely value from a dataset. Knowing what values in a dataset tend to be is important in understanding what general trend of all the data in your dataset is. You can compare values to this to find outliers and such.
- ***Median***: The center value of the data as it is in sorted order. This tells you a central tendency independent of skewed data or large outliers
- ***Range***: is the minimum and maximum values the values of the dataset might take. It is useful to understand how a dataset is bounded, both to see how far outliers might be from the center of a dataset, and just to get an idea for what the data looks like in scale.

Whenever we are organizing data for a machine learning algorithm, we must understand our data ourselves to have a hope of predicting a trend given our data. Looking at values like mean or median tell us easy to understand generalizations about data so that we may get a general understanding of the meaning of our data without having to analyze every data point. Given more and more powerful generalization tools, or methods of analyzation, we can grow more and more confident in the understanding of our data.

If we can see a general trend using our descriptive statistical measurements, then we can assure our model will be able to eventually get the specific trend data we hope to predict from that dataset.

### Covariance and Correlation
Given two attributes that may or may not be related, we may find the *covariance* and *correlation* between those two bits of data. The covariance tells how one attributes data might relate to another. If x's covariance to y is a high positive number, we know that as x goes up y goes down, and vice versa. Correlation is just a version of that number, scaled down to a range of (-1, 1) in order to make the factor much more uniform

The values are different from those above, considering they are not just measurements of an attribute of data but an extrapolation from that data about relationships or patterns. This is very useful when working in ml, as our end goal is figuring out how data tends to relate to certain results. Using how data correlates then directly supports our end goal of predicting outcomes, or just understanding complex relationships.
56 changes: 56 additions & 0 deletions docs/ML Work/ml overview.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
---
share: true
category: ML Work
title: ML Work
---

# Machine Learning Work
Back in 2022 I took an introduction to Machine Learning with the wonderful [Karen Mazidi](https://www.linkedin.com/in/mazidiaiconsulting/) who gave us a large overview of both data science basics, and basic machine learning. The class was project based, with a focus on providing documentation of the process.

## Learning in R
We covered using R for a variety of algorithms:
- Linear Regression
- Logistic Regression
- Naive Bayes
- kNN
- k-means Clustering
- Decision Trees and Random Forests
- Support Vector Machines

As well as best practices for picking good attributes for analysis, cleaning up data in CSVs, and visualizing the results

## Learning in Python
We then pivoted to implementing these solutions in the Python package ecosystem, using:
- NumPy
- Pandas
- Scikit-Learn
- Seaborn

To implement all the algorithms we just learned. Python allowed us to branch into Neural Networks using *Keras* for the implementation of zero-shot classification

>[!note]
>We even touched on Hidden Markov Models and Bayesian nets, but not much on their implementation
## Continued Work
This class carried into my work in NLP, which happened to coincide with the emergence of Open-AI's ChatGPT. [[nlp overview|Check it out]] for work on more advanced neural nets

## The Projects
> [!seealso] While I will leave my original code open source on [github](https://github.com/zaiquiriw/ml-portfolio/tree/main) the following brief summaries will link to pdfs summarizing the projects.
- [[data exploration|Low Level Basics]]: I worked a little bit in C++ to implement basic statistics calculations, just to make sure I had the swing of things.
- ***Linear Models***: Assuming that problem, its input and output, are linearly related, there are multiple ways to create a predictive supervised model:
- [[linear regression.pdf|Linear Regression]]
- [[linear classification.pdf|Linear Classification]]: Naive Bayes and Logistic Regression
- [[from scratch.pdf|Building From Scratch]]
- ***Similarities***: If instead of predicting a target value, we just wanted to understand the data, we have many different methods of breaking down complex (and high dimensional) data. This works hand in hand with dimensionality reduction. I have more on the subject [[similarities main.pdf|here]]. kNN and K-means algorithms were both good tools to improve the performance of our previous models.
- [[similarities regression.pdf|Using similarties to improve regression]]
- [[similarities classification.pdf|Using similarities to improve classification]]
- [[Clustering.pdf|Clustering Spotify genres]]
- [[Dimensionality.pdf|Failed dimensionality reduction of Spotify]]
- ***Support Vector Machines***: SVM's divide data in such a way that optimizes the margin between data. While we can perhaps visualize data being split by a line for classification, this method can not only be used for high dimensional classification but regression as well. I have a much better description [[SVM and ensemble.pdf|here]].
- [[ClassificationSVM.pdf|Classifciaton with SVMs]]
- [[RegressionSVM.pdf|Regression with SVMs]]
- ***Neural Networks***: While I explore neural networks in full in my future [[nlp overview|Natural Language Processing]] class, we still got some experience with how neural networks work with keras and tensorflow. Using RNN, CNN, and even finetuning on Google's MobileNet V2, I was able to create a pretty good [[keras_image_recognition.pdf|rice identification model]].


Out of anything I would recommend reading my [[keras_image_recognition.pdf|Keras Image Classification]] paper for a good picture of the progress made in this class. Needless to say this was maybe the second most impactful class of my degree, right before [[nlp overview|NLP]].
28 changes: 28 additions & 0 deletions docs/nlp overview.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
---
share: true
---
# Natural Language Processing
Continuing my study in machine learning, I decided to focus on language processing and take a class on NLP. My class focused on learning the various libraries and ML techniques we use to under stand language, and scaling that up in python all the way to deep learning in python. We covered:
- Foundational NLP Language distinctions like Parts of Speech and word, sentance, and corpora
- Basic Python usage with NLTK for preprocessing
- Wordnet and building word relationships
- N-gram models for language generation
- Context Free Grammars
- Numpy, pandas, scikit-learn, and seaborn
- Naive Bayes and Logistic Regression for NLP
- Keras for CNN's, RNN's, LSTM and GRU
- Using embeddings along with decoders and encoders

For all of these topics we did various projects to get better at implementing our knowledge and sharing it using jupyter notebooks.

## The Projects
If you would like to view the code and notebook work related to these projects they are still posted on [[https://github.com/zaiquiriw/nlp-portfolio|github]] to view! However here are some short summaries of my work in NLP. I value my [[Summary_of_Attention_Article.pdf|analysis of attention as an explainability metric]] if you would like to view it!

- [[wordnet.pdf|Wordnets]]: This is an exploration of how wordnets can reveal complex meanings of words not simply found in the definition
- [[ngrams-assignment.pdf|N-grams]]: Just a brief description of ngrams to illustrate their usefulness
- [[summary.pdf|Netscraping for LLM's]]: I used BeautifulSoup to scrape the web for an LLM
- [[text-classification.pdf|]]: I used simple Neural Networks with the goal of building a network that could be used to train a network on imitating characters (in this case Rick and Morty's voice and tone)
- [[Summary_of_Attention_Article.pdf|The Impact of Attention]]: This short paper summarizes a paper on the impact of a "Is Attention Explanation" and bridges the creation of modern GPTs into the now pressing Alignment problem and other consequences of modern attention. A personal favorite project where I explored the quakes in AI research sudden prominence of new AI techniques.
- [[RickMortyTwo.pdf|More Rick And Morty]]: I liked to have fun, so I did a take two on classifying text based on the Rick and Morty voice. However, it came out more on a study on how you can't squeeze data to work your use case. You just have to work with the data you have.

I came out of this class *really* wanting to do more research, but I did not want to jump right into a masters. Perhaps one day, but I need a break after 16 or so years of schooling. I do feel very comfortable in data science, and I value that greatly!
43 changes: 43 additions & 0 deletions docs/obsidian/copart internship.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
---
share: true
category: obsidian
---
# Data Science at Copart
The company Copart is right next to my house. Right now, as of 2 days ago, they have a position open for a data scientist internship right now! I meet all of the requirements, and have the portfolio to support it. Now I just need to put all that together for a

## The Job

> Copart is looking for a data scientist who will work closely with IT and various other departments to drive insight into data and deliver machine learning solutions to improve Copart's operations. This data scientist will also design/manipulate large scale data sets from a multitude of sources, work to operationalize and integrate machine learning solutions into Copart's current products and visualize and report on findings and results to provide insight to the organization.
**Job Duties**

- Develop new predictive models using advanced techniques
- Apply critical thinking to ensure data integrity and quality control is applied to each dataset, model and other analysis prior to presenting with internal customers
- Coordinate with different functional teams to operationalize, and monitor machine learning solutions
- Apply statistical methodologies such as cluster and regression analysis, if necessary.
- Act as a proponent of data science/analytics to senior leadership and others by being able to explain the benefits of machine learning, and other techniques.
- 6 Months of experience (relevant academic internships & projects can be considered in lieu of professional experience) with machine learning, statistical modeling, and data mining techniques.
- Bachelor or Master's degree in highly quantitative field (computer science, mathematics, machine learning, statistics) or equivalent experience
- Proficiency in either R or Python
- Proficiency in data sourcing/manipulation in SQL
- Bachelor or Master's degree in highly quantitative field (computer science, mathematics, machine learning, statistics) or equivalent experience
- Experience applying various machine learning techniques, specifically neural networks and gradient boosted machines, and understanding the key parameters that affect their performance
- Strong data visualization skills using open source tools (plotly, ggplot2, shiny)
- Experience with both supervised and unsupervised modeling techniques

This is, simply, exactly what I have experience in. But I haven't done anything in the field recently.

## What I Want to Do
I am terrified of applying to jobs. So I am going to compromise between not applying and applying right now. I'd say I have to prep my past work on the subject
- [ ] [[portfolio cleanup|Combine my work]] from Mazidi's NLP and ML classes into one portfolio in markdown
- [ ] Host it on my URL with a home page summary that links to my linkedin and github
- [ ] Clean up my github
- [ ] Clean up my Resume to be focused on Data Science
- [ ] Review focus points on the list of requirements:
- [ ] Python basic programming questions
- [ ] Profeciency with ggplot2 and pandas, matplot, pytorch, and scikit
- [ ] Mild R review
- [ ] Neural network review, and how to use dimensionality reduction and gradient descent to better ML models



Loading

0 comments on commit 09854e8

Please sign in to comment.