diff --git a/docs/ML Work/Classification.md b/docs/ML Work/Classification.md deleted file mode 100644 index 5172be0..0000000 --- a/docs/ML Work/Classification.md +++ /dev/null @@ -1,361 +0,0 @@ ---- -title: "Classification" -author: "Zachary Canoot & Gray Simpson" -share: true -category: "ML Work" ---- -# Classification -This time we will be using linear models in order to classify observations. Linear models like logistic regression and Naive Bayes work by finding the probability of a target variable given a predictor variable. This means we are predicting a class as opposed to a continuous value like in linear regression. These models are great for data with outliers and are easy to implement and interpret. Linear regression isn't very flexible however, and Naive Bayes makes a naive assumption that predictors are independent. - -## What is Our Data? -The weather data we used in our quantitative didn't have a suitable categorical target field, so we are switching to [income census data](https://www.kaggle.com/datasets/rdcmdev/adult-income-dataset). The data has a great binary classification in the form of an IncomeClass attribute that only states whether a given person's income is below or above 50k. We have plenty of categories for each person, and continuous measurements like age and work hours. - -The census itself is from the year 1994, and spans various socieo-economic groups. We both trying to predict this income classification based on all of the data, as well as just get an understanding of some key predictors in the data. - -With IncomeClass as our target, lets analyze the data! - -### Reading the Data -The data is stored as two files, with rows just delimited by commas, so we read them in to one whole data frame, and label the headers manual using our source as a reference. It's worth noting that this data was extracted with the intention of creating a classification model, so the two files are meant to be training and test data, but we are going to re-distribute the data later. -```r -income_train <- read.table("adult.data", sep=",", header=FALSE) -income_test <- read.table("adult.test", sep=",", header=FALSE) -income <- rbind(income_test, income_train) -colnames(income) <- c("Age", "WorkClass", "Weight", "Education", "YearsEdu", "Marital-Status", "Job", "Relationship", "Race", "Sex", "CapitalGain", "CapitalLoss", "HoursWorked", "NativeCountry", "IncomeClass") -#Just to check to make sure it read properly -str(income) -``` -Now we want to turn the qualitative data into factors. - -Find all attributes of income that are non-numeric - - sapply() returns a logical object of every attribute run through the given function - - which() returns all of the true indices of a logical object - - income[,] extracts the attributes (See help(Extract)) - - We then lapply, with as.factor forcing them to be factors in a list - -Then just factor them. -```r -# Note here that while sapply returns a vector, lapply returns a list -income[, sapply(income, is.character)] <- lapply(income[, sapply(income, is.character)], as.factor) -# Checking our work -str(income) -``` - -Now the data is a bit cleaner we can start to look at it! -```r -summary(income) -``` -Now that we can really see our factor's options, I see a couple skewed data points: - - Twice as many men as women! Hope those numbers are better in 2022! - - A large percent of the data is for natives to the US, which is kind of expected - - Weight: Now, this represent what census takers thought a particular row represented the whole of the dataset. I must admit at the time I don't know how to account for statistical weight, but considering our model only needs to match training data, not other data from 1994, we are safe to ignore it. - -The data looks very clean! Except for a bit of an anomaly with how the Target column, IncomeClass is stored. Some levels have a "." at the end, which we would like to remove. So lets go ahead and condense that, remove the Weight attribute, and create our training and test data. - -```r -# Simply just reassign the levels -levels(income$IncomeClass) <- c("<=50k", "<=50k", ">50k", ">50k") -levels(income$IncomeClass) -# Then remove the attribute weight using it's index -income <- income[, -3] -income -``` - -Then we are good to start exploring! - -## Training Data Exploration -### Spliting Training Data -We are splitting training data on a 80/20 split -```r -set.seed(42069) -trainindex <- sample(1:nrow(income),nrow(income)*.8,replace=FALSE) -train <- income[trainindex,] -test <- income[-trainindex,] -# Cleaning up earlier data -rm("income", "income_test", "income_train") -``` - -### Textual Measurements -And what does that training data look like! - -We would want to use different metrics, like mean, or count our factors: -```r -mean(train$Age) -nlevels(train$WorkClass) -``` - -But we can just do that in `summary()`. - -```r -summary(train) -``` -The summary above is good for making sure there is no errors in the data, and of course skews we can deal with. For this data, there sure are a lot of men native to America, but that as said earlier is expected. Looking a bit more: - -```r -sum(is.na(train)) -head(train) -tail(train) -``` -We get an example of whats at the end and start of the data set, and make sure there are no NA's. The census people really keep their data clean. - -For one more look lets see some correlation data. Curious how much capital loss went up with age? We can see below... well not much honestly. -```r -cor(train$Age, train$CapitalLoss) -``` - -#### Text Analysis Conclusion -We fear the skew of my data towards 1 type of person (Married Men about to hit their 40's) will make the model's we produce perform well for our dataset, but fail to get any real world accuracy. Obviously if this model was actually destined to predict in the real world if people's income was above or below a certain level (in the 1990's), well if we had all this data we would probably already know their income. So the model is a pointless but fun experiment... - -Regardless it is worth noting that a transformation of the data before running logistic regression or naive bayes could produce better results, but it is beyond the scope of this experiment. - -While it is probably a realistic distribution of income class (3 people with less than 50k for every person over 50k), the data may just guess that everyone doesn't make that much money due to the skew. This actually is a lot more important then skewed predictors, as our eventual precision/recall could be quite bad. For now, simply observing this is good enough, but this should be onsidered for the final analysis. (And perhaps in our comparision between Bayes and logistic regression). - -### Visual Analysis -We want to see how our target, IncomeClass relates to our numerical data: -```r -plot(x = train$IncomeClass, y=train$Age, ylab="", xlab="Income", main="Age") -plot(x = train$IncomeClass, y=train$CapitalLoss, ylab="", xlab="Income", main="Capital Loss") -plot(x = train$IncomeClass, y=train$CapitalGain, ylab="", xlab="Income", main="Capital Gain") -plot(x = train$IncomeClass, y=train$HoursWorked, ylab="", xlab="Income", main="Hours Worked") -``` - -Numerical trends are just easier to spot, especially the effect of age on IncomeClass. You can definitely see in the ease graphs, particular age and hours worked, that there are *some* grounds to predict this income classification based on the predictor data. - -For another view: -```r -cdplot(train$Age, train$IncomeClass) -breaks <- (0:10)*10 -plot(train$IncomeClass ~ findInterval(train$HoursWorked, breaks)) -plot(train$Sex, train$IncomeClass) -``` - -Above we can see a couple trends relating to Income Class: - - Women don't make as much as men - - It seems the more hours worked, the higher your chances of making it over 50k - - Right around 50 years old is when people were the most likely to make >50k - -#### Visual Analysis Conclusion -There are so many different factors in this data, that we think assuming the factors are independent could harm the -eventual accuracy of our linear models. While we can graph individual factors relation to the target, there are complicated relationships between the predictor data. We may be able to guess that more education would lead to a higher income, but an in-depth analysis of how gender or native country may hamper access to education isnt represented by just the relationship from gender to income. To the final product, it just *looks* like you can bet women make less money, even if that may be due to a compaction of other factors. - -Just a couple trends are seen above, and they still tell us that there is some merit to this data being alble to predict relations between our predictors and our target. Now it is time to see if all of those predictors together have a good chance of classifying them into the >50k or <=50k levels. - -## Classification Regression -### Logistic Regression -```r -glm1 <- glm(IncomeClass~., data=train, family=binomial) -summary(glm1) -``` -Ah! Well we sure do get to get to view the impact of every level (dummy variable) on the output model. Before analyzing the coefficients predicted by the model, I want to examine which attributes were better for the model as compared to others. - -#### Explanation -The data produced by the model is the coefficients of each predictor. The coefficient represents the effect the value of the predictor has on our target. If we have a positive coefficient like age, as age goes up we can expect the probability of our target (IncomeClass) to go up. The final model then considers each of these coefficients in it's prediction. Different parts of the data are: -- Deviance Residuals: -- The Null Deviance: -- Residual Deviance: -- Degrees of Freedom: -- AIC: -- Fisher Scoring Iterations: -- Standard Error: -- Z Value: -- P Value: - -#### Looking at P-Values -A coefficient estimate's p-values can tell us which features are valuable predictors. However, because the data is mostly qualitative, each level of each factor has a different impact on the data. - -WorkClass seems like it is a good predictor *overall*, but if a given person's WorkClass is Never-worked, well the p-value is huge! Now, obviously if you have never worked your income isn't going to be very high, and the model estimates a high negative correlation. Yet the P-Value is super high! - -This could be due to a number of factors: -- The sample size of people who have never worked in this data is much smaller than the total population. -- Our target factor is skewed, so this predictor can't differ too much from the null hypothesis -- People who have never worked have varying life experiences, so the final accuracy of their coefficients isn't going to be able to fit the data - -As humans we can see the this coefficient should be significant, so perhaps this isn't the best dataset for logistic regression. The summary of the model basically is this: - -> While factors that you would expect to negatively impact income class do have large negative coefficents their p-values are very large because the overall target is very skewed (probably) towards what they are predicting (low income). - -#### Probability Warning -Another issue with the data is the warning: - -> `Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred` - -This error occurs when our model fits the data so well it is most likely too perfect. This means there is *somewhat likely* an error in our data. We can check it by looking at a couple predictions: - -```r -head(predict(glm1, train, type="response"), 30) # Looking at some probabilities -``` - -Looking at just 30 fitted probabilities we see that not every single probability is 1 or 0, but another warning: - -> Warning: prediction from a rank-deficent fit may be misleading - -This means our number of linearly independent columns does not equal the number of parameters. Funny enough, the actual model throws out what it believes are perfectly colinear variables, causing this warning. The solution would then be to remove the colinear attributes, which will be done in just a moment. - -#### Initial Impressions - -Dismissing those issues, good predictors are: -- Age -- Work Class -- Education (Specifically higher education) -- Job -- Marriage Status -- Sex -- Hours Worked - -This model makes me wonder what would happen if we selected a sample from this dataset that is less skewed, but I'm unsure what this would do to the accuracy of this model in the real world. - -#### Improving the Model -We wanted to see if removing predictors would help the overall accuracy, especially given that our predictors are somewhat dependent on each other. A brief search revealed that the anova function can show how adding each predictor effects the model. - -```r -anova(glm1, test="Chisq") -``` - -Looking below, we can tell that each addition to the model is statistically relevant - -#### Conclusions -There are issues with the data, mostly a high bias and a skewed target variable, but our current model still could give good predictions given a similar data set. If you took another sample of census data just after this one it could probably predict income class a bit - -### Naive Bayes Model -```r -library(e1071) -nb1 <- naiveBayes(train$IncomeClass~., data=train) -nb1 -``` - -Naive Bayes produces a model that first finds the prior probability (A-priori, or the probability of having <=50k or >50k with no considerations of other data) and then finds the probability of the income given each condition independently. For example the table for Sex states that the probability that someone is female given that you make less than 50k is ~40%, while if a person makes more than 50k the chance they are a woman is ~15%. - -We also see the results for quantified predictors. For a continuous predictor like age, the mean age for people <=50k is 36.85352 while people >50k are older at a mean of 44.32006 years old. - -The model may just be finding the independent probabilities of the target event given each predictor but using all of the probabilities at once can provide a pretty good guess. Good enough to predict our training data! - -#### Issues in the Data -It's worth noting once again that our predictors may not be completely independent but our model here assumes they are. That is why we call it naive! With such a large amount of data, probability can overcome the shortcomings of this assumption and we could get reasonably accurate predictions - -### Predictions -```r -p1 <- predict(glm1, newdata=test, type="response") -pred1 <- ifelse(p1>0.5, ">50k", "<=50k") -head(pred1) -head(test$IncomeClass) -cm1 <- caret::confusionMatrix(as.factor(pred1), reference=test$IncomeClass) -cm1 -``` - - - -```r -p2 <- predict(nb1, newdata=test, type="class") -head(p2) -head(test$IncomeClass) -cm2 <- caret::confusionMatrix(as.factor(p2), test$IncomeClass) -cm2 -``` - -```r -cm1$byClass -``` - - -#### Initial conclusion -The initial conclusion to be drawn from our predictions is that our accuracy for both our models is okay, and our logistic regression model did better than our Naive Bayes. This could probably be due to Naive Bayes often doing better with small data sets while logistic regression works better with large datasets. On the other hand the logistic regression model might have still been overwhelmed by the amount of factors, and the accuracy was only ~84%. - -The confusion matrix tells us True Positive, False Positive, True Negative, and False Negative results from applying the model to the test data. We can use the ratios between these numbers to evaluate useful metrics like accuracy or sensitivity. - -#### The Confusion Matrix -``` - Reference -Prediction <=50k >50k - <=50k 6898 1225 - >50k 498 1148 -``` -Just for an example we are looking at the naive bayes confusion matrix. -- 6898: The number of True Positives -- 1148: The number of True Negatives -- 498: The number of False Negatives -- 1225: The number of False Positives - -We can use these to calculate other metrics - -#### Accuracy -``` -Logistic R.: ~85% -Naive Bayes: ~82% -``` - -The diagonals, or our true results, divided by all of our predictions is our accuracy, or the percentage we were correct. As you can see, our logistic regression model was accurate more of the time. Most likely because it thrived more with the large amount of data. - -#### Sensitivity & Specificity -``` -Logistic R.: 0.9287 and 0.5874 -Naive Bayes: 0.9327 and 0.4838 -``` -Naive Bayes had a higher sensitivity, which is the number of true positives out of true positives + false negatives (the number of positives in the data). If we were trying to perhaps locate all people with "low" income but didn't care about our accuracy with people above 50k, the stat shows naive bayes could be useful. - -Specificity is the measure of true negatives in the negative class. We can tell then that we were much better at identifying our people with <=50k income than people with >50k income. However, logistic regression was still better than Naive Bayes in this stat. - -Well you ignore part of the data and perhaps get to ignore issues ini your model (like ignoring a bunch of false negatives), these are great for getting what matters out of data. - -#### Kappa -``` -Logistic R.: 0.5519 -Naive Bayes: 0.4648 -``` - -Woah! These aren't the best numbers, but considering this is a measure of accuracy that corrects for prediction by chance, I'm surprised the number is so high. The data set was skewed, it seemed a large margin of the success of our models was due to random chance. According to a reference on kappa scores though, these numbers are in "moderate agreement" with what is expected. - -Kappa is great for regarding datasets where the random chance of getting a prediction high is right. Of course, there isn't a consensus on what the number means on a scale, but its still generally useful. - -#### ROC Curves and AUC - -```r -library(ROCR) -head(p1) -head(test$IncomeClass) -pr <- prediction(p1, test$IncomeClass) -prf <- performance(pr, measure = "tpr", x.measure = "fpr") -plot(prf) -# Compute AUC -auc <- performance(pr, measure = "auc") -auc <- auc@y.values[[1]] -auc -``` - -```r -library(ROCR) -p2raw <- predict(nb1, newdata=test, type="raw")[,2] -pr2 <- prediction(p2raw,as.numeric(test$IncomeClass)) -prf2 <- performance(pr2, measure = "tpr", x.measure = "fpr") -plot(prf2) - -# Compute AUC -auc <- performance(pr2, measure = "auc") -auc <- auc@y.values[[1]] -auc -``` - -#### Matthew's Correlation Coefficient (MCC): -```r -# Logistic Regression -mltools::mcc(as.factor(pred1), test$IncomeClass) -``` - -```r -# Naive Bayes -mltools::mcc(p2, test$IncomeClass) -``` - -.4771213 is smack dab in between a perfect model (1) and a model that is perfectly average (0). Pretty good! - -### Strengths and Weaknesses -Logistic Regression basically is attempting to draw a line between classes. It ends up being quite computationally inexpensive, easy to understand, and does its job well if classes are easy to separate. But because of it's simplicity as a line, it just isn't complex enough to capture complex non-linear decision boundaries. Naive Bayes is also simple, but with the added bonus that it works well with high dimensions (complex data sets) *if* they aren't too big. It's simple however because it assumes variables are independent, and ends up lacking with larger data sets. - -### Summary of Metrics -Accuracy being the ratio of correct predictions to incorrect predictions, it is broadly useful. But often we are searching for subsections of accuracy. Sensitivity is good for detecting the amount we get one (the positive) class and ignores the other. Specificity on the other hand is the ratio of correct negative classes. This means we can use these metrics to see how ell our data is at guessing what matters in the at. If we want to see general accuracy, but account for the chance of getting the prediction randomly correct, Kappa is great for checking that. - -Now ROC... well it graphs the true positive rate and the false positive rate (sensitivity and specificity). Unfortunately we tried til the deadline to get this to work for Naive Bayes but we swear we understand what it means! The name, Receiver Operator Characteristic curve comes from signal detection theory so it doesn't help much to remind what it means. However, basically it graphs the trade off of a model between sensitivity and specificity. The Area under the curve then represents how much the model is capable of distinguishing between classes. - -The MCC is a metric that basically gives a good value if you get a good reliable rate in all 4 values of the confusion matrix. The values are considered in proportional the size of the positive and negative values. Rather then combining the sensitivity and specificity of a metric into a single metric (like with an F1-Score), MCC considers the size of of negative samples. MCC's account for class distribution makes it great at providing an accuracy rating for the whole model rated from -1 to 1. - - -## Conclusion -We have 1 large takeaway from this data, linear data has limitations, and none of that is helped by having a skewed data set. In the future we would like to select a data set that has less of skewed target, or at least try to sample this data at a better ratio again. It was fun to look at though! diff --git a/docs/ML Work/Regression.md b/docs/ML Work/Regression.md deleted file mode 100644 index 9bea7d5..0000000 --- a/docs/ML Work/Regression.md +++ /dev/null @@ -1,242 +0,0 @@ ---- -title: "Regression" -author: "Zachary Canoot & Gray Simpson" -category: "ML Work" -share: true ---- -# Regression - - Using the Hungary Dataset [Weather in Szeged 2006-2016](https://www.kaggle.com/datasets/budincsevity/szeged-weather) -Found on Kaggle. - - Our goal is to see if we can see how other weather factors, such as Wind Speed and Humidity, relate to the difference between Apparent Temperature and actual Temperature. Though we identify apparent temperature as a very good predictor of the difference, we do not use this in this assignment as we are interested in exploring more the other factors that influence the disparity. - - Linear Regression is one of many supervised models of Machine Learning that functions by finding a trend in given data, using one or more input parameters to find a line of best fit, though it is not always a straight line. As shown by the name, linear regression models assume that the relationship between relevant attributes is linear. The model will predict coefficients for the effect of each predictor. It has low variance due to its linear nature, but with such an assumption, it will also be very high bias. - - -### Data Exploration - First, we read the data in, then divide our data up into training and testing. We have to add a column for the data we are interested in learning about, however, it is simply the difference between two other columns. -```r -df <- read.csv("weatherHistory.csv") -#Here we'll add the data that we are interested in: difference in Apparent Temp and Temp. -df$Temperature.Diff <- df$Temperature..C. - df$Apparent.Temperature..C. -#We'll also convert some data to factors for ease. -df$Precip.Type <- as.factor(df$Precip.Type) -df$Summary <- as.factor(df$Summary) -str(df) -#Now we'll divide into train and test. -set.seed(8) -trainindex <- sample(1:nrow(df),nrow(df)*.8,replace=FALSE) -train <- df[trainindex,] -test <- df[-trainindex,] -``` - - - -Next, we want to explore our training data. -```r -names(df) -dim(df) -head(df) -colMeans(df[4:11]) - -#Noticing the mean of 0 of df$Loud.Cover, lets check its sum in specific. -sum(df$Loud.Cover) - -#Let's see if we have any NAs more generally, now. -colSums(is.na(df)) -#Okay, so we don't have any NAs. - -#Now, lets see how R would summarize this data. -summary(df) -#We would also like to look at this particular aspect to see how the different values pan out. -summary(df$Summary) - -``` - One thing we notice is that there is an attribute labeled 'Loud Cover' that all values are 0 in. Therefore, this will be an aspect that we will ignore. - - However, in the summary, we can notice that there is a minimum value of 0 on Pressure, which has an average and max similar to each other. We can assume a 0 is a NA value here. - - The other values that are 0 we can't make assumptions on validity. - - If Wind Speed is 0, so will Wind Bearing, we can realize from looking through the data in passing. - Since there is no place in earth without wind, we can also state that these values aren't accurate. After we look a the data some more, we'll decide how we want to clean them up. - - We cannot come to a clear resolution on other attributes. - - - We'll pull up some graphs to get a better idea of what we have to do, now. Yellow dots are null precipitation days, green is rain, and blue is snow. -```r -cor(df[4:7]) -boxplot(df$Temperature.Diff,main="Apparent Temperature") -boxplot(df$Humidity,main="Humidity") -boxplot(df$Wind.Speed..km.h.,main="Wind Speed") -#pairs(df[4:7],main="Temperature, Humidity, and Wind Correlations") -plot(df$Temperature.Diff,df$Wind.Speed..km.h.,pch=21,bg=c("yellow","green","blue")[as.integer(df$Precip.Type)]) -plot(df$Temperature.Diff,df$Humidity,pch=21,bg=c("yellow","green","blue")[as.integer(df$Precip.Type)]) -plot(df$Temperature.Diff, df$Temperature..C.,pch=21,bg=c("yellow","green","blue")[as.integer(df$Precip.Type)]) - -``` - We can notice that there are some outliers in humidity at the 0 line we'll want to sort out, as well. - - We also notice a sort of "knife" shape in the data when being compared, at around 10 degrees Celsius. - By and large, we have such a large amount of data, it's difficult to notice quick correlations, aside from wind speed and temperature when it is snowing/below freezing. - - That's why we have Machine Learning, we suppose, even if it implies that linear regression may not be the best fit for this data set. - - While we would like to predict the results without the base temperature, we can see that is is very clearly related and helpful. - - - - Now, we'll clean up the data according to what we found. We'll clean up only what is referenced, but we will delete what we are uncertain about, since we have such a large amount of data. -```r -df[,6:7][df[,6:7]==0] <- NA -df[,13:13][df[,13:13]==0] <- NA -df <- na.omit(df) -summary(df) -``` - - - Now we'll do some more graphs. -```r -cor(df[4:7]) -boxplot(df$Temperature.Diff,main="Apparent Temperature") -boxplot(df$Humidity,main="Humidity") -boxplot(df$Wind.Speed..km.h.,main="Wind Speed") -#pairs(df[4:7],main="Temperature, Humidity, and Wind Correlations") -plot(df$Temperature.Diff,df$Wind.Speed..km.h.,pch=21,bg=c("yellow","green","blue")[as.integer(df$Precip.Type)]) -plot(df$Temperature.Diff,df$Humidity,pch=21,bg=c("yellow","green","blue")[as.integer(df$Precip.Type)]) -plot(df$Temperature.Diff, df$Temperature..C.,pch=21,bg=c("yellow","green","blue")[as.integer(df$Precip.Type)]) - -``` - - Before we move on to linear regression, we have one more question before we investigate disparities in the data. - - Since this project has multiple contributors, perhaps there are even more hidden NA's. - - Let's see what the data looks like if we remove data where there is no difference. -```r -diffOnly <- df -diffOnly[,3:4][diffOnly[,3:3]==diffOnly[,4:4]] <- NA -diffOnly <- na.omit(diffOnly) -plot(diffOnly$Temperature.Diff,diffOnly$Wind.Speed..km.h.,pch=21,bg=c("yellow","green","blue")[as.integer(df$Precip.Type)]) -plot(diffOnly$Temperature.Diff,diffOnly$Humidity,pch=21,bg=c("yellow","green","blue")[as.integer(df$Precip.Type)]) -#These graphs show the overall descriptor of the weather. There are 27 options. -plot(diffOnly$Temperature.Diff,diffOnly$Wind.Speed..km.h.,pch=21,bg=c("white","aquamarine","blue","brown","green","yellow","red","cyan","darkgray","darkgreen","magenta","orange","dodgerblue","forestgreen","gold","sienna","thistle","violet","springgreen","slateblue","wheat","tomato","yellowgreen","tan","lightblue","hotpink","darkred")[as.integer(df$Summary)]) -plot(diffOnly$Temperature.Diff,diffOnly$Humidity,pch=21,bg=c("white","aquamarine","blue","brown","green","yellow","red","cyan","darkgray","darkgreen","magenta","orange","dodgerblue","forestgreen","gold","sienna","thistle","violet","springgreen","slateblue","wheat","tomato","yellowgreen","tan","lightblue","hotpink","darkred")[as.integer(df$Summary)]) -plot(df$Temperature.Diff, df$Temperature..C.,pch=21,bg=c("white","aquamarine","blue","brown","green","yellow","red","cyan","darkgray","darkgreen","magenta","orange","dodgerblue","forestgreen","gold","sienna","thistle","violet","springgreen","slateblue","wheat","tomato","yellowgreen","tan","lightblue","hotpink","darkred")[as.integer(df$Summary)]) - -``` - This does not seem to have effected data trends very much. - - Looking at the data based on Summary does not help us much, but we can notice that the cloud of data that does not seem to have much trend is in the violet(18) and slateblue (20), or rather Mostly Cloudy and Partly Cloudy. There isn't much helpful we can do with that information at this time, however. - - Since that seemed to cause no changes, but may have helped clean it up a small amount, let's write it to df. We'll also clean up the train and test data again. -```r -df <- diffOnly -trainindex <- sample(1:nrow(df),nrow(df)*.8,replace=FALSE) -train <- df[trainindex,] -test <- df[-trainindex,] -``` - - - Now, we'll move on to the regression. - - -### Linear Regression: Simple - Let's start with a linear regression model with one predictor, wind speed, and summarize it. -```r -simplelinreg <- lm(Temperature.Diff~Wind.Speed..km.h.,data=train) -summary(simplelinreg) -``` - So, it's better than nothing it seems. The R^2 isn't great, but we can see that there's enough of a correlation to count. We could get a better reading by using the actual temperature, since those are very closely related, but one goal of this is learning to understand how the change in temperature works based on other factors. - - Lets plot the residual errors, and evaluate. -```r -par(mfrow=c(2,2)) -plot(simplelinreg) -``` - We can see that the trends are fairly close to the given lines. They are in no way perfect, but they seem to get the gist. The most concerning piece seems to be Residuals vs Leverage. The given line implies that we do have outlier (y-axis) leverage (x-axis) values that may influence our trend line. It may be something such as an issue during a case of severe weather, or a broken device used in data collection at that time. - - As well, this is only the data from our simple regression. We will be able to see how other models compare at a later time. - - - -### Linear Regression: Multiple - Let's up the complexity, now. We'll build a multiple linear regression model, and see if we can improve the accuracy. -```r -multlinreg <- lm(Temperature.Diff~Humidity+Wind.Speed..km.h.+Precip.Type,data=train) -summary(multlinreg) -par(mfrow=c(2,2)) -plot(multlinreg) -``` - We understand from our data exploration that Humidity, Wind Speed, and Precipitation Type all relate to the data in different ways. We can find different trends depending on what we're looking at, so we can ask the model to reference all of that data when its processing now. When the precipitation type was rain, it didn't add much to figuring things out, but knowing that it was in the snow range was very helpful. - - It's doing better than our simple model, getting the R^2 up much more and a lower RSE. The Residuals vs Leverage chart looks like it has encountered some issues, however the two entirely separate sections does match up with some inconsistent trends that we noticed when we were graphing the attributes we planned on working with. The Residuals vs Fitted and Scale-Location graphs look comparatively stellar. Normal Q-Q is about the same. - - -### Linear Regression: Combinations - Now let's go a step even farther. We'll use a combination of predictors, interaction effects, and polynomial regression to see if we can get even more accurate. -```r -combolinreg <- lm(Temperature.Diff~poly(Humidity*Wind.Speed..km.h.)+Precip.Type+Summary,data=train) -summary(combolinreg) -par(mfrow=c(2,2)) -plot(combolinreg) -``` - Here, we added Summary as well as an interaction effect with precipitation. We made this decision based on the cloud of Partly Cloudy values that didn't seem to follow other data, and we can see that some specific Summary values were quite helpful in the result, and some were not. - - Overall, though, R^2 is up a bit more, and RSE is down. It's not a huge change, but it does help. Humidity and Wind Speed seemed to have some similar trends and attributes when we graphed them, and the type of weather is related to the type of precipitation, which is why we had those certain attributes marked as an interaction effect. - - The residuals are now very different from the other two models' results. It seems like the values are much more as intended, horizontal where they should be to indicate a good fit, though Q-Q seems to be the same. The outlying x and y observations also seem to be different than the ones the other models denoted. - - -### Evaluation - With each new type, using more aspects of different Machine Learning model results, we were able to increase the model's ability to find lines within the data, which should help us when we predict our results. In general, the more ways we let the data interact, the better the resulting model seemed to be, so long as we did not do it blindly. Depending on how related certain attributes are, they need to be treated differently, since some attributes are a result of other attributes that may be included in the data. - - So, in the end, the combination of interaction effects and multiple regression provided the best trends. Simple regression did not seem to fit the data well at all in comparison to the other models. The combination data may have only been a little about +.02 better on R^2 than multiple regression, however it is a significant enough change to be useful in data prediction. - - - -### Predictions - Using the three models, we will predict and evaluate using the metric correlation and MSE. -```r -simplepred <- predict(simplelinreg,newdata=test) -simplecor <- cor(simplepred,test$Temperature.Diff) -simplemse <- mean((simplepred-test$Temperature.Diff)^2) -simplermse <- sqrt(simplemse) -multpred <- predict(multlinreg,newdata=test) -multcor <- cor(multpred,test$Temperature.Diff) -multmse <- mean((multpred-test$Temperature.Diff)^2) -multrmse <- sqrt(multmse) -combopred <- predict(combolinreg,newdata=test) -combocor <- cor(combopred,test$Temperature.Diff) -combomse <- mean((combopred-test$Temperature.Diff)^2) -combormse <- sqrt(combomse) -#Output results -print("-------Simple Model-------") -print(paste("Correlation: ", simplecor)) -print(paste("MSE: ", simplemse)) -print(paste("RMSE: ", simplermse)) -print("-------Multiple Model-------") -print(paste("Correlation: ", multcor)) -print(paste("MSE: ", multmse)) -print(paste("RMSE: ", multrmse)) -print("-------Combo Model-------") -print(paste("Correlation: ", combocor)) -print(paste("MSE: ", combomse)) -print(paste("RMSE: ", combormse)) - - -anova(simplelinreg,multlinreg,combolinreg) -``` - Judging by these values, we can verify our evaluation that the combination linear regression model was the best model for our data. The low MSE (mean squared error) in comparison to the simple and multiple models means that the mistakes made were smaller than others. The RMSE (root MSE) says that we were off, on average, .939 degrees Celsius. While this isn't entirely accurate, the range of the value was around -5 degrees to +10 degrees, and if someone was predicting the weather the average person would likely be tolerant of a one degree difference. In that sense, the multiple regression would also be considered accurate enough to be helpful. The simple model is not terrible either, however the low correlation and high MSE do support the fact that there is much more room for improvement. - - The difference in temperature is not extremely related to the wind speed, as we attempted in that first model. While it is a factor, the apparent temperature is a multifaceted issue better represented by numerous other effects, such as Humidity, precipitation, etc. - - Our results were also very good considering we were purposely avoiding using one aspect of given data, and that there were disparities showing what may have been differences due to how different contributors to the data set initially reading data in different ways. - - In summary, we can extract a surprising amount of data about the disparity in temperature based on wind, humidity, precipitation, and even descriptors of the sky. The more we combine usage of different attributes, acknowledging how they interact and work together, the better a result we can get. - - - - diff --git a/docs/ML Work/RegressionSVM.md b/docs/ML Work/RegressionSVM.md deleted file mode 100644 index 8213e2b..0000000 --- a/docs/ML Work/RegressionSVM.md +++ /dev/null @@ -1,353 +0,0 @@ ---- -title: "Regression with SVM" -author: Zachary Canoot -share: true -category: ML Work ---- - -# SVM Regression - -Support Vector Machines can divide data into classes by a hyperplane in -multidimensional space. This line separates classes by finding minimum distance -of margins between support vectors. Once we calculate support vectors for our -model (given an input of slack in the margins optimized with validation data), -we can then classify the data in relation to the margins on the hyperplane. - -In the case of Regression, we apply this logic to fit a line to the data (as -opposed to divide the data). Classification minimizes the margins such that all -examples on either side of the margins are assumed to be classified correctly. -SVM Regression's minimization function will instead find a hyper plane that fits -the data within a certain accuracy. Specifically, the support vectors have the -largest amount of error (distance) from the hyperplane, and everything within -those support vectors (the margin) is assumed correctly fitted. Thus the -hyperplane fits the data like simple regression. - -We are going to apply this algorithm to a simple 1 target, 1 predictor data set, -and get a nice visual demonstration of the hyperplane. - -## Exploring our Data - -Our data, found on [Kaggle](https://www.kaggle.com/datasets/sohier/calcofi) is a -very detailed and large collection of samples of larval fish data. However, I'm -only interested in the temperature of the ocean water in relation to it's -salinity. So lets read it in and trim it down to just the necessary columns and -a handleable training size. After finishing the document, I realized I had to -trim the file size for github, so note that the actual data on kaggle is much -larger! - -> Note: We are going to sample the data to a smaller size right now, knowing -> that the last slide for SVM warned this may be neccessary - -We think it is worth leaving depth in, sense this will probably be a good -predictor of temperature, if not as easy to see visually. - -> I'm leaving in the 2 code chunks below as comments, but they are referring to -> data *before* I trimmed down the csv file to upload to github! - -```r -# This takes a second -# ocean_data <- read.csv("bottle.csv") -``` - -```r -# Selecting the wanted columns: 5= T_degC, 6=Depthm, and 7=Salnty -# df <- ocean_data[c(5,6,7)] -# Removing NAs -# which_nas <- apply(df, 1, function(X) any(is.na(X))) -# nas <- length(which(which_nas)) -# size <- nrow(df) -# ratio <- nas/size -# size <- format(size, big.mark = ",", scientific = FALSE) -# nas <- format(length(which(which_nas)), big.mark = ",", scientific = FALSE) -# sprintf("%%%.2f of the %s large dataset contains NA's. Removing %s", ratio*100, size, nas) -# df <- na.omit(df) -``` - -864,863 rows! That's still quite large! Lets reduce to exactly 10,000 - -```r -# set.seed(8) -# df <- df[sample(1:nrow(df),10000,replace=FALSE),] -# nrow(df) -# head(df) -``` - -```r -# write.csv(df,"ocean_data.csv") -``` - -We can then split into training, testing, and validation data - -```r -# The line below reads in the modified csv -set.seed(8) -df <- read.csv("ocean_data.csv") -spec <- c(train=.6, test=.2, validate=.2) -i <- sample(cut(1:nrow(df), nrow(df)*cumsum(c(0,spec)), labels=names(spec))) -train <- df[i=="train",] -test <- df[i=="test",] -vald <- df[i=="validate",] -``` - -## Graphical and Text Analysis - -Lets look at the data we are using to build our models: - -```r -summary(train) -``` - -Some things to note: - -1. A min depth of 0 is odd, but makes sense as a surface reading -2. The outliers don't seem that far, so this might be some good data to look at -3. From the mean values we can gather our average case: At a depth of of 228 - meters, the temperature was \~11 degrees Celsius (51.8 Fahrenheit), with a - Salinity of \~34 percent, which matches averages easily found online. -4. There weren't that many data reads in extremely salty waters (\>36%) - -I want to graph the correlation from salinity to temperature, a 3d graph of -depth, temperature, and salinity, etc. - -```r -library(ggplot2) -theme_set(theme_minimal()) -mid <- mean(train$Depthm) -ggplot(train, aes(x=Salnty, y=T_degC)) + - geom_point(pch=19, aes(color = Depthm), size=1) + - geom_smooth(formula="y~x", method="lm", color="red", linetype=2) + - labs(title="Salinity vs Temperature", x="Salinity (Percentage)", y - ="Temperature (C)") + - scale_color_gradient2(midpoint=mid, low = "red", mid = "blue", high = "red", space = "lab") - -ggplot(train, aes(x=Depthm, y=T_degC)) + - geom_point(pch=19, aes(color = Salnty), size=1) + - geom_smooth(formula="y~x", method="lm", color="red", linetype=2) + - labs(title="Depth vs Temperature", x="Depth (Meters)", y - ="Temperature (C)") -``` - -We can see by this scatter plot, along with the convenient smoothing line that -there is definitely a trend or correlation in the data. The smoothing line shows -what the result of a linear regression might be on the model, and it's -confidence interval (indicated by the gray border on the line) indicates the -model... holds some water. - -Both Depth and Salinity have a strong correlation, but it seems Depth might have -an exponential decay like relationship with temperature. That makes me wonder -how the best way to portray these relationships. - -```r -cor(df) -``` - -Looking at the correlation as well as our graphs, simple visual analysis tells -us that both depth and salinity are useful predictors of temperature, and a -linear regression model would be useful. However, the relationships are complex, -and we may find that a SVM regression model with a polynomial kernel might yield -a good result. This is because it would be able to mold together our 2 -predictors. - -Lets have a multiple-predictor linear regression model to compare our SVM -results to and then carry on. - -```r -# Creating a lm using both predictors and graphing predictions -lm <- lm(T_degC ~ Salnty + Depthm, train) -pred <- predict(lm, newdata=test) -preddf <- data.frame(temp_pred=pred, salinity=test$Salnty) -summary(lm) -cor <-cor(pred, test$T_degC) -mse <- mean((pred-test$T_degC)^2) -sprintf("Correlation of our prediction: %s", cor) -sprintf("Mean Squared Error: %s", mse) - -ggplot(test, aes(x=Salnty, y=T_degC)) + - geom_point(pch=19, aes(color = Depthm), size=1) + - geom_point(pch=1, color="yellow",data=preddf, aes(x=salinity, y=temp_pred), size=.1) + - labs(title="Salinity vs Temperature", x="Salinity (Percentage)", y - ="Temperature (C)") + - scale_color_gradient2(midpoint=mid, low = "red", mid = "blue", high = "red", space = "lab") -``` - -## Performing SVM Regression - -Lets just create a model for each type, tune the hyperparameters, and analyze -the results. - -### Linear Kernel - -```r -library(e1071) -svmlin <- svm(T_degC~Depthm+Salnty, data=train, kernel="linear", cost=10, scale=TRUE) -summary(svmlin) -pred <- predict(svmlin, newdata=test) -cor_svmlin <- cor(pred, test$T_degC) -mse_svmlin <- mean((pred - test$T_degC)^2) -sprintf("Correlation of our prediction: %s", cor_svmlin) -sprintf("Mean Squared Error: %s", mse_svmlin) -``` - -That is a little worse then our baseline, lets try and tune it using a smaller -subsample - -```r -# Just using the ranges used in examples -tune_svmlin <- tune(svm, T_degC ~ Depthm + Salnty, data = vald, kernel="linear", ranges=list(cost=c(0.001, 0.01, 0.1, 1, 5, 10, 100))) -tune_svmlin$best.model -``` - -We ran it with 1000 rows, and got a cost value of 1, then ran it with 3000 and -got 100, then ran it with 5000, and then decided to wait for the 10000 to finish -and got.. well 100. This suggest I should increase the range, but also that the -linear model isn't the best solution to our data, considering the cost had to be -so high. - -```r -svmlin <- svm(T_degC~Depthm+Salnty, data=train, kernel="linear", cost=5, scale=TRUE) -pred <- predict(svmlin, newdata=test) -tuned_cor_svmlin <- cor(pred, test$T_degC) -tuned_mse_svmlin <- mean((pred - test$T_degC)^2) -sprintf("Correlation of our prediction: %s", tuned_cor_svmlin) -sprintf("Mean Squared Error: %s", tuned_mse_svmlin) -``` - -The error was higher, and the correlation was lower so we didn't improve. Tuning -also barely effected the results! - -### Polynomial Kernel - -```r -svmpoly <- svm(T_degC~Depthm+Salnty, data=train, kernel="polynomial", cost=5, scale=TRUE) -summary(svmpoly) -pred <- predict(svmpoly, newdata=test) -cor_svmpoly <- cor(pred, test$T_degC) -mse_svmpoly <- mean((pred - test$T_degC)^2) -sprintf("Correlation of our prediction: %s", cor_svmpoly) -sprintf("Mean Squared Error: %s", mse_svmpoly) -``` - -Well that didn't go well! MSE is even higher then the baseline, so I can try to -tune - -```r -# tune_svmpoly <- tune(svm, T_degC ~ Depthm + Salnty, data = vald, kernel="polynomial", ranges=list(cost=c(0.001, 0.01, 0.1, # 1, 5, 10, 100))) -# tune_svmlin$best.model -``` - -This would take a while to run, but would simply keep trying to increase the -cost. But then that produces a model that over fits the training data! It seems -this simply isn't the best way to fit this data and is a bit worse then simple -regression. - -### Radial Kernel - -My only hope... - -```r -svmrad <- svm(T_degC~Depthm+Salnty, data=train, kernel="radial", cost=5, gamma=.5, scale=TRUE) -summary(svmrad) -pred <- predict(svmrad, newdata=test) -cor_svmrad <- cor(pred, test$T_degC) -mse_svmrad <- mean((pred - test$T_degC)^2) -sprintf("Correlation of our prediction: %s", cor_svmrad) -sprintf("Mean Squared Error: %s", mse_svmrad) -``` - -WOAH. What a great result! That is a good correlation, and what seems to be a -pretty good MSE, and both are much better then our baseline linear regression. -Lets try and tune - -```r -tune_svmrad <- tune(svm, T_degC ~ Depthm + Salnty, data = vald, kernel="radial", ranges=list(cost=c(0.001, 0.01, 0.1, 1, 5, 10, 100), gamma=c(0.5, 1, 2, 3, 4))) -tune_svmrad$best.model -``` - -```r -svmrad <- svm(T_degC~Depthm+Salnty, data=train, kernel="radial", cost=10, gamma=1, scale=TRUE) -pred <- predict(svmrad, newdata=test) -cor_svmrad <- cor(pred, test$T_degC) -mse_svmrad <- mean((pred - test$T_degC)^2) -sprintf("Correlation of our prediction: %s", cor_svmrad) -sprintf("Mean Squared Error: %s", mse_svmrad) -``` - -It made the results a bit marginally worse, which suggests over fitting of the -data with the training data vs the validation data. - -Either way, good results! Lets graph it! - -```r -preddf2 <- data.frame(temp_pred=pred, salinity=test$Salnty) - -ggplot(test, aes(x=Salnty, y=T_degC)) + - geom_point(pch=19, aes(color = Depthm), size=1) + - geom_point(pch=3, color="green",data=preddf2, aes(x=salinity, y=temp_pred), size=.2) + - geom_point(pch=3, color="yellow",data=preddf, aes(x=salinity, y=temp_pred), size=.2) + - labs(title="Salinity vs Temperature", x="Salinity (Percentage)", y - ="Temperature (C)") + - scale_color_gradient2(midpoint=mid, low = "red", mid = "blue", high = "red", space = "lab") -``` - -The green points indicate our new Just a bit lower then our linear regression -model at the start, which suggests less susceptibility to outliers. - -## Analysis - -I would love to visualize the radial kernel to get an idea for how it works on -this data, but quoting StatQuest: "Because the Radial Kernel finds Support -Vector Classifiers in infinite dimensions, it's not possible to visualize what -it does". While I know that you can still get an approximation, I'm going to go -ahead and skip figuring out how to do that for now. - -The main topic to analyze here is why the radial kernel was better than the -polynomial or the linear kernel. Each kernel represents a different method of -transforming the data so that a hyperplane may divide the data, or in this case, -fit a function. - -- The linear kernel is simple, it fits a hyperplane to the data - -- The polynomial kernel transforms the data in such a way to mimic adding more - features to the data set, really just by mapping the input data to a - polynomial of a higher degree. By mapping values in a higher degree space, - say, to the second degree, what really is a circular data set classification - can now have a straight line drawn through it. - -- The radial kernel compares the distance between every 2 values in the input - data, and scales the data by the value of it's distance. This mimics nearest - neighbor, where the model predicts every value with increasing weight - supplied to its neighbors. The kernel can then map the input to a higher - (infinite) dimensional space where it is easiest to fit a hyperplane that - best maximizes the margins of the model... it's not exactly easy to wrap a - brain around - -Because we found in our initial analysis that the relationship between our two -predictors and our target was a complicated combination of a a not-so-linear -relationship and a perhaps-exponential relationship, it makes sense that a -complicated model like SVM with a radial kernel would have been able to find -better results - -To think of this in terms of the data, the radial kernel was best able to -interpret how depth related to temperature compared to salinity. As depth -increased, it got colder, and salinity mattered less. The interactions between -the data were important in this data! So while linear regression was good at -approximating the relationship, the radial kernel was *probably* able to better -fit the data. - -## Link Dump! - - - - - - - -I referenced the book for a lot of ggplot2 stuff, as well as code on SVM. - - - - - - - - diff --git a/docs/ML Work/data exploration.md b/docs/ML Work/data exploration.md deleted file mode 100644 index be372fb..0000000 --- a/docs/ML Work/data exploration.md +++ /dev/null @@ -1,253 +0,0 @@ ---- -share: true -category: ML Work ---- -# Data Exploration -## The Premise ->"In class, we covered how to do data exploration with statistical functions in R. In this assignment, you recreate that functionality in C++ code. This will prepare us to write algorithms in C++ in future assignments" - -For me this is both a review of C++, but also a review of what correlation is. - -### Notes -- I deliberate whether returning range as the min and max, or as the difference between the two. I eventually chose just returning a min and max. -- I can't get relative links to work at the moment, hope it's fine that it is linking to the file hosted on the main site -- -## Conclusion -### The Code -```c -#include -#include -#include -#include -#include -#include -#include - -using namespace std; - -// TODO: Convert double vectors to taking in (explicitly) any numeric value -// Reference: - Iterators: https://www.geeksforgeeks.org/iterators-c-stl/ -// - What get's passed into sort: https://cplusplus.com/reference/iterator/RandomAccessIterator/ - -class Explore -{ -public: - // Calculate the sum of the vector - double sum_vector(vector vect) - { - double sum = 0; - for (int i = 0; i < vect.size(); i++) - { - sum += vect[i]; - } - return sum; - } - - // Calculate the mean of a vector - double mean_vector(vector vect) - { - double mean = sum_vector(vect) / vect.size(); - return mean; - } - - // Calculate the median of a vector - double median_vector(vector vect) - { - double median; - // Use an iterator because it is probably better -internet - vector::iterator it; - // Find the center if it is even or odd - sort(vect.begin(), vect.end()); - if (vect.size() % 2 == 0) // If there is an even number of elements - { - it = vect.begin() + vect.size() / 2 - 1; - median = (*it + *(it + 1)) / 2; - } - else // if there is an odd number of elements - { - it = vect.begin() + vect.size() / 2; - median = *it; - } - return median; - } - - // Calculate the range of a vector - vector range_vector(vector vect) - { - vector range = {max_vector(vect), min_vector(vect)}; - return range; - } - - // Calculate the max of a vector (Just for range) - double max_vector(vector vect) - { - double max; - vector::iterator it; - sort(vect.begin(), vect.end()); - it = vect.end() - 1; - max = *it; - return max; - } - - // Calculate the min of a vector (Just for range) just with a loop - double min_vector(vector vect) - { - double min; - vector::iterator it; - sort(vect.begin(), vect.end()); - it = vect.begin(); - min = *it; - return min; - } - - // Calculate the covariance of two vectors - // Cov(x,y) = E((x-x_mean)(y-y_mean)))/n-1 - double covar_vector(vector x, vector y) - { - double sum = 0; - double mean_x = mean_vector(x); - double mean_y = mean_vector(y); - for (int i = 0; i < x.size(); i++) - { - float x_i_diff = x[i] - mean_x; - float y_i_diff = y[i] - mean_y; - float y_times_x_diff = x_i_diff * y_i_diff; - // cout << x_i_diff << " * " << y_i_diff << " = " << y_times_x_diff << endl; - sum = sum + y_times_x_diff; - } - return sum / (x.size() - 1); - } - - // Calculate the correlation of two vectors - // Cor(x,y) = Cov(x,y)/(standard_devation(x)*standard_devation(y)) - // Using the hint from the assignment: - // "sigma of a vector can be calculated as the square root of variance(v,v)" - double cor_vector(vector x, vector y) - { - double covar = covar_vector(x, y); - double sigma_x = sqrt(covar_vector(x, x)); - double sigma_y = sqrt(covar_vector(y, y)); - return covar / (sigma_x * sigma_y); - } - - // Run suite of statistcal functions on a vector - void print_stats(vector vect) - { - cout << "Sum: " << sum_vector(vect) << endl; - cout << "Mean: " << mean_vector(vect) << endl; - cout << "Median: " << median_vector(vect) << endl; - vector range = range_vector(vect); - cout << "Range: " << range[1] << ", " << range[0] << endl; - } -}; - -int main(int argc, char **argv) -{ - ifstream inFS; - string line; - string rm_in, medv_in; - const int MAX_LEN = 1000; - vector rm(MAX_LEN), medv(MAX_LEN); - - cout << "Opening file Boston.csv." << endl; - - inFS.open("Boston.csv"); - if (!inFS.is_open()) - { - cout << "Error opening file Boston.csv." << endl; - return 1; - } - - cout << "Reading line 1 of Boston.csv." << endl; - getline(inFS, line); - - // echo heading - cout << "Headings: " << line << endl; - - // read data - int numObservations = 0; - while (inFS.good()) - { - getline(inFS, rm_in, ','); - getline(inFS, medv_in, '\n'); - rm.at(numObservations) = stof(rm_in); - medv.at(numObservations) = stof(medv_in); - - numObservations++; - } - - rm.resize(numObservations); - medv.resize(numObservations); - - cout << "New Length: " << rm.size() << endl; - - cout << "Closing file Boston.csv." << endl; - inFS.close(); // Done - - cout << "Number of records: " << numObservations << endl; - - // Create an Explore object to use stats functions - Explore explore; - - cout << "\nStats for rm" << endl; - explore.print_stats(rm); - - cout << "\nStats for medv" << endl; - explore.print_stats(medv); - - cout << "\n Covariance = " << explore.covar_vector(rm, medv) << endl; - - cout << "\n Correlation = " << explore.cor_vector(rm, medv) << endl; - - cout << "\nProgram terminated." << endl; -} - -``` - -### Returns -```bash -Opening file Boston.csv. -Reading line 1 of Boston.csv. -Headings: rm,medv -New Length: 506 -Closing file Boston.csv. -Number of records: 506 - -Stats for rm -Sum: 3180.03 -Mean: 6.28463 -Median: 6.2085 -Range: 3.561, 8.78 - -Stats for medv -Sum: 11401.6 -Mean: 22.5328 -Median: 21.2 -Range: 5, 50 - - Covariance = 4.49345 - - Correlation = 0.69536 - -Program terminated. -``` - -### Built into R or C++ -I believe that it was clearly easier to use R functions vs going through and making these functions in C++. This indicates the value of use using R into the future for machine learning. A more straight forward way to analyze data will allow us to understand our data models. - -I will note that someone fluent in C++ would do better than I did, considering I was just stumbling around for a bit trying to remember how import a library for a second there. - -### Statistical Value -What statistical measures did I evaluate: -- ***Mean***: A mean is an average of the data set, and represents the typical or most likely value from a dataset. Knowing what values in a dataset tend to be is important in understanding what general trend of all the data in your dataset is. You can compare values to this to find outliers and such. -- ***Median***: The center value of the data as it is in sorted order. This tells you a central tendency independent of skewed data or large outliers -- ***Range***: is the minimum and maximum values the values of the dataset might take. It is useful to understand how a dataset is bounded, both to see how far outliers might be from the center of a dataset, and just to get an idea for what the data looks like in scale. - -Whenever we are organizing data for a machine learning algorithm, we must understand our data ourselves to have a hope of predicting a trend given our data. Looking at values like mean or median tell us easy to understand generalizations about data so that we may get a general understanding of the meaning of our data without having to analyze every data point. Given more and more powerful generalization tools, or methods of analyzation, we can grow more and more confident in the understanding of our data. - -If we can see a general trend using our descriptive statistical measurements, then we can assure our model will be able to eventually get the specific trend data we hope to predict from that dataset. - -### Covariance and Correlation -Given two attributes that may or may not be related, we may find the *covariance* and *correlation* between those two bits of data. The covariance tells how one attributes data might relate to another. If x's covariance to y is a high positive number, we know that as x goes up y goes down, and vice versa. Correlation is just a version of that number, scaled down to a range of (-1, 1) in order to make the factor much more uniform - -The values are different from those above, considering they are not just measurements of an attribute of data but an extrapolation from that data about relationships or patterns. This is very useful when working in ml, as our end goal is figuring out how data tends to relate to certain results. Using how data correlates then directly supports our end goal of predicting outcomes, or just understanding complex relationships. \ No newline at end of file diff --git a/docs/ML Work/ml overview.md b/docs/ML Work/index.md similarity index 98% rename from docs/ML Work/ml overview.md rename to docs/ML Work/index.md index c0e18ff..ef08b85 100644 --- a/docs/ML Work/ml overview.md +++ b/docs/ML Work/index.md @@ -1,10 +1,7 @@ --- share: true -category: ML Work title: ML Work --- - -# Machine Learning Work Back in 2022 I took an introduction to Machine Learning with the wonderful [Karen Mazidi](https://www.linkedin.com/in/mazidiaiconsulting/) who gave us a large overview of both data science basics, and basic machine learning. The class was project based, with a focus on providing documentation of the process. ## Learning in R diff --git a/docs/nlp overview.md b/docs/NLP Work/index.md similarity index 98% rename from docs/nlp overview.md rename to docs/NLP Work/index.md index 7dff610..5a6875b 100644 --- a/docs/nlp overview.md +++ b/docs/NLP Work/index.md @@ -1,7 +1,7 @@ --- share: true +title: NLP Work --- -# Natural Language Processing Continuing my study in machine learning, I decided to focus on language processing and take a class on NLP. My class focused on learning the various libraries and ML techniques we use to under stand language, and scaling that up in python all the way to deep learning in python. We covered: - Foundational NLP Language distinctions like Parts of Speech and word, sentance, and corpora - Basic Python usage with NLTK for preprocessing diff --git a/docs/NLP Work/summary.md b/docs/NLP Work/summary.md deleted file mode 100644 index 711fdd0..0000000 --- a/docs/NLP Work/summary.md +++ /dev/null @@ -1,20 +0,0 @@ -# Project 1: Parsing Data! -This assignment is an example of a few things you can do in python: -1. Reading in a file (.csv) -2. Reading arguments from the command line -3. Using text tools like regex to format strings -4. Save dictionaries to pickle files -5. Print formatted data - -To run the script, ensure you have the contents of the project 1 directory from [here](https://github.com/zaiquiriw/nlp-portfolio), ensure you are in the directory and run: -``` -python contact_parser.py data/data.csv -``` - -Assuming you have Python setup, it should run the script! - -## Is Python the right tool for this? -Python is such a quick tool to use, I think writing basic text processing is what it is *meant* to do. You can easily, while coding, run methods in the console to test regex or various function definitions. The ability to test and iterate quickly does mean the language hides a couple efficiency saving options from me. Built in functions like capitalize() hide their implementation from the user, which could harm efficiency and the accuracy of the code in other situations. - -## What did I learn? -I got into the nitty gritty of some regex features, like certain unicode characters that I unfortunately didn't use in the final product. I also hadn't used the `*` or `**` operators before to unpack lists into function arguments. I would say most of the rest of the project was a *needed* review in just coding in python. Even if it wasn't the most difficult, I had to remember a lot of content for the final product. diff --git a/docs/assets/graph.html b/docs/assets/graph.html new file mode 100644 index 0000000..541324f --- /dev/null +++ b/docs/assets/graph.html @@ -0,0 +1,155 @@ + + + + + + + + + +
+

+
+ + + + + + +
+

+
+ + + + + +
+ + +
+
+ + + + + + + \ No newline at end of file diff --git a/docs/nlp overview/index.md b/docs/nlp overview/index.md deleted file mode 100644 index 3d9d1d1..0000000 --- a/docs/nlp overview/index.md +++ /dev/null @@ -1,29 +0,0 @@ ---- -share: true -category: nlp overview ---- -# Natural Language Processing -Continuing my study in machine learning, I decided to focus on language processing and take a class on NLP. My class focused on learning the various libraries and ML techniques we use to under stand language, and scaling that up in python all the way to deep learning in python. We covered: -- Foundational NLP Language distinctions like Parts of Speech and word, sentance, and corpora -- Basic Python usage with NLTK for preprocessing -- Wordnet and building word relationships -- N-gram models for language generation -- Context Free Grammars -- Numpy, pandas, scikit-learn, and seaborn -- Naive Bayes and Logistic Regression for NLP -- Keras for CNN's, RNN's, LSTM and GRU -- Using embeddings along with decoders and encoders - -For all of these topics we did various projects to get better at implementing our knowledge and sharing it using jupyter notebooks. - -## The Projects -If you would like to view the code and notebook work related to these projects they are still posted on [[https://github.com/zaiquiriw/nlp-portfolio|github]] to view! However here are some short summaries of my work in NLP. I value my [[Summary_of_Attention_Article.pdf|analysis of attention as an explainability metric]] if you would like to view it! - -- [[wordnet.pdf|Wordnets]]: This is an exploration of how wordnets can reveal complex meanings of words not simply found in the definition -- [[ngrams-assignment.pdf|N-grams]]: Just a brief description of ngrams to illustrate their usefulness -- [[summary.pdf|Netscraping for LLM's]]: I used BeautifulSoup to scrape the web for an LLM -- [[text-classification.pdf|]]: I used simple Neural Networks with the goal of building a network that could be used to train a network on imitating characters (in this case Rick and Morty's voice and tone) -- [[Summary_of_Attention_Article.pdf|The Impact of Attention]]: This short paper summarizes a paper on the impact of a "Is Attention Explanation" and bridges the creation of modern GPTs into the now pressing Alignment problem and other consequences of modern attention. A personal favorite project where I explored the quakes in AI research sudden prominence of new AI techniques. -- [[RickMortyTwo.pdf|More Rick And Morty]]: I liked to have fun, so I did a take two on classifying text based on the Rick and Morty voice. However, it came out more on a study on how you can't squeeze data to work your use case. You just have to work with the data you have. - -I came out of this class *really* wanting to do more research, but I did not want to jump right into a masters. Perhaps one day, but I need a break after 16 or so years of schooling. I do feel very comfortable in data science, and I value that greatly! \ No newline at end of file diff --git a/docs/tags.md b/docs/tags.md deleted file mode 100644 index c4a3cae..0000000 --- a/docs/tags.md +++ /dev/null @@ -1 +0,0 @@ -[TAGS] \ No newline at end of file diff --git a/encryptcontent.cache b/encryptcontent.cache new file mode 100644 index 0000000..f2f2d20 --- /dev/null +++ b/encryptcontent.cache @@ -0,0 +1,4 @@ +kdf_iterations: 10000 +obfuscate: {} +password: {} +userpass: {} diff --git a/mkdocs.yml b/mkdocs.yml index 1653e0a..292d37e 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -3,7 +3,7 @@ site_description: All things ZaiquiriW site_url: https://zaiquiri.github.io theme: - name: 'material' + name: material logo: assets/meta/favicons.png favicon: assets/meta/favicons.png custom_dir: overrides @@ -64,9 +64,6 @@ markdown_extensions: anchor_linenums: true - pymdownx.tasklist: custom_checkbox: true - - pymdownx.emoji: - emoji_index: !!python/name:materialx.emoji.twemoji - emoji_generator: !!python/name:materialx.emoji.to_svg - admonition - toc: permalink: true @@ -83,7 +80,7 @@ plugins: - git-revision-date-localized: type: date fallback_to_build_date: true - locale: fr + locale: en custom_format: "%A %d %B %Y" enable_creation_date: true - ezlinks: @@ -94,8 +91,6 @@ plugins: custom-attributes: 'assets/css/custom_attributes.css' - custom-attributes: file: 'assets/css/custom_attributes.css' - - tags: - tags_file: tags.md - encryptcontent: title_prefix: '🔐' summary: 'Private page' diff --git a/overrides/hooks/__pycache__/on_env.cpython-310.pyc b/overrides/hooks/__pycache__/on_env.cpython-310.pyc new file mode 100644 index 0000000..5a87b4a Binary files /dev/null and b/overrides/hooks/__pycache__/on_env.cpython-310.pyc differ diff --git a/overrides/hooks/__pycache__/on_files.cpython-310.pyc b/overrides/hooks/__pycache__/on_files.cpython-310.pyc new file mode 100644 index 0000000..e772ffa Binary files /dev/null and b/overrides/hooks/__pycache__/on_files.cpython-310.pyc differ diff --git a/overrides/hooks/__pycache__/on_page_markdown.cpython-310.pyc b/overrides/hooks/__pycache__/on_page_markdown.cpython-310.pyc new file mode 100644 index 0000000..86e3dde Binary files /dev/null and b/overrides/hooks/__pycache__/on_page_markdown.cpython-310.pyc differ