Skip to content

Commit

Permalink
binference fix
Browse files Browse the repository at this point in the history
  • Loading branch information
adarsh0806 committed Jan 26, 2018
1 parent 76ec01b commit dc0514f
Showing 1 changed file with 15 additions and 24 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@
"\n",
"One thing to consider is the independence of these features amongst each other. For example if a child looks nervous at the event then the likelihood of that person being a threat is not as much as say if it was a grown man who was nervous. To break this down a bit further, here there are two features we are considering, age AND nervousness. Say we look at these features individually, we could design a model that flags ALL persons that are nervous as potential threats. However, it is likely that we will have a lot of false positives as there is a strong chance that minors present at the event will be nervous. Hence by considering the age of a person along with the 'nervousness' feature we would definitely get a more accurate result as to who are potential threats and who aren't. \n",
"\n",
"This is the 'Naive' bit of the theorem where it considers each feature to be independant of each other which may not always be the case and hence that can affect the final judgement.\n",
"This is the 'Naive' bit of the theorem where it considers each feature to be independent of each other which may not always be the case and hence that can affect the final judgement.\n",
"\n",
"In short, the Bayes theorem calculates the probability of a certain event happening(in our case, a message being spam) based on the joint probabilistic distributions of certain other events(in our case, a message being classified as spam). We will dive into the workings of the Bayes theorem later in the mission, but first, let us understand the data we are going to work with."
]
Expand Down Expand Up @@ -102,7 +102,7 @@
},
"source": [
">**Instructions: **\n",
"* Convert the values in the 'label' colum to numerical values using map method as follows:\n",
"* Convert the values in the 'label' column to numerical values using map method as follows:\n",
"{'ham':0, 'spam':1} This maps the 'ham' value to 0 and the 'spam' value to 1.\n",
"* Also, to get an idea of the size of the dataset we are dealing with, print out number of rows and columns using \n",
"'shape'."
Expand Down Expand Up @@ -130,7 +130,7 @@
"\n",
"Here we'd like to introduce the Bag of Words(BoW) concept which is a term used to specify the problems that have a 'bag of words' or a collection of text data that needs to be worked with. The basic idea of BoW is to take a piece of text and count the frequency of the words in that text. It is important to note that the BoW concept treats each word individually and the order in which the words occur does not matter. \n",
"\n",
"Using a process which we will go through now, we can covert a collection of documents to a matrix, with each document being a row and each word(token) being the column, and the corresponding (row,column) values being the frequency of occurrance of each word or token in that document.\n",
"Using a process which we will go through now, we can covert a collection of documents to a matrix, with each document being a row and each word(token) being the column, and the corresponding (row,column) values being the frequency of occurrence of each word or token in that document.\n",
"\n",
"For example: \n",
"\n",
Expand All @@ -153,7 +153,7 @@
"[count vectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer) method which does the following:\n",
"\n",
"* It tokenizes the string(separates the string into individual words) and gives an integer ID to each token.\n",
"* It counts the occurrance of each of those tokens.\n",
"* It counts the occurrence of each of those tokens.\n",
"\n",
"** Please Note: ** \n",
"\n",
Expand Down Expand Up @@ -449,7 +449,7 @@
">>**\n",
"Instructions:**\n",
"Create a matrix with the rows being each of the 4 documents, and the columns being each word. \n",
"The corresponding (row, column) value is the frequency of occurrance of that word(in the column) in a particular\n",
"The corresponding (row, column) value is the frequency of occurrence of that word(in the column) in a particular\n",
"document(in the row). You can do this using the transform() method and passing in the document data set as the \n",
"argument. The transform() method returns a matrix of numpy integers, you can convert this to an array using\n",
"toarray(). Call the array 'doc_array'\n"
Expand Down Expand Up @@ -503,7 +503,7 @@
"source": [
"Congratulations! You have successfully implemented a Bag of Words problem for a document dataset that we created. \n",
"\n",
"One potential issue that can arise from using this method out of the box is the fact that if our dataset of text is extremely large(say if we have a large collection of news articles or email data), there will be certain values that are more common that others simply due to the structure of the language itself. So for example words like 'is', 'the', 'an', pronouns, grammatical contructs etc could skew our matrix and affect our analyis. \n",
"One potential issue that can arise from using this method out of the box is the fact that if our dataset of text is extremely large(say if we have a large collection of news articles or email data), there will be certain values that are more common that others simply due to the structure of the language itself. So for example words like 'is', 'the', 'an', pronouns, grammatical constructs etc could skew our matrix and affect our analyis. \n",
"\n",
"There are a couple of ways to mitigate this. One way is to use the `stop_words` parameter and set its value to `english`. This will automatically ignore all words(from our input text) that are found in a built in list of English stop words in scikit-learn.\n",
"\n",
Expand Down Expand Up @@ -627,7 +627,7 @@
"Now that we have our dataset in the format that we need, we can move onto the next portion of our mission which is the algorithm we will use to make our predictions to classify a message as spam or not spam. Remember that at the start of the mission we briefly discussed the Bayes theorem but now we shall go into a little more detail. In layman's terms, the Bayes theorem calculates the probability of an event occurring, based on certain other probabilities that are related to the event in question. It is composed of a prior(the probabilities that we are aware of or that is given to us) and the posterior(the probabilities we are looking to compute using the priors). \n",
"\n",
"Let us implement the Bayes Theorem from scratch using a simple example. Let's say we are trying to find the odds of an individual having diabetes, given that he or she was tested for it and got a positive result. \n",
"In the medical field, such probabilies play a very important role as it usually deals with life and death situatuations. \n",
"In the medical field, such probabilies play a very important role as it usually deals with life and death situations. \n",
"\n",
"We assume the following:\n",
"\n",
Expand All @@ -645,13 +645,13 @@
"\n",
"<img src=\"images/bayes_formula.png\" height=\"242\" width=\"242\">\n",
"\n",
"* `P(A)` is the prior probability of A occuring independantly. In our example this is `P(D)`. This value is given to us.\n",
"* `P(A)` is the prior probability of A occurring independently. In our example this is `P(D)`. This value is given to us.\n",
"\n",
"* `P(B)` is the prior probability of B occuring independantly. In our example this is `P(Pos)`.\n",
"* `P(B)` is the prior probability of B occurring independently. In our example this is `P(Pos)`.\n",
"\n",
"* `P(A|B)` is the posterior probability that A occurs given B. In our example this is `P(D|Pos)`. That is, **the probability of an individual having diabetes, given that, that individual got a positive test result. This is the value that we are looking to calculate.**\n",
"\n",
"* `P(B|A)` is the likelihood probability of B occuring, given A. In our example this is `P(Pos|D)`. This value is given to us."
"* `P(B|A)` is the likelihood probability of B occurring, given A. In our example this is `P(Pos|D)`. This value is given to us."
]
},
{
Expand All @@ -662,7 +662,7 @@
"\n",
"`P(D|Pos) = (P(D) * P(Pos|D) / P(Pos)`\n",
"\n",
"The probability of getting a positive test result `P(Pos)` can be calulated using the Sensitivity and Specificity as follows:\n",
"The probability of getting a positive test result `P(Pos)` can be calculated using the Sensitivity and Specificity as follows:\n",
"\n",
"`P(Pos) = [P(D) * Sensitivity] + [P(~D) * (1-Specificity))]`"
]
Expand Down Expand Up @@ -838,7 +838,7 @@
"* Probability that Gary Johnson says 'environment': 0.1 ---> `P(E|G)`\n",
"\n",
"\n",
"And let us also assume that the probablility of Jill Stein giving a speech, `P(J)` is `0.5` and the same for Gary Johnson, `P(G) = 0.5`. \n",
"And let us also assume that the probability of Jill Stein giving a speech, `P(J)` is `0.5` and the same for Gary Johnson, `P(G) = 0.5`. \n",
"\n",
"\n",
"Given this, what if we had to find the probabilities of Jill Stein saying the words 'freedom' and 'immigration'? This is where the Naive Bayes'theorem comes into play as we are considering two features, 'freedom' and 'immigration'.\n",
Expand Down Expand Up @@ -1010,14 +1010,14 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"And as we can see, just like in the Bayes' theorem case, the sum of our posteriors is equal to 1. Congratulations! You have implemented the Naive Bayes' theorem from scratch. Our analysis shows that there is only a 6.6% chance that Jill Stein of the Green Party uses the words 'freedom' and 'immigration' in her speech as compard the the 93.3% chance for Gary Johnson of the Libertarian party."
"And as we can see, just like in the Bayes' theorem case, the sum of our posteriors is equal to 1. Congratulations! You have implemented the Naive Bayes' theorem from scratch. Our analysis shows that there is only a 6.6% chance that Jill Stein of the Green Party uses the words 'freedom' and 'immigration' in her speech as compared the the 93.3% chance for Gary Johnson of the Libertarian party."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Another more generic example of Naive Bayes' in action is as when we search for the term 'Sacramento Kings' in a search engine. In order for us to get the results pertaining to the Scramento Kings NBA basketball team, the search engine needs to be able to associate the two words together and not treat them individually, in which case we would get results of images tagged with 'Sacramento' like pictures of city landscapes and images of 'Kings' which could be pictures of crowns or kings from history when what we are looking to get are images of the basketball team. This is a classic case of the search engine treating the words as independant entities and hence being 'naive' in its approach. \n",
"Another more generic example of Naive Bayes' in action is as when we search for the term 'Sacramento Kings' in a search engine. In order for us to get the results pertaining to the Scramento Kings NBA basketball team, the search engine needs to be able to associate the two words together and not treat them individually, in which case we would get results of images tagged with 'Sacramento' like pictures of city landscapes and images of 'Kings' which could be pictures of crowns or kings from history when what we are looking to get are images of the basketball team. This is a classic case of the search engine treating the words as independent entities and hence being 'naive' in its approach. \n",
"\n",
"\n",
"Applying this to our problem of classifying messages as spam, the Naive Bayes algorithm *looks at each word individually and not as associated entities* with any kind of link between them. In the case of spam detectors, this usually works as there are certain red flag words which can almost guarantee its classification as spam, for example emails with words like 'viagra' are usually classified as spam."
Expand Down Expand Up @@ -1175,19 +1175,10 @@
"One of the major advantages that Naive Bayes has over other classification algorithms is its ability to handle an extremely large number of features. In our case, each word is treated as a feature and there are thousands of different words. Also, it performs well even with the presence of irrelevant features and is relatively unaffected by them. The other major advantage it has is its relative simplicity. Naive Bayes' works well right out of the box and tuning it's parameters is rarely ever necessary, except usually in cases where the distribution of the data is known. \n",
"It rarely ever overfits the data. Another important advantage is that its model training and prediction times are very fast for the amount of data it can handle. All in all, Naive Bayes' really is a gem of an algorithm!\n",
"\n",
"Congratulations! You have succesfully designed a model that can efficiently predict if an SMS message is spam or not!\n",
"Congratulations! You have successfully designed a model that can efficiently predict if an SMS message is spam or not!\n",
"\n",
"Thank you for learning with us!"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
Expand Down

0 comments on commit dc0514f

Please sign in to comment.