en_Q-fOOMGSxlo.srt

0
00:00:00,500 --> 00:00:02,370
Hello, and welcome!

1
00:00:02,370 --> 00:00:06,400
In this video, we’ll be covering the process of building decision trees.

2
00:00:06,400 --> 00:00:09,080
So let’s get started!

3
00:00:09,080 --> 00:00:11,769
Consider the drug dataset again.

4
00:00:11,769 --> 00:00:17,980
The question is, “How do we build the decision tree based on that dataset?”

5
00:00:17,980 --> 00:00:22,670
Decision trees are built using recursive partitioning to classify the data.

6
00:00:22,670 --> 00:00:26,650
Let’s say we have 14 patients in our dataset.

7
00:00:26,650 --> 00:00:31,740
The algorithm chooses the most predictive feature to split the data on.

8
00:00:31,740 --> 00:00:36,170
What is important in making a decision tree, is to determine “which attribute is the

9
00:00:36,170 --> 00:00:41,050
best, or more predictive, to split data based on the feature.”

10
00:00:41,050 --> 00:00:46,820
Let’s say we pick “Cholesterol” as the first attribute to split data.

11
00:00:46,820 --> 00:00:50,149
It will split our data into 2 branches.

12
00:00:50,149 --> 00:00:56,629
As you can see, if the patient has high “Cholesterol,” we cannot say with high confidence that Drug B

13
00:00:56,629 --> 00:00:59,050
might be suitable for him.

14
00:00:59,050 --> 00:01:04,080
Also, if the Patient’s “Cholesterol” is normal, we still don’t have sufficient

15
00:01:04,080 --> 00:01:11,700
evidence or information to determine if either Drug A or Drug B is, in fact, suitable.

16
00:01:11,700 --> 00:01:16,250
It is a sample of bad attribute selection for splitting data.

17
00:01:16,250 --> 00:01:19,590
So, let’s try another attribute.

18
00:01:19,590 --> 00:01:23,310
Again, we have our 14 cases.

19
00:01:23,310 --> 00:01:27,780
This time, we pick the “sex” attribute of patients.

20
00:01:27,780 --> 00:01:32,869
It will split our data into 2 branches, Male and Female.

21
00:01:32,869 --> 00:01:38,200
As you can see, if the patient is Female, we can say Drug B might be suitable for her

22
00:01:38,200 --> 00:01:40,040
with high certainty.

23
00:01:40,040 --> 00:01:46,479
But, if the patient is Male, we don’t have sufficient evidence or information to determine

24
00:01:46,479 --> 00:01:49,829
if Drug A or Drug B is suitable.

25
00:01:49,829 --> 00:01:55,560
However, it is still a better choice in comparison with the “Cholesterol” attribute, because

26
00:01:55,560 --> 00:01:58,860
the result in the nodes are more pure.

27
00:01:58,860 --> 00:02:04,360
It means, nodes that are either mostly Drug A or Drug B.

28
00:02:04,360 --> 00:02:10,920
So, we can say the “Sex” attribute is more significant than “Cholesterol,” or

29
00:02:10,919 --> 00:02:15,000
in other words, it’s more predictive than the other attributes.

30
00:02:15,000 --> 00:02:21,200
Indeed, “predictiveness” is based on decrease in “impurity” of nodes.

31
00:02:21,200 --> 00:02:27,810
We’re looking for the best feature to decrease the ”impurity” of patients in the leaves,

32
00:02:27,810 --> 00:02:31,209
after splitting them up based on that feature.

33
00:02:31,209 --> 00:02:37,390
So, the “Sex” feature is a good candidate in the following case, because it almost found

34
00:02:37,390 --> 00:02:38,840
the pure patients.

35
00:02:38,840 --> 00:02:42,260
Let’s go one step further.

36
00:02:42,260 --> 00:02:48,560
For the Male patient branch, we again test other attributes to split the subtree.

37
00:02:48,560 --> 00:02:51,790
We test “Cholesterol” again here.

38
00:02:51,790 --> 00:02:55,550
As you can see, it results in even more pure leaves.

39
00:02:55,550 --> 00:02:59,580
So, we can easily make a decision here.

40
00:02:59,580 --> 00:03:04,280
For example, if a patient is “Male”, and his “Cholesterol” is “High”, we can

41
00:03:04,280 --> 00:03:13,060
certainly prescribe Drug A, but if it is “Normal”, we can prescribe Drug B with high confidence.

42
00:03:13,060 --> 00:03:18,550
As you might notice, the choice of attribute to split data is very important, and it is

43
00:03:18,550 --> 00:03:22,360
all about “purity” of the leaves after the split.

44
00:03:22,360 --> 00:03:29,850
A node in the tree is considered “pure” if, in 100% of the cases, the nodes fall into

45
00:03:29,850 --> 00:03:33,230
a specific category of the target field.

46
00:03:33,230 --> 00:03:40,379
In fact, the method uses recursive partitioning to split the training records into segments

47
00:03:40,379 --> 00:03:44,230
by minimizing the “impurity” at each step.

48
00:03:44,230 --> 00:03:50,020
”Impurity” of nodes is calculated by “Entropy” of data in the node.

49
00:03:50,020 --> 00:03:53,990
So, what is “Entropy”?

50
00:03:53,990 --> 00:04:00,670
Entropy is the amount of information disorder, or the amount of randomness in the data.

51
00:04:00,670 --> 00:04:06,720
The entropy in the node depends on how much random data is in that node and is calculated

52
00:04:06,720 --> 00:04:08,460
for each node.

53
00:04:08,460 --> 00:04:14,990
In decision trees, we&#39;re looking for trees that have the smallest entropy in their nodes.

54
00:04:14,990 --> 00:04:20,579
The entropy is used to calculate the homogeneity of the samples in that node.

55
00:04:20,579 --> 00:04:25,930
If the samples are completely homogeneous the entropy is zero and if the samples are

56
00:04:25,930 --> 00:04:30,030
equally divided, it has an entropy of one.

57
00:04:30,030 --> 00:04:35,890
This means, if all the data in a node are either Drug A or Drug B, then the entropy

58
00:04:35,890 --> 00:04:43,440
is zero, but if half of the data are Drug A and other half are B, then the entropy is

59
00:04:43,440 --> 00:04:44,850
one.

60
00:04:44,850 --> 00:04:50,750
You can easily calculate the entropy of a node using the frequency table of the attribute

61
00:04:50,750 --> 00:04:56,810
through the Entropy formula, where P is for the proportion or ratio of a category, such

62
00:04:56,810 --> 00:05:01,889
as Drug A or B. Please remember, though, that you don’t

63
00:05:01,889 --> 00:05:07,320
have to calculate these, as it’s easily calculated by the libraries or packages that

64
00:05:07,320 --> 00:05:09,229
you use.

65
00:05:09,229 --> 00:05:14,710
As an example, let’s calculate the entropy of the dataset before splitting it.

66
00:05:14,710 --> 00:05:20,000
We have 9 occurrences of Drug B and 5 of Drug A.

67
00:05:20,000 --> 00:05:25,660
You can embed these numbers into the Entropy formula to calculate the impurity of the target

68
00:05:25,660 --> 00:05:28,270
attribute before splitting it.

69
00:05:28,270 --> 00:05:31,200
In this case, it is 0.94.

70
00:05:31,200 --> 00:05:36,000
So, what is entropy after splitting?

71
00:05:36,000 --> 00:05:40,860
Now we can test different attributes to find the one with the most “predictiveness,”

72
00:05:40,860 --> 00:05:43,830
which results in two more pure branches.

73
00:05:43,830 --> 00:05:49,500
Let’s first select the “Cholesterol” of the patient and see how the data gets split,

74
00:05:49,500 --> 00:05:51,890
based on its values.

75
00:05:51,890 --> 00:05:58,449
For example, when it is “normal,” we have 6 for Drug B, and 2 for Drug A.

76
00:05:58,449 --> 00:06:04,699
We can calculate the Entropy of this node based on the distribution of drug A and B,

77
00:06:04,699 --> 00:06:07,460
which is 0.8 in this case.

78
00:06:07,460 --> 00:06:14,830
But, when Cholesterol is “High,” the data is split into 3 for drug B and 3 for drug A.

79
00:06:14,830 --> 00:06:18,890
Calculating its entropy, we can see it would be 1.0.

80
00:06:18,890 --> 00:06:23,830
We should go through all the attributes and

81
00:06:23,830 --> 00:06:29,430
calculate the “Entropy” after the split, and then chose the best attribute.

82
00:06:29,430 --> 00:06:32,880
Ok, let’s try another field.

83
00:06:32,880 --> 00:06:37,139
Let’s choose the Sex attribute for the next check.

84
00:06:37,139 --> 00:06:43,440
As you can see, when we use the Sex attribute to split the data, when its value is “Female,”

85
00:06:43,440 --> 00:06:49,740
we have 3 patients that responded to Drug B, and 4 patients that responded to Drug A.

86
00:06:49,740 --> 00:06:56,280
The entropy for this node is 0.98 which is not very promising.

87
00:06:56,280 --> 00:07:02,461
However, on other side of the branch, when the value of the Sex attribute is Male, the

88
00:07:02,461 --> 00:07:08,229
result is more pure with 6 for Drug B and only 1 for Drug A.

89
00:07:08,229 --> 00:07:11,770
The entropy for this group is 0.59.

90
00:07:11,770 --> 00:07:19,289
Now, the question is, between the Cholesterol and Sex attributes, which one is a better

91
00:07:19,289 --> 00:07:20,960
choice?

92
00:07:20,960 --> 00:07:27,250
Which one is better as the first attribute to divide the dataset into 2 branches?

93
00:07:27,250 --> 00:07:33,620
Or, in other words, which attribute results in more pure nodes for our drugs?

94
00:07:33,620 --> 00:07:40,950
Or, in which tree, do we have less entropy after splitting rather than before splitting?

95
00:07:40,950 --> 00:07:48,889
The “Sex” attribute with entropy of 0.98 and 0.59, or the “Cholesterol” attribute

96
00:07:48,889 --> 00:07:55,110
with entropy of 0.81 and 1.0 in its branches?

97
00:07:55,110 --> 00:08:00,940
The answer is, “The tree with the higher information gain after splitting.&quot;

98
00:08:00,940 --> 00:08:05,360
So, what is information gain?

99
00:08:05,360 --> 00:08:11,020
Information gain is the information that can increase the level of certainty after splitting.

100
00:08:11,020 --> 00:08:16,970
It is the entropy of a tree before the split minus the weighted entropy after the split

101
00:08:16,970 --> 00:08:18,940
by an attribute.

102
00:08:18,940 --> 00:08:23,789
We can think of information gain and entropy as opposites.

103
00:08:23,789 --> 00:08:31,060
As entropy, or the amount of randomness, decreases, the information gain, or amount of certainty,

104
00:08:31,060 --> 00:08:32,440
increases, and vice-versa.

105
00:08:32,440 --> 00:08:39,990
So, constructing a decision tree is all about finding attributes that return the highest

106
00:08:39,990 --> 00:08:41,610
information gain.

107
00:08:41,610 --> 00:08:46,740
Let’s see how “information gain” is calculated for the Sex attribute.

108
00:08:46,740 --> 00:08:52,470
As mentioned, the information gain is the entropy of the tree before the split, minus

109
00:08:52,470 --> 00:08:56,000
the weighted entropy after the split.

110
00:08:56,000 --> 00:09:01,180
The entropy of the tree before the split is 0.94.

111
00:09:01,180 --> 00:09:08,310
The portion of Female patients is 7 out of 14, and its entropy is 0.985.

112
00:09:08,310 --> 00:09:18,680
Also, the portion of men is 7 out of 14, and the entropy of the Male node is 0.592.

113
00:09:18,680 --> 00:09:24,310
The result of a square bracket here is the weighted entropy after the split.

114
00:09:24,310 --> 00:09:30,850
So, the information gain of the tree if we use the “Sex” attribute to split the dataset

115
00:09:30,850 --> 00:09:34,220
is 0.151.

116
00:09:34,220 --> 00:09:39,930
As you can see, we will consider the entropy over the distribution of samples falling under

117
00:09:39,930 --> 00:09:45,750
each leaf node, and we’ll take a weighted average of that entropy – weighted by the

118
00:09:45,750 --> 00:09:49,850
proportion of samples falling under that leaf.

119
00:09:49,850 --> 00:09:55,360
We can calculate the information gain of the tree if we use “Cholesterol” as well.

120
00:09:55,360 --> 00:09:56,360
It is 0.48.

121
00:09:56,360 --> 00:10:02,380
Now, the question is, “Which attribute is more suitable?”

122
00:10:02,380 --> 00:10:08,200
Well, as mentioned, the tree with the higher information gain after splitting.

123
00:10:08,200 --> 00:10:11,540
This means the “Sex” attribute.

124
00:10:11,540 --> 00:10:16,390
So, we select the “Sex” attribute as the first splitter.

125
00:10:16,390 --> 00:10:21,490
Now, what is the next attribute after branching by the “Sex” attribute?

126
00:10:21,490 --> 00:10:27,450
Well, as you can guess, we should repeat the process for each branch, and test each of

127
00:10:27,450 --> 00:10:32,529
the other attributes to continue to reach the most pure leaves.

128
00:10:32,529 --> 00:10:34,800
This is the way that you build a decision tree!

129
00:10:34,800 --> 00:10:36,490
Thanks for watching!