-
Notifications
You must be signed in to change notification settings - Fork 0
/
en_Q-fOOMGSxlo.srt
520 lines (390 loc) · 12.3 KB
/
en_Q-fOOMGSxlo.srt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
0
00:00:00,500 --> 00:00:02,370
Hello, and welcome!
1
00:00:02,370 --> 00:00:06,400
In this video, we’ll be covering the process of building decision trees.
2
00:00:06,400 --> 00:00:09,080
So let’s get started!
3
00:00:09,080 --> 00:00:11,769
Consider the drug dataset again.
4
00:00:11,769 --> 00:00:17,980
The question is, “How do we build the decision tree based on that dataset?”
5
00:00:17,980 --> 00:00:22,670
Decision trees are built using recursive partitioning to classify the data.
6
00:00:22,670 --> 00:00:26,650
Let’s say we have 14 patients in our dataset.
7
00:00:26,650 --> 00:00:31,740
The algorithm chooses the most predictive feature to split the data on.
8
00:00:31,740 --> 00:00:36,170
What is important in making a decision tree, is to determine “which attribute is the
9
00:00:36,170 --> 00:00:41,050
best, or more predictive, to split data based on the feature.”
10
00:00:41,050 --> 00:00:46,820
Let’s say we pick “Cholesterol” as the first attribute to split data.
11
00:00:46,820 --> 00:00:50,149
It will split our data into 2 branches.
12
00:00:50,149 --> 00:00:56,629
As you can see, if the patient has high “Cholesterol,” we cannot say with high confidence that Drug B
13
00:00:56,629 --> 00:00:59,050
might be suitable for him.
14
00:00:59,050 --> 00:01:04,080
Also, if the Patient’s “Cholesterol” is normal, we still don’t have sufficient
15
00:01:04,080 --> 00:01:11,700
evidence or information to determine if either Drug A or Drug B is, in fact, suitable.
16
00:01:11,700 --> 00:01:16,250
It is a sample of bad attribute selection for splitting data.
17
00:01:16,250 --> 00:01:19,590
So, let’s try another attribute.
18
00:01:19,590 --> 00:01:23,310
Again, we have our 14 cases.
19
00:01:23,310 --> 00:01:27,780
This time, we pick the “sex” attribute of patients.
20
00:01:27,780 --> 00:01:32,869
It will split our data into 2 branches, Male and Female.
21
00:01:32,869 --> 00:01:38,200
As you can see, if the patient is Female, we can say Drug B might be suitable for her
22
00:01:38,200 --> 00:01:40,040
with high certainty.
23
00:01:40,040 --> 00:01:46,479
But, if the patient is Male, we don’t have sufficient evidence or information to determine
24
00:01:46,479 --> 00:01:49,829
if Drug A or Drug B is suitable.
25
00:01:49,829 --> 00:01:55,560
However, it is still a better choice in comparison with the “Cholesterol” attribute, because
26
00:01:55,560 --> 00:01:58,860
the result in the nodes are more pure.
27
00:01:58,860 --> 00:02:04,360
It means, nodes that are either mostly Drug A or Drug B.
28
00:02:04,360 --> 00:02:10,920
So, we can say the “Sex” attribute is more significant than “Cholesterol,” or
29
00:02:10,919 --> 00:02:15,000
in other words, it’s more predictive than the other attributes.
30
00:02:15,000 --> 00:02:21,200
Indeed, “predictiveness” is based on decrease in “impurity” of nodes.
31
00:02:21,200 --> 00:02:27,810
We’re looking for the best feature to decrease the ”impurity” of patients in the leaves,
32
00:02:27,810 --> 00:02:31,209
after splitting them up based on that feature.
33
00:02:31,209 --> 00:02:37,390
So, the “Sex” feature is a good candidate in the following case, because it almost found
34
00:02:37,390 --> 00:02:38,840
the pure patients.
35
00:02:38,840 --> 00:02:42,260
Let’s go one step further.
36
00:02:42,260 --> 00:02:48,560
For the Male patient branch, we again test other attributes to split the subtree.
37
00:02:48,560 --> 00:02:51,790
We test “Cholesterol” again here.
38
00:02:51,790 --> 00:02:55,550
As you can see, it results in even more pure leaves.
39
00:02:55,550 --> 00:02:59,580
So, we can easily make a decision here.
40
00:02:59,580 --> 00:03:04,280
For example, if a patient is “Male”, and his “Cholesterol” is “High”, we can
41
00:03:04,280 --> 00:03:13,060
certainly prescribe Drug A, but if it is “Normal”, we can prescribe Drug B with high confidence.
42
00:03:13,060 --> 00:03:18,550
As you might notice, the choice of attribute to split data is very important, and it is
43
00:03:18,550 --> 00:03:22,360
all about “purity” of the leaves after the split.
44
00:03:22,360 --> 00:03:29,850
A node in the tree is considered “pure” if, in 100% of the cases, the nodes fall into
45
00:03:29,850 --> 00:03:33,230
a specific category of the target field.
46
00:03:33,230 --> 00:03:40,379
In fact, the method uses recursive partitioning to split the training records into segments
47
00:03:40,379 --> 00:03:44,230
by minimizing the “impurity” at each step.
48
00:03:44,230 --> 00:03:50,020
”Impurity” of nodes is calculated by “Entropy” of data in the node.
49
00:03:50,020 --> 00:03:53,990
So, what is “Entropy”?
50
00:03:53,990 --> 00:04:00,670
Entropy is the amount of information disorder, or the amount of randomness in the data.
51
00:04:00,670 --> 00:04:06,720
The entropy in the node depends on how much random data is in that node and is calculated
52
00:04:06,720 --> 00:04:08,460
for each node.
53
00:04:08,460 --> 00:04:14,990
In decision trees, we're looking for trees that have the smallest entropy in their nodes.
54
00:04:14,990 --> 00:04:20,579
The entropy is used to calculate the homogeneity of the samples in that node.
55
00:04:20,579 --> 00:04:25,930
If the samples are completely homogeneous the entropy is zero and if the samples are
56
00:04:25,930 --> 00:04:30,030
equally divided, it has an entropy of one.
57
00:04:30,030 --> 00:04:35,890
This means, if all the data in a node are either Drug A or Drug B, then the entropy
58
00:04:35,890 --> 00:04:43,440
is zero, but if half of the data are Drug A and other half are B, then the entropy is
59
00:04:43,440 --> 00:04:44,850
one.
60
00:04:44,850 --> 00:04:50,750
You can easily calculate the entropy of a node using the frequency table of the attribute
61
00:04:50,750 --> 00:04:56,810
through the Entropy formula, where P is for the proportion or ratio of a category, such
62
00:04:56,810 --> 00:05:01,889
as Drug A or B. Please remember, though, that you don’t
63
00:05:01,889 --> 00:05:07,320
have to calculate these, as it’s easily calculated by the libraries or packages that
64
00:05:07,320 --> 00:05:09,229
you use.
65
00:05:09,229 --> 00:05:14,710
As an example, let’s calculate the entropy of the dataset before splitting it.
66
00:05:14,710 --> 00:05:20,000
We have 9 occurrences of Drug B and 5 of Drug A.
67
00:05:20,000 --> 00:05:25,660
You can embed these numbers into the Entropy formula to calculate the impurity of the target
68
00:05:25,660 --> 00:05:28,270
attribute before splitting it.
69
00:05:28,270 --> 00:05:31,200
In this case, it is 0.94.
70
00:05:31,200 --> 00:05:36,000
So, what is entropy after splitting?
71
00:05:36,000 --> 00:05:40,860
Now we can test different attributes to find the one with the most “predictiveness,”
72
00:05:40,860 --> 00:05:43,830
which results in two more pure branches.
73
00:05:43,830 --> 00:05:49,500
Let’s first select the “Cholesterol” of the patient and see how the data gets split,
74
00:05:49,500 --> 00:05:51,890
based on its values.
75
00:05:51,890 --> 00:05:58,449
For example, when it is “normal,” we have 6 for Drug B, and 2 for Drug A.
76
00:05:58,449 --> 00:06:04,699
We can calculate the Entropy of this node based on the distribution of drug A and B,
77
00:06:04,699 --> 00:06:07,460
which is 0.8 in this case.
78
00:06:07,460 --> 00:06:14,830
But, when Cholesterol is “High,” the data is split into 3 for drug B and 3 for drug A.
79
00:06:14,830 --> 00:06:18,890
Calculating its entropy, we can see it would be 1.0.
80
00:06:18,890 --> 00:06:23,830
We should go through all the attributes and
81
00:06:23,830 --> 00:06:29,430
calculate the “Entropy” after the split, and then chose the best attribute.
82
00:06:29,430 --> 00:06:32,880
Ok, let’s try another field.
83
00:06:32,880 --> 00:06:37,139
Let’s choose the Sex attribute for the next check.
84
00:06:37,139 --> 00:06:43,440
As you can see, when we use the Sex attribute to split the data, when its value is “Female,”
85
00:06:43,440 --> 00:06:49,740
we have 3 patients that responded to Drug B, and 4 patients that responded to Drug A.
86
00:06:49,740 --> 00:06:56,280
The entropy for this node is 0.98 which is not very promising.
87
00:06:56,280 --> 00:07:02,461
However, on other side of the branch, when the value of the Sex attribute is Male, the
88
00:07:02,461 --> 00:07:08,229
result is more pure with 6 for Drug B and only 1 for Drug A.
89
00:07:08,229 --> 00:07:11,770
The entropy for this group is 0.59.
90
00:07:11,770 --> 00:07:19,289
Now, the question is, between the Cholesterol and Sex attributes, which one is a better
91
00:07:19,289 --> 00:07:20,960
choice?
92
00:07:20,960 --> 00:07:27,250
Which one is better as the first attribute to divide the dataset into 2 branches?
93
00:07:27,250 --> 00:07:33,620
Or, in other words, which attribute results in more pure nodes for our drugs?
94
00:07:33,620 --> 00:07:40,950
Or, in which tree, do we have less entropy after splitting rather than before splitting?
95
00:07:40,950 --> 00:07:48,889
The “Sex” attribute with entropy of 0.98 and 0.59, or the “Cholesterol” attribute
96
00:07:48,889 --> 00:07:55,110
with entropy of 0.81 and 1.0 in its branches?
97
00:07:55,110 --> 00:08:00,940
The answer is, “The tree with the higher information gain after splitting."
98
00:08:00,940 --> 00:08:05,360
So, what is information gain?
99
00:08:05,360 --> 00:08:11,020
Information gain is the information that can increase the level of certainty after splitting.
100
00:08:11,020 --> 00:08:16,970
It is the entropy of a tree before the split minus the weighted entropy after the split
101
00:08:16,970 --> 00:08:18,940
by an attribute.
102
00:08:18,940 --> 00:08:23,789
We can think of information gain and entropy as opposites.
103
00:08:23,789 --> 00:08:31,060
As entropy, or the amount of randomness, decreases, the information gain, or amount of certainty,
104
00:08:31,060 --> 00:08:32,440
increases, and vice-versa.
105
00:08:32,440 --> 00:08:39,990
So, constructing a decision tree is all about finding attributes that return the highest
106
00:08:39,990 --> 00:08:41,610
information gain.
107
00:08:41,610 --> 00:08:46,740
Let’s see how “information gain” is calculated for the Sex attribute.
108
00:08:46,740 --> 00:08:52,470
As mentioned, the information gain is the entropy of the tree before the split, minus
109
00:08:52,470 --> 00:08:56,000
the weighted entropy after the split.
110
00:08:56,000 --> 00:09:01,180
The entropy of the tree before the split is 0.94.
111
00:09:01,180 --> 00:09:08,310
The portion of Female patients is 7 out of 14, and its entropy is 0.985.
112
00:09:08,310 --> 00:09:18,680
Also, the portion of men is 7 out of 14, and the entropy of the Male node is 0.592.
113
00:09:18,680 --> 00:09:24,310
The result of a square bracket here is the weighted entropy after the split.
114
00:09:24,310 --> 00:09:30,850
So, the information gain of the tree if we use the “Sex” attribute to split the dataset
115
00:09:30,850 --> 00:09:34,220
is 0.151.
116
00:09:34,220 --> 00:09:39,930
As you can see, we will consider the entropy over the distribution of samples falling under
117
00:09:39,930 --> 00:09:45,750
each leaf node, and we’ll take a weighted average of that entropy – weighted by the
118
00:09:45,750 --> 00:09:49,850
proportion of samples falling under that leaf.
119
00:09:49,850 --> 00:09:55,360
We can calculate the information gain of the tree if we use “Cholesterol” as well.
120
00:09:55,360 --> 00:09:56,360
It is 0.48.
121
00:09:56,360 --> 00:10:02,380
Now, the question is, “Which attribute is more suitable?”
122
00:10:02,380 --> 00:10:08,200
Well, as mentioned, the tree with the higher information gain after splitting.
123
00:10:08,200 --> 00:10:11,540
This means the “Sex” attribute.
124
00:10:11,540 --> 00:10:16,390
So, we select the “Sex” attribute as the first splitter.
125
00:10:16,390 --> 00:10:21,490
Now, what is the next attribute after branching by the “Sex” attribute?
126
00:10:21,490 --> 00:10:27,450
Well, as you can guess, we should repeat the process for each branch, and test each of
127
00:10:27,450 --> 00:10:32,529
the other attributes to continue to reach the most pure leaves.
128
00:10:32,529 --> 00:10:34,800
This is the way that you build a decision tree!
129
00:10:34,800 --> 00:10:36,490
Thanks for watching!