-
Notifications
You must be signed in to change notification settings - Fork 0
/
Readme
122 lines (90 loc) · 4.61 KB
/
Readme
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
Credit Card Fraud Detection
Anonymized credit card transactions labeled as fraudulent or genuine
Using Logistic Regression and SVM (with RBF kernel) on a highly skewed data set.
Download the data set from here:
https://www.kaggle.com/dalpozz/creditcardfraud
Ideas:
______
-->Dropped the time column from the original dataset and also normalized the amount column. (Done)
-->Can further analyse each column (feature) to see if we can discard other features.
-->Can do PCA to project complete data in a meaningful way. Work in progress
-->Use Neural Networks (with more than 1 hidden layers) for classification.
-->Show a comparitive analysis of using different parameters and types of kernels in SVM function calls.
This may not yield much different results.
Dependencies:
_____________
python 3.5
scikit-learn
pandas
numpy (optional)
tensorflow (comment out, works fine without. Idea is to use tensorflow in future.)
matplotlib
How to run:
call python cc.py (or python 3, however your env for python 3 is configured)
You can call logistic regression/ SVM seperately using cc_2.py by commenting out appropirate lines.
Use function printing_Kfold_cross for Log Reg and
diff_kernels() for SVM with different kernels (all kernels will run)
pass complete dataset (temp[0],temp[1]) or the undersampled one(temp[2],temp[3]) as parameters to the function.
SVM runs for really long time on the complete dataset. Adding the graphs for this run.
Note:
Please ignore the function using_SVM, it needs proper structuring and further analysis.
Including graphs for recall on complete data and undersampled data comparing the performance of SVM and Logistic Regression.
Questions to explore:
Use of L1/L2 loss function needs further analysis. (Used L2 here, maybe bfgs is faster.)
Does SVM perform worse on skewed data?
Is recall a good metric to analyse the performance? It really depends on the data:
If false positives are much less tolerable than false negatives. (You do not have cancer but it predicts that you do.)
(or) the other way around.
Some results :
(C:\ProgramData\Anaconda3) C:\Users\gsk69>m:
(C:\ProgramData\Anaconda3) M:\>cd "Course stuff\Machine Learning\projects\Data Sets\credit card"
(C:\ProgramData\Anaconda3) M:\Course stuff\Machine Learning\projects\Data Sets\credit card>python cc_2.py
USING Logistic Regression::
Best Mean: 0.612260 with inverse regularization strength 10.000000
USING SVM :
Best Mean: 0.684470 with inverse regularization strength 10.0000 using rbf kernel
USING SVM :
Best Mean: 0.754652 with inverse regularization strength 1.0000 using poly kernel
USING SVM :
Best Mean: 0.275240 with inverse regularization strength 1.0000 using sigmoid kernel
(C:\ProgramData\Anaconda3) M:\Course stuff\Machine Learning\projects\Data Sets\credit card>
_______________________________________________________________________________________________________________
_______________________________________________________________________________________________________________
currently running rbf kernel
USING SVM :
Best Mean: 0.607521 with inverse regularization strength 1.2800 using rbf kernel
SVM with rbf took 6244.27 seconds for completion.
currently running linear kernel
USING SVM :
Best Mean: 0.782996 with inverse regularization strength 0.0200 using linear kernel
SVM with linear took 1137.45 seconds for completion.
currently running sigmoid kernel
USING SVM :
Best Mean: 0.274755 with inverse regularization strength 0.1600 using sigmoid kernel
SVM with sigmoid took 1676.50 seconds for completion.
currently running poly kernel
USING SVM :
Best Mean: 0.711231 with inverse regularization strength 0.1600 using poly kernel
SVM with poly took 1000.95 seconds for completion.
____________________________________________________________________________________
Run details :
Used BaggingClassifier with default values and base classifier = SVM (SVC/LinearSVC)
All kernels with max_iter set to 1000
Used LinearSVC for linear kernel
currently running rbf kernel
USING SVM :
Best Mean: 0.604330 with inverse regularization strength 1.2800 using rbf kernel
rbf took 3615.35 seconds for completion.
currently running linear kernel
USING SVM :
Best Mean: 0.727653 with inverse regularization strength 1.2800 using linear kernel
linear took 6100.82 seconds for completion.
currently running sigmoid kernel
USING SVM :
Best Mean: 0.274755 with inverse regularization strength 0.3200 using sigmoid kernel
sigmoid took 1713.92 seconds for completion.
currently running poly kernel
USING SVM :
Best Mean: 0.703166 with inverse regularization strength 0.3200 using poly kernel
poly took 1635.25 seconds for completion.
Total run time = 13065.34s = 3h 37m