-
Notifications
You must be signed in to change notification settings - Fork 1
/
proposal.tex
2784 lines (2289 loc) · 169 KB
/
proposal.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\documentclass{NSF}
% https://arxiv.org/pdf/1605.07277.pdf
% Transferability in Machine Learning: from Phenomena to Black-Box Attacks using Adversarial Samples
% Nicolas Papernot and Patrick McDaniel The Pennsylvania State University University Park, PA {ngp5056,mcdaniel}@cse.psu.ed
% https://www.usenix.org/system/files/sec19-demontis.pdf
% Why Do Adversarial Attacks Transfer? Explaining Transferability of Evasion and Poisoning Attacks
% Ambra Demontis, Marco Melis, and Maura Pintor, University of Cagliari, Italy; Matthew Jagielski, Northeastern University; Battista Biggio, University of Cagliari, Italy, and Pluribus One; Alina Oprea and Cristina Nita-Rotaru, Northeastern University; Fabio Roli, University of Cagliari, Italy, and Pluribus One
% https://www.usenix.org/conference/usenixsecurity19/presentation/demontis
% This paper is included in the Proceedings of the 28th USENIX Security Symposium.
% August 14–16, 2019 • Santa Clara, CA, USA 978-1-939133-06-9
\usepackage[T1]{fontenc}
\usepackage{textcomp}
\usepackage{mathpazo}
\usepackage{framed}
\usepackage{epstopdf}
\epstopdfDeclareGraphicsRule{.gif}{png}{.png}{convert gif:#1 png:\OutputFile}
\AppendGraphicsExtensions{.gif}
\usepackage{rotating}
\usepackage{framed}
\usepackage{colortbl}
\usepackage{wrapfig}
\usepackage{pgfplots}
\definecolor{maroon}{cmyk}{0,0.87,0.68,0.32}
\setlength{\parskip}{0.5mm}
\usepackage{indentfirst}
\setlength{\parindent}{0.75cm}
\newcommand{\quart}[4]{\begin{picture}(80,4)%1
{\color{black}\put(#3,2){\circle*{4}}\put(#1,2){\line(1,0){#2}}}\end{picture}}
\usepackage{longtable}
\usepackage[most]{tcolorbox}
\usepackage{caption}
\usepackage{multirow}
\usepackage{comment}
\usepackage{pifont}
\usepackage{array}
\usepackage{enumitem}
\usepackage{hyperref}
\newenvironment{myitemize}
{ \begin{itemize}[topsep=0pt,bottomsep=0pt,itemsep=0,leftmargin=*]
\setlength{\itemsep}{0pt}
\setlength{\parskip}{0pt}
\setlength{\parsep}{0pt} }
{ \end{itemize} }
\newenvironment{mysmallize}
{ \begin{itemize}[topsep=0pt,bottomsep=0pt,itemsep=0,leftmargin=*]
\setlength{\itemsep}{0pt}
\setlength{\parskip}{0pt}
\setlength{\parsep}{0pt} }
{ \end{itemize} }
\newenvironment{mynumns}
{ \begin{enumerate}[topsep=0pt,bottomsep=0pt,itemsep=0,leftmargin=*]
\setlength{\itemsep}{0pt}
\setlength{\parskip}{0pt}
\setlength{\parsep}{0pt} }
{ \end{enumerate} }
\definecolor{ao(english)}{rgb}{0.0, 0.5, 0.0}
\newcommand{\be}{\begin{mynumns}}
\newcommand{\ee}{\end{mynumns}}
\newcommand{\bi}{\begin{myitemize}}
\newcommand{\ei}{\end{myitemize}}
\newcommand{\bii}{\begin{mysmallize}}
\newcommand{\eii}{\end{mysmallize}}
\newcommand{\tion}[1]{\S\ref{tion:#1}}
\newcommand{\tbl}[1]{Table~\ref{tbl:#1}}
\newcommand{\fig}[1]{Figure~\ref{fig:#1}}
\newcommand{\eq}[1]{Equation~\ref{eq:#1}}
\usepackage[T1]{fontenc}
\usepackage{textcomp}
\usepackage{mathpazo}
\definecolor{Gray}{gray}{0.9}
\hyphenation{}
\newcommand{\jnote}[1]{{\color{blue}[JEFF: #1]}}
%\newcommand{\IT}{{\bf {\sffamily TINKER}}}
\newtheorem{criteria}{Success Criteria}
\usepackage[tikz]{bclogo}
\usepackage{tikz}
\def\checkmark{\tikz\fill[scale=0.3](0,.35) -- (.25,0) -- (1,.7) -- (.25,.15) -- cycle;}
\def\firstcircle{(90:1.75cm) circle (2.5cm)}
\def\secondcircle{(210:1.75cm) circle (2.5cm)}
\def\thirdcircle{(330:1.75cm) circle (2.5cm)}
\newenvironment{eval}[1]%
{\noindent\begin{minipage}[c]{\linewidth}%
\begin{bclogo}[couleur=gray!25,%
arrondi=0.1, % barre=zigzag,%
logo=\bcattention,%
ombre=true]{~#1} \begin{criteria}\small}%
{\end{criteria}\end{bclogo}\end{minipage}\vspace{2mm}}
\newcommand{\IT}{{\sffamily {\em MASS~CONFUSION}}}
\newcommand{\TITLE}{SATC: What can Security Learn from SE Research? (The {\IT} project)}
\usepackage[labelfont=small,font=bf]{caption}
\captionsetup{font+=sf}
\usepackage{pifont}
\usepackage{listings}
\usepackage{tcolorbox}
\newtcolorbox{blockquote}{colback=gray!15,boxrule=0.4pt,colframe=gray!15,fonttitle=\bfseries,top=2pt,bottom=2pt}
\setlength{\fboxsep}{1.7pt}
\begin{document}
\ProjectTitle{\TITLE}
\ProjectAuthor{Tim Menzies, NC State}
%\renewcommand{\attn}
\begin{nsfsummary}
\begin{center}
{\bf \TITLE}\\\vspace{1mm}
Tim Menzies, IEEE Fellow, NC State
\end{center}
Some important software engineering (SE) principles are under-applied in the security domain.
For example, years of research in SE on
{\em configuration } and {\em search-based SE} has shown just how quickly
automatic tools can generate thousands of substitute versions of a software.
By quickly configuring thousands of substitutes,
then randomly switching between them,
we {\em confuse} adversaries trying to attack a system
(since they never can tell which version of the system they are attacking),
This {\em confusion-based} approach can extensively mitigate against {\em static adversaries}; i.e. those who are unaware that we trying to {\em confuse} them. But what about {\em adaptive adversaries} that
adjust their attacks based on feedback from our defences? Our {\em transfer attacks}
that reuse examples which previously, confused the defenders?
For this purpose,
we propose extending our existing {\em confusion}
approach with better defenses for
adaptive and transfer attackers. Specifically, we conjecture that we can better defend against
adaption and transfer by increasing the variance in the hyperspace boundaries within our
defenders. To test that conjecture, we will first
implement our extended defender tools, then test them on a wide range of data sets.
%\vspace{4mm}
\noindent
\underline{{\bf INTELLECTUAL MERIT:}}
This work is a unique combination of SE and security research.
We offer a new synthesis of SE and security principles
that applies {\em configuration engineering} and {\em search-based SE} to
the problem of adversarial learning for security applications.
We characterize defending against adversaries
as a configuration problem, where the defender
{\em confuses} the adversary by automatically
and continually adjusts its own configuration.
Such a {\em confusion-based} approach has not previously been
well explored in the literature.
While certainly useful in the security domain, the general principle here of very faster generation of
system alternates designs is also useful for any domain where it required to explore a range of options (e.g. in
requirements engineering, during refactoring). Usually, such exploration is conducted
to (a)~ remove bugs in the software or (b)~show domain experts a range of implementation options. Here, we say that such configuration adaptions can also be assessed via nonfunctional requirements such as
security. Going forward, we can see this work changing much of software design since it would show that
issues of security can be explored (and enhanced) via automatic tools for a wide range of software systems.
% \vspace{4mm}
\noindent
\underline{{\bf BROADER IMPACTS:}}
With the growing reliance on information technology, cybercrime is a serious threat to the economic, military and other industial sectors with the United States
(and elsewhere). Malware is one of the important factors that undermine Internet security. New malware, is more and more difficult to be detected by current defection technologies
The core problem being explored here (security defense against adversaries) is a core current
problem in SE. The more computational resources we give developers, the more likely it becomes that ``black hat'' developers
will use those resources for nefarious purposes. Here, we propose ``turning the tables'' and using those same computational resources for defensive purposes (by generating thousands of equivalent configurations of our software in order to confuse attackers).
We focus on an issue of tremendous socio-economical importance.
Many high-stake applications such as finance, hiring, admissions, criminal justice use software.
That software is threatened by black-hat programmers who mount adversarial attacks
on our software systems. The technology explored here will ensure the safer use of software,
more secure from adversarial attach.
PI Menzies will continue his established tradition of graduating research students for historically under-represented groups. This work will inform the curriculum of the various NC State NSF-funded REUs (research experience for undergraduates).
In that program, places are reserved for students from traditionally under-represented areas; e.g. economically challenged regions of the state of North Carolina) and/or students from universities lacking advanced research facilities. While some of the concepts of this grant would be too advanced for that group, some of the simpler concepts and case studies would be suitable for lectures. Also, we will prepare and present demonstration tools for the regular NCState on-campus summer schools for high school students (where we will show by example that data scientists have choices in how fair they make their models).
% \vspace{4mm}
\noindent \underline{{\bf KEYWORDS:}} security, adversarial learning, configuration, search-based software engineering
\end{nsfsummary}
\begin{nsfdescription}
\thispagestyle{plain}
\begin{center}
{\bf \TITLE}\\
{Tim Menzies, IEEE Fellow, NC State}
\end{center}
%Papernot Demontis
\section{Overview}\label{tion:intro}
\noindent
We propose the application of recent results in
{\em configuration of software systems} (from SE research) to an open security problem:
{\em defending the detectors of malicious software against adversarial attacks}. Such attacks are designed to fool defending software that some malicious input is apparently benign~\cite{wang2019security}.
We conjecture:
\begin{quote}
{\em Next generation SE configuration methods can confuse, and thwart, an adversary trying to hide malware by making malicious software more complex, using ensemble methods (see \fig{ensemble}).}
\end{quote}
\begin{wrapfigure}{r}{2in}
\includegraphics[width=2in]{fig/ensemble.png}
\caption{ Ensembles = decisions via polling multiple learners.}
\label{fig:ensemble}
\end{wrapfigure}
\noindent It should be noted that prior results are very negative about
our conjecture~\cite{he2017adversarial}~\cite{DBLP:conf/iclr/TramerKPGBM18}~\cite{papernot2016transferability}. For example, Tramer et al.~\cite{DBLP:conf/iclr/TramerKPGBM18} and ohters
find that new attacks that substantially improve the transferbility of adversarial examples can greatly reduce the achieved robust accuracy. Papernot et al. say that adversarial attackers can use their own learners to find ways to defeat a malicious software defector, even if (a)~the adversary has no knowledge of the defenders algorithm and even if (b)~the defender uses ensembles to confuse the attacker~\cite{papernot2016transferability}.
We believe that those prior results are incomplete since they did not (a)~{\bf explore enough classifiers} nor (b)~{\bf explore the right kinds of classifiers}.
% Prior security research on defending against adversaries
% using {\em ensemble learning} (see \fig{ensemble}) did not take full advantage of the
For example, as discussed below, our {\IT}.1 prototype found that within 1000 classifiers, as few as ten might actually be useful for malicious software detection. Yet security researchers build their defenders using very small ensembles~\cite{khasawneh2017rhmd,demontis2019adversarial,papernot2016transferability}; e.g. Khasawneh et al. used only six members~\cite{khasawneh2017rhmd } in their ensembles. We argue that such a small ensembles barely scratches the surface of the large space of options within ensemble generation. For example,
\tbl{hyperparameters} lists billions of configuration options within a detector (in this case, a deep learner looking for malicious code inside a PDF file). Single learners can be combined into ensembles using
\tbl{methods}. That space is very large and mostly unexplored (by other researchers).
Accordingly, to better explore that space,
PI Menzies and his student Mr. Rui Shu applied his SE methods on configuring large systems.
Preliminary results showed that the {\IT}.1 prototype can mitigate against numerous attackers
including FGSM~\cite{DBLP:journals/corr/GoodfellowSS14},
BIM~\cite{KurakinGB17a},
JSMA~\cite{papernot2016limitations},
DeepFool~\cite{moosavi2016deepfool} and
Carlini\_Wagner~\cite{carlini2017towards,DBLP:conf/ccs/Carlini017}.
\subsection{Summary of Approach}
We propose {\IT}.2 to (a)~address the shortcomings of that first prototype and (b)~gain a deeper understanding of how to defend against adversaries.
This research will combine recent SE results into software configuration with insights from
Khasawneh et al.\cite{khasawneh2017rhmd} and Demontis et al.~\cite{demontis2019adversarial}
to implement the {\IT}.2 defence against adversaries. In the this proposal, we will:
\bi
\item Review our preliminary results with this {\IT} approach; as well as
\item The SE configuration techniques we use to implement it.
\ei
While a successful prototype, {\IT}.1 has certain limits to its theoretical foundations and its evaluation.
This proposal will fix those issues in a new system called {\IT}.2, to be developed as part
of this research. {\IT}.2 will be evaluated using the data sets and comparisons algorithms of \tbl{baseline}, then released
as open source software (in Github, under an MIT license).
\subsection{Outcomes}
When successful, {\IT}.2 would be a better way to build and maintain malicious software defectors
while also defending against adversarial attacks.
Also, {\IT}.2 will address an important open issue in security research.
According to Demontis et al.~\cite{demontis2019adversarial} methods for confusing an adversial attackler have an unwanted side-effect: specifically, they can reduce malware detector performance.
But if {\IT}.2 is successful, that that is evidence for a
loophole in Demontis et al.'s argument. As discussed below, {\IT}.2 can increase decision boundary complexity, thus defeating adversaries, while at the same time maintaining (to a large extent)
the efficacy of malicious software detectors.
\begin{table} \caption{Deep learning defector for PDF files containing malware. If numerics divide into ten ranges, this figure lists
$11^2*10^5*7*4*3=1,016,400,000$ options.}
\label{tbl:hyperparameters}
\scriptsize
\begin{tabular}{r|l|r}
\hline
\rowcolor[HTML]{ECF4FF}
\textbf{Hyperparameter} & \bf{Ranges} & {\bf Combinations}\\ \hline
Hidden layer activation function & elu, relu, selu, sigmoid, softmax, tanh, hard\_sigmoid, softplus, softsign, linear, exponential & 11 \\
Output layer activation function & elu, relu, selu, sigmoid, softmax, tanh, hard\_sigmoid, softplus, softsign, linear, exponential & 11 \\
First layer dense & quniform(30, 150, 1) & 10 \\
Second layer dense & quniform(30, 50, 1) & 10\\
Third layer dense & quniform(10, 32, 1) & 10\\
Drop out rate & uniform(0.0, 0.5) & 10\\
Number of epochs & quniform(5, 20, 1) & 10 \\
Optimizer & Adadelta, Adagrad, Adam, Adamax, NAdam, RMSprop, SGD & 7\\
Batch size & 16, 32, 64, 128 & 4\\
Learning rate & 0.001, 0.01, 0.1 & 3 \\\hline
\multicolumn{3}{c}{\textbf{quniform($low$, $high$, $q$)} returns $round(uniform(low, high)/q)*q$
while \textbf{uniform($low$, $high$)} returns a value uniformly between $low$ and $high$.}
\end{tabular}
\end{table}\begin{table}
\caption{Comparing our methods to prior
work from the software security literature~\cite{DBLP:conf/iclr/TramerKPGBM18}~\cite{DBLP:conf/icml/PangXDCZ19}~\cite{DBLP:journals/corr/abs-1709-03423}. In summary, other researchers explore a narrow range of ensemble generation options (shown on the left) while our current {\IT}.1 prototype tries the options on the right.
The {\IT}.2
research proposed here will try options
across all these seven dimensions.}\label{tbl:methods}
\footnotesize
\begin{tabular}{|r|c|p{1in}|l|p{5cm}|}\cline{3-5}
\multicolumn{2}{c}{~}&\multicolumn{3}{|c|}{This proposal will explore across this range of options.}\\\hline
\rowcolor{blue!20} &id &Prior work & In-between & Methods from {\IT}.1 \\\hline
number of classifiers &\ding{182} &10 to 100 & 100 to 1000 & 1000 to 10,000\\
\rowcolor{gray!20}using which classifiers? &\ding{183} &all & some. & only those that pass certain tests\\
exploit hyperparameter variance &\ding{184}& no & some & yes \\
\rowcolor{gray!20}feature space & \ding{185}&Varied & some variation & not varied\\
learner hyperparameters (e.g. \tbl{hyperparameters}) &\ding{186}& use default & some tuning & much tuning\\
\rowcolor{gray!20} ensemble voting procedure & \ding{187}&hard-wired & some variation & tuned via hyperparameter optimization\\
seek optimum &
\ding{188}& no &some & yes
\end{tabular}
\normalsize
\begin{tabular}{|p{.98\linewidth}|}
\bi
\item[\ding{182}]
We build ensembles using orders of magnitude more classifiers
than previous work.
\item[\ding{183}]
When dealing with a large number
of classifiers is it important to assess them before using them. For example,
after generating 1000 classifiers we discard 990.
\item[\ding{184}]
\tion{khas} of this paper
presents
Khasawneh et al/'s argument that defending classifiers must be assessed by much variance they introduce to the decision boundary.
\item[\ding{185}]
Other ensemble generation methods in the security domain
make extensive use of feature engineering via, for example, (a)~elaborate and time confusion instrumentation of hardware and software APIs~\cite{khasawneh2017rhmd};
or (b)~expensive CPU support when (e.g.) the first few layers of a deep learner convert a large input
space into a smaller number of useful synthetic features~\cite{dai19} (this can be very slow if deep learners
must repeatedly analyze data 100s of times in order to converge on useful internal weights).
We conjecture that would be just as effective, yet cheaper and faster,
to leave the feature as are while exploring large ensembles generated by varying
learner hyperparameters.
\item[\ding{186}] Rather than use a learner's default settings, we
control settings via hyperparameter optimization.
\item[\ding{187}] Ensembles use ``voting rule'' that polls the ensemble to make final decisions. There are many possible voting rules (e.g. majority rules) so we use automatic tools to adjust these rules for different data sets.
\item[\ding{188}] The current {\IT}.1 prototype builds it models from classifiers generated by an optimization process; i.e. from a pool of
classifiers with excellent performance.
\tion{khas} of this proposal
shows results from PAC learning suggesting that the best way to confuse an adversary might be to inject a little error into the defender (since that makes it harder for the attacker to learn the policies of the defender). Hence, for
{\IT}.2, we need to try classifier optimization policies that use classifier that are somewhat less than the best.
\ei
\end{tabular}\hline
\end{table}
\subsection{Novelty}
In this proposal, two key terms explained
later on, will be
\bi
\item {\em Epsilon domination:}
a heuristic approach to very rapidly select the best region within a set of parameters;
\item {\em Boundary variance:}
a method to select members of an ensemble such that it is hard for attackers to reverse engineer the decision boundary.
\ei
To the best of our knowledge, these two technologies have yet to be applied to software security.
As to the SE literature,
other SE research papers on discuss
security and configuration.
Several of those authors~\cite{Hilton2017TradeoffsIC,8360943,vassallo_carmine_2020_3860985,rahman2019seven} have recently proposed agents that report on known software configuration
issues (some of which have security implications). While important work, that research comments
on the existence of
those issues, but does not repair them.
Our combination of
{\em epsilon domination}
and {\em boundary variance}, on the other hand, is a search-based optimizer
that explores a massive space of possible configurations to automatically select the configurations that satisfy some predicate
(e.g. increase our ability to defend against an adversary). Hence, while that other configuration
research explores hundreds on known bad configurations, we explore millions of good configurations
looking for ways to make them even better.
\section{Background}
\subsection{About Malicious Software}
With the growing reliance on information technology, cybercrime is a serious threat to the economic, military and other industrial sectors with the United States
(and elsewhere)~\cite{bissell2019cost}~\cite{jang2014survey}~\cite{chang2019evaluating}~\cite{opderbeck2015cybersecurity}~\cite{rovskot2020cybercrime}.
Malicious software is one of the important factors that undermine Internet security~\cite{dai19}.
Such software are programs of files intended to ``damage'' computers, computer systems, or networks. There many types of damage includes loss of functionality and/or loss of privacy
(e.g. when the damaged software leaks private information bout users) and/or loss of security (e.g. when the damaged system is used to launch other attacks on other systems). After Euh et.al.~\cite{euh2020comparative} we say that ``damaged software'' exhibits some type of misbehavior that goes against the benefits of users after it is implanted into computers.
A 2019 study by Accenture reported that the damage caused by malware traffic and cybercrime is US\$5.2 trillion over the next five years~\cite{bissell2019cost}.
Alarming, that cost is growing. That same report documents that in the United States, the annual average cost to organization of malicious software has grown 29\% in the last year. Looking forward, it is expected
to see more kinds of malicious software~\cite{bissell2019cost} including:
\bi
\item
{\em Malware software} that is hidden within files; e.g. PDF files can carry malicious code~\cite{smith01}.
\item
{\em Ransomware} that encrypts a victim's files then demands a ransom from the victims to restore access to the data upon payment.
A March 2019 ransomware attack on aluminium producer Norsk Hydro caused 60 million pounds of remediation cost. The attack brought production to a halt at 170 sites around the world (some 22,000 computers were affected across 40 different networks)~\cite{bissell2019cost}.
\item
Industrial espionage software that infects, then destroyed industrial machinery; e.g. the Stuxnet virus used to damage centrifuges at
the Iran’s Natanz uranium enrichment facility~\cite{langner2011stuxnet}.
\item
In the social network realm, Facebook estimated that hackers stole user information from nearly 30 million people through malicious software~\cite{facebookreport}.
\item
According to the International Data Corporation (IDC), the Android operating system covers around 85\% of the world’s smartphone market. Because of its increasing popularity, Android is drawing the attention of malware developers and cyber-attackers. Android malware families that are popular are spyware, bots, Adware, Potentially Unwanted Applications (PUA), Trojans, and Trojan spyware, which affect millions of Android users~\cite{bhat2019survey}.
\item
Other kinds of malicious software include
(a)~{\em scareware} that socially engineers
anxiety, or the perception of a threat, to manipulate users into e.g. buying unwanted software;
(b)~{\em adware} that throws advertisements up on your screen (most often within a web browser); and (c)~software that infects your computer then, without your permission of knowledge, mounts a {\em denial of service attack} on other computers.
\ei
% \tbl{hyperparameters} shows the billions of options within the parameter space
% of just of one malware detector (that used deep learning).
% Preliminary results with {\em mass confusion} approach (see \tion{results})
% show that, when building an ensemble for security purposes, it is necessary to explore, and reject, a very large number of classifiers.
% Just by way of contrast, we note that prior work built their
% ensembles from only 10 to 100 classifier
% or less
% Given the very many ways we might build an ensemble (see tbl{hyperparameters} and \tbl{methods})
% this seems
% a very limited sample.
% But how to generated and assess so many classifiers?
% As discussed below, one result from our {\IT}.1 prototype is
% exploring all those classifiers via standard methods is very slow
% (this is especially true when exploring many options of deep learners like
% \tble{hyperparameters} since the training time for deep learners is so slow).
% Recent wok by PI Menzies~\cite{nairFSEbadlearenr,DODGE,ZHEsectuirty,EMSEDODGEpaper} shows
% that, at least for SE domains, it is possible to quickly build and explore and assess a very large space of classifiers.
% Hence, for the {\IT}.2 research proposed here, we will use
% {\em epsilon-domination search} (described in \tion{eps}) to quickly tune a ``complex'' ensemble of many malware detectors
% (where ``complex'' means ``large variance in the decision boundary'').
% This is importance since
% Khasawneh et al.\cite{khasawneh2017rhmd} and Demontis et al.~\cite{demontis2019adversarial}
% assert in \tion{khas} that such ``complexity'' blocks adversaries in their attempts to slip past
% defences.
% We recommend the {\em epsilon domination } approach (that explores in the internal
% search space of a learner) for generating malware ensembles since
% (a)~it is very fast and (b)~it does not require extensive or expensive domain engineering or CPU-intensive algorithms. By way of contrast, other ensemble generation methods in the security domain
% need (a)~elaborate and time confusion instrumentation of hardware and software APIs~\cite{khasawneh2017rhmd};
% or (b)~expensive CPU support when (e.g.) the first few layers of a deep learner convert a large input
% space into a smaller number of useful synthetic features~\cite{dai19} (this can be very slow if deep learners
% must repeatedly analyze data 100s of times in order to converge on useful internal weights).
In response to the growing incidents in malicious software, system administrators and the operating system vendors add malicious software detectors to their environments to detect dangerous data. Euh et al caution that it is difficult to deal with advanced malware variants, such as polymorphism~\cite{sikorsi2012practical,alazab2015profiling} packing~\cite{roundy2013binary}, obfuscation~\cite{murad2010evading}, etc. They note that manual designed and constructed malware detectors cannot keep pace with the new of new malware variant.
When manual methods fail, automatic methods can be of assistance. Euh et al. say that a {\em malicious software detector} is a machine learner that utilizes known detective patterns to verify whether a new application becomes a threat. Such malware detectors can be generated using many techniques:
\bi
\item
Build classifier to examine a web page for malicious content~\cite{canali2011prophiler}.
\item
Apply supervised learning in detecting malicious web pages~\cite{eshete2012binspect}.
\item
Construct multiple classifier systems to classify spam emails~\cite{biggio2010multiple}.
\item
Build classifiers to detect malicious PDF files~\cite{xu2016automatically}.
\item
Apply machine learning to detect Android malwares~\cite{grosse2017adversarial}.
\item
Design supervised learning algorithm to classify http logs~\cite{liu2017robust}.
\item
Design machine learning models to classify Ransomware~\cite{munoz2017towards}.
\item
Detect malicious PowerShell commands using deep neural networks~\cite{hendler2018detecting}.
\ei
% \input{rui}
\subsection{ Adversarial Attacks (Hacking at the Decision Boundary)}
%At present, there are a large number of malware dynamic detection
% and a good detection rate can be obtained (e.g.~\cite{7422770}), but malware with evasion behavior cannot be effectively handled. Commonly used dynamic detection evasion meth- ods use code reuse technical against detectors (e.g.~\cite{7163058}) or use mimicry attack to evade detectors (eg. ~\cite{7509925}). Smutz and Stavrou~\cite{Smutz2016WhenAT} proposed the method of using mutual agree- ment analysis combined with ensemble learning for eva- sion malware detection
All malicious software detectors share the same weakness: an adversary watching the defender can learn how to design malware that slips past the defenses. To understand how that can be accomplished, we need
to understand decision boundaries within learners. For example:
\bi
\item
PDF files can be used to carry computer viruses~\cite{smith01}.
\item
Defending software can protect against maliciousness by learning features that predict for PDF injection.
\item
The defending learners learn a decision boundary that separates different classes of examples (see \fig{boundary}). As examples approach the boundary, they enter a region where learners c
have trouble distinguish between (e.g.) malicious and safe PDF files.
\item
A second adversarial learner can watch the defender to learn what features
of malware make it fall towards that decision boundary.
This adversary then alters (e.g.) their PDF injections such that they fall on the attacker's decision boundary.
\item
This means the defender can no longer recognize the attack.
\ei
\begin{wrapfigure}{r}{2.5in}
\includegraphics[width=2.5in]{fig/boundary.png}\caption{All learners learn a {\em decision boundary} that separates examples of different classes. }
\label{fig:boundary}
\end{wrapfigure}
Papernot et al. say that ``adversarial examples'' are inputs specially crafted to cause a machine learning model to produce an incorrect output~\cite{papernot2016transferability}.
In \fig{boundary}, such adversarial examples fall
close to the lines indicating the decision boundary.
One thing to notice about \fig{boundary} is that the decision boundary found by different learners is (approximately) the same shape. That is, in this small example, the data has a certain ``shape'' and the all
the learners are finding nearly the same shape. This is implies, that, at least for this example,
we could use one learner (e.g. decision trees) to find adversarial examples that could confuse
another learner (e.g. neural network).
% esnembles = XXX 5 experts
Papernot et al. ~\cite{papernot2016transferability} asked what information does the adversary need to learn adversarial examples.
They
studied defenders who use some malware detector LEARNER1 for recognizing malicious examples. If an attacker
applies some other learner (which we will call LEARNER2) to a log of the behaviour of a defended system,
they they showed empirically that adversarial examples from LEARNER2 are also adversarial examples for LEARNER1.
That is, attackers can find ways to defeat the defenses {\em without having to know any details about the defending learner}. This is a troubling conclusions since it tells us that there exist general methods for attacking any set of malware detectors.
One defense against an adversary might be to use a learner that builds decision boundaries via some extra dimensions synthesized
in a learner-specific manner. For example:
\bi
\item
The first few layers of a deep neural network or the kernel of a support vector machine first convert data into some extra dimensional space before learning a classifier.
\item In theory, if the extra dimensions generated by some defending LEARNER1 finds a radically different shape of the decision boundary, then some external
attacking LEARNER2 might have trouble relearning that boundary
(and, hence, also have difficulty discovering adversarial examples).
\item
Unfortunately, in practice, this has not proved to be a useful defense against adversaries. Papernot et al. ~\cite{papernot2016transferability} showed
that adversarial examples transfer between both learners that use raw dimensions (kth-nearest neighbor, decision trees, logistic regression) as well
as those tat use extra-dimensions (deep neural networks, support vector machines).
\ei
\begin{wrapfigure}{r}{1.5in}
\includegraphics[width=1.5in]{fig/kh.png}
\caption{Decision boundaries for two random classifiers. From Khasawneh et al.\cite{khasawneh2017rhmd}.}
\label{fig:kh}
\end{wrapfigure}
\noindent Khasawneh et al.\cite{khasawneh2017rhmd} and Demontis et al.~\cite{demontis2019adversarial} explain the Papernot et al. conclusion as follows.
They argue that, for defeating an adversary
the {\bf properties of the learner} is {\em less important} than
the {\bf properties of the decision boundary}.
Specifically,
Demontis et al.~\cite{demontis2019adversarial}
rank various defence strategies from best to worst defenders, then show
that the best defenders have more ``wriggle'' in their decision boundary. They found
empirically that increased variance in the position of the boundary predicted for which defense was most successful.
For an explanation
of the Demontis et al. results, we turn
to some comemnts from
Khasawneh et al.\cite{khasawneh2017rhmd}.
Firstly, Khasawneh et al
offer an intuitive based on \fig{kh}.
This figure shows the decision boundaries of two learners trying to learn boundaries
$H_1$ and $H_2$. The diagram is deliberately drawn so the decision boundaries are not mutually exclusive: $H_1$ recognizes malware in regions 2,3 and $H_2$
recognizes malware in regions are 3,4. To fool both, the malware has to find the region which both detectors say is ``normal'' (in \fig{kh}, that is region~1).
To isolate region~1,
a malicious software detector has to learn a decision boundary of a higher complexity class which, as Khasawneh et al. observe, requires exponentially larger number of samples.
Using this example, Khasawneh et al. propose defeating adversaries by a large set of candidate learners (which they generate by training them from different features), of which a random subset is used for the malware detection at any given time.
To back up that intution, Khasawneh et al.~\cite{demontis2019adversarial} offer a theoretical discussong
for how poorly an attacker can locate the decision boundary (and, hence, be able
to find adversarial examples). In computational learning theory, probably approximately correct (PAC) learning is a framework for mathematical analysis of machine learning. Its analysis is general
to any learner seeing to learn an hypothesis from a sample of data.
Hence PAC is a unifying framework
for discussing limits to learning
for all the machine learning algorithms currently in widespread use~\cite{valiant84}.
Given an ensemble of $H$ learners, each of which has its own errors of $e(\hat{H_i})$, then according to PAC learning theory, the upper bound on the error of the attacker is
\begin{equation}\label{one}
2(\mathit{max}\; e(\hat{H_i}))
\end{equation}
i.e. twice the worse error of any one defending learner. This result has a clear intuition: the more the defender makes mistakes, the harder it becomes for the attacker to learn the policies of the defender. Khasawneh et al.~\cite{demontis2019adversarial} warn that this kind of defense has an unwanted side-effect. Specifically, it can reduce malware detector performance.
In this proposal we call this the {\bf trade principle:}
\begin{quote}
{\em The above theorem (Equation~\ref{one}) suggests a trade-off between the accuracy of the defender under no reverse-engineering vs. the susceptibility to being reverse-engineered: using low-accuracy but high-diversity classifiers allow the defender to induce a higher error rate on the attacker, but will also degrade the baseline performance against the target. }
\end{quote}
The Khasawneh et al. remarks are a clear research challenge; i.e. {\bf how much to surrender?}
PAC learning tells us that when we are trying to confuse an adversary, there will be {\em some} reduction in defender efficacy. The engineering challenge will be to {\em minimize that reduction}.
Another issue relates to the {\bf value of adjusting the decision boundary}. Here, we propose {\IT}, i.e. creating an ensemble
from a very large set of classifiers.
But one theoretical problem with our proposal is that as far as a learner that makes decisions from a set of boundaries (e.g. an ensemble learning) is still a learner with its own decision boundary (albeit assembled from its parts). Hence, in theory,
it might still be possible for an adversary to learn how to confuse an ensemble learner.
Right now, the empirical evidence is split on that point. Papernot et al. ~\cite{papernot2016transferability} found that even when they used ensemble learning, then an adversary could still build find adversarial examples within its own learner, which then ``transferred'' to the defending ensemble (i.e. an example that was adversarial
in the attacker's learner was also an adversarial example within the defending ensemble).
But in our own experiments
with {\IT}.1, we found that it is possible to mitigate attacks using ensembles (those experiments are discussed below).
The research of this proposal would seek to understand the reason for these inconsistent results.
As shown in \tbl{methods} (right-hand-side column), {\IT}.1 built ensembles using different
design choices.
But which of those different choices actually matter? Which can be improved? These questions are addressed below.
\subsection{Preliminary Results with {\IT}.1}
This section reviews
the {\IT}.1 results.
That review, plus the above theory,
will lead to a set of research questions
that must be addressed in {\IT}.2
{\IT} is an hyperparameter optimizer that tries many
configurations of the learner $C_1,C_2,...$ before it recommends some configuration $C'$ as the ``best'' configuration.
Since any attacker watching the same data might also be able to learn $C'$, the premise of ${\IT}$ is:
\begin{center}
{\em To \underline{most} confuse an attacker, use some configuration that is \underline{ most} distant to $C'$.}
\end{center}
The engineering challenge of {\IT} is to find that distant
configuration while at the same time, also selecting configurations that are effective for detecting malicious software. To that end
{\IT}.1 runs in the four stages:
\[\mathit{optimize}(N_1) \rightarrow \mathit{explore}(N_2,N_3) \rightarrow \mathit{select}(N_4) \rightarrow \mathit{weight}(W)
\]
where $\{N_1,N_2,N_3,N_4,W\}$ are control parameters, explained below.
{\IT}.1's {\em optimize,explore}
stages assumes that, prior to being attacked,
there exists {\em DATA$_1$}; i.e. track record
of prior examples of malicious and non-malicious inputs.
{\em Optimizer} explores the parameter space of the learner (e.g. \tbl{hyperparameters}) in order to improve that learner and {\em explore}
prunes away the less useful optimization results. Two issues that we have to address in this propsoal are that
(a)~{\em optimize} can be slow to execute\footnote{Accordingly, our {\bf RQ1} of \S\ref{rq} will try to speed that up.} and
(b)~our currently implementation
only adjust the internals of the learner and not the input features\footnote{Accordingly, our {\bf RQ2} of \S\ref{rq} will assess the merits
of feature vs parameter perturbation.}
Once the attack starts,
this generates a second set of examples
{\em DATA$_2$} that show how well the defender can respond to the attack.
{\em Select} picks the learners that best address the attack, then
{\em weight} combines those together to form the final ensemble.
One drawback with the current implementation of {\em select} is that it ignores
Khasawneh et al.'s {\em trade principle}\footnote{Accordingly, our {\bf RQ3} of \S\ref{rq} will check if using less-than-best optimizations results is effective.}
so clearly we need to explore more the balance between
effectively confusing the attacker and effectively recongizing malicious
software\footnote{Which we will do in {\bf RQ4} of \S\ref{rq}.}.
Another issue with
{\em DATA$_2$} is needed during the {\em select}
and {\em combine} stages. That is to say, {\IT}.1 needs access
to information on the specific
attacker that is active at the current time (which, in turn, means our tool only
works for known attacks and not for previously-unseen and currently undetected attacks)\footnote{Accordingly, in our {\bf RQ5} of \S\ref{rq},
we will try to avoid that.}.
Yet another issue with {\IT}.1 is that the
{\em weight} sub-routine uses a rather simplest ensemble combination rule (weighted majority vote)\footnote{Accordingly, in {\bf RQ6} of \S\ref{rq},
we need to explore a wider range of options.}.
While we are critical of certain aspects of {\IT}.1, the current
results with that system are most encouraging.
Using the data of Table~\ref{tbl:dataInPhase},
{\IT}.1 was used to defend against the attackers described in \tbl{attack}:
BIM-A, BIM-B~\cite{kurakin2016adversarial}, DeepFool~\cite{moosavi2016deepfool}, and FGSM~\cite{goodfellow2014explaining}, JSMA~\cite{papernot2016limitations}, and Carlini Wanger~\cite{carlini2017towards}~\cite{DBLP:conf/ccs/Carlini017}.
In all these experiments:
\bi
\item
The deep learner specified in
\tbl{hyperparameters} was used to recognize malicious PDF files.
\item
The {\em optimizer, explore, select, weight}
process described above was applied
(so BIM-A, BIM-B, DeepFool, FSGM , KSMA, and Carlini Wanger
started their attach after the {\em explore}
stage).
\item
During the {\em explore} stage, $N_1=1000$ evaluations were made while trying to improve
accuracy.
\ei
\begin{table}[!t]
\centering
\small
\begin{tabular}{r|r|r|r|r|p{2cm}|r}
\hline
\rowcolor[HTML]{ECF4FF}
\multicolumn{1}{c|}{\cellcolor[HTML]{ECF4FF}} & \multicolumn{2}{c|}{\cellcolor[HTML]{ECF4FF}\textbf{Training Phase}} & \multicolumn{2}{c|}{\cellcolor[HTML]{ECF4FF}\textbf{Testing Phase}} & \multicolumn{2}{c}{\cellcolor[HTML]{ECF4FF}\textbf{Total}} \\ \cline{2-7}
\rowcolor[HTML]{ECF4FF}
\multicolumn{1}{c|}{\multirow{-2}{*}{\cellcolor[HTML]{ECF4FF}\textbf{Dataset}}} & \multicolumn{1}{c|}{\cellcolor[HTML]{ECF4FF}\textbf{Benign}} & \multicolumn{1}{c|}{\cellcolor[HTML]{ECF4FF}\textbf{Malicious}} & \multicolumn{1}{c|}{\cellcolor[HTML]{ECF4FF}\textbf{Benign}} & \multicolumn{1}{c|}{\cellcolor[HTML]{ECF4FF}\textbf{Malicious}} & \multicolumn{1}{c|}{\cellcolor[HTML]{ECF4FF}\textbf{Benign}} & \multicolumn{1}{c}{\cellcolor[HTML]{ECF4FF}\textbf{Malicious}} \\ \hline
NSL-KDD & 67,343 & 58,530 & 12,833 & 9,711 & 80,176 & 68,341 \\
CSE-CIC-IDS2018 & 535,701 & 102,379 & 133,926 & 25,595 & 669,627 & 127,974 \\
CIC-IDS-2017 & 363,410 & 89,050 & 90,853 & 22,262 & 454,263 & 111,312 \\
CICAndMal2017 & 193,777 & 224,873 & 48,445 & 56,218 & 242,222 & 281,091 \\
Contagio PDF Malware & 8,821 & 9,199 & 2,205 & 2,300 & 11,026 & 11,499 \\\hline
\multicolumn{7}{|p{6.3in}|}{
\bi
\item
\textit{NSL-KDD}~\cite{nsl-kdd} dataset is an improved version of KDD'99 dataset~\cite{tavallaee2009detailed}, which recorded network traffic under different types of attacks.
\item
\textit{CIC-IDS-2017}~\cite{sharafaldin2018toward} comprises normal traffic and simulated abnormal data caused by intentional attacks on a test network. Built using thhe NetFlowMeter Network Traffic Flow analyser, it holds a week of data collection with 225,745 packages with over 80 features.
\item
\textit{CSE-CIC-IDS2018}~\cite{sharafaldin2018toward}, collected
in 2018 by the Canadian Institute for Cybersecurity (CIC) on AWS (Amazon Web Services),
includes various attack scenarios such as Brute-force, Heartbleed, Botnet, DoS, DDoS, Web attacks, and infiltration of the network from inside.
\item
\textit{CICAndMal2017}~\cite{lashkari2018toward} is an Android malware dataset that collects 426 malicious and 1,700 benign applications collected from 2015 to 2017 by researchers at the University of New Brunswick (UNB). The malicious samples are split into four categories (Adware, Ransomware, Scareware, SMS Malware) and 42 families. In addition to providing the APK files, the authors also ran each malicious sample on real Android smartphones and captured network traffic during installation, before restart, and after restart.
\item
\textit{Contagio PDF Malware}~\cite{contagio-pdf} dataset is widely available and designated for signature research and testing. This source of data sets was selected because it contains a large number
of labeled benign and malicious PDF documents, including a relatively large number from targeted attacks.\ei}\\\hline
\end{tabular}
\caption{Some data sets used in our experiments.}\label{tbl:dataInPhase}
\end{table}
\begin{table}[!t]
\small
\begin{tabular}{|p{\linewidth}|}\hline
With \textit{\textbf{Fast Gradient Sign Method (FGSM)}}~\cite{DBLP:journals/corr/GoodfellowSS14} , data in a dataset $x$ is modified by adding or subtracting an almost imperceptible error of $\epsilon$ as follows: $x_{adv} = x + \epsilon* sign(\nabla_{x}\mathcal{L}(\theta, x, y))$ where
$x_{adv}$ is the adversarial dataset, $x$ is the original data, $y$ is the original label, $\theta$ is a model hyperparameters, $\nabla$ is the gradient, and $\mathcal{L}$ is the loss function. Tominimzie the perturbation, FGSM's uses the sign of the gradient, not its actual norm, and scales that by a small factor epsilon.\\
\rowcolor{blue!10}
\textit{\textbf{Basic Iterative Method (BIM)}}~\cite{KurakinGB17a} is an iterative version of FGSM where a small perturbation is added in each iteration. BIM-A method stops the iteration as soon as misclassification is achieved while in BIM-Bm iterations stop after a fixed number of rounds.\\
\textit{\textbf{Jacobian-based Salience Map (JSMA)}}~\cite{papernot2016limitations} is an iterative method that achieve misclassification of an input to any pre-specified class.
The feature $x$ with max {\em saliency} is perturbed by some value $\epsilon$ where
{\em saliency} is defiend as follows (intuitively, it seeks changes to $x$ that
most effect te classification $c$:
{\scriptsize\[
S^{+}(x_{(i)}, c) = \begin{cases}
0 \quad \text{if $\frac{\partial f(x)_{(c)}}{\partial x_{(i)}} < 0$ or ${\mathlarger\sum}_{c^{'}\neq c} \frac{\partial f(x)_{(c^{'})}}{\partial x_{(i)}} > 0$} \\
\\
-\frac{\partial f(x)_{(c)}}{\partial x_{(i)}} \cdot {\mathlarger\sum}_{c^{'}\neq c} \frac{\partial f(x)_{(c^{'})}}{\partial x_{(i)}} \quad \text{otherwise} \\
\end{cases}
\]}\\
\rowcolor{blue!10}
\textit{\textbf{DeepFool}}~\cite{moosavi2016deepfool} works in an iterative manner to minimizing the euclidean distance between perturbed samples and original samples (i.e., the $L_{2}$ distance metric). An approximation to the decision boundary is calculated and then perturbation perpendicular to that boundary. The algorithm terminates once misclassification is achieved.\\
\textit{\textbf{Carlini-Wagner (C \& W)}} attack~\cite{carlini2017towards}~\cite{DBLP:conf/ccs/Carlini017} seeks the smallest perturbation that can cause a misclassification. If we consider input $x \in [0,1]^{n}$ and noise $\delta \in [0,1]^{n}$, this attack finds the adversarial instance by finding the smallest noise $\delta \in \mathbb{R}^{n}$ added to the input $x$ that will change the classification to a class $t$. The noise level is measured in terms of $L_{p}$ distance. Finding the minimum $L_{p}$ distance of $\delta$ while ensuring that the output will have a class $t$ is not a trivial optimization problem. Instead, Carlini-Wagner attack solves the optimization problem of the form$\underset{\delta \in \mathbb{R}^{n}}{min} \quad ||\delta||_{p} + c \cdot f(x + \delta)$
where $x + \delta \in [0,1]^{n}$, and $f$ is an objective function that drives the input $x$ to be misclassified, and $c > 0$ is a suitably chosen constant. The $L_{0}$, $L{2}$, and $L_{\infty}$ norms are considered.
\\\hline
\end{tabular}
\caption{In our experiments, {\IT} was attacked by these methods.}\label{tbl:attack}
\end{table}
\noindent
\fig{rescue} shows some results after BIM-A attacked {\IT}.1.
That figure's
horizontal red line shows the performance of the $C'$
malicious software defector
(measured as the accuracy of the defender's classifier trying to recognize malicious software). Note that this performacne is achieved after the {\em optimize} and {\em explore} stages and {\em before} the attacker arrives.
~{\bf Model0} shows the impact of the attack
before {\IT}.1 tries any mitigations.
The purple dots show a range of classifiers generated during the {\em explore} state. These dots are spread across the x-axis that measures the Euclidean distance between the $C'$ of the red line and the configuration of each each purple dot.
The {\em rescue set} shows the classifiers that are far from the optimizer's best configuration $C'$ and which, during {\em select}, were found to perform well against the attack.
This {\em rescue set} is used in the {\em weight} stage to build an ensemble.
\definecolor{green1}{rgb}{0.12, 0.58, 0.29}
\definecolor{yellow1}{rgb}{0.99, 0.64, 0.22}
\definecolor{red1}{rgb}{0.91, 0.10, 0.20}
\definecolor{blue1}{rgb}{0.18, 0.46, 0.90}
\begin{wrapfigure}{r}{2.3in}
\includegraphics[width=2.3in]{fig/rescue.png}
\caption{ {\IT}.1 defending Contagio PDF against
BIM-A.}
\label{fig:rescue}
\end{wrapfigure}
\fig{all20} shows results from six attackers and five data sets (so the \fig{rescue} results appear in \fig{all20}.c).
That figure shows results with the various attackers as well as a ``straw man'' ensemble method where the defender divide her data into 10 *90\% samples and use results from their median (in the literature, this called a ``avg weight ensemble''). XXX3 In that figure:
\bi
\item
The \colorbox{blue1}{\textcolor{white}{blue}} and \colorbox{red1}{red} bars shows the accuracy before and after the attack
(respectively).
\item
The \colorbox{yellow1}{yellow} bar show the accuracy after applying {\IT}.1.
\ei
From \fig{all20}, we highlight three key observations:
\bi
\item {\em After} the attack and {\em before} {\IT}.1 applies
it defenses, it is clear that the attacker can damage the defender (evidence: the
\colorbox{red1}{red} bars are much lower than the \colorbox{blue1}{\textcolor{white}{blue}} bars).
\item
{\em After} {\IT}.1 applies
it defenses, it still cannot fully mitigate against the attacks (evidence: the
\colorbox{yellow1}{yellow} are lower than the pre-attack
\colorbox{blue1}{\textcolor{white}{blue}} bars). Note that this is too be expected
(indeed, it is an example of Khasawneh et al.'s {\bf trade principle} described above; i.e. when we are trying to confuse an adversary, there will be some reduction in defender efficacy (and the engineering challenge is be to minimize that reduction).
\item
That said,
{\IT}.1 extensively mitigates the attacks (evidence: the
\colorbox{yellow1}{yellow} bars are much taller than the \colorbox{red1}{red} bars). This result encourages us to further explore
{\IT}.1 (hence this proposal).
\ei
% While these results motivate us to explore {\IT}.1, they come with all the caveats listed above; i.e.
% \bi
% \item The optimization stage needs to be faster (hence, {\bf RQ1}).
% \item The {\em explore} stage made some arbitrary design decisions that need more
% exploration (hence, {\bf RQ2}).
% \item Also, the {\em explore} stage did not take advantage of Khasawneh et al.;s
% {\bf trade principle} which conjectures that better defences might result if we draw our ensemble from (slightly) less-than-optimal classifiers. This conjecture, backed up by PAC learning theory, merits further exploration (hence, {\bf RQ3}).
% \item Ideally, the {\em select} phase builds its ensembles without feedback from
% the specific attacker that is currently active (hence, {\bf RQ4}).
% \item {\IT}.1's {\em weight} stage made some arbitrary design decisions that need more
% exploration (hence, {\bf RQ5}).
% \ei
~
\section{Research Questions}\label{rq}
As said above, {\IT}.1
comprises four stages {\em optimize, explore, select, wieght}
which are controlled by the parameters $N_1,N_2,N_3,N_4,W$
This section reviews the internal design decisions of
details. Based on shortcomings with those decisions, we propose various research questions that will result in the {\IT}.2
system.% that we will use in {\bf RQ7}
\subsection{Evaluation Principles}
Carlini et al.~\cite{carlini2019evaluating} provides practical advice for evaluating defense that are intended to be robust to adversarial examples. Specifically, this work shows a list of principles of rigorous evaluations and a checklist of recommendations. Motivated by this work, here we list suggestions that are applicable to our work, and refer these suggestions as our guideline. Besides, we also add some other principles that beyond Carlini et al.'s suggestions.
\bi
\item \textit{P1}: State a precise threat model.
\item \textit{P2}: Release pre-trained models and source code.
\item \textit{P3}: Report clean model accuracy when not under attack.
\item \textit{P4}: Generate an attack success rate vs. perturbation budget curve.
\item \textit{P5}: Describe the attacks applied, including all attack hyperparameters.
\item \textit{P6}: Apply a diverse set of attacks.
\item \textit{P7}: Record the execution time of hyperparameter learning phase.
\item \textit{P8}: Any work with a stochastic component need to repeat experiments 20 times and study effect size and take a significance test.
\item \textit{P9}:
\item \textit{P10}:
\item \textit{P11}:
\item \textit{P12}:
\ei
\subsection{Stage1: $N_1$ {\em Optimizations}}
Every learner comes with control parameters that adjust how it generates a model.
For example, \tbl{hyperparameters} offered some of the parameters that might be used to configure a deep learner trying to recognize (e.g.)
malicious software hiding inside a PDF file.
{\IT}.1 learns uses hyperparameter optimizations
to decide which hyperparameters to use.
The Tree-structured Parzen Estimation (TPE) algorithm
described in \tbl{egconfig}
is used to control the complex exploration of the deep learner hyperparameter space.
But even with TPE's intelligent culling of bad optimizations this {\em optimize} stage can be very slow
We found we need at least $N_1=1000$ evaluations
(in step four, listed above) to achieve good
classification results for malicious software
(on the ``Contagio PDf Malware'' example of
Table~\ref{tbl:dataInPhase}).
In practice, that meant TPE required five to ten hours to terminate, This is troubling since, when attackers have more CPU than defenders, they might be able to learn to adapt faster
than we can defend. Hence we must ask:
\begin{blockquote}
\noindent
\textbf{Research Question 1}:
How to faster find and evaluate classifiers (via $M_1\ll N_1$ evaluations )?
\end{blockquote}
Much research has explored ways to speed up the process of generating and evaluating multiple configuration options (see \tbl{egconfig}).
While these methods can automatically explore different configurations, they all suffer from the same problem:
{\em exploring benchmark sets for different configurations is very slow}. This is particularly tree when trying to configure deep learners like \tbl{hyperparameters} since every time we need to test a configuration, we need to pause and train up a new a deep learner (which can be a slow process).
\begin{wrapfigure}{r}{4in}
\includegraphics[width=4in]{fig/res1.png}
\caption{{\IT}.1 results: five data sets, six attackers.
The y-axis shows how accurately the defender can recognize malicious software
(so {\em larger} y-axis values means {\em more} security).}
\label{fig:all20}
\end{wrapfigure}
\begin{table}[!htbp]
\small
\begin{tabular}{|p{.99\linewidth}|}\hline
\rowcolor{blue!10}
In \textit{TPE}~\cite{bergstra2011algorithms},
the results of optimizations seen-so-far are divided on some loss $y*$;
TPE then builds two models: $L$ and $G$. These models compute the likelihood of examples having losses \underline{{\bf L}}ess than or \underline{{\bf G}}reater than $y*$.
Next, some generation process then proposes candidate optimizations $\{x_1,x_2,...\}$. Proposals that maximizes ${L(x_i)/G(x_i)}$ are ``best'' since they select
most for lower loss examples
(while avoiding the higher losses). These
``best'' proposals are then tested via some evaluation function (in our case, learn a classifier from {\em Data$_1$} to recognize malicious software).
This results in new examples which means TPE can run again, this time with updated $L,G$ models.\\
\textit{Grid search}~\cite{bergstra2011algorithms, tantithamthavorn2016automated} is a ``brute force'' hyperparameter optimizer that wraps a learner into for-loops that walk through a wide range of all learner's control parameters. after just a handful of options, grid search suffers from the ``curse of dimsensionality'' and can miss important optimizations. Hence, grid search can waste much CPU resources and miss important optimizations~\cite{bergstra2012random}.\\
\rowcolor{blue!10}
An alternative to grid search is some form of \textit{random search}~\cite{bergstra2012random} that stochastically samples the search space. For example,
stochastic evolutionary algorithms runs in ``generations'' where each new generation is seeded from the best examples selected from the last generation~\cite{goldberg2006genetic}. {\em Simulated annealing} is a special form of evolutionary algorithms where the population size is one~\cite{kirkpatrick1983optimization,Menzies:2007a}.
\\
\textit{Genetic algorithms (GA)} is another form of random search where the population size is greater than one, and new mutants are created by crossing over parts of the better members of the current population~\cite{goldberg2006genetic,Panichella:2013}.
There are many forms of GAs including NSGA-II~\cite{deb2002fast}, IBEA~\cite{zitzler2004indicator}, MOEA/D~\cite{zhang2007moea} and many others.
\\\rowcolor{blue!10}
{\em Reinforcement learning} takes feedback from prior executions to update weights on current options~\cite{DBLP:journals/corr/abs-1902-06583,sutton2018reinforcement,Li:2017}.
\\
{\em Surrogate optimization}~\cite{alipourfard2017cherrypick, snoek2012practical,brochu2010tutorial}
uses the current model to make guesses about the as-yet-unevaluated options\cite{eggensperger2013towards}, then evaluates the most ``informative'' guess (where ``interesting'' is defined via some domain-specific ``acquisition function''
that might, e.g., favor guess with highest value as well as highest variance).
By only evaluating the most informative guess, Baysian optimization can
make far fewer evaluations to find better configurations (compared to more expensive prior work~\cite{guo2013variability, sarkar2015cost, nair2018finding, nair2017using}).
In the literature, surrogate optimization is also called ``baeysian optimization'' or
``sequential model-based optimization''.
\\\hline
\end{tabular}
\caption{Standard algorithms used to automatically configure software.}.\label{tbl:egconfig}
\end{table}
PI Menzies has had much recent success using
{\em epsilon domination}~\cite{deb2005evaluating} to dramatically speed up automatic configuration. In optimization, a solution is said to \textit{dominate} the other solutions if and only if it is better in at least one objective, and no worse in other objectives. A set of optimal solutions that are not dominated by any other feasible solutions form the \textit{Pareto frontier}. The yellow dots in the Figure ~\ref{fig:pareto} (left-hand-side) are an example of the output space based on $\epsilon$-dominance.
In the figure, the grid patterns shows regions of similar performance (and these axis might be recall versus false alarm).
When there exists some $\epsilon$ value below which is useless or impossible to distinguish results, then it is superfluous to explore anything less than $\epsilon$~\cite{deb2005evaluating}.
In that case, the space of output performacne results like divide into a small number of cells (see the grid pattern, left-hand-side, of Figure ~\ref{fig:pareto}).
\begin{figure}[!t]
\begin{minipage}{2in}
\includegraphics[width=2in]{fig/pareto.png}
\end{minipage}~~~~~\begin{minipage}{4.2in}
\footnotesize
\begin{tabular}{|p{4.2in}|}\hline
DODGE($\epsilon$) executed by:
\be
\item
Assign weights $w=0$ to configuration options.
\item
$N=30$ times repeat:
\be
\item
Randomly pick
options, favoring those with most weight;
\item
Configuring and executing data pre-processors and learners using those options;
\item
Dividing output scores into regions of size $\epsilon=0.2$;
\item
if some new configuration has scores with $\epsilon$ of
prior configurations then...
\bi
\item
...reduce the weight of those configuration options $w=w-1$;
\item Else, add to their weight with $w=w+1$.
\ei
\ee
\item Return the best option found in the above.
\ee\\\hline
Note that after Step 2d, then the choices made in subsequent Step 2a will avoid options that arrive within $\epsilon$ of
other observed scores.\\\hline
\end{tabular}
\end{minipage}
\caption{Left hand side: Outputs $f_1,f_2$, divided into cells of size $(\epsilon=0.1)^2$
where the goal
is to minimize $f1,f_2$. Gray dots show feasible solutions. Yellow dots show the
best solutions found so far.\newline
Right hand side: the DODGE algorithm.}\label{fig:pareto}
\end{figure}
When that happens, the configuration problem reduces to finding
one configuration per cell. This can result
in a very large reduction in the configuration search space.
For example, to statistically evaluate a learner, it is common
to call that
learner 10 times, each time with 90\% of the
data. Suppose across those 10 runs, there is a mean $\pm 5\%$
variation in the output performance. Assuming normality,
then scores less than $\epsilon=1.96*2*0.05=0.196$ are statistically indistinguishable. Hence, for learners evaluated on (say) $N=2$ dimensions (e.g. the grid pattern of the left of Figure ~\ref{fig:pareto}),
those scores divide into just $\left(1/(\epsilon = 0.196))^{N=2}=26$ different regions.
To test the impact of $\epsilon$-domination,
PI Menzies and his student Amritanshu Argrawal~\cite{dodge}
have ``dodged'' around the