Android Malware Detection

Machine Learning Models and Fake-Shop Dataset

Please note that the repository does not contain any pre-trained machine learning models nor datasets. The MAL2 project however has collected and currated and annotated an archived html corpus dataset of fake-shops, the largest of its kind, free to use For research and scientific purposes. Click here to request access to the corpus dataset.

What does the Android malware-dataset of the MAL2 project include?

The MAL2 Android Malware Ground-Truth dataset was compiled in two iterations. The first iteration was kept small with 56.392 APKs and was used to test the proof-of-concept prototype in the project. For this purpose, 45.676 APKs (of which a total of "Benign“ 27.965) were used as training data, 5.076 as test data and 5.640 as validation data. It contains samples from 430 different PUA familes and 25 Trojan families. In the final iteration 790 thousand APK datasets consisting of Malware, Adware, Probably Clean and Google Play Samples were collected and their correct allocation was verified by using the IKARUS scanner. Using the developed MAL2 framework, a feature extraction from the ground-truth dataset took place. The resulting text data is part of the Android malware dataset which contains 860GB of data and is available in chunks, free to use for research and scientific purposes.

Background on the MAL2 research project

Neural Networks are a dominant force in machine learning and are responsible for the massive momentum in deep learning in numerous application domains. MAL2 will apply Deep Neural Networks and Unsupervised Learning to advance cybercrime prevention by a) automating the discovery of fraudulent eCommerce and b) evaluating the capabilities of detecting Potentially Harmful Apps (PHAs) in Android operating systems. Online shopping is commonplace, with 61.6% of Austrians already using this form of commerce. The turnover of the top 250 online shops in Austria in 2016 was € 2.3 billion, which corresponds to growth of around 9 percent compared to the previous year. Ripping of customers through fraudulent eCommerce shops is a rapidly growing area in cybercrime. Since July 2013, the Internet Ombudsman (ÖIAT) offers preventive information and maintains a blacklist on the "Watchlist Internet" portal. Exposing such fake offerings however is a labour intensive, manual task as often, dozens or more of these copies exist at the same time - every week more than 150 new fake online-shops are entered for manual verification. MAL2 provides means for advancing the automation and detection of fake-shop cybersquatting through machine learning technologies by classifying sites based on their structural similarity. With over two billion monthly active devices, the Android operating system for tablets, phones and smart devices it by far the most widespread mobile operating system in the world. Four million new malware programs were released for this platform in the year 2016. The total market share of exploits that target the Android platform is 21% which makes it the second most targeted platform for running exploit attacks. By Q42016 0.71% of all devices had potentially harmful applications (PHAs) installed. The goal of the project is to train a Neural Network to evaluate the discoverability and explainability of upcoming attack patterns. Classification capabilities of Neural Networks are heavily reliant on the quality of the underlying datasets, and subsequently dependent even more on the granularity of extracted features. Up to date no web-archive dataset of fraudulent eCommerce sites has been collected and released. MAL2 will collect/harvest and curate two large-scale Ground-Truth dataset existing of a) malware/benign applications and b) web-archives of fake-shops, to train its machine learning detection models in the application domains. Currently there is a lack of technology supporting an integrated solution of large-scale feature extraction and Neural Network training. The goal of the MAL2 project is (i) to release Open Source framework which provides integrated functionality along the required pipeline – from data extraction, feature composition up to Neural Network training and analysis of results (ii) to execute its components at large-scale within Hadoop and GPU cluster support and (iii) to publish the harvested Ground-Truth dataset, the extracted features as well as the trained Neural Network in both application domains on open data platforms. To visualize the projects results and to raise awareness for cybercrime prevention in the general public, two demonstrators are deployed at Watchlist Internet that allow live-inspection on the trustworthiness of eCommerce sites and Android Apps. MAchine Learning detection of MALicious content - Research Proposal www.malzwei.at - project website

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Android Malware Detection

Machine Learning Models and Fake-Shop Dataset

What does the Android malware-dataset of the MAL2 project include?

Background on the MAL2 research project

Clone this wiki locally