This repository contains our final work for the Computer Engineering graduation at the National Telecommunications Institute located in Santa Rita do Sapucaí, Brazil, titled "Evidencing CAPTCHA Vulnerabilites using Convolutional Neural Networks".
This document aims to describe the development process of a Convolutional Neural Network that seeks to assess the reliability of CAPTCHAs, security mechanisms present in several websites. It is presented a theoretical revision about the technologies used and related scientific works. We also present the experiments, metrics and details of the construction and operation of the Neural Network. In the end, we present the work's results.
This project is comprised of the following directories:
- dataset-generator: Contains code for the generation of the artificial dataset used on the training of the neural network.
- neural-network: Contains code used for training and testing of the neural network.
- experiments: Miscellaneous files scripts used on the project.
- results: Results collected from the experiment.
To generate an artificial dataset for the training of the neural network, on the dataset-generator
directory run the following command:
$ python fies-generate.py <number_of_samples>
Where <number_of_samples>
is the number of sample CAPTCHAs to be generated. This command can take a long time to run, depending on the number of samples being generated. The images will be saved on the dataset/raw
folder.
After generating the dataset, the images must be filtered and segmented to be used to train the neural network. To do that, run the following command:
$ python fies-filter.py
The segmented images will be saved on subdirectories of the dataset/segmented
directory.
With our dataset ready, we can start training the network. To do that, on the neural-network
folder, run the command:
$ python train-network.py
WARNING: This step can consume large amounts of RAM (about ~8GB for 72000 segmented images). Close any unnecessary programs before running.
You can uncomment the following lines to enable hardware acceleration on OpenCL-enabled devices (like AMD graphics cards). This can greatly speed up the training process:
import plaidml.keras
plaidml.keras.install_backend()
Various parameters of the network can be changed by editing this script, as shown below:
num_samples = 2000 # number of samples to use on training
epochs = 1024 # number of epochs of training
learning_rate = 1e-3 # learning rate of the network
batch_size = 128 # batch size
validation_split = 0.66 # the train/validation split percentage to be used
min_delta = 1e-6 # minimum improvement of the validation accuracy before stopping training
patience = 10 # number of epochs without improvement before stopping training
The trained model will be saved to the models
folder.
The following libraries were used on this project:
- Numpy: Scientific computing package for Python
- OpenCV: Computer Vision library
- Keras: Machine learning library that runs atop Tensorflow
- TensorFlow: High-performance machine learning library
- Pillow: Image creation and manipulation library.
- PlaidML: Keras backend, used for enabling GPU acceleration on OpenCL-enabled devices
- Matplotlib: Chart plotting library
- Memory Profiler: Memory Profiler for Python
- Marcelo V. C. Aragão (https://github.com/marcelovca90)
- Daniel S. P. Neves (https://github.com/danielpontello)
- Fernanda C. Avelar
- Karina V. V. Ribeiro