Botrecon is a simple command line tool that can help you secure your network from botnets. Just collect your network traffic (see details), point botrecon to the netflow capture file and wait for the results.
The newest version of python is generally recommended, but anything above python 3.6 is supported and should work both. Versions below 3.8 may not fully support all filetypes (e.g. pickle5), which might cause some tests to fail depending on your configuration.
Download the repository
git clone https://github.com/mhubl/botrecon.git
Move into the directory
cd botrecon
Install the dependencies
pip install -r requirements.txt
Install the package
pip install .
BotRecon should now be usable as botrecon
. If it isn't, make sure your appropriate site-packages
directory or ~/.local/bin
is in your $PATH
.
If you want to modify the code to fit your requirements, requirements-dev.txt
contains dependencies for running tests and linting. It's also recommended to use the -e flag with pip for an editable install.
pip install -r requirements-dev.txt
pip install -e .
And afterwards, from the base directory:
Running tests
pytest
Verifying style
flake8
The default models use netflow capture files and require the following columns:
- source address
- protocol
- destination port
- source port
- state
- duration
- total bytes
- source bytes
The names specified above are guaranteed to work, but if different ones are used botrecon will try to find the correct ones. In case a column cannot be located, an error message containing the missing name will be printed. If the file does not contain column names, it is expected that there are exactly 8 columns and botrecon assumes that the order is as listed above. To read more about the netflow format see the Wikipedia article on NetFlow and the OpenArgus website.
The verbose option enables some additional status/log messages during the execution, while debug disables additional error handling and should print full trace messages in most cases unrelated to parameter parsing and implicitly enables --verbose
(unless --silent
is enabled). Additionally, debug does not display the progress bar during predictions if data is batchified, but it prints a line each iteration.
Botrecon supports custom models, but also provides three default ones.
A Random Forest Classifier, it's the default option. This is potentially the best performing classifier out of the three attached by default. The returned scores are probabilities, ranging between 0 and 1.
A Support Vector Machine classifier using the RBF kernel. Potentially a bit worse than the default. This uses the Nystrom method to approximate the kernel matrix, and will cause high memory usage. If you need to use it consider using --batchify
if you encounter memory issues. This classifier does not support multiprocessing out of the box. Scores returned are not probabilities, any score above 0 is a positive classification and higher values mean higher confidence.
This is also a Random Forest Classifier, but trained on different training dataset in order to generalize better. This might actually perform better than the default, but it also might not, so use at your discretion. As in the default rforest, scores are probabilities in range between 0 and 1.
To improve accuracy or adapt botrecon to your data you may want to use your own custom machine learning model. This is supported using the -M/--custom-model
option. If a custom model is specified the data is passed to it without any transformations being applied, with the following exceptions:
- column containing source addresses is dropped
- column names are transformed to lowercase and all spaces are stripped
The model is expected to implement a predict_proba
, decision_function
or predict
method. The score is then calculated as a mean of the classification scores for each host. The model is also loaded using pickle, so it needs to be picklable. If you're using a model that does not support that (e.g. a keras hdf5 model) you can create a custom sklearn estimator using a BaseEstimator object and load it there. You can also use a similar subclassing approach with transformers if you need to import more packages than are made available by default (sklearn and category_encoders).
Basic usage
botrecon path/to/netflow/capture/file.csv
Using different filetypes
botrecon -t feather path/to/netflow/capture/file.feather
Saving output to a file
botrecon path/to/netflow/capture/file.csv path/to/desired/output.csv
Using a custom model
botrecon -M /path/to/custom/model.pkl path/to/netflow/capture/file.csv
Displaying prediction progress using a progress bar, and
Splitting the data into 100 even batches to save memory using certain classifiers
botrecon --batchify 1 % path/to/netflow/capture/file.csv
Usage: botrecon [OPTIONS] INPUT_FILE [OUTPUT_FILE]
Get a list of infected hosts based on network traffic
BotRecon takes data about the network traffic, uses machine learning to
finds hosts infected with botnet malware. It can also be used as an
interface to use with your own machine learning model if it supports the
scikit-learn API.
INPUT_FILE is a path to the file with captured NetFlow traffic. Data
should be in a csv format unless a different --type is specified. BotRecon
expects the following data:
source address
protocol
destination port
source port
state
duration
totalbytes
sourcebytes
OUTPUT_FILE is a path to the desired output file location. It will be
saved as a .csv
Options:
-M, --custom-model PATH Path to your own custom model, has to
conform with scikit-learn specifications -
one of predict, predict_proba, or
decision_function must be implemented. The
model will receive the raw data from
INPUT_FILE. It is recommendedto use an
sklearn pipeline ending with a classifier.
-m, --model [rforest|svm|rforest-experimental]
One of the available, predefined models for
classifying the traffic. This parameter is
ignored if --custom-model/-M is passed. A
description of all available models can be
found in README.md [default: rforest]
-j, --jobs INTEGER Number of parallel jobs to use for
predicting. Negative values will match the
cpu count. Only applies to classifiers that
support multiprocessing (such as the default
random forest). [default: -1]
-c, --min-count INTEGER Minimum netflow count required for a host to
be evaluated. If set to 0or lower no hosts
are filtered. [default: 0]
-b, --batchify <FLOAT TEXT>... Divide data into batches before predicting.
Helpful for classifiers that have high
memory usage or for large amounts of data.
To use specify a value and then type. Type
can be either "%" or "batches". If "%" the
value has to be a float between 0 and 100.
If "batches" it has to be a positive integer
lower than the number of rows in data (after
filtering). If no verbosity options are
passed, this enables a progress bar for
predicting. Example: `--batchify 5 %`
-v, --verbose Increases the default verbosity of the
application.
-s, --silent Completely disables console output from the
application.
-d, --debug Enable debug mode.
-i, --ignore-invalid, --ignore-invalid-addresses
Controls the behavior in regards to invalid
host addresses in the data.Setting this flag
will make invalid addresses be silently
ignored instead of raising an error. Ignored
addresses will not be considered during
classification. Only applies if filtering by
IPs/ranges, no validity requirements are
enforced otherwise.
-r, --range, --ip TEXT An IP address, network, or a path to a file
containing a list with one of either per
line. If specified, hosts not on the list
will be ignored. Can be passed multiple
times.
-y, --yes, --confirm Automatically accepts any prompts shown by
the application. Currently the only prompt
appears when more than 50 infected hosts
were identified and no output file was
specified.
-t, --type [csv|feather|fwf|stata|json|pickle|parquet|excel]
Type of the input file. Some types may
require additional python modules to work.
[default: csv]
-V, --version Show the version and exit.
-h, --help Show this message and exit.
For a more detailed documentation see README.md
https://github.com/mhubl/botrecon
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
PLEASE NOTE THAT THE ABOVE NOTICE ALSO APPLIES TO THE ATTACHED MACHINE LEARNING MODELS. NO GUARANTEES ARE MADE IN REGARDS TO THEIR ACCURACY AND ABILITY TO PROPERLY DETECT ANY THREATS, THEY ARE NOT A REPLACEMENT FOR ANY OTHER SOLUTIONS OR SECURITY POLICIES.