GitHub - Cybonto/OllaBench: Evaluating LLMs' Cognitive Behavioral Reasoning for Cybersecurity

Evaluating LLMs' Cognitive Behavioral Reasoning for Cybersecurity

Latest News

[2024/02/07] 🚀 White Paper full benchmark results is available
[2024/05/19] OllaBench GUI demo and project brief are available at DevPost
[2024/05/09] Benchmarking of models are running while a white paper is being developed. Early results indicate mainstream LLM models do not score high - a sign of a good benchmark.
[2024/04/19] OllaBench v.0.2 is out. Benchmark dataset and sample LLM response results were uploaded. A white paper with benchmark analysis of mainstream open-weight models including the newly released Llama3 will be shared as soon as possible!
[2024/02/20] OllaBench v.0.2 Development Agenda is out. Will be twice as powerful 💥
[2024/02/12] 90sec Project Video Brief
[2024/02/07] 🚀 OllaGen1 is Launched!

Overview

Introduction to Interdependent Cybersecurity

Interdependent cybersecurity addresses the complexities and interconnectedness of various systems, emphasizing the need for collaborative and holistic approaches to mitigate risks. This field focuses on how different components, from technology to human factors, influence each other, creating a web of dependencies that must be managed to ensure robust security.

Background and Rationale

Despite significant investments in cybersecurity, many organizations struggle to effectively manage cybersecurity risks due to the increasing complexity and interdependence of their systems. Notably, human factors account for half of the long-lasting challenges in interdependent cybersecurity. Agent-Based Modeling powered by Large Language Models emerges as a promising solution as it is excellent at capturing individual characteristics, allowing the micro-level agent behaviors to collectively generate emergent macro-level structures. Evaluating LLMs in this context is crucial for legal compliance and effective application development. However, traditional evaluation frameworks for large language models often neglect the human factor and cognitive computing capabilities essential for interdependent cybersecurity. The paper introduces OllaBench, a novel evaluation framework designed to fill this gap by assessing LLMs on their ability to reason about human-centric interdependent cybersecurity scenarios, thereby enhancing their application in interdependent cybersecurity threat modeling and risk management.

Main Conclusions

Here we show that OllaBench effectively evaluates the accuracy, wastefulness, and consistency of LLMs in answering scenario-based information security compliance/non-compliance questions. The results indicate that while commercial LLMs perform best overall, smaller open-weight models also show promising capabilities. The most accurate models are not the most efficient models in terms of tokens spent in wrong answers which unecessarily increases the cost of adopting these models. Finally, the best performing models are not only accurate but also consistent in the way they answer questions.

Context and Impact

The findings from OllaBench highlight the opportunities and the importance of fine-tuning existing large language models to address human factors in interdependent cybersecurity. Providing a comprehensive tool for assessing LLM performance in human-centric, complex, interdependent cybersecurity scenarios, this work advances the field by closing the gaps of evaluating large language models in deeply complex interdisciplinary areas such as human-centrict interdependent cybersecurity threat modeling and risk management. The findings also contribute to the development of more reliable and effective cybersecurity systems, ultimately enhancing organizational resilience against evolving cyber threats.

❗IMPORTANT❗
Dataset Generator and test Datasets at the OllaGen1 subfolder.
You need to have either a local LLM stack (nvidia TensorRT-LLM with Llama_Index in my case) or OpenAI api key for generating new OllaBench datasets.
OpenAI throttle Requests per Minutes which may cause significant delays in generating big datasets.
When OllaBench white paper is published (later in MARCH), OllaBench benchmark scripts and leaderboard results will be made available.

Quick Start

Evaluate with your own codes

You can grab the evaluation datasets to run with your own evaluation codes. Note that the datasets (csv files) are for zero-shot evaluation. It is recommended that you modify the OllaBench Generator 1 (OllaGen1) params.json with your desired specs and run the OllaGen1.py to generate for yourself fresh, UNSEEN datasets that match your custom needs. Check OllaGen-1 README for more details.

Use OllaBench

OllaBench will evaluate your models within Ollama model zoo using OllaGen1 default datasets. You can quickly spin up Ollama with Docker desktop/compose and download LLMs to Ollama. Please check the below Installation section for more details.

Tested System Settings

The following tested system settings show successful operation for running OllaGen1 dataset generator and OllaBench.

Primary Generative AI model: Llama2
Python version: 3.10
Windows version: 11
GPU: nvidia geforce RTX 3080 Ti
Minimum RAM: [your normal ram use]+[the size of your intended model]
Disk space: [your normal disk use]+[minimum software requirements]+[the size of your intended model]
Minimum software requirements: nvidia CUDA 12 (nvidia CUDA toolkit), Microsoft MPI, MSVC compiller, llama_index
Additional system requirements: docker compose and other related docker requirements if you use Docker stack

Quick Install of Key Components

This quick install is for a single Windows PC use case (without Docker) and for when you need to use OllaGen1 to generate your own datasets. I assume you have nvidia GPU installed.\

Go to TensorRT-LLM for Windows and follow the Quick Start section to install TensorRT-LLM and the prerequisites.
If you plan to use OllaGen1 with local LLM, go to Llama_Index for TensorRT-LLM and follow instrucitons to install Llama_Index, and prepare models for TensorRT-LLM
If you plan to use OllaGen1 with OpenAI, please follow OpenAI's intruction to add the api key into your system environment. You will also need to change the llm_framework param in OllaGen1 params.json to openai.

Commands to check for key software requirements

Python
python -V
nvidia CUDA 12
nvcc -V
Microsoft MPI*
mpiexec -hellp \

Installation

The following instructions are mainly for the Docker use case.

Windows Linux Subsystem

If you are using Windows, you need to install WSL. The Windows Subsystem for Linux (WSL) is a compatibility layer introduced by Microsoft that enables users to run Linux binary executables natively on Windows 10 and Windows Server 2019 and later versions. WSL provides a Linux-compatible kernel interface developed by Microsoft, which can then run a Linux distribution on top of it. See here for information on how to install it. In this set up, we use Debian linux. You can check verify linux was installed by executing wsl -l -v You enter WSL by executing the command "wsl" from windows command line window.

Please disregard if you are using a linux system.

Nvidia Container Toolkit

The NVIDIA Container Toolkit is a powerful set of tools that allows users to build and run GPU-accelerated Docker containers. It leverages NVIDIA GPUs to enable the deployment of containers that require access to NVIDIA graphics processing units for computing tasks. This toolkit is particularly useful for applications in data science, machine learning, and deep learning, where GPU resources are critical for processing large datasets and performing complex computations efficiently. Instalation instructions are in here.

Please disregard if your computer does not have a GPU.

nVidia TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs.

LlamaIndex Tutorial on Installing TensorRT-LLM
TensorRT-LLM Github page

Ollama

Install Docker Desktop and Ollama with these instructions.

Run OllaGen-1

Please go to OllaGen1 subfolder and follow the instructions to generate the evaluation datasets.

Name		Name	Last commit message	Last commit date
Latest commit History 98 Commits
.devcontainer		.devcontainer
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
.streamlit		.streamlit
BenchmarkResults		BenchmarkResults
DEV		DEV
OllaGen-1		OllaGen-1
OllaGen-RAG		OllaGen-RAG
Responses		Responses
admin		admin
archive		archive
pages		pages
templates		templates
.gitignore		.gitignore
LICENSE		LICENSE
OllaBench-Flows.png		OllaBench-Flows.png
OllaBench-PrivacyPolicy.pdf		OllaBench-PrivacyPolicy.pdf
OllaBench-Results.xlsx		OllaBench-Results.xlsx
OllaBench1.py		OllaBench1.py
OllaBench1_gui.py		OllaBench1_gui.py
OllaBench_gui_menu.py		OllaBench_gui_menu.py
README.md		README.md
installation_LLMs.md		installation_LLMs.md
knowledge_graph.png		knowledge_graph.png
logo.png		logo.png
params.json		params.json
pull.bat		pull.bat
pull.sh		pull.sh
requirements.txt		requirements.txt
score_compare.PNG		score_compare.PNG
test2gpt-4o_chunk0_2024-05-28_20-41_QA_Results.csv		test2gpt-4o_chunk0_2024-05-28_20-41_QA_Results.csv
wasted_compare.PNG		wasted_compare.PNG

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Evaluating LLMs' Cognitive Behavioral Reasoning for Cybersecurity

Latest News

Table of Contents

Overview

Quick Start

Evaluate with your own codes

Use OllaBench

Tested System Settings

Quick Install of Key Components

Commands to check for key software requirements

Installation

Windows Linux Subsystem

Nvidia Container Toolkit

nVidia TensorRT-LLM

Ollama

Run OllaGen-1

About

Releases

Packages

Languages

License

Cybonto/OllaBench

Folders and files

Latest commit

History

Repository files navigation

Evaluating LLMs' Cognitive Behavioral Reasoning for Cybersecurity

Latest News

Table of Contents

Overview

Quick Start

Evaluate with your own codes

Use OllaBench

Tested System Settings

Quick Install of Key Components

Commands to check for key software requirements

Installation

Windows Linux Subsystem

Nvidia Container Toolkit

nVidia TensorRT-LLM

Ollama

Run OllaGen-1

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages