- [2024/02/07] 🚀 White Paper full benchmark results is available
- [2024/05/19] OllaBench GUI demo and project brief are available at DevPost
- [2024/05/09] Benchmarking of models are running while a white paper is being developed. Early results indicate mainstream LLM models do not score high - a sign of a good benchmark.
- [2024/04/19] OllaBench v.0.2 is out. Benchmark dataset and sample LLM response results were uploaded. A white paper with benchmark analysis of mainstream open-weight models including the newly released Llama3 will be shared as soon as possible!
- [2024/02/20] OllaBench v.0.2 Development Agenda is out. Will be twice as powerful 💥
- [2024/02/12] 90sec Project Video Brief
- [2024/02/07] 🚀 OllaGen1 is Launched!
Introduction to Interdependent Cybersecurity
Interdependent cybersecurity addresses the complexities and interconnectedness of various systems, emphasizing the need for collaborative and holistic approaches to mitigate risks. This field focuses on how different components, from technology to human factors, influence each other, creating a web of dependencies that must be managed to ensure robust security.
Background and Rationale
Despite significant investments in cybersecurity, many organizations struggle to effectively manage cybersecurity risks due to the increasing complexity and interdependence of their systems. Notably, human factors account for half of the long-lasting challenges in interdependent cybersecurity. Agent-Based Modeling powered by Large Language Models emerges as a promising solution as it is excellent at capturing individual characteristics, allowing the micro-level agent behaviors to collectively generate emergent macro-level structures. Evaluating LLMs in this context is crucial for legal compliance and effective application development. However, traditional evaluation frameworks for large language models often neglect the human factor and cognitive computing capabilities essential for interdependent cybersecurity. The paper introduces OllaBench, a novel evaluation framework designed to fill this gap by assessing LLMs on their ability to reason about human-centric interdependent cybersecurity scenarios, thereby enhancing their application in interdependent cybersecurity threat modeling and risk management.
Main Conclusions
Here we show that OllaBench effectively evaluates the accuracy, wastefulness, and consistency of LLMs in answering scenario-based information security compliance/non-compliance questions. The results indicate that while commercial LLMs perform best overall, smaller open-weight models also show promising capabilities. The most accurate models are not the most efficient models in terms of tokens spent in wrong answers which unecessarily increases the cost of adopting these models. Finally, the best performing models are not only accurate but also consistent in the way they answer questions.
Context and Impact
The findings from OllaBench highlight the opportunities and the importance of fine-tuning existing large language models to address human factors in interdependent cybersecurity. Providing a comprehensive tool for assessing LLM performance in human-centric, complex, interdependent cybersecurity scenarios, this work advances the field by closing the gaps of evaluating large language models in deeply complex interdisciplinary areas such as human-centrict interdependent cybersecurity threat modeling and risk management. The findings also contribute to the development of more reliable and effective cybersecurity systems, ultimately enhancing organizational resilience against evolving cyber threats.
❗IMPORTANT❗
Dataset Generator and test Datasets at the OllaGen1 subfolder.
You need to have either a local LLM stack (nvidia TensorRT-LLM with Llama_Index in my case) or OpenAI api key for generating new OllaBench datasets.
OpenAI throttle Requests per Minutes which may cause significant delays in generating big datasets.
When OllaBench white paper is published (later in MARCH), OllaBench benchmark scripts and leaderboard results will be made available.
You can grab the evaluation datasets to run with your own evaluation codes. Note that the datasets (csv files) are for zero-shot evaluation. It is recommended that you modify the OllaBench Generator 1 (OllaGen1) params.json with your desired specs and run the OllaGen1.py to generate for yourself fresh, UNSEEN datasets that match your custom needs. Check OllaGen-1 README for more details.
OllaBench will evaluate your models within Ollama model zoo using OllaGen1 default datasets. You can quickly spin up Ollama with Docker desktop/compose and download LLMs to Ollama. Please check the below Installation section for more details.
The following tested system settings show successful operation for running OllaGen1 dataset generator and OllaBench.
- Primary Generative AI model: Llama2
- Python version: 3.10
- Windows version: 11
- GPU: nvidia geforce RTX 3080 Ti
- Minimum RAM: [your normal ram use]+[the size of your intended model]
- Disk space: [your normal disk use]+[minimum software requirements]+[the size of your intended model]
- Minimum software requirements: nvidia CUDA 12 (nvidia CUDA toolkit), Microsoft MPI, MSVC compiller, llama_index
- Additional system requirements: docker compose and other related docker requirements if you use Docker stack
This quick install is for a single Windows PC use case (without Docker) and for when you need to use OllaGen1 to generate your own datasets. I assume you have nvidia GPU installed.\
- Go to TensorRT-LLM for Windows and follow the Quick Start section to install TensorRT-LLM and the prerequisites.
- If you plan to use OllaGen1 with local LLM, go to Llama_Index for TensorRT-LLM and follow instrucitons to install Llama_Index, and prepare models for TensorRT-LLM
- If you plan to use OllaGen1 with OpenAI, please follow OpenAI's intruction to add the api key into your system environment. You will also need to change the
llm_framework
param in OllaGen1params.json
toopenai
.
Python
python -V
nvidia CUDA 12
nvcc -V
Microsoft MPI*
mpiexec -hellp
\
The following instructions are mainly for the Docker use case.
If you are using Windows, you need to install WSL. The Windows Subsystem for Linux (WSL) is a compatibility layer introduced by Microsoft that enables users to run Linux binary executables natively on Windows 10 and Windows Server 2019 and later versions. WSL provides a Linux-compatible kernel interface developed by Microsoft, which can then run a Linux distribution on top of it. See here for information on how to install it. In this set up, we use Debian linux. You can check verify linux was installed by executing
wsl -l -v
You enter WSL by executing the command "wsl" from windows command line window.
Please disregard if you are using a linux system.
The NVIDIA Container Toolkit is a powerful set of tools that allows users to build and run GPU-accelerated Docker containers. It leverages NVIDIA GPUs to enable the deployment of containers that require access to NVIDIA graphics processing units for computing tasks. This toolkit is particularly useful for applications in data science, machine learning, and deep learning, where GPU resources are critical for processing large datasets and performing complex computations efficiently. Instalation instructions are in here.
Please disregard if your computer does not have a GPU.
TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs.
- Install Docker Desktop and Ollama with these instructions.
Please go to OllaGen1 subfolder and follow the instructions to generate the evaluation datasets.