Inspiration

The grand challenge that most CEO's care about is maintaining the right level of cybersecurity at a minimum cost as companies are not able to reduce cybersecurity risks despite their increased cybersecurity investments [1]. Fortunately, the problem can be explained via interdependent cybersecurity (IC) [2] as follows. First, optimizing cybersecurity investments in existing large interdependent systems is already a well-known non-convex difficult problem that is still yearning for new solutions. Second, smaller systems are growing in complexity and interdependence. Last, new low frequency, near simultaneous, macro-scale risks such as global pandemics, financial shocks, geopolitical conflicts have compound effects on cybersecurity.

Human factors account for half of the long-lasting challenges in IC as identified by Kianpour et al. [3], and Laszka et al. [4]. Unfortunately, human-centric research within the context of IC is under-explored while research on general IC has unrealistic assumptions about human factors. Fortunately, the dawn of Large Language Models (LLMs) promise a much efficient way to research and develop solutions to problems across domains. In cybersecurity, the Zero-trust principles require the evaluation, validation, and continuous monitoring and LLMs are no exception.

Therefore, OllaBench was born to help both researchers and application developers conveniently evaluate their LLM models within the context of cybersecurity compliance or non-compliance behaviors.

What it does

A Scenario was generated based on a peer-reviewed knowledge network of cognitive behavioral constructs for cybersecurity (the nodes) and the relationships among them (the edges). Examples of cognitive behavioral constructs include Belief, Norm, Goal, etc. There are 10,000 scenarios in the default benchmark dataset. A scenario consists of two employees (A and B). Each employee has a short description of their cognitive behavioral profile containing natural language description of cognitive behavioral construct paths (a collection of nodes and edges).

The Evaluator Model (i.e. Arctic) generates the scenarios and questions. The Evaluatee Models (i.e. any model supported by Ollama) answer the questions and are compared based on the following scores.

Which Cognitive Path question presents four options of cognitive behavioral construct paths and asks the model to select the (only) correct option that fits the cognitive behavioral profile of either employees. The Avg WCP score is the average score across a model's answers each of which receives 1 if correct and 0 otherwise.

Who is Who question asks the model to decide who is MORE/LESS compliant with information security policies based on the stated cognitive behavioral profiles. The Avg WHO score is the average score across a model's answers each of which receives 1 if correct and 0 otherwise.

Team Risk Analysis question asks the model to assess whether the risk of information security noncompliant will increase if the two employees work closely together in the same team. The Avg Team Risk score is the average score across a model's answers each of which receives 1 if correct and 0 otherwise.

Target Factor Analysis question asks the model to identify the best cognitive behavioral construct to be targeted for strengthening so that the overall information cybersecurity compliance posture of the two employees can increase. The Avg Target Factor score is the average score across a model's answers each of which receives 1 if correct and 0 otherwise.

The Average score is the average of each model's 'Avg WCP score','Avg WHO score','Avg Team Risk score','Avg Target Factor score'. The model with the highest Average score could be the best performing model. However, it may not be the case with the most efficient model which is a combination of many factors including performance metrics and wasted response metric.

Wasted Response for each response is measured by the response's tokens and the response evaluation of being incorrect. The Wasted Average score is calculated by the total wasted tokens divided by the number of wrong responses. Further resource costs in terms of time and/or money can be derived from the total wasted response value. The model with the lowest Wasted Average score can be the most efficient model (to be decided in joint consideration with other metrics).

How I built it

OllaBench Architecture

Potential Impacts

Given that human factors significantly contribute to enduring challenges in cybersecurity, the framework’s evaluation results will foster AI-driven innovation and efficiency in the field. For example:

  • Bench-marking results will inform practitioners on which LLMs may be the best candidates for their human-centric LLM applications, reducing costs in early development phases and save costs in the long run.
  • Researchers may run OllaBench after each fine-tuning of their human-centric cybersecurity models.
  • Both practitioners and researchers may modify OllaBench for other projects.

The Results (so far)

Model Performance Across 4 Metrics (the higher the better)

Model Efficiency (the lower the better)

MODELS Avg WCP Duration Avg WCP Counts Avg WCP score Avg WHO Duration Avg WHO Counts Avg WHO score Avg Team Risk Duration Avg Team Risk Counts Avg Team Risk score Avg Target Factor Duration Avg Target Factor Counts Avg Target Factor score Avg Score Wasted WCP Counts Wasted WHO Counts Wasted Team Risk Counts Wasted Target Factor Counts Wasted Average
llama2-13b 11328175436.940144 355.07885714285715 0.30214285714285716 5489383415.113857 162.403 0.6054285714285714 7088231012.6097145 214.48557142857143 0.5047142857142857 6624460159.934 198.51657142857144 0.34614285714285714 0.43960714285714286 1748283 446868 738592 920381 137.64728571428572
llama2-7b 8492252613.1398 306.5312 0.2491 5054308887.6915 176.707 0.6027 6313784600.8859 224.0448 0.4666 5512254103.1345 194.0564 0.3167 0.408775 2312084 701811 1189559 1351947 138.885025
mistral-7b 8959153207.0657 292.4959 0.3221 3843736940.608 118.4382 0.5984 4347391603.8823 136.4793 0.5322 5757130027.6119 182.3919 0.4009 0.4634 2017752 476609 617705 1104906 105.4243
mixtral-8x7b 148058972478.00333 252.368 0.35933333333333334 87464366821.55267 125.25866666666667 0.6013333333333334 114885144102.572 188.906 0.49666666666666665 104986877016.44867 175.658 0.4093333333333333 0.4666666666666667 242523 76767 142957 163916 104.3605
orca2-13b 20472223410.8879 625.5752 0.0245 13790271030.6872 414.31 0.6252 15214033255.1227 458.987 0.1166 15709322249.2406 475.3549 0.0879 0.21355 6112178 1574714 4039670 4354959 402.038025
orca2-7b 15138010530.5208 570.6399 0.0241 10866066666.8139 404.5928 0.5843 12822254518.9788 480.2021 0.0617 11644636439.9282 434.7639 0.2357 0.22645 5569100 1667142 4503611 3346329 377.15455
phi-2.7b 2052660640.5804 64.1359 0.2219 1192838313.5778 35.343 0.3225 1758810938.0609 54.0241 0.2928 1522030470.9613 46.4584 0.3396 0.2942 483938 187374 405970 303897 34.529475
vicuna-13b 8867362511.8793 260.3653 0.2687 3831699001.4209 103.2363 0.5368 4754472050.2115 131.705 0.4333 3429500807.9661 91.6197 0.4241 0.415725 1997074 418984 781284 545532 93.57185

What's next for OllaBench

  • JUL2024 - White paper on OllaBench v.0.2 evaluation results of mainstream open-weight LLMs such as Llama3, Llama2, Orca2, Gemma, etc.
  • DEC2024 - OllaBench v.0.3 release, providing datasets and benchmark scripts for evaluating LLMs’ ability to detect Cognitive Warfare manipulations in chat texts.
  • JAN2025 - White paper on OllaBench v.0.3 evaluation results of mainstream open-weight LLMs
  • FEB2025 - Release tutorials on how to customize OllaBench for any interdisciplinary evaluation purposes of LLMs.
  • MAR2025 and forward - Running workshops on how to use OllaBench in various conferences, IEEE chapters, and university sponsored events.

Built With

Share this project:

Updates