For my capstone project in the HLTMS program, I aim to develop a Controlled Natural Language (CNL) and necessary tools and databases for Structured Threat Information eXpression (STIX) descriptions
in the Cyber Threat Intelligence (CTI) domain. The project, named ‘STIX-D’, will be a subset of Attempto Controlled English (ACE).
Phase 1 of the project started during the Summer 2024 academic session. It includes three components with each component corresponding to a class project. The components are:
- LING 508. An application to extract descriptions from STIX objects and parse the description texts into documents, sentences, and words, forming a corpus.
- INFO 579. A relational database to hold the STIX-D Corpus.
- INFO 523. Data mining in STIX-D Project.
This module focuses on developing an application to extract descriptions from STIX objects, parsing the description texts into documents, sentences, and words to form a comprehensive corpus.
Below are the functional components implement so far for the STIX-D Corpus Builder:
clex_importer.py
handles importing the ACE Common Lexicon (Clex) into the STIX-D Corpus Database, parsing the content, and saving lexical entries into the database.
lexicon_manager.py
manages the processing of lexical entries and creation of lexicon objects in the database.
sent_manager.py
manages the processing of sentences and creation of sentence objects in the database.
doc_scrapper.py
manages fetching and processing HTML documents, converting them to markdown.
doc_manager.py
manages the processing of documents and creation of document objects in the database.
stix_importer.py
handles importing STIX objects into the STIX-D Corpus Database, parsing the content, and saving descriptions into the database.
mysql_repository.py
provides the MySQL database implementation for Create, Read, Update, and Delete (CRUD) operations in the repository.
repository.py
defines the abstract repository interface, and mysql_repository.py provides the MySQL database implementation for CRUD operations.
gen_uuid.py
and gen_clex_uuid.py
handle generating UUIDs for STIX objects and Clex entries, respectively.
Unit tests are located in the tests
directory, utilizing pytest
to ensure the functionality of various components.
app.py
contains a simple Flask API for interacting with the STIX-D Corpus Database.
No. | Repository | Description |
---|---|---|
1. | APE | ACE Parser Engine (APE) |
2. | attack-stix-data | MITRE ATT&CK dataset represented in STIX 2.1 JSON collections. |
3. | Clex | ACE Common Lexicon |
4. | cti-pattern-validator | A software tool for checking the syntax of the Cyber Threat Intelligence (CTI) STIX Pattern expressions |
5. | cti-python-stix2 | Python APIs for serializing and de-serializing STIX2 JSON content, along with higher-level APIs for common tasks, including data markings, versioning, and for resolving STIX IDs across multiple data sources. |
6. | cti-stix2-json-schemas | JSON schemas and examples for STIX 2 |
7. | cti-stix-validator | The STIX Validator checks that STIX JSON content conforms to the requirements specified in the STIX 2.1 specification. |