diff --git a/README.md b/README.md index 1b60cea..df6617e 100644 --- a/README.md +++ b/README.md @@ -121,6 +121,7 @@ In July 2023, we started populating a [Github repository](https://github.com/dan ## Version History + * 2024-01-24 v1.3.0: BREAKING: Milestone release with indexing improvements for PicHash and MinHash. To ensure full backward compatibility, recalculation of all hashes is recommended. Check this [migration guide](https://github.com/danielplohmann/mcrit/blob/main/docs/migration-v1.3.0.md). * 2024-01-23 v1.2.26: Pinning lief to 0.13.2 in order to ensure that the pinned SMDA remains compatible. * 2024-01-09 v1.2.25: Ensure that we can deliver system status regardless of whether there is a `db_state` and `db_timestamp` or not. * 2024-01-05 v1.2.24: Now supporting "query" argument in CLI, as well as compact MatchingResults (without function match info) to reduce file footprint. @@ -132,19 +133,19 @@ In July 2023, we started populating a [Github repository](https://github.com/dan * 2023-12-05 v1.2.15: Added convenience functionality to Job objects, version number aligned with mcritweb. * 2023-11-24 v1.2.11: SMDA pinned to version 1.12.7 before we upgrade SMDA and introduce a database migration to recalculate pic + picblock hashes with the improved generalization. * 2023-11-17 v1.2.10: Added ability to set an authorization token for the server via header field: `apitoken`; added ability to filter by job groups; added ability to fail orphaned jobs. - * 2023-10-17 v1.2.8: Minor fix in job groups. - * 2023-10-16 v1.2.6: Summarized queue statistics, refined Job classification. - * 2023-10-13 v1.2.4: Exposed Queue/Job Deletion to REST interface, improved query speed for various queue lookups via indexing and parameterized mongodb queries. - * 2023-10-13 v1.2.3: Workers will now de-register from in-progress jobs in case they crash (THX to @yankovs for the code template). - * 2023-10-03 v1.2.2: MatchingResult filtering for min/max num samples (incl. fix). - * 2023-10-02 v1.2.0: Milestone release for Virus Bulletin 2023. - * 2023-09-18 v1.1.7: Bugfix: Tasking matching with 0 bands now deactivates minhash matching as it was supposed to be before. Also matching job progress percentage fixed. - * 2023-09-15 v1.1.6: Bugfix in BlockMatching, convenience functionality for interacting with Job objects. - * 2023-09-14 v1.1.5: Deactivated gunicorn as default WSGI handler for the time being due to issues with non-returning calls when handling compute-heavy calls. - * 2023-09-14 v1.1.4: BUGFIX: Added `requirements.txt` to `data_files` in `setup.py` to ensure it's available for the package. - * 2023-09-13 v1.1.3: Extracted some performance critical constants into parameters configurable in MinHashConfig and StorageConfig, fixed progress reporting for batched matching, BUGFIX: usage of GunicornConfig to proper dataclass. - * 2023-09-13 v1.1.1: Streamlined requirements / setup, excluded `gunicorn` for Windows (THX to @yankovs!!). - * 2023-09-12 v1.1.0: For Linux deployments, MCRIT now uses `gunicorn` instead of `waitress` as WSGI server because of [much better performance](https://github.com/danielplohmann/mcrit/pull/39). As gunicorn needs its own config, this required bumping the minor versions (THX to @yankovs!!). + * 2023-10-17 v1.2.8: Minor fix in job groups. + * 2023-10-16 v1.2.6: Summarized queue statistics, refined Job classification. + * 2023-10-13 v1.2.4: Exposed Queue/Job Deletion to REST interface, improved query speed for various queue lookups via indexing and parameterized mongodb queries. + * 2023-10-13 v1.2.3: Workers will now de-register from in-progress jobs in case they crash (THX to @yankovs for the code template). + * 2023-10-03 v1.2.2: MatchingResult filtering for min/max num samples (incl. fix). + * 2023-10-02 v1.2.0: Milestone release for Virus Bulletin 2023. + * 2023-09-18 v1.1.7: Bugfix: Tasking matching with 0 bands now deactivates minhash matching as it was supposed to be before. Also matching job progress percentage fixed. + * 2023-09-15 v1.1.6: Bugfix in BlockMatching, convenience functionality for interacting with Job objects. + * 2023-09-14 v1.1.5: Deactivated gunicorn as default WSGI handler for the time being due to issues with non-returning calls when handling compute-heavy calls. + * 2023-09-14 v1.1.4: BUGFIX: Added `requirements.txt` to `data_files` in `setup.py` to ensure it's available for the package. + * 2023-09-13 v1.1.3: Extracted some performance critical constants into parameters configurable in MinHashConfig and StorageConfig, fixed progress reporting for batched matching, BUGFIX: usage of GunicornConfig to proper dataclass. + * 2023-09-13 v1.1.1: Streamlined requirements / setup, excluded `gunicorn` for Windows (THX to @yankovs!!). + * 2023-09-12 v1.1.0: For Linux deployments, MCRIT now uses `gunicorn` instead of `waitress` as WSGI server because of [much better performance](https://github.com/danielplohmann/mcrit/pull/39). As gunicorn needs its own config, this required bumping the minor versions (THX to @yankovs!!). * 2023-09-08 v1.0.21: All methods of McritClient now forward apitokens/usernames to the backend. * 2023-09-05 v1.0.20: Use two-complement to represent addresses in SampleEntry, FunctionEntry when storing in MongoDB to address BSON limitations (THX to @yankovs). * 2023-09-05 v1.0.19: Statistics are now using the internal counters that had been created a while ago (THX to @yankovs). @@ -154,13 +155,13 @@ In July 2023, we started populating a [Github repository](https://github.com/dan * 2023-08-23 v1.0.12: Added the ability to rebuild the minhash bands used for indexing. * 2023-08-22 v1.0.11: Fixed a bug where when importing bulk data, the `function_name` was not also added as a `function_label`. * 2023-08-11 v1.0.10: Fixed a bug where when importing bulk data, the function_id would not be adjusted prior to adding MinHashes to bands, possibly leading to non-existing function_ids. - * 2023-08-02 v1.0.9: IDA plugin can now filter by block size and minhash score, optimized layout and user experience (THX for the feedback to @r0ny123!!) - * 2023-07-28 v1.0.8: IDA plugin can now display colored graphs for remote functions and do queries for PicBlockHashes (for basic blocks) for the currently viewed function. - * 2023-06-06 v1.0.7: Extended filtering capabilities on MatchingResult. - * 2023-06-02 v1.0.6: IDA plugin can now task matching jobs, show their results and batch import labels. Harmonization of MatchingResult. - * 2023-05-22 v1.0.3: More robustness for path verification when using MCRIT CLI on Malpedia repo folder. - * 2023-05-12 v1.0.1: Some progress on label import for the IDA plugin. Reflected API extension of MCRITweb in McritClient. - * 2023-04-10 v1.0.0: Milestone release for Botconf 2023. + * 2023-08-02 v1.0.9: IDA plugin can now filter by block size and minhash score, optimized layout and user experience (THX for the feedback to @r0ny123!!) + * 2023-07-28 v1.0.8: IDA plugin can now display colored graphs for remote functions and do queries for PicBlockHashes (for basic blocks) for the currently viewed function. + * 2023-06-06 v1.0.7: Extended filtering capabilities on MatchingResult. + * 2023-06-02 v1.0.6: IDA plugin can now task matching jobs, show their results and batch import labels. Harmonization of MatchingResult. + * 2023-05-22 v1.0.3: More robustness for path verification when using MCRIT CLI on Malpedia repo folder. + * 2023-05-12 v1.0.1: Some progress on label import for the IDA plugin. Reflected API extension of MCRITweb in McritClient. + * 2023-04-10 v1.0.0: Milestone release for Botconf 2023. * 2023-04-10 v0.25.0: IDA plugin can now do function queries for the currently viewed function. * 2023-03-24 v0.24.2: McritClient can forward username/apitoken, addJsonReport is now forwardable. * 2023-03-21 v0.24.0: FunctionEntries now can store additional FunctionLabelEntries, along submitting user/date. diff --git a/docs/migration-v1.3.0.md b/docs/migration-v1.3.0.md new file mode 100644 index 0000000..fb3d50e --- /dev/null +++ b/docs/migration-v1.3.0.md @@ -0,0 +1,40 @@ +# MCRIT Migration Guide for v1.3.0 + +With the MCRIT v1.3.0 release, we address several issues noticed with SMDA over the last months. +In particular, we noticed e.g. that not all addresses were [properly masked](https://github.com/danielplohmann/smda/issues/37), which caused functions that should be PicHash-identical to have different hashes and thus being missed during this matching phase. +Additionally, some of you experienced log output about [unhandled instructions](https://github.com/danielplohmann/smda/issues/48) during mnemonic escaping, which also in rare cases broke [opcode bytes](https://github.com/danielplohmann/smda/issues/46). +Larger Delphi binaries could furthermore stall batch processing, as there were [issues](https://github.com/danielplohmann/smda/issues/44) in parsing internal structures. + +All of these have been fixed, but some of this comes at the price of potential incompatibility with calculated PicHashes and MinHashes in your databases. +To simplify the migration and especially avoid having to reprocess any binary content, we have introduced specific migration functions in the MinHashIndex that will help to modernize all content to the new SMDA version. + +## Triggering the Database Migration + +After updating to the latest requirements, you should have SMDA v1.3.11 or higher available: + +```bash +$ python -m pip install -r requirements.txt +... +$ python -m pip freeze | grep smda +smda==1.3.11 +``` + +You can now do one of the following: + +* use curl to queue the recalculation jobs for PicHash and MinHash: +```bash +$ curl http://127.0.0.1:8000/recalculate_pichashes +$ curl http://127.0.0.1:8000/recalculate_minhashes +``` + +* use the McritClient to queue the recalculation jobs for PicHash and MinHash: +```python +>>> from mcrit.client.McritClient import McritClient +>>> c = McritClient() +>>> c.recalculatePicHashes() +>>> c.recalculateMinHashes() +``` +* use the McritWeb front-end to trigger the matching jobs +-> this will be implemented asap and then be available to admin users in the server section. + +Note that these jobs may run for an extensive amount of time depending on the number of functions indexed in your database. \ No newline at end of file diff --git a/mcrit/Worker.py b/mcrit/Worker.py index c555993..e02c81c 100644 --- a/mcrit/Worker.py +++ b/mcrit/Worker.py @@ -190,8 +190,8 @@ def recalculatePicHashes(self, progress_reporter=NoProgressReporter()): # Reports PROGRESS @Remote(progress=True) def recalculateMinHashes(self, progress_reporter=NoProgressReporter()): - return self._storage.recalculateAllMinHashes(progress_reporter=progress_reporter) - + self._storage.deleteAllMinHashes(progress_reporter=progress_reporter) + return self.updateMinHashes(None, progress_reporter=progress_reporter) # Reports PROGRESS @Remote(progress=True) diff --git a/mcrit/config/McritConfig.py b/mcrit/config/McritConfig.py index fa0b315..ea65304 100644 --- a/mcrit/config/McritConfig.py +++ b/mcrit/config/McritConfig.py @@ -10,7 +10,7 @@ class McritConfig(object): # NOTE to self: always change this in setup.py as well! - VERSION = "1.2.26" + VERSION = "1.3.0" # basic pathing info CONFIG_FILE_PATH = str(os.path.abspath(__file__)) PROJECT_ROOT = str(os.path.abspath(os.sep.join([CONFIG_FILE_PATH, "..", ".."]))) diff --git a/mcrit/storage/MongoDbStorage.py b/mcrit/storage/MongoDbStorage.py index 5120ef4..1aa5173 100644 --- a/mcrit/storage/MongoDbStorage.py +++ b/mcrit/storage/MongoDbStorage.py @@ -1102,6 +1102,20 @@ def rebuildMinhashBandIndex(self, progress_reporter=None): progress_reporter.step() return {"minhash_functions_indexed": minhash_functions} + def deleteAllMinHashes(self, progress_reporter=None): + # delete all minhashes + self._getDb().functions.update_many({}, {"$set": {"minhash": ""}}) + # reset bands + collections = [] + for band_id in range(self._storage_config.STORAGE_NUM_BANDS): + collections.append("band_%d" % band_id) + for c in collections: + self._getDb()[c].drop() + col = self._getDb()[c] + self._getDb()[c].create_index("band_hash") + LOGGER.info("Dropped all Minhashes and created a fresh banding index.") + return + def recalculateAllPicHashes(self, progress_reporter=None): # get current SMDA version smda_config = SmdaConfig() diff --git a/mcrit/storage/StorageInterface.py b/mcrit/storage/StorageInterface.py index 45b9e9b..3f45184 100644 --- a/mcrit/storage/StorageInterface.py +++ b/mcrit/storage/StorageInterface.py @@ -650,11 +650,10 @@ def recalculateAllPicHashes(self) -> int: """ raise NotImplementedError - def recalculateAllMinHashes(self) -> int: - """ Process all FunctionEntries and use this SMDA version and MCRIT config to recalculate and update the MinHashes - In the end, call rebuildMinhashBandIndex + def deleteAllMinHashes(self) -> int: + """ drop every minhash in all function_entries as a preparation for a full rebuild Returns: - the number of minhashes indexed + the number of minhashes dropped """ raise NotImplementedError diff --git a/setup.py b/setup.py index c9285c1..9bcbf98 100644 --- a/setup.py +++ b/setup.py @@ -7,7 +7,7 @@ setup( name='mcrit', - version="1.2.26", + version="1.3.0", description='MCRIT is a framework created for simplified application of the MinHash algorithm to code similarity.', long_description_content_type="text/markdown", long_description=README,