Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature-85: Code Review Optimizations #86

Merged
merged 13 commits into from
Jan 25, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 18 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ Documentation is a work in progress, and can be found on the [Metacat repository

## HashStore Overview

HashStore is a content-addressable file management system that utilizes the content identifier of an object to address files. The system stores both objects, references (refs) and metadata in its respective directories and provides an API for interacting with the store. HashStore storage classes (like `FileHashStore`) must implement the HashStore interface to ensure the expected usage of HashStore.
HashStore is a content-addressable file management system that utilizes the content identifier of an object to address files. The system stores objects, references (refs) and metadata in its respective directories and provides an API for interacting with the store. HashStore storage classes (like `FileHashStore`) must implement the HashStore interface to ensure the expected usage of HashStore.

###### Public API Methods
- store_object
Expand All @@ -32,7 +32,7 @@ HashStore is a content-addressable file management system that utilizes the cont
- delete_metadata
- get_hex_digest

For details, please see the HashStore interface (HashStore.java)
For details, please see the HashStore interface (hashstore.py)


###### How do I create a HashStore?
Expand Down Expand Up @@ -68,7 +68,7 @@ my_store = factory.get_hashstore(module_name, class_name, properties)
pid = "j.tao.1700.1"
object = "/path/to/your/object.data"
object_metadata = my_store.store_object(pid, object)
object_cid = object_metadata.id
object_cid = object_metadata.cid

# Store metadata (.../[hashstore_path]/metadata/)
# By default, storing metadata will use the given properties namespace `format_id`
Expand All @@ -89,17 +89,17 @@ metadata_cid = my_store.store_metadata(pid, metadata, format_id)

In HashStore, objects are first saved as temporary files while their content identifiers are calculated. Once the default hash algorithm list and their hashes are generated, objects are stored in their permanent location using the store's algorithm's corresponding hash value, the store depth and the store width. Lastly, reference files are created for the object so that they can be found and retrieved given an identifier (ex. persistent identifier (pid)). Note: Objects are also stored once and only once.

By calling the various interface methods for `store_object`, the calling app/client can validate, store and tag an object simultaneously if the relevant data is available. In the absence of an identfiier (ex. persistent identifier (pid)), `store_object` can be called to solely store an object. The client is then expected to call `verify_object` when the relevant metadata is available to confirm that the object has been stored as expected. If the object is determined to be invalid (via `verify_object`), it will be deleted. Lastly, to finalize this process of storing an object (to make the object discoverable), the client calls `tag_object`. In summary, there are two expected paths to store an object:
By calling the various interface methods for `store_object`, the calling app/client can validate, store and tag an object simultaneously if the relevant data is available. In the absence of an identifier (ex. persistent identifier (pid)), `store_object` can be called to solely store an object. The client is then expected to call `verify_object` when the relevant metadata is available to confirm that the object has been stored as expected. If the object is determined to be invalid (via `verify_object`), the client is expected to delete the object directly. Lastly, to finalize this process of storing an object (to make the object discoverable), the client calls `tag_object`. In summary, there are two expected paths to store an object:

```py
# All-in-one process which stores, validates and tags an object
objectMetadata objInfo = store_object(InputStream, pid, additionalAlgorithm, checksum, checksumAlgorithm, objSize)
objectMetadata objInfo = store_object(stream, pid, additional_algo, checksum, checksum_algo, objSize)

# Manual Process
# Store object
obj_metadata = store_object(InputStream)
obj_metadata = store_object(stream)
# Validate object, throws exceptions if there is a mismatch and deletes the associated file
verify_object(objInfo, checksum, checksumAlgorithn, objSize)
verify_object(obj_metadata, checksum, checksumAlgorithn, objSize)
# Tag object, makes the object discoverable (find, retrieve, delete)
tag_object(pid, cid)
```
Expand All @@ -108,8 +108,8 @@ tag_object(pid, cid)
- To retrieve an object, call the Public API method `retrieve_object` which opens a stream to the object if it exists.

**How do I find an object or check that it exists if I have the pid?**
- To find the location of the object, call the Public API method `find_object` which will return the content identifier (cid) of the object.
- This cid can then be used to locate the object on disk by following HashStore's store configuration.
- To check if an object exists, call the Public API method `find_object` which will return the content identifier (cid) of the object if it exists.
- If desired, this cid can then be used to locate the object on disk by following HashStore's store configuration.

**How do I delete an object if I have the pid?**
- To delete an object, call the Public API method `delete_object` which will delete the object and its associated references and reference files where relevant.
Expand Down Expand Up @@ -152,7 +152,7 @@ These reference files are implemented in HashStore underneath the hood with no e

###### What does HashStore look like?

```
```shell
# Example layout in HashStore with a single file stored along with its metadata and reference files.
# This uses a store depth of 3, with a width of 2 and "SHA-256" as its default store algorithm
## Notes:
Expand Down Expand Up @@ -200,28 +200,28 @@ How to use HashStore client (command line app)
$ python './src/hashstore/hashstoreclient.py' /path/to/store/ -chs -dp=3 -wp=2 -ap=SHA-256 -nsp="http://www.ns.test/v1"

# Get the checksum of a data object
$ python './src/hashstore/hashstoreclient.py' /path/to/store/ -getchecksum -pid=content_identifier -algo=SHA-256
$ python './src/hashstore/hashstoreclient.py' /path/to/store/ -getchecksum -pid=persistent_identifier -algo=SHA-256

# Find an object (returns the content identifier)
$ python './src/hashstore/hashstoreclient.py' /path/to/store/ -findobject -pid=content_identifier
$ python './src/hashstore/hashstoreclient.py' /path/to/store/ -findobject -pid=persistent_identifier

# Store a data object
$ python './src/hashstore/hashstoreclient.py' /path/to/store/ -storeobject -pid=content_identifier -path=/path/to/object
$ python './src/hashstore/hashstoreclient.py' /path/to/store/ -storeobject -pid=persistent_identifier -path=/path/to/object

# Store a metadata object
$ python './src/hashstore/hashstoreclient.py' /path/to/store/ -storemetadata -pid=content_identifier -path=/path/to/metadata/object -formatid=http://ns.dataone.org/service/types/v2.0
$ python './src/hashstore/hashstoreclient.py' /path/to/store/ -storemetadata -pid=persistent_identifier -path=/path/to/metadata/object -formatid=http://ns.dataone.org/service/types/v2.0

# Retrieve a data object
$ python './src/hashstore/hashstoreclient.py' /path/to/store/ -retrieveobject -pid=content_identifier
$ python './src/hashstore/hashstoreclient.py' /path/to/store/ -retrieveobject -pid=persistent_identifier

# Retrieve a metadata object
$ python './src/hashstore/hashstoreclient.py' /path/to/store/ -retrievemetadata -pid=content_identifier -formatid=http://ns.dataone.org/service/types/v2.0
$ python './src/hashstore/hashstoreclient.py' /path/to/store/ -retrievemetadata -pid=persistent_identifier -formatid=http://ns.dataone.org/service/types/v2.0

# Delete a data object
$ python './src/hashstore/hashstoreclient.py' /path/to/store/ -deleteobject -pid=content_identifier
$ python './src/hashstore/hashstoreclient.py' /path/to/store/ -deleteobject -pid=persistent_identifier

# Delete a metadata file
$ python './src/hashstore/hashstoreclient.py' /path/to/store/ -deletemetadata -pid=content_identifier -formatid=http://ns.dataone.org/service/types/v2.0
$ python './src/hashstore/hashstoreclient.py' /path/to/store/ -deletemetadata -pid=persistent_identifier -formatid=http://ns.dataone.org/service/types/v2.0
```

## License
Expand Down
Loading
Loading