Skip to content

Commit

Permalink
Merge pull request #78 from DataONEorg/feature-71-client-updates
Browse files Browse the repository at this point in the history
Feature-71: Client Updates
  • Loading branch information
doulikecookiedough authored Dec 29, 2023
2 parents 5acc793 + f704389 commit 1d4c820
Show file tree
Hide file tree
Showing 4 changed files with 149 additions and 35 deletions.
156 changes: 134 additions & 22 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,20 +16,33 @@ DataONE in general, and HashStore in particular, are open source, community proj

Documentation is a work in progress, and can be found on the [Metacat repository](https://github.com/NCEAS/metacat/blob/feature-1436-storage-and-indexing/docs/user/metacat/source/storage-subsystem.rst#physical-file-layout) as part of the storage redesign planning. Future updates will include documentation here as the package matures.

## Development build
## HashStore Overview

HashStore is a python package, and built using the [Python Poetry](https://python-poetry.org) build tool.
HashStore is a content-addressable file management system that utilizes the content identifier of an object to address files. The system stores both objects, references (refs) and metadata in its respective directories and provides an API for interacting with the store. HashStore storage classes (like `FileHashStore`) must implement the HashStore interface to ensure the expected usage of HashStore.

To install `hashstore` locally, create a virtual environment for python 3.9+,
install poetry, and then install or build the package with `poetry install` or `poetry build`, respectively.
###### Public API Methods
- store_object
- verify_object
- tag_object
- find_object
- store_metadata
- retrieve_object
- retrieve_metadata
- delete_object
- delete_metadata
- get_hex_digest

For details, please see the HashStore interface (HashStore.java)

To run tests, navigate to the root directory and run `pytest -s`. The test suite contains tests that
take a longer time to run (relating to the storage of large files) - to execute all tests, run
`pytest --run-slow`. To see detailed

## Usage Example
###### How do I create a HashStore?

To view more details about the Public API - see 'hashstore.py` interface documentation
To create or interact with a HashStore, instantiate a HashStore object with the following set of properties:
- store_path
- store_depth
- store_width
- store_algorithm
- store_metadata_namespace

```py
from hashstore import HashStoreFactory
Expand All @@ -54,49 +67,148 @@ my_store = factory.get_hashstore(module_name, class_name, properties)
# Store objects (.../[hashstore_path]/objects/)
pid = "j.tao.1700.1"
object = "/path/to/your/object.data"
hash_address = my_store.store_object(pid, object)
object_cid = hash_address.id
object_metadata = my_store.store_object(pid, object)
object_cid = object_metadata.id

# Store metadata (.../[hashstore_path]/metadata/)
# By default, storing metadata will use the given properties namespace `format_id`
pid = "j.tao.1700.1"
sysmeta = "/path/to/your/sysmeta/document.xml"
metadata_cid = my_store.store_metadata(pid, sysmeta)
```

If you want to store other types of metadata, add an additional `format_id`.
```py
# If you want to store other types of metadata, add an additional `format_id`.
pid = "j.tao.1700.1"
metadata = "/path/to/your/metadata/document.json"
format_id = "http://custom.metadata.com/json/type/v1.0"
metadata_cid = my_store.store_metadata(pid, metadata, format_id)

# ...
```

###### Working with objects (store, retrieve, delete)

In HashStore, objects are first saved as temporary files while their content identifiers are calculated. Once the default hash algorithm list and their hashes are generated, objects are stored in their permanent location using the store's algorithm's corresponding hash value, the store depth and the store width. Lastly, reference files are created for the object so that they can be found and retrieved given an identifier (ex. persistent identifier (pid)). Note: Objects are also stored once and only once.

By calling the various interface methods for `store_object`, the calling app/client can validate, store and tag an object simultaneously if the relevant data is available. In the absence of an identfiier (ex. persistent identifier (pid)), `store_object` can be called to solely store an object. The client is then expected to call `verify_object` when the relevant metadata is available to confirm that the object has been stored as expected. And to finalize the process (to make the object discoverable), the client calls `tag_object``. In summary, there are two expected paths to store an object:

```py
# All-in-one process which stores, validates and tags an object
objectMetadata objInfo = store_object(InputStream, pid, additionalAlgorithm, checksum, checksumAlgorithm, objSize)

# Manual Process
# Store object
obj_metadata = store_object(InputStream)
# Validate object, throws exceptions if there is a mismatch and deletes the associated file
verify_object(objInfo, checksum, checksumAlgorithn, objSize)
# Tag object, makes the object discoverable (find, retrieve, delete)
tag_object(pid, cid)
```

**How do I retrieve an object if I have the pid?**
- To retrieve an object, call the Public API method `retrieve_object` which opens a stream to the object if it exists.

**How do I find an object or check that it exists if I have the pid?**
- To find the location of the object, call the Public API method `find_object` which will return the content identifier (cid) of the object.
- This cid can then be used to locate the object on disk by following HashStore's store configuration.

**How do I delete an object if I have the pid?**
- To delete an object, call the Public API method `delete_object` which will delete the object and its associated references and reference files where relevant.
- Note, `delete_object` and `tag_object` calls are synchronized on their content identifier values so that the shared reference files are not unintentionally modified concurrently. An object that is in the process of being deleted should not be tagged, and vice versa. These calls have been implemented to occur sequentially to improve clarity in the event of an unexpected conflict or issue.


###### Working with metadata (store, retrieve, delete)

HashStore's '/metadata' directory holds all metadata for objects stored in HashStore. To differentiate between metadata documents for a given object, HashStore includes the 'format_id' (format or namespace of the metadata) when generating the address of the metadata document to store (the hash of the 'pid' + 'format_id'). By default, calling `store_metadata` will use HashStore's default metadata namespace as the 'format_id' when storing metadata. Should the calling app wish to store multiple metadata files about an object, the client app is expected to provide a 'format_id' that represents an object format for the metadata type (ex. `store_metadata(stream, pid, format_id)`).

**How do I retrieve a metadata file?**
- To find a metadata object, call the Public API method `retrieve_metadata` which returns a stream to the metadata file that's been stored with the default metadata namespace if it exists.
- If there are multiple metadata objects, a 'format_id' must be specified when calling `retrieve_metadata` (ex. `retrieve_metadata(pid, format_id)`)

**How do I delete a metadata file?**
- Like `retrieve_metadata`, call the Public API method `delete_metadata` which will delete the metadata object associated with the given pid.
- If there are multiple metadata objects, a 'format_id' must be specified when calling `delete_metadata` to ensure the expected metadata object is deleted.


###### What are HashStore reference files?

HashStore assumes that every object to store has a respective identifier. This identifier is then used when storing, retrieving and deleting an object. In order to facilitate this process, we create two types of reference files:
- pid (persistent identifier) reference files
- cid (content identifier) reference files

These reference files are implemented in HashStore underneath the hood with no expectation for modification from the calling app/client. The one and only exception to this process when the calling client/app does not have an identifier, and solely stores an objects raw bytes in HashStore (calling `store_object(stream)`).

**'pid' Reference Files**
- Pid (persistent identifier) reference files are created when storing an object with an identifier.
- Pid reference files are located in HashStores '/refs/pid' directory
- If an identifier is not available at the time of storing an object, the calling app/client must create this association between a pid and the object it represents by calling `tag_object` separately.
- Each pid reference file contains a string that represents the content identifier of the object it references
- Like how objects are stored once and only once, there is also only one pid reference file for each object.

**'cid' Reference Files**
- Cid (content identifier) reference files are created at the same time as pid reference files when storing an object with an identifier.
- Cid reference files are located in HashStore's '/refs/cid' directory
- A cid reference file is a list of all the pids that reference a cid, delimited by a new line ("\n") character


###### What does HashStore look like?

```
# Example layout in HashStore with a single file stored along with its metadata and reference files.
# This uses a store depth of 3, with a width of 2 and "SHA-256" as its default store algorithm
## Notes:
## - Objects are stored using their content identifier as the file address
## - The reference file for each pid contains a single cid
## - The reference file for each cid contains multiple pids each on its own line
.../metacat/hashstore/
└─ objects
└─ /d5/95/3b/d802fa74edea72eb941...00d154a727ed7c2
└─ metadata
└─ /15/8d/7e/55c36a810d7c14479c9...b20d7df66768b04
└─ refs
└─ pid/0d/55/5e/d77052d7e166017f779...7230bcf7abcef65e
└─ cid/d5/95/3b/d802fa74edea72eb941...00d154a727ed7c2
hashstore.yaml
```

## Development build

HashStore is a python package, and built using the [Python Poetry](https://python-poetry.org) build tool.

To install `hashstore` locally, create a virtual environment for python 3.9+,
install poetry, and then install or build the package with `poetry install` or `poetry build`, respectively.

To run tests, navigate to the root directory and run `pytest -s`. The test suite contains tests that
take a longer time to run (relating to the storage of large files) - to execute all tests, run
`pytest --run-slow`. To see detailed

## HashStore Client

How to use HashStore client (command line app)
```sh
# Step 1: Create a HashStore
$ python './src/hashstore/client.py' /path/to/store/ -chs -dp=3 -wp=2 -ap=SHA-256 -nsp="http://www.ns.test/v1"
$ python './src/hashstore/hashstoreclient.py' /path/to/store/ -chs -dp=3 -wp=2 -ap=SHA-256 -nsp="http://www.ns.test/v1"

# Get the checksum of a data object
$ python './src/hashstore/client.py' /path/to/store/ -getchecksum -pid=content_identifier -algo=SHA-256
$ python './src/hashstore/hashstoreclient.py' /path/to/store/ -getchecksum -pid=content_identifier -algo=SHA-256

# Store a data object
$ python './src/hashstore/client.py' /path/to/store/ -storeobject -pid=content_identifier -path=/path/to/object
$ python './src/hashstore/hashstoreclient.py' /path/to/store/ -storeobject -pid=content_identifier -path=/path/to/object

# Store a metadata object
$ python './src/hashstore/client.py' /path/to/store/ -storemetadata -pid=content_identifier -path=/path/to/metadata/object -formatid=http://ns.dataone.org/service/types/v2.0
$ python './src/hashstore/hashstoreclient.py' /path/to/store/ -storemetadata -pid=content_identifier -path=/path/to/metadata/object -formatid=http://ns.dataone.org/service/types/v2.0

# Retrieve a data object
$ python './src/hashstore/client.py' /path/to/store/ -retrieveobject -pid=content_identifier
$ python './src/hashstore/hashstoreclient.py' /path/to/store/ -retrieveobject -pid=content_identifier

# Retrieve a metadata object
$ python './src/hashstore/client.py' /path/to/store/ -retrievemetadata -pid=content_identifier -formatid=http://ns.dataone.org/service/types/v2.0
$ python './src/hashstore/hashstoreclient.py' /path/to/store/ -retrievemetadata -pid=content_identifier -formatid=http://ns.dataone.org/service/types/v2.0

# Delete a data object
$ python './src/hashstore/client.py' /path/to/store/ -deleteobject -pid=content_identifier
$ python './src/hashstore/hashstoreclient.py' /path/to/store/ -deleteobject -pid=content_identifier

# Delete a metadata file
$ python './src/hashstore/client.py' /path/to/store/ -deletemetadata -pid=content_identifier -formatid=http://ns.dataone.org/service/types/v2.0
$ python './src/hashstore/hashstoreclient.py' /path/to/store/ -deletemetadata -pid=content_identifier -formatid=http://ns.dataone.org/service/types/v2.0
```

## License
Expand Down
10 changes: 6 additions & 4 deletions src/hashstore/filehashstore.py
Original file line number Diff line number Diff line change
Expand Up @@ -1857,10 +1857,12 @@ def delete(self, entity, file):

try:
os.remove(realpath)
except OSError:
pass
else:
self._remove_empty(os.path.dirname(realpath))
except OSError as err:
exception_string = (
f"FileHashStore - delete(): Unexpected {err=}, {type(err)=}"
)
logging.error(exception_string)
raise err

def _remove_empty(self, subpath):
"""Successively remove all empty folders starting with `subpath` and
Expand Down
File renamed without changes.
18 changes: 9 additions & 9 deletions tests/test_hashstore_client.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
import sys
import os
from pathlib import Path
from hashstore import client
from hashstore import hashstoreclient


def test_create_hashstore(tmp_path):
Expand All @@ -29,7 +29,7 @@ def test_create_hashstore(tmp_path):
sys.path.append(client_directory)
# Manually change sys args to simulate command line arguments
sys.argv = chs_args
client.main()
hashstoreclient.main()

hashstore_yaml = Path(client_test_store + "/hashstore.yaml")
hashstore_object_path = Path(client_test_store + "/objects")
Expand Down Expand Up @@ -66,7 +66,7 @@ def test_get_checksum(capsys, store, pids):
sys.path.append(client_directory)
# Manually change sys args to simulate command line arguments
sys.argv = chs_args
client.main()
hashstoreclient.main()

capsystext = capsys.readouterr().out
expected_output = (
Expand Down Expand Up @@ -99,7 +99,7 @@ def test_store_object(store, pids):
sys.path.append(client_directory)
# Manually change sys args to simulate command line arguments
sys.argv = chs_args
client.main()
hashstoreclient.main()

assert store.exists("objects", pids[pid][store.algorithm])

Expand Down Expand Up @@ -131,7 +131,7 @@ def test_store_metadata(store, pids):
sys.path.append(client_directory)
# Manually change sys args to simulate command line arguments
sys.argv = chs_args
client.main()
hashstoreclient.main()

assert store.exists("metadata", pids[pid]["metadata_cid"])

Expand Down Expand Up @@ -159,7 +159,7 @@ def test_retrieve_objects(capsys, pids, store):
sys.path.append(client_directory)
# Manually change sys args to simulate command line arguments
sys.argv = chs_args
client.main()
hashstoreclient.main()

object_stream = store.retrieve_object(pid)
object_content = (
Expand Down Expand Up @@ -201,7 +201,7 @@ def test_retrieve_metadata(capsys, pids, store):
sys.path.append(client_directory)
# Manually change sys args to simulate command line arguments
sys.argv = chs_args
client.main()
hashstoreclient.main()

metadata_stream = store.retrieve_metadata(pid, namespace)
metadata_content = (
Expand Down Expand Up @@ -239,7 +239,7 @@ def test_delete_objects(pids, store):
sys.path.append(client_directory)
# Manually change sys args to simulate command line arguments
sys.argv = chs_args
client.main()
hashstoreclient.main()

assert not store.exists("objects", pids[pid][store.algorithm])

Expand Down Expand Up @@ -271,6 +271,6 @@ def test_delete_metadata(pids, store):
sys.path.append(client_directory)
# Manually change sys args to simulate command line arguments
sys.argv = chs_args
client.main()
hashstoreclient.main()

assert not store.exists("metadata", pids[pid]["metadata_cid"])

0 comments on commit 1d4c820

Please sign in to comment.