Skip to content

Commit

Permalink
Refactor 'hashstore' module to use reStructuredText for sphinx-autodoc
Browse files Browse the repository at this point in the history
  • Loading branch information
doulikecookiedough committed Dec 29, 2023
1 parent f5bf47e commit 97ea4e8
Showing 1 changed file with 147 additions and 146 deletions.
293 changes: 147 additions & 146 deletions src/hashstore/hashstore.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,220 +25,218 @@ def store_object(
checksum_algorithm,
expected_object_size,
):
"""The `store_object` method is responsible for the atomic storage of objects to
disk using a given stream. Upon successful storage, the method returns a ObjectMetadata
object containing relevant file information, such as the file's id (which can be
used to locate the object on disk), the file's size, and a hex digest dict of algorithms
and checksums. Storing an object with `store_object` also tags an object (creating
references) which allow the object to be discoverable.
`store_object` also ensures that an object is stored only once by synchronizing multiple
calls and rejecting calls to store duplicate objects. Note, calling `store_object` without
a pid is a possibility, but should only store the object without tagging the object.
It is then the caller's responsibility to finalize the process by calling `tag_object`
after veriftying the correct object is stored.
The file's id is determined by calculating the object's content identifier based on
the store's default algorithm, which is also used as the permanent address of the file.
The file's identifier is then sharded using the store's configured depth and width,
delimited by '/' and concatenated to produce the final permanent address
and is stored in the `/store_directory/objects/` directory.
By default, the hex digest map includes the following hash algorithms:
md5, sha1, sha256, sha384, sha512 - which are the most commonly used algorithms in
dataset submissions to DataONE and the Arctic Data Center. If an additional algorithm
is provided, the `store_object` method checks if it is supported and adds it to the
hex digests dict along with its corresponding hex digest. An algorithm is considered
"supported" if it is recognized as a valid hash algorithm in the `hashlib` library.
Similarly, if a file size and/or checksum & checksum_algorithm value are provided,
`store_object` validates the object to ensure it matches the given arguments
before moving the file to its permanent address.
Args:
pid (string): Authority-based identifier.
data (mixed): String or path to object.
additional_algorithm (string): Additional hex digest to include.
checksum (string): Checksum to validate against.
checksum_algorithm (string): Algorithm of supplied checksum.
expected_object_size (int): Size of object to verify
Returns:
object_metadata (ObjectMetadata): Object that contains the permanent address,
file size and hex digest dictionary.
"""Atomic storage of objects to disk using a given stream.
The `store_object` method ensures atomic storage of objects to disk. Upon successful
storage, it returns an ObjectMetadata object containing relevant file information,
such as the file's id (used to locate the object on disk), the file's size, and a hex digest
dictionary of algorithms and checksums. The method also tags the object, creating references
for discoverability.
`store_object` ensures that an object is stored only once by synchronizing multiple calls
and rejecting attempts to store duplicate objects. If called without a pid, it stores the
object without tagging, and it becomes the caller's responsibility to finalize the process
by calling `tag_object` after verifying the correct object is stored.
The file's id is determined by calculating the object's content identifier based on the
store's default algorithm, which is also the permanent address of the file. The file's
identifier is then sharded using the store's configured depth and width, delimited by '/',
and concatenated to produce the final permanent address. This address is stored in the
`/store_directory/objects/` directory.
By default, the hex digest map includes common hash algorithms (md5, sha1, sha256, sha384,
sha512). If an additional algorithm is provided, the method checks if it is supported and
adds it to the hex digests dictionary along with its corresponding hex digest. An algorithm
is considered "supported" if it is recognized as a valid hash algorithm in the `hashlib`
library.
If file size and/or checksum & checksum_algorithm values are provided, `store_object`
validates the object to ensure it matches the given arguments before moving the file to
its permanent address.
:param str pid: Authority-based identifier.
:param mixed data: String or path to the object.
:param str additional_algorithm: Additional hex digest to include.
:param str checksum: Checksum to validate against.
:param str checksum_algorithm: Algorithm of the supplied checksum.
:param int expected_object_size: Size of the object to verify.
:return: ObjectMetadata - Object containing the permanent address, file size, and
hex digest dictionary.
"""
raise NotImplementedError()

@abstractmethod
def tag_object(self, pid, cid):
"""The `tag_object` method creates references that allow objects stored in HashStore
to be discoverable. Retrieving, deleting or calculating a hex digest of an object is
based on a pid argument; and to proceed, we must be able to find the object associated
with the pid.
"""Create references to make objects discoverable in HashStore.
Args:
pid (string): Authority-based or persistent identifier of object
cid (string): Content identifier of object
The `tag_object` method enables operations such as retrieving, deleting, or calculating
a hex digest based on the provided pid argument. To perform these actions, it's crucial
to locate the object associated with the given pid.
Returns:
boolean: `True` upon successful tagging.
:param str pid: Authority-based or persistent identifier of the object.
:param str cid: Content identifier of the object.
:return: bool - `True` upon successful tagging.
"""
raise NotImplementedError()

@abstractmethod
def verify_object(
self, object_metadata, checksum, checksum_algorithm, expected_file_size
):
"""Confirms that an object_metadata's content is equal to the given values.
"""Confirm equality of content in an ObjectMetadata.
The `verify_object` method verifies that the content in the provided `object_metadata`
matches the specified values.
:param ObjectMetadata object_metadata: ObjectMetadata object.
:param str checksum: Value of the checksum.
:param str checksum_algorithm: Algorithm of the checksum.
:param int expected_file_size: Size of the temporary file.
Args:
object_metadata (ObjectMetadata): object_metadata object
checksum (string): Value of checksum
checksum_algorithm (string): Algorithm of checksum
expected_file_size (int): Size of the tmp file
:return: None
"""
raise NotImplementedError()

@abstractmethod
def find_object(self, pid):
"""The `find_object` method checks whether an object referenced by a pid exists
and returns the content identifier.
"""Check if an object referenced by a pid exists and retrieve its content identifier.
Args:
pid (string): Authority-based or persistent identifier of object
The `find_object` method validates the existence of an object based on the provided
pid and returns the associated content identifier.
Returns:
cid (string): Content identifier of the object
:param str pid: Authority-based or persistent identifier of the object.
:return: str - Content identifier of the object.
"""
raise NotImplementedError()

@abstractmethod
def store_metadata(self, pid, metadata, format_id):
"""The `store_metadata` method is responsible for adding and/or updating metadata
(ex. `sysmeta`) to disk using a given path/stream, a persistent identifier `pid`
and a metadata `format_id`. The metadata object's permanent address, which is
determined by calculating the SHA-256 hex digest of the provided `pid` + `format_id`.
Upon successful storage of metadata, `store_metadata` returns a string that
represents the file's permanent address. Lastly, the metadata objects are stored
in parallel to objects in the `/store_directory/metadata/` directory.
Args:
pid (string): Authority-based identifier.
format_id (string): Metadata format
metadata (mixed): String or path to metadata document.
Returns:
metadata_cid (string): Address of the metadata document.
"""Add or update metadata, such as `sysmeta`, to disk using the given path/stream.
The `store_metadata` method uses a persistent identifier `pid` and a metadata `format_id`
to determine the permanent address of the metadata object. The permanent address is
calculated by obtaining the SHA-256 hex digest of the concatenation of `pid` & `format_id`.
Upon successful storage of metadata, the method returns a string representing the file's
permanent address. Metadata objects are stored in parallel to objects in the
`/store_directory/metadata/` directory.
:param str pid: Authority-based identifier.
:param mixed metadata: String or path to the metadata document.
:param str format_id: Metadata format.
:return: str - Address of the metadata document.
"""
raise NotImplementedError()

@abstractmethod
def retrieve_object(self, pid):
"""The `retrieve_object` method retrieves an object from disk using a given
persistent identifier (pid). If the object exists, the method will open and return
a buffered object stream ready to read from.
"""Retrieve an object from disk using a persistent identifier (pid).
The `retrieve_object` method opens and returns a buffered object stream ready for reading
if the object associated with the provided `pid` exists on disk.
Args:
pid (string): Authority-based identifier.
:param str pid: Authority-based identifier.
Returns:
obj_stream (io.BufferedReader): A buffered stream of a data object.
:return: io.BufferedReader - Buffered stream of the data object.
"""
raise NotImplementedError()

@abstractmethod
def retrieve_metadata(self, pid, format_id):
"""The 'retrieve_metadata' method retrieves the metadata object from disk using
a given persistent identifier (pid) and metadata namespace (format_id).
If the object exists (determined by calculating the metadata object's permanent
address using the SHA-256 hash of the given pid+format_id), the method will open
and return a buffered metadata stream ready to read from.
Args:
pid (string): Authority-based identifier
format_id (string): Metadata format
Returns:
metadata_stream (io.BufferedReader): A buffered stream of a metadata object.
"""Retrieve the metadata object from disk using a persistent identifier (pid)
and metadata namespace (format_id).
The `retrieve_metadata` method calculates the metadata object's permanent address
by hashing the concatenation of the given `pid` and `format_id`. If the object
exists, the method opens and returns a buffered metadata stream ready for reading.
:param str pid: Authority-based identifier.
:param str format_id: Metadata format.
:return: io.BufferedReader - Buffered stream of the metadata object.
"""
raise NotImplementedError()

@abstractmethod
def delete_object(self, pid):
"""The 'delete_object' method deletes an object permanently from disk using a
given persistent identifier.
"""Delete an object permanently from disk using a persistent identifier (pid).
The `delete_object` method removes the object associated with the provided `pid` from
disk, resulting in the permanent deletion of the object.
Args:
pid (string): Authority-based identifier.
:param str pid: Authority-based identifier.
Returns:
boolean: `True` upon successful deletion.
:return: bool - `True` upon successful deletion.
"""
raise NotImplementedError()

@abstractmethod
def delete_metadata(self, pid, format_id):
"""The 'delete_metadata' method deletes a metadata document permanently
from disk using a given persistent identifier and format_id.
"""Delete a metadata document permanently from disk using a persistent identifier (pid)
and metadata namespace (format_id).
Args:
pid (string): Authority-based identifier
format_id (string): Metadata format
The `delete_metadata` method removes the metadata document associated with the provided
`pid` and `format_id` from disk, resulting in its permanent deletion.
Returns:
boolean: `True` upon successful deletion.
:param str pid: Authority-based identifier.
:param str format_id: Metadata format.
:return: bool - `True` upon successful deletion.
"""
raise NotImplementedError()

@abstractmethod
def get_hex_digest(self, pid, algorithm):
"""The 'get_hex_digest' method calculates the hex digest of an object that exists
"""Calculate the hex digest of an object in HashStore.
The `get_hex_digest` method calculates the hex digest of an object that exists
in HashStore using a given persistent identifier and hash algorithm.
Args:
pid (string): Authority-based identifier.
algorithm (string): Algorithm of hex digest to generate.
:param str pid: Authority-based identifier.
:param str algorithm: Algorithm of hex digest to generate.
Returns:
hex_digest (string): Hex digest of the object.
:return: str - Hex digest of the object.
"""
raise NotImplementedError()


class HashStoreFactory:
"""A factory class for creating `HashStore`-like objects (classes
that implement the 'HashStore' abstract methods)
"""A factory class for creating `HashStore`-like objects.
This factory class provides a method to retrieve a `HashStore` object
based on a given module (ex. "hashstore.filehashstore.filehashstore")
and class name (ex. "FileHashStore").
The `HashStoreFactory` class serves as a factory for creating `HashStore`-like objects,
which are classes that implement the 'HashStore' abstract methods.
This factory class provides a method to retrieve a `HashStore` object based on a given module
(e.g., "hashstore.filehashstore.filehashstore") and class name (e.g., "FileHashStore").
"""

@staticmethod
def get_hashstore(module_name, class_name, properties=None):
"""Get a `HashStore`-like object based on the specified `module_name` and `class_name`.
Args:
module_name (str): Name of package (ex. "hashstore.filehashstore") \n
class_name (str): Name of class in the given module (ex. "FileHashStore") \n
properties (dict, optional): Desired HashStore properties, if 'None', default values
will be used. \n
Example Properties Dictionary:
{
"store_path": "var/metacat",\n
"store_depth": 3,\n
"store_width": 2,\n
"store_algorithm": "sha256",\n
"store_sysmeta_namespace": "http://ns.dataone.org/service/types/v2.0"\n
}
Returns:
HashStore: A hash store object based on the given `module_name` and `class_name`
Raises:
ModuleNotFoundError: If module is not found
AttributeError: If class does not exist within the module
The `get_hashstore` method retrieves a `HashStore`-like object based on the provided
`module_name` and `class_name`, with optional custom properties.
:param str module_name: Name of the package (e.g., "hashstore.filehashstore").
:param str class_name: Name of the class in the given module (e.g., "FileHashStore").
:param dict properties: Desired HashStore properties (optional). If `None`, default values
will be used. Example Properties Dictionary:
{
"store_path": "var/metacat",
"store_depth": 3,
"store_width": 2,
"store_algorithm": "sha256",
"store_sysmeta_namespace": "http://ns.dataone.org/service/types/v2.0"
}
:return: HashStore - A hash store object based on the given `module_name` and `class_name`.
:raises ModuleNotFoundError: If the module is not found.
:raises AttributeError: If the class does not exist within the module.
"""
# Validate module
if importlib.util.find_spec(module_name) is None:
Expand All @@ -259,11 +257,14 @@ def get_hashstore(module_name, class_name, properties=None):
class ObjectMetadata(namedtuple("ObjectMetadata", ["id", "obj_size", "hex_digests"])):
"""Represents metadata associated with an object.
Attributes:
id (str): A unique identifier for the object (Hash ID, hex digest).
obj_size (bytes): The size of the object in bytes.
hex_digests (list, optional): A list of hex digests to validate objects
(md5, sha1, sha256, sha384, sha512)
The `ObjectMetadata` class represents metadata associated with an object,
including a unique identifier (`id`), the size of the object in bytes (`obj_size`),
and an optional list of hex digests (`hex_digests`) to validate objects.
:param str id: A unique identifier for the object (Hash ID, hex digest).
:param bytes obj_size: The size of the object in bytes.
:param list hex_digests: A list of hex digests to validate objects
(md5, sha1, sha256, sha384, sha512) (optional).
"""

# Default value to prevent dangerous default value
Expand Down

0 comments on commit 97ea4e8

Please sign in to comment.