Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: duplication of LOBs between representations #4

Open
Laurira opened this issue Apr 12, 2021 · 2 comments
Open

Discussion: duplication of LOBs between representations #4

Laurira opened this issue Apr 12, 2021 · 2 comments
Labels
enhancement Issues that are an enhancement needed to be evaluated and action decided question Issue is a question which will be answered and replied to. Might lead to new issues with actions.

Comments

@Laurira
Copy link
Collaborator

Laurira commented Apr 12, 2021

This issue may be impossible to solve but maybe someone has the same problem.

Our situation:

  • We usually create siard from full database (representation 1). This is the main snapshot from whole database. It contains all elements: schemas, tables, columns, users, procedures, triggers, views, related files, keys etc.
  • Siard from views (representation 2). This snapshot is taken only from views. Usually we select views that do not contain any restricted data so these can be made public and usable for everyone. In this representation there are no relations.
  • Database engine native dump (representation 3). For example when database used Oracle engine then this representation can be read using Oracle-specific tools. Ths representation is for backup purposes when we discover that siard is somehow not usable.
  • Documentation
  • Logs and other data created during the archiving process. NAE adds the log of the transfer process as a document into a separate archive's folder (i.e. the folder / dossier that holds all relevant documentation about the archive/fond and activities done with it).

All this data can be easily added to SIARD-CITS but there is a big problem with duplication. Most databases contain links to external files and the amount of external files are usually measured in terabytes. When we create full siard (rep 1) then these external files are downloaded along with database. So the first representation is the biggest and contains all the external files as well.
Problem is that rep 2 and Oracle native dump are referring to those same files.
It would be efficient when the folder with external files are mutual for every representation.
In the future can this be handled in SIARD and/or CITS-SIARD?

@Laurira Laurira added the question Issue is a question which will be answered and replied to. Might lead to new issues with actions. label Apr 12, 2021
@Laurira Laurira changed the title discussion: duplication of LOBs between representations Discussion: duplication of LOBs between representations Apr 12, 2021
@jmaferreira
Copy link

Dear Lauri,

Representations are supposed to be complete. Why don't you solve the issue by delegating the deduplication feature to the storage layer (i.e. hardware)? Alternatively, the repository should also be able to handle the deduplication of files for the entire archival storage.

I would love to hear others' opinions on this...

@PhillipAasvangTommerholt

I am not sure that representations are supposed to be complete.
If you use a migration strategy there might be a small proportion of files which can not be migrated. So if you have an "original" submission representation and a preservation representation you might have more files in the original representation than in the preservation.

I think it might be good solution to delegate the deduplication feature to the storage layer, but I also think that it would be a good idea to implement or describe this scenario in the specification or in the guidelines to it as a best practice.

@PhillipAasvangTommerholt PhillipAasvangTommerholt added the enhancement Issues that are an enhancement needed to be evaluated and action decided label Aug 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Issues that are an enhancement needed to be evaluated and action decided question Issue is a question which will be answered and replied to. Might lead to new issues with actions.
Projects
None yet
Development

No branches or pull requests

3 participants