Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove filename from storageFiles #55

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

mrtcode
Copy link
Member

@mrtcode mrtcode commented May 25, 2017

'filename' column should be removed from storageDownloadLog and storageUploadLog too, because they depend on storageFiles table.

For those tables a default 'filename' value must be set.
storageFiles
storageDownloadLog
storageUploadLog

}
else {
$response = Zotero_Storage::downloadFile($info, $downloadDir, $realFilename);
}
if ($response) {
if ($zip) {
$success = self::extractZip($downloadDir . $info['filename'], $dir);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we create a random dir, I think there is no difference what filename the zip file has.

@mrtcode
Copy link
Member Author

mrtcode commented May 31, 2017

We need a script to point all storageFileItems files that have the same hash, to the same (lowest storageFileID) row in storageFiles.

There are a few options how it could be implemented.

Option 1:
Get unique lowest storageFileIDs with hash:
SELECT MIN(storageFileID) AS id, hash FROM storageFiles GROUP BY hash
And loop through all shards and update:
UPDATE storageFileItems JOIN itemAttachments USING(itemID) SET storageFileID=123 WHERE storageHash = 'hash'
But this is quite inefficient because there are many shards and the file can exists only on a few of them.

Option 2:
We can use the new storageFileLibraries table.
Firstly get storageFileID, libraryID, hash. Then get the minimum storageFileID for that hash. And then update each library's storageFileItems table where storageFileID not equal to the lowest storageFileID that we have for that hash.

Option 3:
Iterate through each row in storageFileItems on every shard, get storageHash from itemAttachements, use that hash to query storageFiles for MIN(storageFileID). Update if needed.


There is an unused 'getFileItems' function that is unimplemented: link. Now when we have storageFileLibraries table this function would be much easier to implement if needed. Or we should remove it if no longer needed.

@@ -531,8 +574,8 @@ public static function getFileItems($hash, $filename, $zip) {


public static function addFile(Zotero_StorageFileInfo $info) {
$sql = "INSERT INTO storageFiles (hash, filename, size, zip) VALUES (?,?,?,?)";
return Zotero_DB::query($sql, array($info->hash, $info->filename, $info->size, (int) $info->zip));
$sql = "INSERT INTO storageFiles (hash, `size`, zip) VALUES (?,?,?)";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the backticks? size isn't a reserved word, as far as I know.

if ($e->getAwsErrorCode() == 'NoSuchKey' || $e->getAwsErrorCode() == 'NotFound') {
$contents = $result->get('Contents');
if (!$contents || count($contents) < 1) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the possible values for this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It returns all keys that are prefixed with a given hash. In our case:

hash
hash/file1
hash/c/file1

@mrtcode
Copy link
Member Author

mrtcode commented May 29, 2018

We should be able to eliminate zip variable too. At least from prefixing S3 keys. listObjects just needs a hash to get the key, whether it's bucket/hash, bucket/hash/file1 or bucket/hash/c/file1.

…eld; remove 'getFileByHash' remove unnecessary backticks; remove any 'storageFiles' filtering on 'zip' or 'filename' fields
@mrtcode
Copy link
Member Author

mrtcode commented May 30, 2018

When uploading a file the original code was copying 'hash/filename1' to 'hash' on S3 if 'hash' doesn't exist. duplicateFile function was doing this. I think it's better to simplify storage logic, and just depend on listObjects to get the legacy key. If we want to save S3 queries on more popular items, we can move legacy files to new keys from external scripts like zfs-purge.

@mrtcode
Copy link
Member Author

mrtcode commented May 30, 2018

I think we should completely eliminate filename and zip fields from storageFiles because this table should only be responsible for file storage and nothing else. We should fully separate storageFiles from user specific file parameters like filename or zip, because they can be different for each user. User specific parameters like filename or zip should come from specific items in shards and not from storageFiles.

@mrtcode
Copy link
Member Author

mrtcode commented May 30, 2018

This filename and zip removal from storageFiles is quite complicated, so we can probably just create a new PR with minimal changes required for zfs-purge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants