Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

creator as non-classic data object #161

Draft
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

abaevbog
Copy link
Contributor

@abaevbog abaevbog commented Jul 21, 2023

  • Removed ClassicDataObject as a parent class of Zotero_Creators
  • Removed all load()-ing of Zotero_Creators. An object is constructed with data right away.
  • One join in loadCreators to load all creators on GET requests, instead of sequential $item->load().
  • Some minor refactoring to try to avoid extra queries
  • One bulk-insert query for creators instead of multiple $creator->save() calls on create or update

@abaevbog abaevbog requested a review from dstillman July 21, 2023 23:41
@abaevbog
Copy link
Contributor Author

One of the consequences of removing libraryID from the creators table is that the items from the same shard essentially "share" creators across libraries. So if 2 items are created with the same creator's information, we have 1 creator entry, and 2 itemCreators entries linking this creator to items.

I suppose this is a good thing, since we don't have to store duplicate data about authors across libraries. But something to account for is how to deal with the deletion of obsolete creators. On every delete of an item, we can check if its creator belongs to any other items, and delete them if not. Or, we can run a cron job every 30 minutes or so to fetch and delete creators that are not linked to any items.

I think the second option is better, since the first option requires performing extra queries on every single DELETE Api call, which puts extra load on the dataserver

@dstillman
Copy link
Member

Sorry, should've noted this earlier — we should just do what we do for itemData and inline the values in itemCreators. That'll mean a bit of data duplication, but it'll be faster, and we really don't want to have to purge these in an event.

(The empty itemDataValueHash column in itemData is a holdover from many years ago when we had a separate itemDataValues table like the client.)

`fieldMode` tinyint(1) unsigned DEFAULT NULL,
PRIMARY KEY (`creatorID`),
KEY `name` (`lastName`(7),`firstName`(6))
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will need to be a PHP script like the others that does a proper migration of existing data. It should set the shard to down, wait 20 seconds, do the migration, and bring the shard up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. I only kept it as .sql until now for convenience.
I'll have it wrapped into the same algorithm as this update, except with state=down.

misc/master.sql Outdated
@@ -184,6 +184,7 @@ CREATE TABLE `libraries` (
`lastUpdated` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
`version` int(10) unsigned NOT NULL DEFAULT '0',
`shardID` smallint(5) unsigned NOT NULL,
`hasData` TINYINT( 1 ) NOT NULL DEFAULT '0',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to add this, but it shouldn't be in this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I push just this one line to master?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

@abaevbog
Copy link
Contributor Author

abaevbog commented Jul 24, 2023

inline the values in itemCreators

Got it. So then, we completely drop the creators table, and the itemCreators will become something like this:

CREATE TABLE itemCreators (
itemID bigint unsigned NOT NULL,
firstName varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci DEFAULT NULL,
lastName varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci DEFAULT NULL,
fieldMode tinyint(1) unsigned DEFAULT NULL,
creatorTypeID smallint(5) unsigned NOT NULL,
orderIndex smallint(5) unsigned NOT NULL,
PRIMARY KEY (itemID,orderIndex),
KEY creatorTypeID (creatorTypeID),
KEY name (lastName(7),firstName(6))
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

With a many-to-one relationship with the items table

@abaevbog
Copy link
Contributor Author

In my last commit, I got rid of creators table, moved all data storage about creators into the itemCreators table, and removed some code dealing with creators that we no longer need. For example, we no longer need the for-loop when we add creators, so I grouped creator insertion into one query. I think there are a few other things can simplify now but that's a start.

php cript to update the database
added copies of files before changes from this PR with old_ prefix
zotero_autoload checks the shard and uses old files for shards that were not migrated yet
@abaevbog
Copy link
Contributor Author

Added a php script to move data from creators and itemCreators into the new itemCreators table on a shard-by-shard basis.

To make it possible to migrate a few shards at a time, I added the php models without changes from this PR with an old_ prefix. Then, in ApiController.php, a global variable $shardID is set - based on it and a hardcoded array of shardIDs, zotero_autoload in header.php decides to use old_* files for shards that were not updated yet or usual files for updated shards. I had no problems running tests or using the API after migrating one shard out of 2 locally.

@abaevbog
Copy link
Contributor Author

abaevbog commented Jul 31, 2023

Brief summary of changes up to this point:

  • Added migration script to remove creators and tags tables and migrate to schema where all data is stored in itemCreators and itemTags.
  • In header.php, a temporary global variable shards is used to specify which shards have been migrated. Based on this, there is some conditional behavior in ItemController.php and TagController.php, depending on if the current shard was migrated or not.
  • When a method from any affected model is called, auto_loader checks current shard against migrated shards and selects either model/...inc.php - model with recent changes or model/old_...inc.php - model using old schema.
  • Both for creators and tags, load(), init(), and save() functions are removed. All data loading happens in the item model via loadCreators and loadTags directly into objects via constructor, and data saving/updating is done via static bulkInsert and bulkDelete.
  • loadTags, loadCreators, bulkCreate and bulkInsert batch multiple queries into one, so fetching or inserting 10 tags is 1 large transaction instead of 10.
  • changed property is set in loadTags, loadCreators - there is a check if the payload is different from existing tags/creators or not.
  • Zotero_Tag and Zotero_Creator properties now match the SQL columns, so there is some refactoring (e.g. in Cite.inc.php) to account for that.
  • All code for Tag and Creator models that is not used is removed. I thought it's easier to see changes that way.
  • Other minor changes, mainly in SQL joins, to work with new schema

I ran some benchmarking on a shard post-upgrade vs a shard before upgrade with profiler on check the performance of adding 5000 tags and then 5000 creators.

Before: tags ~ 30 sec, creators ~ 30 sec. ~20,000 small DB queries that took 3-7 seconds.
After: tags ~ 40 sec, creators ~ 20 sec. ~30 larger DB queries that took 0.4-0.6 seconds.

So the database queries are better (since I cut down the # of queries a lot) but there is some inefficiency elsewhere that slows things down.

Copy link
Member

@dstillman dstillman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial notes

Zotero_Admin_DB::query("ALTER TABLE `itemTags` DROP CONSTRAINT `itemTags_ibfk_2`;", false, $shardID);

// Create new itemTags table
Zotero_Admin_DB::query("CREATE TABLE `itemTagsNew` ( `tagID` BIGINT UNSIGNED NOT NULL, `itemID` BIGINT UNSIGNED NOT NULL, `name` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin NOT NULL, `type` tinyint(1) unsigned NOT NULL DEFAULT '0', `version` int(10) unsigned NOT NULL DEFAULT '1', PRIMARY KEY (`tagID`, `itemID`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8;", false, $shardID);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Why did you leave version here? Because of ?since= support for tags? I think we can just take that out — I suspect it's not used, or is used rarely enough that we can give a brief warning that it's going away. It's only mentioned as an aside in the Syncing documentation, and I'm not seeing any recent usage of it.

  2. I'd say the same with tagID as with creatorID above, though I'm having second thoughts there. We could decide that the data savings are worth keeping a creators table with a creatorID and libraryID (but still no dates, key, or version). We don't do that for item data values and we won't do it for creators, but tags are much more likely to be assigned to many items, so the extra data usage could be significant. We would have to deal with purging when a tag was removed or an item was deleted, though, so it might not be worth it. Maybe need to run some queries there on real data to see how much data we would save.

  3. There's no index on the tag here. We'll have to check the queries used for tag-based API searches to see if it can even use an index or if it will only use the libraryID and scan for tags. But it'd be the same with the name index on itemCreators — we either need it for both or for neither. (Before, in separate tables, we had to do name-based lookups to actually get the creatorID/tagID, so the indexes were definitely used.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Yes, there is a test that calls /tags?newer=, so I assumed we need to keep it around. If we don't want it - that's great, I'll remove it from here and from my mocha tests PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Yeah, some context from real world data would certainly be helpful. Maybe if we could see the output of select count(*) from itemTags and divide it by select count(*) from tags, we'd get an idea of roughly many itemTag relationships there are for each tag. And then if we multiply that by select SUM(length(name)) from tags, we'd have a rough estimate of how much extra memory duplicating tag names would cause.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Queries related to tags that were affected are:
  • In /items?tag= where we select the itemIDs that fit the tag filter SELECT itemID FROM items JOIN itemTags USING (itemID) WHERE LOWER(itemTags.name) IN (?)

  • In tags/?tag= where we select tagIDs that fit the query before fetching their actual data and constructing objects SELECT SQL_CALC_FOUND_ROWS DISTINCT tagID FROM itemTags JOIN items USING (itemID) WHERE libraryID=? AND namein (?) ANDtype in (?);

Currently, these do a full table scan, which is not good. There was a UNIQUE KEY uniqueTags (libraryID, name, type) in the old creators table that was used for such queries. We can add KEY nameType (name, type) to current itemTags and these queries will use that key. I checked manually with explain sql statement and it seems to work.

With creators, the name key indeed does not do anything now because the creators never fetched by themselves and are only loaded as a part of an item via SELECT * FROM itemCreators WHERE itemID=?.

To sum up, I guess we should add KEY nameType (name, type) to itemTags and remove name from creators?

@@ -330,7 +330,7 @@ CREATE TABLE `syncDeleteLogIDs` (

CREATE TABLE `syncDeleteLogKeys` (
`libraryID` int(10) unsigned NOT NULL,
`objectType` enum('collection','creator','item','relation','search','setting','tag','tagName') NOT NULL,
`objectType` enum('collection','item','relation','search','setting','tag','tagName') NOT NULL,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tag should be removed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need all the schema changes here

// Creators
echo "Migrating creators\n";
// Drop foreign key constraint
Zotero_Admin_DB::query("ALTER TABLE `itemCreators` DROP CONSTRAINT `itemCreators_ibfk_1`;", false, $shardID);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No semicolons needed in SQL from PHP (throughout)

@@ -74,7 +89,7 @@ function zotero_autoload($className) {
return;
}
}

$GLOBALS['updatedShards'] = [1];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be empty in the committed code

public static function bulkDelete($libraryID, $itemID, $creatorOrdersArray) {
$placeholders = implode(', ', array_fill(0, sizeOf($creatorOrdersArray), '?'));
$sql = "DELETE FROM itemCreators WHERE itemID=? AND orderIndex IN ($placeholders)";
Zotero_DB::query($sql, array_merge([$itemID],$creatorOrdersArray), Zotero_Shards::getByLibraryID($libraryID));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing space after comma

$names = array_map(function($tag) {
return $tag->name;
}, $tags);
$existingTagsSql = "SELECT t.tagID, t.name, MAX(t.version) as `version` from itemTags t JOIN items i USING (itemID) WHERE libraryID = ? AND name IN ($placeholders) GROUP BY tagID, `name`;";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • House style: Capitalized acronyms in variables: existingTagsSqlexistingTagsSQL
  • House style: Capitalized table aliases in SQL: itemTags titemTags IT, items iitems I
  • Remove extra space after comma
  • Only need backticks for reserved words (e.g., groups)

// Use a real tag name, in case case differs
if (!$title) {
$title = "Items of Tag ‘" . $tag->name . "’";
if (in_array($GLOBALS['shardID'], $GLOBALS['updatedShards']) ) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to have this same check everywhere. You can just do it once in header.inc.php and define() a constant that's available globally (like, e.g., Z_ENV_BASE_PATH).

Zotero_Admin_DB::query("ALTER TABLE `itemCreators` DROP CONSTRAINT `itemCreators_ibfk_2`;", false, $shardID);

// Create new itemCreators table
Zotero_Admin_DB::query("CREATE TABLE `itemCreatorsNew` ( `creatorID` BIGINT UNSIGNED NOT NULL, `itemID` BIGINT UNSIGNED NOT NULL, `firstName` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci DEFAULT NULL, `lastName` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci DEFAULT NULL, `fieldMode` tinyint(1) UNSIGNED DEFAULT NULL, `creatorTypeID` smallint(5) UNSIGNED NOT NULL, `orderIndex` smallint(5) UNSIGNED NOT NULL, PRIMARY KEY (`creatorID`, `itemID`), KEY `creatorTypeID` (`creatorTypeID`), KEY `name` (`lastName`(7),`firstName`(6)) ) ENGINE=InnoDB DEFAULT CHARSET=utf8;", false, $shardID);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I led you astray originally, and didn't clarify later. I originally said there'd still be creatorID in creators, but since we're inlining the data in itemCreators, there's no reason for there to be a creatorID at all — it's not serving any purpose. And the primary key should be on itemID, orderIndex, since that's what's unique. (It was actually wrong to begin with — it should've been on itemID, orderIndex before like in the client, not itemID, creatorID, orderIndex.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants