Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Annotate sample application schemas #169

Merged
merged 2 commits into from
May 28, 2023
Merged

Annotate sample application schemas #169

merged 2 commits into from
May 28, 2023

Conversation

benkilimnik
Copy link
Member

@benkilimnik benkilimnik commented May 8, 2023

Adds data ownership annotations to the schemas of a number of sample applications and modifies them for compatibility with k9db. Unit tests may be added for them in the future.

Progress:

  • commento
  • ghchat
  • instagram
  • schnack
  • hotcrp
  • socify (requires rewriting into SQL)
  • mouthful

Large schemas deprioritized, moved to branch annotate-prestashop-opencart

  • opencart
  • prestashop

@rpaul48
Copy link
Collaborator

rpaul48 commented May 9, 2023

Things noticed on review (will update as I go through all the schemas):

commento

  • Can we specify default values? (e.g. domains table)
  • Can tables have composite primary keys? (e.g. moderators table)
    • Can email column be made a foreign key otherwise?
  • Is domain the PK we want for the views table?

ghchat

  • *Variable ownership example with groups
  • Should group_info have id column, which should also be the PK?
  • Looks like many columns originally of type int(11) are here either text or int; not sure if this should be standardized
  • If desired, could specify anonymization rules on deletion of private_msgs

instagram

  • groups1 table has pk id rather than group_id in original
  • posts table has pk id rather than post_id in original
  • filter column commented out in posts table?
  • Is the ownership in the favourites table correct?
  • should follow_system have a foreign key for follow_to?
  • *Potentially interesting accessorship in group_members
  • messages table might need: FOREIGN KEY (con_id) REFERENCES conversations(con_id)
  • Should the giver and receiver of a recommendation (recommendations table) be owners?
  • shares table might need: FOREIGN KEY (post_id) REFERENCES posts(id)

hotcrp

  • In PaperStorage table:
    • should the PK be paperStorageId and there be a FK for paperId that references the Paper table?
    • should there be FKs for filterType and originalStorageId?
  • Is the PK for the Capability table correct?
  • In the MailLog table, sounds like paperIds won't work (since plural) as an FK into Paper
  • Why was an id column added to the PaperComment table? Seems like commentId should be the PK
  • *Potentially interesting ON GET anon case with PaperReview table, ON DEL in PaperReviewRefused

socify

  • In activities table, do we need an FK for recipient_id?
  • Does authentications need an OWNED_BY annotation (for user_id?)?
  • badges_sashes table might need FK for notified_user column
  • PK for friendly_id_slugs might be sluggable_id instead
  • Missing PK for target_id in merit_actions table?
  • Missing PK for related_change_id in merit_activity_logs table?
  • Missing PK for photo_album_id in photos table?
  • Missing PK for voter_id in votes table?

mouthful

  • Missing PK for ReplyTo column in Comment table?

@benkilimnik
Copy link
Member Author

Thanks @rpaul48

I've responded to your questions & feedback below.

commento

  • Can we specify default values? (e.g. domains table)
    • currently unsupported by k9db. schema-annot/README.md describes some of the changes made to the schemas due to unsupported features in k9db.
  • Can tables have composite primary keys? (e.g. moderators table)
    • multi-column primary keys unsupported by k9db
    • Can email column be made a foreign key otherwise?
      • in which table? Where would it point?
  • Is domain the PK we want for the views table?
    • You're right, this doesn't really make sense. I've added an id column. A primary key was not provided in the original -- depending on the application, some of these tables probably autogenerate PK columns.

ghchat

  • *Variable ownership example with groups
  • Should group_info have id column, which should also be the PK?
    • Yeah, I think you are right. I've changed it.
  • Looks like many columns originally of type int(11) are here either text or int; not sure if this should be standardized
    • Good find, though I don't think the column types really matter to illustrate the privacy compliance in k9db.
  • If desired, could specify anonymization rules on deletion of private_msgs
    • Good idea, I've added some.

instagram

  • groups1 table has pk id rather than group_id in original
    • This column was renamed because of a conflict with a reserved keyword group which causes k9db to crash.
  • posts table has pk id rather than post_id in original
    • I may have renamed the PK column. Not hugely important I think.
  • filter column commented out in posts table?
    • reserved keyword in k9db
  • Is the ownership in the favourites table correct?
    • I think this is a follow situation. fav_by is the person who initiated the action (owner), while user is the person being favourited (they shouldn't get access or ownership).
  • should follow_system have a foreign key for follow_to?
    • I think you're right. I've added a FK.
  • *Potentially interesting accessorship in group_members
  • messages table might need: FOREIGN KEY (con_id) REFERENCES conversations(con_id)
    • Nice find. I've added it.
  • Should the giver and receiver of a recommendation (recommendations table) be owners?
    • Good point. I think recommend_to should simply be a FK. I've changed it.
  • shares table might need: FOREIGN KEY (post_id) REFERENCES posts(id)
    • Yup, thanks. I've added it.

hotcrp

  • In PaperStorage table:
    • should the PK be paperStorageId and there be a FK for paperId that references the Paper table?
      • Paper table already points to PaperStorage via paperStorageId. Would have to move the CREATE TABLE for Paper above PaperStorage to have an FK pointing to Paper from PaperStorage
    • should there be FKs for filterType and originalStorageId?
  • Is the PK for the Capability table correct?
    • The PK salt was provided in the original
  • In the MailLog table, sounds like paperIds won't work (since plural) as an FK into Paper
    • yeah you're right. Not really sure how to handle that
  • Why was an id column added to the PaperComment table? Seems like commentId should be the PK
    • The original PK was a (paperId, commentId) but K9db doesn't support multi-col PKs currently. Hence I added a new id column bc neither columns are unique on their own.
  • *Potentially interesting ON GET anon case with PaperReview table, ON DEL in PaperReviewRefused

socify

  • In activities table, do we need an FK for recipient_id?
    • Yes, you're right. Added.
  • Does authentications need an OWNED_BY annotation (for user_id?)?
    • Yeah, I think that's right. Added.
  • badges_sashes table might need FK for notified_user column
    • Yup.
  • PK for friendly_id_slugs might be sluggable_id instead
    • Yes.
  • Missing PK for target_id in merit_actions table?
    • Hm. Not sure if that would be unique. Added new id col.
  • Missing PK for related_change_id in merit_activity_logs table?
    • fixed.
  • Missing PK for photo_album_id in photos table?
    • Added new col since photo_album_id may not be unique.
  • Missing PK for voter_id in votes table?
    • Multi-col PKs unsupported, but i've added a new id col.

mouthful

  • Missing PK for ReplyTo column in Comment table?
    • multi-col PKs unsupported.

@artemagvanian
Copy link
Member

I've taken a look at instagram schema, might do some more if I have time: here are a couple of things that came up:

  • bkmrk_by in bookmarks might be a FK (possibly, to the users table?);
  • similarly -- block_by in blocks might be a FK;
  • in notifications, FOREIGN KEY (user) REFERENCES users(id) seems to be incorrect, given that in the data dump user is always 0;
  • when there is a link between two users through some table, we might need ACCESSED_BYannotations (this is probably off-topic, since you have probably only considered data ownership):
    • in follow_system, follow_to might need an ACCESSED_BYannotation;
    • in shares, share_to might need an ACCESSED_BYannotation;
    • in recommend, recommend_to might need an ACCESSED_BYannotation;
    • in view, view_to might need an ACCESSED_BYannotation;
    • in comments, post_idmight need anACCESSED_BY`annotation;

@benkilimnik
Copy link
Member Author

Thanks @artemagvanian! Some comments below:

  • bkmrk_by in bookmarks might be a FK (possibly, to the users table?);
    • Nice find, added!
  • similarly -- block_by in blocks might be a FK;
    • Added
  • in notifications, FOREIGN KEY (user) REFERENCES users(id) seems to be incorrect, given that in the data dump user is always 0;
    • I don't think it's always zero in the dump though? e.g. user=14 in (503, 24, 28, 0, 0, 'recommend', 14, '1525003631635', 'read'),
  • when there is a link between two users through some table, we might need ACCESSED_BYannotations (this is probably off-topic, since you have probably only considered data ownership):
    • I thought about this and I think that the user being followed, shared to, viewed, commented on etc. should not have access and only the person who initiated the action should have data ownership. Happy to discuss this further though.

@KinanBab
Copy link
Collaborator

Things look good to me. Thanks @rpaul48 and @artemagvanian for the follow up.

I think the way the follow_system is annotated now is reasonable. I understand where Artem is coming from but:

  1. While related to the user being followed, that the row in the follow_system is created based solely on actions by the person initiating the follow. Generally, the right of access applies to data the application has about a user, i.e. data that was given to the application (or created in the application) by active actions of that user, or data the user did not actively create but was derived implicitly from their data (e.g. an application scraping my social media posts would have data about me, even though it is data that they acquired via scraping that I had no involvement in). I think it is a bit of stretch to say that the follow_system record is about the person being followed in that sense.
  2. With that said, I do not think that giving users ACCESS_BY rights is unreasonable. I am simply saying it is not required for compliance. To an extent, there is no right or wrong answer here, we just need a policy that is reasonable, and there could be indeed multiple reasonable policies. I lean more towards simplicity in this case.

TLDR: lets keep follow_system as it is right now.

I am going to take a deeper look. Thanks everyone for you help.

@arjeyaraj
Copy link
Collaborator

Apologies for taking a while to get to this, had two comments about ghchat and hotcrp:

ghchat

I think the addition of the group_info(id) column makes sense, but I think some of the foreign keys need to be changed to reflect the new column:

  • group_msg(to_group_id) -> group_info(id), should the col type of group_msg(to_group_id) be changed to int to match group_info(id)
  • group_user_relation(to_group_id) -> group_info(to_group_id), I think we can change the dependent column to be group_info(id) and then update the col type as in group_msg (might fix the access fail, since I think the schema refers to a non PK col right now)

hotcrp

This is less specific to the annotation, but more of a general question for if/how K9DB handles a type of pattern. From what I understand from how HotCRP works, the PaperConflict table encodes different relationships between contactInfo and paperId based on the value of conflictType (e.g. a value for co-author, another value for institutional relationship, etc.). I think depending on the relationship, the data ownership/accessorship pattern might be different, like we would want to extract all papers that a contactInfo has co-authored, but not papers that they might have an institutional conflict with. I think it make more sense if we remove the OWNED_BY annotation on the column in the meanwhile (to prevent returning too many results)?

@benkilimnik
Copy link
Member Author

Thank you @arjeyaraj!

ghchat: Great finds. I've made the changes you suggested. The FOREIGN KEY (to_group_id) ACCESSES group_info(id) still fails unfortunately.

hotcrp: Good point. I've removed the OWNED_BY on contactId in PaperConflict and added an issue #172. Let's discuss this ownership pattern at the meeting.

FOREIGN KEY (requestedBy) ACCESSED_BY ContactInfo(contactId),
-- author should see review, but not who reviewed it
-- only thing author should see is content of review, and when submitted (dates)
ON GET contact_id ANON (review_token, reviewType, reviewRound, requestedBy, reviewNotified)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not look right to me.

  1. There is no contact_id column.
  2. The contact_id is the reviewer right? We should be anonymizing for the authors of the paper.

Maybe you meant ON GET paperId ANON (....)? As in, if the authors of the paper request their data, do not show them private info about the review? (Also please make sure to anonymize the identity of the reviewer and who requested the review etc).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, that should be paperId not contact_id. Fixed in recent commit.

@artemagvanian
Copy link
Member

Just took a look at the other schemas, here are more comments:

commento

  • I am a tiny bit unsure about whether it was intentional to have moderators being both sharded to owners and a data subject. Maybe it is worth taking another look at.

ghchat

  • Maybe anonymization rules on deletion should be specified in user_user_relation too, similarly to private_msgs.

socify

  • In votes, voter_id might be a FK to users.
  • In photos, photo_album_id might be a FK to photo_albums.
  • It seems to me that the result of the EXPLAIN COMPLIANCE call needs to be updated to match the updated annotations (might apply to other schemas as well).

hotcrp

  • @KinanBab has already mentioned the possibly incorrect ON GET contact_id ANON annotation.
  • I am not sure whether I understand the semantics of PaperReviewRefused correctly, but having the ON GET requestedBy ANON ... annotation feels a bit redundant. If someone has requested a review, shouldn't they be able to see the details of the reviewer?
  • I am a bit suspicious about nothing pointing to ReviewRequest even though it has a DATA_SUBJECT annotation. Could it be that some FKs in PaperReviewRefused or PaperReview should actually point to ReviewRequest? It would be nice if someone more familiar with horcrp took another look.

@benkilimnik
Copy link
Member Author

benkilimnik commented May 22, 2023

Thanks @artemagvanian and @KinanBab !

commento

  • You're right, this is odd. I think it's because the FK was automatically annotated by k9db. I have used REFERENCES ONLY to fix the sharding behavior.

ghchat

  • Added!

socify

  • In votes, voter_id might be a FK to users.
    • Yup, added
  • In photos, photo_album_id might be a FK to photo_albums.
    • Absolutely, added it.
  • It seems to me that the result of the EXPLAIN COMPLIANCE call needs to be updated to match the updated annotations (might apply to other schemas as well).
    • Should be updated now.

hotcrp

  • @KinanBab has already mentioned the possibly incorrect ON GET contact_id ANON annotation.
    • Should be fixed now.
  • I am not sure whether I understand the semantics of PaperReviewRefused correctly, but having the ON GET requestedBy ANON ... annotation feels a bit redundant. If someone has requested a review, shouldn't they be able to see the details of the reviewer?
    • I think the idea is that only the creator of the review contactId should own it, while the others requestedBy and refusedBy should be able to access it, so when accessors to PaperReviewRefused request their data, PII fields like email, firstName, and lastName should be anonymized.
  • I am a bit suspicious about nothing pointing to ReviewRequest even though it has a DATA_SUBJECT annotation. Could it be that some FKs in PaperReviewRefused or PaperReview should actually point to ReviewRequest? It would be nice if someone more familiar with horcrp took another look.
    • hm, I am not sure if there should be FKs to ReviewRequest. @KinanBab what do you think?

@KinanBab
Copy link
Collaborator

For Hotcrp:

  • I agree with contactId owning the PaperReviewRefused. Also note that contactId is not null while refusedBy is null.
  • I think ON GET (requestedBy) ANON (...) is reasonable. I agree that it feels a little redundant, but putting it in would not cause any harm w.r.t compliance. On the other hand, if we remove this, we will be taking on a risk in case there is some subtle policy issue with allowing the requester to see the review. I do not feel comfortable doing that myself, but perhaps I would have done it if we were hotcrp's original developers and we understood all the nuances of the application.
  • I think ReviewRequest being isolated like this is reasonable. The reason we made it a data subject is that the request may be directed at someone who does not have a hotcrp account (i.e. is not a contactId). Such a person still has rights. However, whenever this person takes any action on the request, the application forces them to create an account (e.g. they have to create an account in order to accept the request and provide a review, they also have to create an account to reject the review because contactId in PaperReviewRefused is not null).

@KinanBab
Copy link
Collaborator

Some general followups:

  1. @artemagvanian discovered a small visual bug in EXPLAIN COMPLIANCE that makes the output look a bit wrong (it ends up confusing source tables for the next tables in a transitive chain).
  2. We are making progress independently on supporting self-directed ACCESSED_BY and ACCESSES.

Both these things will be done and merged either Tuesday or Wednesday. My plan is to merge this immediately after.

@benkilimnik It would be great if you can do the following after the above is done and merged:

  1. Put back the self accessed_by annotation for comments in hotcrp (was there some other application that also had this behavior?)
  2. Reproduce the EXPLAIN COMPLIANCE output in the README after making all the pending changes to the schemas and after merging in the visual bug fix.

I will ping you on slack when the above is done and merged so that you can do the final touches. I can merge this immediately after.

@KinanBab
Copy link
Collaborator

Sorry @benkilimnik I meant comments in commento not in hotcrp.

* Modify sqlengine/create.cc to correctly handle ACCESSED_BY self references
* Throw an error if self reference is OWNED_BY, OWNS, or ACCESSES
* Add tests to ensure self reference works correctly, including anonymization
* Fix visual bug in EXPLAIN COMPLIANCE
* Decrement user count in gdpr_forget.cc
@KinanBab KinanBab force-pushed the annotate-schemas branch 2 times, most recently from 9456d27 to 3fb00c1 Compare May 23, 2023 23:39
* Commento
* ghchat
* hotcrp
* instagram
* schnack
* socify
* mouthful

This includes adding missing FKs for many of the schemas that did not have any FKs, extracting schema from non-sql code, and normalizing some schemas that were not in normal form.

Add a README with a description of interesting scenarios and output of EXPLAIN COMPLIANCE.
Copy link
Collaborator

@KinanBab KinanBab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good! Approved

@KinanBab KinanBab merged commit c83dd78 into master May 28, 2023
@KinanBab KinanBab deleted the annotate-schemas branch May 28, 2023 00:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants