Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DIP-267 Profile discoverability #267

Closed
wesbiggs opened this issue Dec 14, 2023 · 11 comments
Closed

DIP-267 Profile discoverability #267

wesbiggs opened this issue Dec 14, 2023 · 11 comments

Comments

@wesbiggs
Copy link
Member

wesbiggs commented Dec 14, 2023

Should Profiles be discoverable without the aid of a content indexer? A Profile link type could be specified as User Data that references a URL and hash. In effect, we would be moving Profile announcements from offline batches to User Data.

Pros:

  • Better alignment with usage patterns expected in social media.
  • Ability for applications to navigate to profile without the aid of other services.

Cons:

  • Higher storage cost for consensus system.

Rejected design option:

  • Putting user-generated "speech" content directly in a consensus system is an anti-goal of DSNP for legal risk reasons as well as technical storage requirement concerns.

Adjacent concerns:

  • The ability to derive a current URL for a Profile could be used by applications to activate Mention tags without outside knowledge; cf. DIP-266 Activity Content "Mention" alignment #266
  • Mastodon uses Person instead of Profile, though most of the fields have overlapping semantics.

Linked PR:

@shannonwells
Copy link
Collaborator

There is a slightly to much greater legal risk by separating out Profile links instead of requiring them to be part of batched announcements, simply because of there being far more URLs to be stored on the consensus system. It would increase opportunities for corrupted or corrupting (i.e. after the fact) URLs. Content indexers would ignore these URLs and the system would potentially lose many "pairs of eyes" on potentially problematic content.

@wesbiggs
Copy link
Member Author

wesbiggs commented Mar 8, 2024

Open URLs in consensus system storage I agree is a risk. Let's consider using CIDs only (a la batch publications).

Latency in finding profile documents on an external distributed file store (e.g. IPFS) could be mitigated by allowing providers to give hints in the form of URL templates that identify their preferred gateways. (For discussion.)

@wilwade
Copy link
Member

wilwade commented Mar 13, 2024

An expansion on some comments I made at the Community Call last week.

Layers

There are several layers here, and I want to lay them out again for clarity around this discussion.

  1. DSNP Id (aka the user account identifier)
  2. Discovery of the existence of a user's profile
  3. Discovery of the user's current profile document location
  4. Retrieval of the user's current profile document
  5. Validation of the user's current profile document
  6. Parsing of the user's current profile document
  7. Attributes on the user's current profile document
    • Public Existence, Public Content (currently the only type)
    • Public Existence, Content Protected (There could also be existence protected, but they wouldn't be in the profile document)
    • Additional/Conditional: Attributes that only show under some circumstances or secondary profile that overrides the "default" in some situations or such.
  8. Retrieval of references from user's current profile document
    • Avatar, etc...

Layers and this Discussion

The only layers being discussed here are really 2-3:

    1. Discovery of the existence of a user's profile
    • Current: Discovered by exhaustive search of all published Profiles
    • Proposed by @wesbiggs: Discovered by direct User Data query by the consensus system
    1. Discovery of the user's current profile document location
    • Current: Most recent profile that returns
    • Proposed by @wesbiggs: Resolved URL or none

Constraints

As discussed above, the primary reason for not choosing direct User Data in the initial design is the cost of storage and churn of the additional data.

  • Assuming an IPFS data, it is approximately 40 bytes per user. So 100 million users that's about 4 Gb. (Likely more as the overhead isn't exact)
  • 100 million users with assuming only 10% new profiles each year and a 50% churn. That's 60 million transactions or ~2 per second. In the case of a blockchain with a block time is 6 seconds, that's 12 transactions per block just used by profile changes.

The current system is better assuming that some of that churn is able to be batched, but does have a longer term growth issue.

@wesbiggs
Copy link
Member Author

wesbiggs commented Apr 4, 2024

To capture a side discussion: the number of updates required for profiles to be linked from User Data is estimated to be far fewer than the number of social graph updates. In that light, this proposal doesn't change the order of magnitude of consensus system changes significantly.

@wilwade
Copy link
Member

wilwade commented May 2, 2024

Notes from DSNP Spec Community Call 2024-05-02

  • Trend to move things into User Data from Announcement Data. We need to profile the data usage on the implementation side
    • Various options for implementation, but it makes sense from the perspective of keeping the spec separate.
  • What about private profile data
  • What about alternate profile expressions such as into a group
  • Loss of the ability to know recently changed profiles
    • CID changes at least tell you about a change given you have a cached version

@wesbiggs
Copy link
Member Author

I'll let those closer to implementations look at optimal storage solutions, but compared to graph storage and updates I don't think profile CIDs would add significant requirements.

There's a separate discussion to be had about the different types of profile data as noted in the spec call comment. I think the goal at the moment would be to cover the existing (public Activity Content) profile definition, but provide a structure to enable future profile-linked formats via the file type enumeration.

The notion of context-based profiles is a deep one and gets into the notion of identity expression via personas. DSNP currently lacks a two-level structure for this, so it is assumed that a DSNP User Id represents a single persona. Structurally, we might consider a way for a user (e.g. a human) to participate in various social networking activities with their choice of personas (and/or participate anonymously in some activities). For separation of personas to be effective, we would like to create a system that (through cryptographic techniques, say) did not enable easy correlation of multiple personas belonging to an individual by a third party. For this reason I think alternate profile expressions are out of scope for this work item, but I'd be open to further discussion.

Change visibility: DSNP requires systems to generate and emit State Change Records whenever User Data changes. While systems are not required to make historical data available, observers can still build their own history set and watch for changes that impact their caches.

@wilwade
Copy link
Member

wilwade commented May 14, 2024

@wesbiggs from a Frequency implementation view, usually we build Frequency Schemas to be versioned via the schema identifier as that is needed to parse the particular data.

If we did this, I could see here we do an on-chain data structure that is something like we see below with the DSNP Profile version inside of the data structure instead of outside as most are/should be.

{
  type: "record",
  name: "UserProfilePointer",
  namespace: "org.dsnp",
  fields: [
    {
      name: "cid",
      type: "bytes",
    },
    {
      name: "version",
      type: "int",
    },
  ],
}

It would be stored in the User data store (Stateful Storage):
And the schemas repo style deploy config:

{
      model: profile,
      modelType: "AvroBinary",
      payloadLocation: "Paginated",
      settings: [],
      dsnpVersion: "1.x",
}

It wouldn't handle the persona issue, which one could argue should be handled at the metadata level, but I think even if there were a persona setup in the generalized profile (instead of being in a group settings or such), then having that data still in one file (with links to media such as profile pictures), still makes sense as the profile, even if it has several duplicated pieces of data, is quite small.

Side Note: I could also see the on-chain metadata including a length value for the expected IPFS file as Frequency has for the IPFS Payload Location structure as well. This informs consumers of the data before they attempt to download extremely large files that might not actually be real profiles.

@wesbiggs
Copy link
Member Author

I'm not following, why is the version needed within the Avro?

I think including the expected byte length is a useful addition.

@wilwade
Copy link
Member

wilwade commented May 15, 2024

I'm not following, why is the version needed within the Avro?

It isn't required, but it is an optimization for the search.

Let's imagine a future where there are 10 non-backward compatible versions. In this case, each query to Frequency to look for a profile could (assuming you wished to support all 10) require up to 10 queries to the chain to discover the profile. On Frequency the map for Stateful storage is User Id, Schema Id. (Then page/item depending).

By shifting the version from effectively the Schema Id layer into the data layer, it allows the content version to shift as long as the metadata doesn't.

@wesbiggs
Copy link
Member Author

wesbiggs commented Jun 3, 2024

I was envisioning that the spec itself might add or replace profile document types over time, hence the inclusion of the type enum value for each link record. For example, the DSNP community might start with the existing Activity Content Profile JSON doc, but later decide that something like a Solid WebID Profile is important to enable for various use cases. The user data payload could include both or either of these.

I think if we do this we keep it open for future evolution of types while not suffering version-related issues, as the core data (cid, type, and size) remains consistent regardless of the target file type.

@wesbiggs wesbiggs changed the title Discussion: Profile discoverability DIP-267 Profile discoverability Jun 6, 2024
wesbiggs added a commit that referenced this issue Jun 17, 2024
Problem
=======
#267 discusses the advantages of DSNP profile documents being
discoverable based on a user's identifier, rather than being temporal
data that must be externally tracked and indexed to be useful.

Solution
========
This proposal adds a new User Data type, `profileResources`, which is
currently defined for the Activity Content profile document type, but is
extensible to other document types that may need to be user-centric in
the future.
A profile resource is simply a link type (as an integer) and a CID as a
string plus byte length.
We define link type 1 for AC profiles.

This change assumes we are versioning for 1.3, but does not include
versioning updates, which are in other open PRs at this time.

Change summary:
---------------
- [ ] ~~Updated Spec Versions~~
- [x] Added definition for ProfileResource Avro type
- [x] Added entry in User Data page for `profileResource`
- [x] Updated navigation; added ProfileResource, moved Profile
announcement to "Migrated" section

---------

Co-authored-by: Wes Biggs <wes.biggs@amplica.io>
@wesbiggs
Copy link
Member Author

wesbiggs commented Jul 8, 2024

This functionality has been integrated for DSNP 1.3.

@wesbiggs wesbiggs closed this as completed Jul 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants