Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

create custom replicate function instead of corestore.replicate #56

Closed
sethvincent opened this issue Jan 9, 2023 · 7 comments
Closed

Comments

@sethvincent
Copy link
Contributor

sethvincent commented Jan 9, 2023

corestore.replicate: https://github.com/holepunchto/corestore/blob/master/index.js#L271

instead of looping through all the cores in the corestore this function would add only the cores that are available to replicate as determined by the sync stage.

from @gmaclennan:

I think this should be implemented as a finite state machine, so that when it is in the "notAuthorized" state, the ondiscoverykey method will only add a core to the replication stream if it is an auth store core. It should probably queue up the keys of other cores that appear in the stream and then add them once it transitions to an "authorized" state. We would need to also manually manage adding newly opened cores to the replication stream.

@tomasciccola
Copy link
Contributor

I was wondering about this and its relation to staged sync.
If a user gets added to a project, should it get every asset (data, blobs) to that project eagerly? I think maybe data (like observations) should be synced eagerly but if blobs (like images) are referenced by smth like an observation maybe they could be downloaded lazily?
Like, if you open an observation that has an image attached then it should download it in that moment? Thinking this out loud maybe its not a great idea since internet connection may not be always available so a user may consciously want to download everything at once and forget.

@sethvincent
Copy link
Contributor Author

Yeah I think it'll depend on both the data type and the device. Something like:

mobile device

  • 1: authstore: load all data
  • 2: datastore: load all data if authstore authorizes device
  • 3: blobstore/hyperdrive
    • 4: load all metadata from the hyperbee
    • 5: only load data prefixed with /thumbnail or /preview, not the /original images from other peers

laptop

  • 1: authstore: load all data
  • 2: datastore: load all data if authstore authorizes device
  • 3: blobstore/hyperdrive
    • 4: load all metadata from the hyperbee
    • 5: load all data including /original images

The main difference just being that the desktop app will get all the /original, /thumbnail and /preview images while the mobile app will just get /thumbnail and /preview.

Sidenote: we'll need to think about how blobstore will handle file types other than images.

@gmaclennan
Copy link
Member

Re. downloading lazily, our users are almost entirely offline, so sync needs to replicate everything eagerly. When we implement selective sync (e.g. offloading originals and in some cases previews when we know they are backed up elsewhere, e.g. online) then we might want to support lazy sync of the original/preview if the device is offline. Otherwise we would just show the scaled-up thumbnail/preview.

In terms of how replication would work, I think we should create sparse replication streams. This will allow us to connect to clients and update hypercore lengths, so that we can provide users with feedback about how much data there is to sync. We can then use the core.download() method to start downloading data.

Sidenote: we'll need to think about how blobstore will handle file types other than images.

Yes, I've been thinking about that. For a video a "thumbnail" can be a frame from the video, but not sure if we have the processing power (or tech resources) to choose a thumbnail intelligently, so maybe just the first frame? The preview I was thinking could be a collection of jpegs of every X frames. To do this we would probably need ffmeg to be working on mobile, or we push the generation of this to the client (which we do for image resizing) and lean on an existing React Native library (if there is one) to do this.

Will we want to namespace the blob core by type to facilitate downloading? e.g. /video/original /video/thumbnail, because hyperdrive does make it easy to download a folder. But maybe, given that selective sync is going to make things more complicated anyway, we will just use the hyperbee metadata for this e.g.

  1. Download the entire hyperbee (db core) via drive.downloadRange({ start: 0, end: -1 })
  2. Read the hyperbee and decide what data we want (based on path prefix (thumbnail, preview, original), media type, owner, and location (potentially) and get all the ranges in the blob core
  3. Call drive.downloadRange() again for all the ranges we want (this is essentially what drive.download(folder) does behind the scenes anyway, it just gives us more control).

Anyway, that is a side-note, conversation to be continued in blob store / selective sync conversation.

@gmaclennan
Copy link
Member

In terms of the logic for this sync, peer A connecting to peer B, from the perspective of peerA:

  1. Peer B starts off in state of "unauthorized".
  2. Peer A eagerly adds all the auth store cores it knows about to the replication stream.
  3. As peer B adds cores to the replication stream, ondiscoverykey on peer A will fire, and looks up discovery key:
    i. Refers to an auth core already in the replication stream -> do nothing.
    ii. Refers to a known data or blob core --> if B is authorized, add to stream, if not, do nothing.
    iii. "A" does not know the core referred to with discovery key --> Request corresponding public key from B and add to stream.
  4. Peer A downloads all data in auth store cores, validates that Peer B is authorized (peer B state -> "authorized").
  5. Peer A adds all known cores to the replication stream.

Our core manager will also need an add-core event for when new cores are added (e.g. from replication with peer C). The replication manager for peer A-B replication would listen on this and if B is authorized, add it to the stream with B.

@gmaclennan
Copy link
Member

Just thinking about it a bit more, authorization state is unknown, authorized, unauthorized. I am assuming for unauthorized we would close the stream and blocklist the peer (to avoid re-connecting)?

Our auth store will need authorized and unauthorized events I think: peer B might become "unauthorized" after initially being "authorized" because peer A syncs with peer C, which contains a statement saying that peer B is removed. So the replication manager for peer A-B replication should list on the unauthorized event and close the replication stream (and blocklist?), even if it was already starting to replicate. If there was a way to "per-peer" turn off uploading in a replication stream then it would be useful, when a peer is/becomes "unauthorized" to only turn off uploading, rather than close the stream, so that peer A can download all the data it might be missing from peer B.

Failing implementation of holepunchto/hypercore#305 we could always "hack" think by piping through a transform stream that filters all upload messages when a state uploading is set to false. Could be costly in terms of needing to parse messages, but we would not need to parse the entire message, just get chunks (via length-prefixed) and read the first bytes to get the message type. Feels a bit fragile though so holepunchto/hypercore#305 would be better.

@gmaclennan
Copy link
Member

I have implemented this in #61. coreManager.replicate() expects a noise stream, and returns an instance of ReplicationStateMachine that can be used to control replication:

const rsm = coreManager.replicate()

Initially, rsm.state.enabledNamespaces.values() is['auth']. Only auth cores will replicate in the replication stream.

You can start replicating other namespaces with rsm.enableNamespace(namespace) where namespace can be data, blobIndex or blob.

Unfortunately there is no rsm.disableNamespace(namespace) yet, because I can see no easy way to remove a core from the replication stream without closing the stream itself.

It though this was better than the core manager having a concept of "authorized" and "unauthorized", instead it just gives the option to control which namespaces are replicating, and the authstore or a "replication manager" can control which namespaces will replicate.

@gmaclennan
Copy link
Member

I think this is ok to close @sethvincent since this is all now implemented in #61 and #52? Re-open if you think it needs to be.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants