Skip to content

The Design of a Collection

quartzjer edited this page Jul 8, 2011 · 3 revisions

A Collection has only two high-level duties:

  1. Collect data of a common type from all other services in the locker.
  2. "Clean up" the data - merge, match, collate, de-duplicate, augment, etc, to bring it all as close as possible to a single common schema. The details of this will vary widely (especially the augmentation of the data) from data type to data type.

Collection Axioms/Guidelines

  1. A Collection is always "installed", i.e. when a locker comes to life, any collections that aren't already installed, are installed before startup is considered to be completed.
  2. A Collection is not responsible for providing access to its data - that happens via the query API
  3. A Collection should store field-specific source information, i.e. where the value of each field of an object originated from
  4. A Collection should NOT contain specific logic for mapping the schema of a subtype object into the schema of the common type. For example, a (collection-level) contact object should contain a "fullname" field, but contact/facebook objects contain a "full_name" field, so the mapping between these two should only happen in the abstract (by asking the connector for the schema mapping). If non-common fields are to be used for merging and matching etc, single, collection-only fields (i.e. not returned via the query API) should be used to contain that merging info. For example, Foursquare presents very useful profile info about attached Facebook and Twitter accounts of all its contacts. This information is not captured in the "common fields" of the contacts collection but can be captured in "_matching.facebook_id" and "_matching.twitter_id" fields, which can then be used simply to merge and match the new Foursquare contact (or future Facebook contact) data in.

Implementation

The Bits

A Collection has several parts (using Contacts as an example):

  • contacts.collection - This is the manifest file for the contacts collection. Information about manifest files can be found in the App Manifest wiki page.
  • contacts.js - This is where the collection starts, creates a webservice to receive events and scheduled update callbacks from core.
  • dataStore.js - Contains all logic for merging, matching, collating, and de-duplicating contact data and pushing that into the database. These two seemingly separate pieces of logic exist in the same file as the database itself is used heavily for comparisons.
  • sync.js - Contains flow logic for retrieving data from appropriate connectors (via their internal REST APIs) if events have been missed.

The Flow

Not worth laying out until a refactor is performed. Starts in contacts.js, goes from there. Most of the action happens in dataStore.js.

State of the Code

Currently, the contacts collection does not adhere to the design pattern and axioms laid out above. :). It contains a whole host of connector-specific merging code and needs to be refactored to work with schema maps and common "_matching" fields. As a result, currently, the distribution of code and logic is rather arbitrary and more a result of a lack of upfront design than anything else. It needs lots of refactoring. Again, :).

Future Direction

Currently, collections are read-only, and don't properly process delete (or in some cases, update) events.

Open Question + "Blind Spots"

  • Should a collection's data be writeable directly from an authorized service (e.g. a merged contacts app that writes back manual merges and allows users to edit data), or should additional data be stored with that service?
  • Should a collection maintain a historical data store similar to that of a connector?
Clone this wiki locally