The main purpose of this project is to index different objects from GitHub and expose them for OnlyDust marketplace. The following GitHub objects are indexed:
- Users
- Repositories
- Issues
- Pull Requests
- Code reviews
- Commits
The project is built following the hexagonal architecture.
The database is built as a data lake, organized into 4 different schemas:
- indexer: This is the main DB schema, used to store core data, like job statuses
- indexer_raw: This schema is used to store
raw
data from the GitHub API - indexer_exp: This schema is used to store the
exposed
data for the marketplace - indexer_outbox: This schema is used to store the outgoing events
The clean
data is represented in-memory as Java objects, not persisted for the time being.
The Indexer
purpose is to read data from a RawStorageReader
and build the clean
model.
This allows any indexer to be decorated to add new features, like caching.
The Exposer
is an Indexer
decorator that read the clean
model and write the exposition
model.
There are currently 2 cache-related decorators implemented:
CacheReadRawStorageReaderDecorator
: This cache is used to cache theRawStorageReader
readsCacheWriteRawStorageReaderDecorator
: This cache is used to store in the cache the data read from theRawStorageReader
Several indexers are implemented, each one with a given purpose:
UserIndexer
: Indexes GitHub users, including its social accountsRepoIndexer
: Indexes GitHub repositories, including its languagesFullRepoIndexer
: Indexes GitHub repositories including the issues and pull requestsIssueIndexer
: Indexes GitHub issues, including the assigneesPullRequestIndexer
: Indexes GitHub pull requests, including the reviews and commits
Each indexer is decorated into several flavors:
cacheOnly
: This flavors only reads the raw storage from the cache, used mainly for refresh purposes.live
: This flavor reads the raw storage from the cache, if not found, reads from the GitHub API and store the result in the cache. This is used for the main indexing processes.diff
: This flavor works like thelive
flavor but only reads incremental changes from the GitHub API.
The project contains a webhook that listens to GitHubApp events. When an event is received, the relevant indexers are triggered to update the data.
The project contains a job scheduler that triggers the indexers to update the data on a regular basis, in case of missed events.
When a user registers in the marketplace, a request is sent to this repo to index the user. For this purpose, all public events related to the user are fetched and indexed.
2 sources of public events are used:
- Github Archives: We fetch the public events using the public BigQuery dataset from user account creation date to the day before the current date.
- Github API: We fetch the current day events from the GitHub API.
In order to avoid hitting the GitHub API rate limit, the public events indexing process works as follows:
- Public events are fetched and stored in the
indexer_raw
schema. raw
inner objects(like repos, pull requests, issues) are read from the events and stored in theindexer_raw
schema.- Some mandatory objects are indexed using
live
indexers.- For example, in the case of a pull request event, the repo and its owner are indexed using the
live
indexers.
- For example, in the case of a pull request event, the repo and its owner are indexed using the
- The relevant indexers are then triggered to expose the data.
cacheOnly
indexers are used to avoid fetching the GitHub API.
A Job
is run on a regular basis to fetch all users public events and index them.
The user languages are computed by the marketplace based on the file extensions of the commits the user has made. However, the modified files are not part of the GitHub public events. A dedicated call to the GitHub API, per commit, must be made to fetch them.
In order to avoid hitting the GitHub API rate limit, a sampling mechanism is used:
TODO: Add the sampling mechanism description