marketplace-indexer

The main purpose of this project is to index different objects from GitHub and expose them for OnlyDust marketplace. The following GitHub objects are indexed:

Users
Repositories
Issues
Pull Requests
Code reviews
Commits

How does it work?

The project is built following the hexagonal architecture.

The database is built as a data lake, organized into 4 different schemas:

indexer: This is the main DB schema, used to store core data, like job statuses
indexer_raw: This schema is used to store raw data from the GitHub API
indexer_exp: This schema is used to store the exposed data for the marketplace
indexer_outbox: This schema is used to store the outgoing events

The clean data is represented in-memory as Java objects, not persisted for the time being.

The Indexer purpose is to read data from a RawStorageReader and build the clean model. This allows any indexer to be decorated to add new features, like caching.

The Exposer is an Indexer decorator that read the clean model and write the exposition model.

There are currently 2 cache-related decorators implemented:

CacheReadRawStorageReaderDecorator: This cache is used to cache the RawStorageReader reads
CacheWriteRawStorageReaderDecorator: This cache is used to store in the cache the data read from the RawStorageReader

Several indexers are implemented, each one with a given purpose:

UserIndexer: Indexes GitHub users, including its social accounts
RepoIndexer: Indexes GitHub repositories, including its languages
FullRepoIndexer: Indexes GitHub repositories including the issues and pull requests
IssueIndexer: Indexes GitHub issues, including the assignees
PullRequestIndexer: Indexes GitHub pull requests, including the reviews and commits

Each indexer is decorated into several flavors:

cacheOnly: This flavors only reads the raw storage from the cache, used mainly for refresh purposes.
live: This flavor reads the raw storage from the cache, if not found, reads from the GitHub API and store the result in the cache. This is used for the main indexing processes.
diff: This flavor works like the live flavor but only reads incremental changes from the GitHub API.

GithubApp events

The project contains a webhook that listens to GitHubApp events. When an event is received, the relevant indexers are triggered to update the data.

Jobs

The project contains a job scheduler that triggers the indexers to update the data on a regular basis, in case of missed events.

GitHub public events

When a user registers in the marketplace, a request is sent to this repo to index the user. For this purpose, all public events related to the user are fetched and indexed.

2 sources of public events are used:

Github Archives: We fetch the public events using the public BigQuery dataset from user account creation date to the day before the current date.
Github API: We fetch the current day events from the GitHub API.

In order to avoid hitting the GitHub API rate limit, the public events indexing process works as follows:

Public events are fetched and stored in the indexer_raw schema.
raw inner objects(like repos, pull requests, issues) are read from the events and stored in the indexer_raw schema.
Some mandatory objects are indexed using live indexers.
- For example, in the case of a pull request event, the repo and its owner are indexed using the live indexers.
The relevant indexers are then triggered to expose the data. cacheOnly indexers are used to avoid fetching the GitHub API.

A Job is run on a regular basis to fetch all users public events and index them.

User languages indexing

The user languages are computed by the marketplace based on the file extensions of the commits the user has made. However, the modified files are not part of the GitHub public events. A dedicated call to the GitHub API, per commit, must be made to fetch them.

In order to avoid hitting the GitHub API rate limit, a sampling mechanism is used:

TODO: Add the sampling mechanism description

Name		Name	Last commit message	Last commit date
Latest commit History 344 Commits
.github		.github
.mvn/wrapper		.mvn/wrapper
application		application
aws		aws
bootstrap		bootstrap
coverage		coverage
doc		doc
domain		domain
infrastructure		infrastructure
scripts/initdb.d		scripts/initdb.d
.editorconfig		.editorconfig
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
codecov.yml		codecov.yml
docker-compose.yaml		docker-compose.yaml
docker-start.sh		docker-start.sh
lombok.config		lombok.config
mvnw		mvnw
mvnw.cmd		mvnw.cmd
pom.xml		pom.xml
settings.xml		settings.xml
system.properties		system.properties
temp		temp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

marketplace-indexer

How does it work?

GithubApp events

Jobs

GitHub public events

User languages indexing

About

Releases

Packages

Contributors 5

Languages

onlydustxyz/marketplace-indexer

Folders and files

Latest commit

History

Repository files navigation

marketplace-indexer

How does it work?

GithubApp events

Jobs

GitHub public events

User languages indexing

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages