layout | title | permalink |
---|---|---|
page |
Contributing |
/contributing/ |
Contributions to the Bioregistry are welcomed and encouraged. Thanks for considering to participate.
All contributors, maintainers, and participants of the Bioregistry project are expected to follow our Code of Conduct. This document is organized as follows:
There are several ways to request a new prefix in the Bioregistry:
- Fill out the new prefix request form on our GitHub issue tracker with as much information about the resource as possible (e.g., name, homepage, format URL pattern, example identifier, pattern). Don't worry if you don't understand everything, our Review Team will guide you through the process.
- Add an entry yourself by editing the Bioregistry in GitHub through the web browser. As the Bioregistry is getting bigger, this is becoming more of an issue, so might not always be possible.
- Make a pull request directly to the upstream repository biopragmatics/bioregistry.
- Get in touch with us informally on Twitter @bioregistry
A prefix can be requested by anyone, even if it is for a resource they do not themselves maintain. A main goal of the Bioregistry is to be a detailed, descriptive resource - expertise is welcome from anywhere. Ideally, the requester should provide contact information for the main responsible person for the resource or include them in discussion on GitHub directly. In many cases, it's much easier for the resource responsible person to provide certain metadata that's required to go with a given prefix.
- New prefixes are allowed to contain letters [a-z], numbers [0-9], and a
single dot
.
if a subspace is requested. More discussion on subspacing policy can be found biopragmatics#133 and biopragmatics#65. - New prefixes must start with a letter.
- New prefixes must be at least two characters. Ideally, prefixes should be three or more characters for legibility.
- Subspaces must start with a letter.
- Subspaces must be at least two characters. Ideally, subspaces should be three or more characters for legibility.
- New prefixes must be lowercase. However, lexical variants can be stored as synonyms for reference (e.g., FBbt).
- New prefixes must validate against the following regular expression:
^[a-z][a-z0-9]+(\.[a-z][a-z0-9]+?)$
- New prefixes must pass all metadata checks, which are canonically defined by the quality assurance workflow.
Unfortunately, these requirements can not be applied retroactively and can not
be trivially applied to automatically imported prefixes. In some cases,
historical prefixes can be modified to follow these requirements. For example,
Identifiers.org's ec-code
was renamed to eccode
while maintaining ec-code
as a synonym.
Original discussion about minimum prefix requirements can be found at biopragmatics#158.
- Do not include titles (e.g., Dr.) in contact information for the requester, reviewer, nor contact for a resource.
- Prefixes should be chosen in a way to minimize confusion (e.g., prefixes should correspond to the name of the resource they are minted for. Most commonly, people use acronyms.)
- Multiple prefixes will not be issued for multiple versions of a resource ( e.g., the fact that there is a mesh.2012 and mesh.2013 prefix registered in Identifiers.org was a huge mistake and causes massive confusion)
- Prefixes must not be too generic or common entity types like gene or chemical. Reviewers will use their best judgment since it's hard to list all possible generic entity types. For example, gene would be bad while hgnc.gene would be better.
- Subspacing should not be used unnecessarily, i.e., when a nomenclature only has one entity type. For example, chebi.chemical would be bad while chebi would be better.
- Prefixes should not end in "O" for "Ontology", "T" for "Terminology" or any letters denoting related words about vocabularies
- New prefixes should not end with "ID" as a way to signify that the prefix is used for
identifiers, like in
doid
for the Disease Ontology orcaid
for ClinGen Canonical Allele identifier.
These policies were developed in parallel with the OBO Foundry policy on choosing a prefix (i.e., IDSPACE) at http://obofoundry.org/id-policy.html.
While they have proven to be rather infrequent between high quality resources, collisions do happen. The Bioregistry has the following policy for handling collisions:
- New prefixes must not collide with any canonical prefixes, preferred prefixes, synonyms, or normalized variants thereof. See biopragmatics#359 for an example of a prefix request that duplicated the synonyms of an existing prefix and how it was able to be resolved.
- New prefixes should not collide with any prefixes in external registries, even if they are not explicitly imported in the Bioregistry. In these cases, a thoughtful discussion should take place explaining why the prefix is being reused (e.g., it has been parked by an inactive or low-quality resource in Bioportal).
- If a new contributor wants to register a prefix that is already present in the Bioregistry, then precedence will be given to the already existing prefix and the contributor will be asked to choose a different prefix.
It has not happened often that prefixes have even collided. One example is two
maintained resources, Gene Expression Omnibus vs. Geographical Entity Ontology,
collided on using geo
when Geographical Entity Ontology was added to the OBO
Foundry. This was resolved
in biopragmatics#67 after deciding to
change the prefix used in Geographical Entity Ontology due to the fact that the
Gene Expression Omnibus was both much older and more well-known. This particular
case motivated the OBO Foundry to update its ontology registration guidelines to
require conflicts with existing Bioregistry records
in OBOFoundry/OBOFoundry.github.io#1519. Another
example is the disease class annotation (legacy classification from the hard
fork of the Disease Ontology that later became MONDO) and Dublin Core, where one
is subjectively more important than the other.
If you would like to submit more than 5 prefixes at once, you can fill out the
bulk prefix request template
spreadsheet and submit it in
an issue. The
template contains several examples - please review them then delete them before
submission. Please number all of the rows in sequential order the first
column (request_id
). The first columns of the template are all required, even
if some of the examples don't have an entry there for historic reasons. All
optional fields are marked as such.
Anyone is welcome to submit a bulk prefix request, but ideally submitters have a large working knowledge of the Bioregistry, its requirements, etc. as reviewing issues in a bulk request is much less ergonomic and more time-consuming than in individual prefix requests, which each get their own discussion thread, pull request, and CI/CD runs. Submitters of bulk prefix requests that contain many issues may be asked to re-submit as individual prefix requests.
A prefix and its corresponding semantic space are substantiated when it's provable that a semantic space exists by one or more of the following:
- There's a public place where you can get the entire list of terms. Ontologies (e.g., Gene Ontology) and databases (e.g., HGNC) usually make this pretty straightforwards by offering download links for the ontology or full database.
- There's a working, public URI format string that either lets you get HTML, JSON, RDF, or some other kind of information artifact for a given local unique identifier. For example, OMIM.PS doesn't have a way to get a full list of terms but if you have a given local unique identifier, you can use it with the URI format string to retrieve some information about the entity corresponding to that LUI.
A potential less strict third criteria for substantiation could be when references to entities in a semantic space (i.e., in the form of CURIEs) can be found in public resources or ontologies that are external to the resource in which the prefix/semantic space are defined. This is more common for historical prefixes (e.g., OpenCyc references appear quite frequently, but this resource was taken down more than a decade ago) and is less applicable to new prefix requests. Therefore, this relaxed criteria will not be considered as sufficient for substantiation.
We define prefix parking as a special case of a prefix request in which the corresponding resource/semantic space for the prefix does not yet exist or is currently under development (and by definition, is not yet substantiated). The Bioregistry does not explicitly discourage prefix parking, but new prefix requests qualifying as prefix parking require additional guidelines, partially motivated by the difficulty of the discussion on biopragmatics#359.
- While it's not typically under the purview of the Bioregistry Review Team to judge the utility of a prefix nor comment on its corresponding design decisions (e.g., choice of local unique identifier scheme, regular expression pattern, URI format string), submitters seeking to park a prefix must both actively publicly seek out and seriously consider suggestions and advice from the Bioregistry Review Team with regards to these matters (e.g., in the issue corresponding to a new prefix request). Submissions unable/unwilling to follow these guidelines may be dismissed and asked to re-submit after their prefix has been substantiated.
- Submissions to park a prefix must include a primary contact person for the resource that is available for public discussion on GitHub. Even though this is likely the same as the submitter, it is important that this person can be contacted. If they are unresponsive within two weeks of contact regarding the parked prefix, then the parked prefix is subject to removal.
- Parked prefixes that are not substantiated within three months are subject to removal. In the case that someone else wants to use that prefix, the fact that the parked prefix has not been substantiated will, by definition, result in no impact or confusion that would normally result from the removal of a prefix. It is the responsibility of the submitter/primary contact person for the prefix to inform the Bioregistry Review Team of the updates and/or to submit the updates to their prefix record themselves that demonstrates it has been substantiated.
- It's not the job of the Bioregistry to support parking prefixes for semantic spaces that will not be public or that won't be used in other public resources - these can be rejected without further discussion.
Original discussion about prefix parking can be found at biopragmatics#365.
Review of new prefix requests is handled by the Bioregistry Review Team, whose membership and conduct is described in the Bioregistry's Project Governance.
There are several ways to update a prefix's record in the Bioregistry:
- Fill out one of the issue templates on our GitHub issue tracker with the requested information. If no template exists for your update, feel free to fill out a blank issue. We will help make the update and attribute you properly.
- Edit an entry yourself by editing the Bioregistry in GitHub through the web browser. As the Bioregistry is getting bigger, this is becoming more of an issue, so might not always be possible.
- Make a pull request directly to the upstream repository biopragmatics/bioregistry.
- Get in touch with us informally on Twitter @bioregistry
A prefix's record can be edited by anyone, even if it is for a resource they do not themselves maintain. A main goal of the Bioregistry is to be a detailed, descriptive resource - expertise is welcome from anywhere. In many cases, editing an existing prefix's record is useful to override incorrect information from integrated repositories, such as Identifiers.org.
Typically, prefixes should not be changed since the Bioregistry acts as an archive of all usages, even deprecated ones. As an alternative, someone wishing to change a prefix can do the following:
- Mark the old prefix as deprecated
- Include in the
comment
field information about when and why the prefix was deprecated - Add a
has_canonical
relationship from the old prefix's record to the new prefix
Like with all edits, these are subject to review by the Bioregistry Review team. Extra care should be given with this kind of edit.
Review of edits to existing records is handled by the Bioregistry Review Team, whose membership and conduct is described in the Bioregistry's Project Governance.
Typically, prefixes should not be removed from the Bioregistry, even if they correspond to subsumed, abandoned, or dead resources, because it is also a historical archive and reference for anyone who might run into legacy prefixes in legacy resources.
Review of removals of existing records is handled by the Bioregistry Review Team, whose membership and conduct is described in the Bioregistry's Project Governance.
New registries can be added by anyone, similarly to prefixes, but there is a lot more required curation.
See the source metaregistry.json
file for inspiration. Entries in this file should follow the schema defined by the
Registry
pydantic model class.
See also the corresponding entry in the Bioregistry's JSON schema
While not strictly required, it's also useful for each registry to add a corresponding getter script and aligner
class in bioregistry.external
and bioregistry.align
, respectively. See examples there, or get in touch on the
issue tracker for help.
This project uses the GitHub Flow model for code contributions. Follow these steps:
- Create a fork of the upstream
repository
at
biopragmatics/bioregistry
on your GitHub account (or in one of your organizations) - Clone your fork
with
git clone https://github.com/<your namespace here>/bioregistry.git
- Make and commit changes to your fork with
git commit
- Push changes to your fork with
git push
- Repeat steps 3 and 4 as needed
- Submit a pull request back to the upstream repository
The Bioregistry uses squash merges to group all related commits in a given pull request into a single commit upon acceptance and merge into the main branch. This has several benefits:
- Keeps the commit history on the main branch focused on high-level narrative
- Enables people to make lots of small commits without worrying about muddying up the commit history
- Commits correspond 1-to-1 with pull requests
This project encourages the use of optional static typing. It
uses mypy
as a type checker
and sphinx_autodoc_typehints
to automatically generate documentation based on type hints. You can check if
your code passes mypy
with tox -e mypy
.
This project uses black
to automatically
enforce a consistent code style. You can apply black
and other pre-configured
linters with tox -e lint
.
This project uses flake8
and several plugins for
additional checks of documentation style, security issues, good variable
nomenclature, and more (
see tox.ini
for a list of flake8 plugins). You can check if your
code passes flake8
with tox -e flake8
.
Each of these checks are run on each commit using GitHub Actions as a continuous integration service. Passing all of them is required for accepting a contribution. If you're unsure how to address the feedback from one of these tools, please say so either in the description of your pull request or in a comment, and we will help you.
Python's builtin print()
should not be used (except when writing to files),
it's checked by the
flake8-print
plugin to flake8
. If
you're in a command line setting or main()
function for a module, you can use
click.echo()
. Otherwise, you can use the builtin logging
module by adding
logger = logging.getLogger(__name__)
below the imports at the top of your
file.
All public functions (i.e., not starting with an underscore _
) must be
documented using
the sphinx documentation format.
The darglint
plugin to flake8
reports on functions that are not fully documented.
This project uses sphinx
to automatically build
documentation into a narrative structure. You can check that the documentation
properly builds with tox -e docs-test
.
Functions in this repository should be unit tested. These can either be written
using the unittest
framework in the tests/
directory or as embedded
doctests. You can check that the unit tests pass with tox -e py
and that the
doctests pass with tox -e doctests
. These tests are required to pass for
accepting a contribution.
If other code is updated before your contribution gets merged, you might need to resolve conflicts against the main branch. After cloning, you should add the upstream repository with
$ git remote add biopragmatics https://github.com/biopragmatics/bioregistry.git
Then, you can merge upstream code into your branch. You can also use the GitHub UI to do this by following this tutorial.
This project aims to support all versions of Python that have not passed their
end-of-life dates. After end-of-life, the version will be removed from the Trove
qualifiers in the setup.cfg
and from the GitHub Actions testing configuration.
See https://endoflife.date/python for a timeline of Python release and end-of-life dates.
Review of edits to existing records is handled by the Bioregistry Core Development Team, whose membership and conduct is described in the Bioregistry's Project Governance.
As the Bioregistry matures, new fields may be added and more strict curation guidelines may be imposed (both from a philosophical and technical perspective). When imposing new rules, reasonable efforts should be made to backfill existing records. Alternatively, existing prefixes can be "grandfathered" in to less strict requirements.