Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include the generated files into Python library #384

Merged
merged 30 commits into from
Jul 4, 2024

Conversation

ielis
Copy link
Contributor

@ielis ielis commented Apr 4, 2024

The PR improves the code of phenopackets - the library for working with phenopackets in Python.

Current state of things

phenopackets consists of the Python bindings for Phenopacket Schema building blocks. The library allows creating phenopackets programatically, as well as JSON and protobuf I/O.

All building blocks are exported from the top level package:

from phenopackets import OntologyClass, Individual, Phenopacket, Gene

or

import phenopackets as pp

oc = pp.OntologyClass()
i = pp.Individual()
p = pp.Phenopacket()

The issue

We do not "namespace" the building blocks. For instance, to use VRS Gene, we do:

from phenopackets import Gene

This is not good, because Gene is here to stay and we cannot add another Gene to the Schema without breaking changes. On top of this, we cannot work with >1 Schema versions (unlike Java).

The proposal

The PR updates the deployment script to maintain the hierarchy set by the proto files. For the time being (open to discussion), we keep the v2.0.2 elements at the top-level, in order to not break existing code. So, the code below will still work:

from phenopackets import OntologyClass, Individual, Phenopacket, Gene

However, we need to discourage the users from using this from now on and this is the new way to import the building blocks:

from phenopackets.schema.v2.base_pb2 import OntologyClass
from phenopackets.schema.v2.individual_pb2 import Individual
from phenopackets.schema.v2.phenopackets_pb2 import Phenopacket
from ga4gh.vrs.v1.vrs_pb2 import Gene

These imports are longer, but they provide several benefits

  • the package structure follows the Phenopacket Schema proto files more closely. I think this is good, since the proto files are the source of truth.
  • the package structure now allows working with v1.0.0 elements, or the future iterations. The following should work after the PR is merged:
    from phenopackets.schema.v1.phenopackets_pb2 import Phenopacket
  • the deployment is simpler. The library follows the proto files with close to none human intervention required.

@julesjacobsen @cmungall @pnrobinson please have a look and let me know your thoughs. I hope the PR is not too big, we tried to document well. I am not asking @iimpulse for review since we worked on this together.

@ielis ielis marked this pull request as ready for review April 10, 2024 02:00
@cmungall
Copy link
Member

This is pretty non-idiomatic python

  • in general you don't distribute top level namespaces alongside your own. Occasionally when vendoring is really necessary it's made a sub-namespace
  • versioning isn't typically done at the module level

I would say if we are truly reusing a separate package's concept of gene, then let's coordinate with them, have them release a vrs package to pypi, and just reuse that. If we truly need to vary this then use standard vendoring patterns.

I don't understand the need for versioning at the module/sub-package level. Why not just use standard python versioning? Poetry has really good support for this, we use this elsewhere in Monarch. Just have releases 1.x.x. and 2.x.x on PyPI, and at some point 3.0.0rc1 and then 3.0.0, etc. Follow standard semantic versioning.

One exception to this is when breaking changes are introduced, e.g. with pydantic v2 there is a compatibility layer pydantic.v1 which allows for more graceful transition.

A more radical approach would be to have ga4gh at the top level, have subpackages for ga4gh.phenopackets,vrs,... I don't think ga4gh is coherent enough yet though, so I think your better having phenopackets at the top level.

I can see the point about aligning with the protobuf, but is the protobuf ideally organized? Perhaps this could be reorganized after v2? (and perhaps use something less restrictive than protobuf...)

@ielis
Copy link
Contributor Author

ielis commented Apr 11, 2024

Hi @cmungall

yes, these are good points.

in general you don't distribute top level namespaces alongside your own. Occasionally when vendoring is really necessary it's made a sub-namespace

I assume you are referring to the ga4gh.vrs.v1.vrs_pb2 part. Unfortunately, I am not sure we can get around this without breaking changes due to limitations of the code generated by the protobuf compiler. The code generated for phenopacket blocks only works if the VRS parts are importable from ga4gh.vrs.v1.vrs_pb2. This is due to project setup:

image

So yes, we would need VRS people to release Java and Python artifacts. I am not sure, however, I would be heard there. Perhaps @cmungall, @julesjacobsen or @pnrobinson can convince them that they must deploy versions to PyPi if they want their code to be used?
As of now, the generated code does not clash with the existing VRS packages on PyPi, we tested that with @iimpulse.

I don't understand the need for versioning at the module/sub-package level. Why not just use standard python versioning? Poetry has really good support for this, we use this elsewhere in Monarch. Just have releases 1.x.x. and 2.x.x on PyPI, and at some point 3.0.0rc1 and then 3.0.0, etc. Follow standard semantic versioning.

Two things. First, we already have this structure in the repo from some reason (see the screenshot above for v1 and v2), so now the Python code would follow the protobuf files more closely. Second, at some point, we will want to remap version x to version x+1 and having more than one version of the data model can simplify the mapping code, e.g. as we do in phenopacket-tools.
I think it may be good to e.g. keep the code for the latest major version and the previous one, e.g. v2 and v1. An alternative would be to dump v1 to e.g JSON, and write parser/converter that maps JSON to v2.
The aim here is to have strategy for schema evolution, and it is open for discussion.

A more radical approach would be to have ga4gh at the top level, have subpackages for ga4gh.phenopackets,vrs,... I don't think ga4gh is coherent enough yet though, so I think your better having phenopackets at the top level.

I can see the point about aligning with the protobuf, but is the protobuf ideally organized? Perhaps this could be reorganized after v2? (and perhaps use something less restrictive than protobuf...)

Protobuf indeed has some quirks and I would be happy to participate in discussions regarding what is the best for v3. However, until we get there, I think there are ways how to add the enhancements we need without breaking changes.


In overall, I think your points regarding the namespaces, versions, and protobuf are good. Do you think they should be addressed within this PR or addressing them later would be sufficient?

@pnrobinson
Copy link
Collaborator

To add to the above, I would support exploring a linkML approach to a version 2.1 or 3 of the schema. There a many moving parts that constrain the current version to have some of the oddities it has above, but these are implementation details that extremely few users will care about, and software should hide them for everybody but developers.
It would be great @cmungall to find some time to discuss if this is a possibility.

@monicacecilia
Copy link
Member

@ahwagner 👀 👆

@monicacecilia
Copy link
Member

I'd like to add that all planning and provisions for a potential v3 of Phenopackets must originate, assess requirements, be discussed, and receive approval in the context of GA4GH ClinPheno. @ielis, if you aren't already, let's ensure you onboard to GA4GH and join the ClinPheno WS.

@pnrobinson @julesjacobsen @cmungall @ahwagner @mcourtot -- 👀 👆

@ielis
Copy link
Contributor Author

ielis commented Apr 22, 2024

Hi @monicacecilia it is understood that the updates must be discussed in the manner you outlined. This PR makes no backward incompatible changes - everything that worked in v2.0.2 or v2.0.0 will continue to work, in line with semantic versioning guidelines.

@cmungall
Copy link
Member

I don't have any strong feelings on what should be done in the short term. It seems we have to do some unusual things to make up for limitations of protobuf. I can be of more use when we start planning on v3 and do things in a more conventional way, without protobuf constraints.

@ahwagner
Copy link
Contributor

Thanks for the tag @monicacecilia.

@ielis, FWIW, the VRS implementation in Phenopackets carries a lot of unusual artifacts by virtue of being "round-tripped" from JSON Schema (VRS native) -> protobuf (Phenopackets native) -> JSON Schema.

The VRS team does maintain both JSON Schema and Python implementations for VRS, I've linked the VRS v1.2 specs that were used with Phenopackets v2.

@cmungall
Copy link
Member

Thanks @ahwagner!

Maybe this is known already but it looks like there are significant deviations

E.g. if we look at the SequenceInterval in this phenopackets example:

https://github.com/phenopackets/phenopacket-tools/blob/fe5346ad17df12fbd8c0f5e7d72646f455e286a8/phenopacket-tools-cli/src/examples/phenopackets/retinoblastoma.json#L216-L235

And compare with VRS SequenceInterval:

https://github.com/ga4gh/vrs/blob/76542a903b913110e67811885a8958625bbc3aae/schema/vrs.yaml#L301-L326

@ielis
Copy link
Contributor Author

ielis commented May 29, 2024

Hello @ahwagner

thank you very much for your comment and for pointing out the Python implementation.

Unfortunately, the data classes of the VRS Python implementation most likely cannot be imported/reused in Phenopacket Schema, and we must refer to their protobuf definitions vrsatile.proto and vrs.proto. The protobuf definitions are namespaced at org.ga4gh.vrs.v1 (see here), but in case of Python, the package directive is ignored and protobuf compiler generates the Python bindings based on the folder structure. In case of Phenopacket Schema, this will lead to imports such as:

# Import a Phenopacket Schema element
from phenopackets.schema.v2.phenopackets_pb2 import Phenopacket

# Import a VRS element
from ga4gh.vrs.v1.vrs_pb2 import Gene

The issue with above is that the bindings are generated into two top-level packages: phenopackets and ga4gh. Deploying phenopackets package to PyPi looks bad since it includes another namespace ga4gh, which is not "owned" by Phenopacket Schema.
This could be fixed by deploying the vrsatile.proto and vrs.proto to PyPi separately, and importing in PS. The absence of such artifact is the source of my frustration. However, I understand that VRS is not bound to Protobuf, so it is not a real priority.

This PR proposes to move the VRS stuff within PS, to result in imports like this (Python):

from phenopackets.schema.v2.phenopackets_pb2 import Phenopacket
from phenopackets.vrs.v1.vrs_pb2 import Gene

and all the other languages will stay the same.

@ahwagner
Copy link
Contributor

ahwagner commented May 30, 2024

However, I understand that VRS is not bound to Protobuf, so it is not a real priority.

Yes–the protobuf implementation was there to help the Phenopackets team implement VRS under Protobuf, but we don't support protobuf as part of the GA4GH Standard or reference implementation precisely to avoid the deviations that we have observed in the Phenopackets project, caused by translating the Protobuf'd VRS schema back to JSON Schema.

My recommendation is as it was at the time this decision was made by the Phenopackets team: Phenopackets in JSON Schema should use the GA4GH standard representation of VRS, not an alternate schema derived from VRS-protobuf and sharing the VRS name. Relying on the standard representation would also enable reuse of Python libraries for VRS.

Assuming that using the standard VRS JSON Schema is off the table, then I think you are unfortunately left with publishing the phenopackets-flavored VRS as part of the phenopackets package, as you suggest. Ideally this could be distinguished from the VRS standard in the documentation in some way.

@pnrobinson
Copy link
Collaborator

It looks like there is no remaining controversy, I will merge this so we can move forward unless somebody protests in the next week. There are various issues with advantages and disadvantages, but I think we will leave major changes for a future version 3.0.

Copy link
Contributor

@ahwagner ahwagner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only request is that the documentation reflects this is not the standard JSON representation of VRS, otherwise I have no concerns. Would we be able to add that to the scope of this PR?

@pnrobinson
Copy link
Collaborator

@ahwagner
Copy link
Contributor

I think it is as simple as adding a warning admonition that uses text derived from the phenopacket-tools paper, where this discrepancy was introduced:

When introduced in Phenopackets v2, a protobuf version of VRS (github.com/ga4gh/vrs-protobuf) was derived from the source VRS representation in JSON schema and used for Phenopackets. In Phenopackets-tools, we translate this derived representation back to JSON schema, resulting in a message structure that is losslessly transformable but syntactically distinct from the native VRS JSON schema. For users that wish to express Phenopackets with a native VRS representation, we have implemented a function in phenopacket-tools that converts VRS objects in Phenopackets into JSON objects that correspond to the native VRS schema. If users desire to interact with databases or software that use the VRS schema, this function can be used to generate conformant JSON code.

@pnrobinson
Copy link
Collaborator

Thanks, everybody, I think we are all set!

@pnrobinson pnrobinson merged commit 804a217 into phenopackets:master Jul 4, 2024
6 of 7 checks passed
@ielis ielis deleted the python/include-generated-files branch July 4, 2024 10:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants