Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Health Sector Git Repo Topic Ontology #35

Open
SamHollings opened this issue Sep 7, 2023 · 8 comments
Open

Health Sector Git Repo Topic Ontology #35

SamHollings opened this issue Sep 7, 2023 · 8 comments
Assignees

Comments

@SamHollings
Copy link

SamHollings commented Sep 7, 2023

Health Sector Git Repo Topic Ontology

!!! tip "TLDR"
- Apply topics to each of your published repos following the ontology described below
- Focus initially on topics related to technique and domain - these are what people are usually most interested in
- Then, you add even more value by adding other topics.
- There is a website which scans github for NHS github repositories and displays them by topic - making it easier to find useful code

??? question "Why should we care?"
- Applying topics for your repos will make it much easier to for you and others to find and reuse useful bits of code
- Using a common ontology will make the topics more useful - we will all be speaking the same language

??? success "Pre-requisites"
* Some information on what someone might need to be familiar with before they can use this page

|Pre-requisite | Importance | Note |
|--------------|------------|------|
|**None!**||Anyone can do this -though you need to [have published some code on github already](https://nhsdigital.github.io/rap-community-of-practice/implementing_RAP/how-to-publish-your-code-in-the-open/)|

A key aim of RAP is to not only automate our pipelines to re-use useful code in other work. This relies on us publishing the code as publicly as possible, and then making it easy to find these useful bits of code. Topics in github can help with this, however we will get the most benefit from topics by using a common topic vocabulary to describe our GitHub code repos.

The topic ontology described in this guide will ensure our code can be searched by:

  • language and tech used
  • what methods were used
  • whether or not the code is recent or old (and if it still updated)
  • what kinds of data the code was used with and where it came from

!!! warning
## The Differences between "topics" and "tags"
In GitHub, tags and topics are different:
- Topics are labels applied to whole repos which describe them, like keywords. Each repo can up to twenty, and github is good at searching and sorting results by topics.
- Tags are labels applied to specific commits within a git repo, and it's how releases are made, e.g. v0.1.0 might be a tag applied to a specific commit locking in that this commit is Version 0.1.0.

Topics

Our aim with topics is to allow people to find code which might be useful to them, so they can reuse it. With this in mind, they usually want to know what kind of data the code was used on, in which language, if it was using the compatible datastructures (e.g. pandas, or pyspark) and how recently it was made / updated (people are less trustworth of ancient, dead code).

When applying topics to your code:

  • we suggest starting with the priority 1 categories below, e.g. Domain Area and Technique, first, as these are people tend to be most concerned with.
  • stick to the topics suggested below - this will ensure we get the most benefit out of them. If there are too many, it becomes meaningless. If there are important ones missing, raise an issue against this github repo with your suggestion for new topics
Priority Category Description Example topics
1 Domain Area/ Datasets/ Data source People will want to know what data these techniques have been applied to, if any. This might inspire them to do something similar, or highlight areas for collaboration. secondary-care
primary-care
hospital-episode-statistics
gpdpr
civil-registration-of-deaths
gdppr
artificial (perhaps if it was using artifical data)
1 Technique People will want to what kinds of data processing, analyses, etc. were done - this might be quite broad as it should cover the sorts of resuable code chunks people might want to look at. clustering
forecasting
classification
regression
statistical-disclosure-control
deduplication
entity-resolution
record-linkage
summarisation
data-cleansing
data-validation
hyperparameter-tuning
artificial-data-generation
etc.
2 Technology if I want to re-use someones Python or R code, and they made it using a different data structure to me, that might cause problems, hence it's important to describe them dplyr
sparklyr
pandas
pyspark
polars
sqlalchemy
sqlalchemy-orm
numpy
sklearn
tensorflow
pytorch
scipy
etc.
2 Language People often want to know if the code is using a language they know/use, and though GitHub can sometimes correctly identify the language used in the repo, if you have a lot of documentation or use certain languages (such as SQL), it can struggle. python
r
sql
2 Maturity People might want to know if a codebase is made to a high standard, or by people who are just starting out. baseline-rap
silver-rap
gold-rap
2 Opt-out of re-use A tag for those people who want to publish their code, but make it clear that it is not optimised for re-use. not-optimised-for-reuse

Using topics to find useful repos (and code)

You can search for repos by topic within github using the search bar (e.g., as seen here, with tips on github search syntax here) or you can use this helpful website which gathers the repos and topics from the various NHS organisations on GitHub.

@SamHollings
Copy link
Author

feedback from Jonny: Consider removing the Meta-tag - we will already filter by organisation anyways, and the maturity tag basically fulfils the same role.

@JRPearson500
Copy link
Collaborator

Suggestion to have an opt-out (black list) topic rather than a white-list meta tag

@JonathanHope42
Copy link

Suggestion to have an opt-out (black list) topic rather than a white-list meta tag

maybe "not-optimised-for-reuse"

@JRPearson500
Copy link
Collaborator

Technique and Domain suggested as the key areas that need topics mandating.

@lilianavalles
Copy link

I'd find useful the release status, WIP, done and ready, active (continuously being improved), and inactive (WIP but with no plans to keep working on it).

@GiuliaMantovani1
Copy link

Some of these might be an overkill, but just for consideration...
Specific types of algorithms? For example, in Data Linkage you could have FS but also other types of Bayesian algorithms.
Database used?
Development tools? In an ideal word this would not impact how the code is written, but sometimes it does... can be probably understood from level of rap?

@SamHollings
Copy link
Author

@lilianavalles

I'd find useful the release status, WIP, done and ready, active (continuously being improved), and inactive (WIP but with no plans to keep working on it).

Some interesting suggestions. A question I have... would they help you find useful code? For example, you can see if a repo is still "active" by whether it has been updated recently, so perhaps a "topic" for this doesn't add so much (though it would make iti slightly faster to see). They would indicate the code was potentially higher quality... but I suppose so would the "gold, silver etc." RAP topics...

@GiuliaMantovani1

Some of these might be an overkill, but just for consideration... Specific types of algorithms? For example, in Data Linkage you could have FS but also other types of Bayesian algorithms. Database used? Development tools? In an ideal word this would not impact how the code is written, but sometimes it does... can be probably understood from level of rap?

I think specific types of algorithms might be good, but I wonder if by making the topics too granular we reduce their usefulness? It's difficult to know where to set the threshold though - so if you think it would benefit people to add those in, we can try it. I think database used, e.g. databricks, chroma, postgres, SQLserver, probably is important, under technology, as it really will affect how the functions are written if they're useful to you. Development tools, do you mean jupyter, Vscode, etc.? I think if they are very... proprietary, such as Databricks, and ipynb notebooks more generally, might be good to add as a "notebook" topic. But Vscode / Pycharm, probably shouldn't change how the code is... I'd assume!

@SamHollings
Copy link
Author

I'm going to move this to the RAP website and then we can continue to develop it there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants