Skip to content

Commit

Permalink
Merge branch 'dev'
Browse files Browse the repository at this point in the history
  • Loading branch information
nsheff committed Aug 14, 2024
2 parents 897a4ac + 8b2c39a commit 4046257
Show file tree
Hide file tree
Showing 14 changed files with 128 additions and 29 deletions.
8 changes: 4 additions & 4 deletions docs/citations.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,15 +7,15 @@ hide:

# How to cite PEPkit

Thanks for citing us! If you use the PEP tools, or concepts in your research, here are papers you can cite.
Thanks for citing us! If you use the PEP tools, or concepts in your research, please cite us!

| If you use... | Please cite ... |
|---------------|-----------------|
| PEP specification | Sheffield et al. (2021) *GigaScience* |
| `eido` | Sheffield et al. (2021) *GigaScience* |
| `geofetch` | Sheffield et al. (2021) *GigaScience*; Khoroshevskyi et al. (2023) |
| `looper` | Sheffield et al. (2021) *GigaScience* |
| `PEPhub` | Sheffield et al. (2021) *GigaScience*; LeRoy et al. (2023) *bioRxiv* |
| `PEPhub` | Sheffield et al. (2021) *GigaScience*; LeRoy et al. (2024) *GigaScience* |
| `peppy` | Sheffield et al. (2021) *GigaScience* |
| `pypiper` | Sheffield et al. (2021) *GigaScience* |
| `pipestat` | Sheffield et al. (2021) *GigaScience* |
Expand All @@ -37,8 +37,8 @@ Thanks for citing us! If you use the PEP tools, or concepts in your research, he

<span class="authors">NJ LeRoy, O Khoroshevskyi, A O’Brien, R Stepień, A Arslan, NC Sheffield.</span><br/>
<span class="paper-title">PEPhub: a database, web interface, and API for editing, sharing, and validating biological sample metadata</span>
<br><i>bioRxiv</i> (2023)
<span class="doi">DOI: <a href="http://dx.doi.org/10.1101/2023.08.15.551388">10.1101/2023.08.15.551388</a></span>
<br><i>GigaScience</i> (2024)
<span class="doi">DOI: <a href="http://dx.doi.org/10.1093/gigascience/giae033">10.1093/gigascience/giae033</a></span>



Expand Down
13 changes: 7 additions & 6 deletions docs/pephub/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@



PEPhub is an open-source database, web interface, and API for sharing, retrieving, and validating sample metadata. PEPhub consists of:
PEPhub is an open-source database, web interface, and API for sharing, retrieving, and validating metadata. PEPhub consists of:

- **Public user interface**: <a href="https://pephub.databio.org/" target="_blank">https://pephub.databio.org/</a>
- **API**: <a href="https://pephub-api.databio.org/api/v1/docs" target="_blank">https://pephub-api.databio.org/api/v1/docs</a>
Expand All @@ -16,17 +16,18 @@ PEPhub is an open-source database, web interface, and API for sharing, retrievin
## Features at-a-glance


- **Validation**. PEPhub validates sample metadata with [eido](../eido/README.md). Users can specify a schema to which the PEP should adhere. All schemas are available on the official website: [https://schema.databio.org/](https://schema.databio.org/). Schemas are particularly useful before running pipelines, as validation provides essential information about PEP compatibility with specific pipelines and highlights any errors in the PEP structure.
- **Validation**. Users specify a schema for a project (using an extended [JSON Schema](https://json-schema.org/)), and use it validate sample metadata.

- **Semantic search**. PEPhub has semantic search functionality based on cutting-edge semantic machine learning. Information from each PEP is encoded using a sentence transformer and stored in a fast vector database. The PEPhub search interface then provides an extremely fast and powerful semantic search of sample metadata.
- **Semantic search**. The PEPhub search interface provides an extremely fast and powerful semantic search of sample metadata. It is built using cutting-edge machine learning (sentence transformers) and stored in a fast vector databases.

- **Authorization**. PEPhub has a robust user authorization system to allow users to submit and edit their own PEPs. Users authenticate via GitHub, and then may upload, modify, and delete PEPs, and star projects. You can also set projects as private to restrict access. PEPhub also provides group-level permissions using GitHub organization membership, providing organizational namespaces that correspond to GitHub organizations to make it possible to collaborate on PEPs.
- **Authorization**. PEPhub has a robust user authorization system to allow users to submit and edit their own metadata. Users authenticate via GitHub. Data may be either public or private, and can be restricted to individual or group-level permissions using GitHub organizations.

- **Group PEPs with using a PEP of PEPs (POP)**. A PEP of PEPs, or simply a POP, is a specific type of PEP in which each row is itself a PEP. Essentially, a POP is a structure to group PEPs, allowing users to organize projects. This allows PEPs related to a specific topic to be consolidated, streamlining organization and accessibility.

- **Re-processing of GEO metadata**. The public PEPhub instance [geo namespace](https://pephub.databio.org/geo) holds metadata from nearly 99% of the [Gene Expression Omnibus](https://www.ncbi.nlm.nih.gov/geo/). PEPhub is updated weekly using [GEOfetch](../geofetch/README.md) to produce standardized PEP sample tables, providing a convenient API interface to GEO metadata.

- **PEPHubClient (phc)**. PEPhubClient is a command-line tool and Python API, which allows users to authenticate with PEPhub, download and upload public or private projects. For more information, see the [PEPHubClient documentation](developer/pephubclient/README.md).
- **Command-line client**. You can use [PEPHubClient](developer/pephubclient/README.md) for command-line tool and Python API, which allows authentication, download, upload of public or private projects.

- **Group PEPs with using a PEP of PEPs (POP)**. A PEP of PEPs, or simply a POP, is a type of PEP in which each row is itself a PEP. POPs allow users to organize projects into groups.

## Next steps

Expand Down
Binary file added docs/pephub/img/history-interface.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/pephub/img/menu-edit.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/pephub/img/menu-filter.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/pephub/img/menu-history.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/pephub/img/select-view.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/pephub/img/validation-notice.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
27 changes: 23 additions & 4 deletions docs/pephub/user/geo.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,27 @@

# Accessing GEO data through PEPhub

Moreover, users can download all project as `tar file` from the GEO namespace using the link available on the geo namespace page. PEPhub doesn't store actual files in the database. Because of this, if you want to download files, there are two options:
The Gene Expression Omnibus is a major source or biological sample data and metadata. However, accessing the metadata from GEO is challenging. Now, PEPhub provides an API-oriented access to processed tabular metadata from GEO.

- Use links to the files that are stored in the project sample table.
- Use geofetch on a local machine to download these files.
Example: `geofetch -i GSE95654 --processed`, where `--processed` indicates that you want to download processed data, not SRA. More information about PEP can be found on the official website [GEOfetch](https://geofetch.databio.org/en/latest/).


## Finding GEO data on PEPhub

Lots of options to find GEO metadata on PEPhub:

1. You can browse or search GEO repositories from the [GEO namespace](https://pephub.databio.org/geo),
2. You can use the main PEPhub search interface.
3. You can also just use the URL directly, of the form: `https://pephub.databio.org/geo/{gse_accession}` (with `gse` lowercase). For example: <https://pephub.databio.org/geo/gse211892>

## Always up-to-date

PEPhub has a weekly update that keeps the PEPhub's GEO namespace in sync. So, you can be sure you're getting the latest metadata from PEPhub. You can think of PEPhub as a convenient mirror to GEO metadata. We are using [geofetch](../../geofetch/README.md) to download any updated files, which processes the data into a more compact PEP sample table, which we then store in PEPhub.

## Download all processed data from GEO

If you want to do a metadata analysis project that uses *all* the metadata from GEO, we also provide a tar archive. Just find the *Download* link on the [GEO namespace page](https://pephub.databio.org/geo). This will provide processed PEPs of all GEO projects.

If you are looking for the *raw* GEO metadata (not already processed into a PEP), then PEPhub can't really help; we process the data into PEP and discard the raw files, which are large. For most use cases, the processed PEP is a more convenient form. If you really need the raw SOFT files, there are two options:

- Use links to the files that are stored in the project sample table to download the data directly.
- Use geofetch yourself on a local machine to download these files. Example: `geofetch -i GSE95654 --processed`, where `--processed` indicates that you want to download processed data, not SRA. More information about PEP can be found on the [geofetch](../../geofetch/README.md).
24 changes: 10 additions & 14 deletions docs/pephub/user/getting-started.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,42 +7,38 @@ Just like GitHub allows you to share and edit projects that you are tracking wit
- **Upload** your metadata to a central database
- **Edit** your metadata in a web interface
- **Share** your metadata with collaborators
- **Validate** your metadata using a schema
- **Access** and update metadata programmatically through an API, from Python or R.

## What is a PEP?

Portable Encapsulated Projects (PEPs) are standard format for biological sample metadata. A PEP is simply a **yaml** + **csv** file (or just csv) -- the CSV file is a sample table, while the YAML file provides project-level metadata and sample modifiers. PEPs are a common input for running workflows using tools such as [Snakemake](https://snakemake.readthedocs.io/en/stable/), [Common Workflow Language](https://www.commonwl.org/), [Looper](http://pep.databio.org/looper), and other workflow systems. For more details, read the [PEP specification](http://pep.databio.org/spec/simple-example).
Portable Encapsulated Projects (PEPs) are standard format for biological sample metadata. A PEP is simply a **csv** file representing a sample table, plus an *optional* **YAML** file for project-level metadata and sample modifiers. For more details, read the [PEP specification](http://pep.databio.org/spec/simple-example).

## How PEPhub and PEPs work together
PEPhub gives you a platform to store and collaborate on your PEPs. This makes it easier to work together for large or small teams. Instead of relying on local files that you send back-and-forth, PEPhub provides a centralized interface and API that simplifies sharing and collaboration.

PEPhub gives you a central location to store and collaborate on your PEPs. This makes it easier to work together for large or small teams. Instead of relying on local files that you send back-and-forth, PEPhub provides a centralized interface and API that simplifies sharing and collaboration.
## Your first PEP on PEPhub

In short, PEPs are the standard format for biological metadata, and PEPhub is the platform that allows you to store, edit, and share these PEPs. Through the API, one can easily access and retrieve PEPs for use in workflows and analyses.
### Logging in

## Logging into PEPhub

PEPhub accounts are linked to GitHub. This allows us to leverage GitHub's OAuth system for secure authentication and namespacing. Once you have a [GitHub account](https://github.com/signup), you can log in to PEPhub. Just click the "Login" button in the top right corner of the [PEPhub home page](https://pephub.databio.org).
You log in to PEPhub using your [GitHub account](https://github.com/signup). Just click the "Log in" button in the top right corner of the [PEPhub home page](https://pephub.databio.org).

![PEPhub login button](../img/login.png)

You will be redirected to GitHub to authorize the PEPhub application. You are now logged in and can upload your first PEP! There are two mains ways to add a PEP to your PEPhub namespace: you can either [upload a PEP directly](#uploading-a-pep), or you can [create a new PEP from scratch](#creating-a-new-pep-from-scratch) using the web interface. This guide will walk you through both methods.

## Uploading a PEP
### Uploading a PEP

Navigate to your PEPhub namespace (`https://pephub.databio.org/{github username}`) and click the "Add" button in the top right. Click the "Upload PEP" tab. You will be prompted to select a PEP file from your local machine. Fill in the details about your PEP and then either drag files to the drop zone or click the drop zone to select files from your computer. Click "Submit" to add the PEP to your namespace.

## Creating a new PEP from scratch
### Creating a new PEP from scratch

Navigate to your PEPhub namespace (`https://pephub.databio.org/{github username}`) and click the "Add" button in the top right. Click the "Blank PEP" tab. Again, fill in the details about your PEP and then you can start filling in the sample table. Click "Submit" to add the PEP to your namespace.

![Submission form for a new PEP](../img/add-pep-form.png)

## Ready to edit your PEP

Once you have uploaded or created a PEP, you can now start using it in pipelines!

## Editing a PEP

Editing a PEP is easy; just make changes in the table and click `Save` when you are finished.
Once you have uploaded or created a PEP, you can edit it. Editing a PEP is easy; just make changes in the table and click `Save` when you are finished.

## Sharing your PEP

Expand Down
5 changes: 5 additions & 0 deletions docs/pephub/user/semantic-search.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# PEPhub's semantic search

PEPhub's main search box (accessible from the [home page](https://pephub.databio.org/)) provides a powerful semantic search.

When a user provides a natural-language search query, PEPhub transforms the query using the same-sentence transformer in real time, then queries the Qdrant API to retrieve the most semantically similar PEP vectors. Qdrant identifies similar PEPs by calculating nearest neighbors in vector space. PEPhub then returns the results to the client with their associated description and registry path. PEPhub’s search engine uses a semantic approach, which provides several advantages: first, the system returns results with similar meaning whether or not they include the terms of the original query. Second, it is tolerant of misspellings and is not limited to any ontology or taxonomy. Finally, because each PEP is represented as a vector, we can use high-speed nearest-neighbor algorithms to identify relevant PEPs, making the search very fast. This method scales to millions of PEPs, and the speed is limited only by network speeds. Users may also tune results with limits, offsets, and relevance score cutoffs.
31 changes: 31 additions & 0 deletions docs/pephub/user/validation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# How to validate sample metadata

PEPhub validates sample metadata with [eido](../../eido/README.md). Schemas can be added and edited on PEPhub directly.

Schemas are particularly useful before running pipelines, as validation provides essential information about PEP compatibility with specific pipelines and highlights any errors in the PEP structure.

There are two ways to use the interfaced to validate PEPs: From the main PEP interface, or from the universal validator.

## Validating a PEP from the main PEP interface

If you're editing a PEP, it's convenient to be able to validate it from the same interface. First, assign a schema to the PEP, and then validation will happen automatically, whenever you save the project.

### Assign a schema to a PEP

From the main table view, use the *Edit* menu to access the properties for a PEP:

![alt text](../img/menu-edit.png)

In this interface, you can select a schema for this PEP.

### Validating

Once a schema is assigned you'll see the validation results:

![alt text](../img/validation-notice.png)

If you click on this notice, you'll see more detailed information about what in the table is causing the validation to fail. This will allow you to validate metadata in real time, as you work on the table.

## Using the universal validator

Alternatively, for a more flexible approach, you can use the [Universal Validator](https://pephub.databio.org/validate). This provides a 2-step interface where you first provide a PEP, either by selecting one from PEPhub or by uploading it, and then a schema, which can be either selected from PEPhub, uploaded, or pasted.
30 changes: 30 additions & 0 deletions docs/pephub/user/version-control.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# How to version control metadata with PEPhub

PEPhub table versions happen through two features: 1) history; and 2) tags.

## History

PEPhub automatically records a history of your files whenever changes are made. Any time you click "save", an entry is added to your history. You can view the history of table edits by selecting the `History` option from the `More` menu.

![alt text](../img/menu-history.png)

Selecting this option will bring up the *History Interface*, which will provide buttons allowing you to view or delete entries from your history table. If you choose the `View` button for an entry, it will show you the PEP at that point in history. It also opens a new interface that will allow you to click `Restore` to overwright your current PEP with the historical version you are currently viewing, or you can `Download` the table as it was at that point in history.

![alt text](../img/history-interface.png)

In PEPhub, old versions are kept automatically, and they are referenced by date. PEPhub does not automatically assign version numbers or other identifiers; the only way to identify the old versions is by timestamp.


### History retention policy

**Old versions of sample tables are kept for 30 days.** Once a history entry is more than 30 days old, it will be automatically purged. If you want to keep an old version for longer, then you will need to manually tag the version, thereby forking it into a new repository.

## Tags

The other versioning feature offered by PEPhub is to use tags. PEPhub tags are unique identifiers of repositories. Every repository has a tag. By default, the tag is simply *default*. The registry path of each PEP takes the form of:

```
{namespace}/{repository}:{tag}
```

For example, `nsheff/my_new_pep:v1` would be the `my_new_pep` repository in my user namespace (`nsheff`), and `v1` is the tag. You can use tags to version your own PEPs. When you're ready to declare a version, just fork the current PEP into a new PEP and name the version tag accordingly.
19 changes: 18 additions & 1 deletion docs/pephub/user/views.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,21 @@
# How to use PEPhub views

*Documentation pending*
## What are views?

Large tables (*e.g.* >5,000 rows) can be unweildy with PEPhup. It can be hard to find the elements you're looking for. To address this, PEPhub provides the *Views* feature. Views provide a way to look at a subset of a large table (basically, a filtered table).

## How to create a view

To create a new view, click the *Down Arrow* to access the filter menu, and set up a filter. This will change the table to display a subset of the rows.

![alt text](../img/menu-filter.png)

Then, you can use the View Settings menu (gear icon next to the view selector) to open the Views interface.

![alt text](../img/select-view.png)

This will allow you to save the view. You can then select it any time from the views menu.

## Read-only limitation

Views are currently read-only; you will not be able to make edits to the table while viewing a subset. We hope to remove this restriction in the future.

0 comments on commit 4046257

Please sign in to comment.