Skip to content

Commit

Permalink
Merge pull request #51 from SCAI-BIO/docs-tsne-visualization
Browse files Browse the repository at this point in the history
Extend README with t-SNE visualization
  • Loading branch information
tiadams authored Nov 12, 2024
2 parents 0e4cbcd + 331f16e commit d401801
Show file tree
Hide file tree
Showing 3 changed files with 44 additions and 2 deletions.
30 changes: 28 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ pip install datastew
### Harmonizing excel/csv resources

You can directly import common data models, terminology sources or data dictionaries for harmonization directly from a
csv, tsv or excel file. An example how to match two seperate variable descriptions is shown in
csv, tsv or excel file. An example how to match two separate variable descriptions is shown in
[datastew/scripts/mapping_excel_example.py](datastew/scripts/mapping_excel_example.py):

```python
Expand Down Expand Up @@ -44,7 +44,7 @@ embedding_model = GPT4Adapter(key="your_api_key")
df = map_dictionary_to_dictionary(source, target, embedding_model=embedding_model)
```

You can also retrieve embeddings from data dictionaries and visualize them in form of an interactive scatterplot to
You can also retrieve embeddings from data dictionaries and visualize them in form of an interactive scatter plot to
explore sematic neighborhoods:

```python
Expand Down Expand Up @@ -101,3 +101,29 @@ Similarity: 0.20031612264852067 -> Concept ID: 73211009 : Hypertension (disorder
You can also import data from file sources (csv, tsv, xlsx) or from a public API like OLS. An example script to
download & compute embeddings for SNOMED from ebi OLS can be found in
[datastew/scripts/ols_snomed_retrieval.py](datastew/scripts/ols_snomed_retrieval.py).

### Embedding visualization

You can visualize the embedding space of multiple data dictionary sources with t-SNE plots utilizing different
language models. An example how to generate a t-sne plot is shown in
[datastew/scripts/tsne_visualization.py](datastew/scripts/tsne_visualization.py):

```python
from datastew.embedding import MPNetAdapter
from datastew.process.parsing import DataDictionarySource
from datastew.visualisation import plot_embeddings

# Variable and description refer to the corresponding column names in your excel sheet
data_dictionary_source_1 = DataDictionarySource(
"source1.xlsx", variable_field="var", description_field="desc"
)
data_dictionary_source_2 = DataDictionarySource(
"source2.xlsx", variable_field="var", description_field="desc"
)

mpnet_adapter = MPNetAdapter()
plot_embeddings(
[data_dictionary_source_1, data_dictionary_source_2], embedding_model=mpnet_adapter
)
```
![t-SNE plot](./docs/tsne_plot.png)
16 changes: 16 additions & 0 deletions datastew/scripts/tsne_visualization.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
from datastew.embedding import MPNetAdapter
from datastew.process.parsing import DataDictionarySource
from datastew.visualisation import plot_embeddings

# Variable and description refer to the corresponding column names in your excel sheet
data_dictionary_source_1 = DataDictionarySource(
"source1.xlsx", variable_field="var", description_field="desc"
)
data_dictionary_source_2 = DataDictionarySource(
"source2.xlsx", variable_field="var", description_field="desc"
)

mpnet_adapter = MPNetAdapter()
plot_embeddings(
[data_dictionary_source_1, data_dictionary_source_2], embedding_model=mpnet_adapter
)
Binary file added docs/tsne_plot.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit d401801

Please sign in to comment.