Skip to content

Neural search engine for discovering semantically similar Python repositories on GitHub

License

Notifications You must be signed in to change notification settings

RepoAnalysis/RepoSnipy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RepoSnipy 🐍🔫

Open in Spaces

Neural search engine for discovering semantically similar Python repositories on GitHub.

Demo

Searching an indexed repository:

Search Indexed Repo Demo

About

RepoSnipy is a neural search engine built with streamlit and docarray. You can query a public Python repository hosted on GitHub and find popular repositories that are semantically similar to it.

It uses the RepoSim pipeline to create embeddings for Python repositories. We have created a vector dataset (stored as docarray index) of over 9700 GitHub Python repositories that has license and over 300 stars by the time of 20th May, 2023.

Running Locally

Download the repository and install the required packages:

git clone https://github.com/RepoAnalysis/RepoSnipy
cd RepoSnipy
pip install -r requirements.txt

Then run the app on your local machine using:

streamlit run app.py

Evaluation

The evaluation script finds all combinations of repository pairs in the dataset and calculates the cosine similarity between their embeddings. It also checks if they share at least one topic (except for python and python3). Then we compare them and use ROC AUC score to evaluate the embeddings performance. The resultant dataframe containing all pairs of cosine similarity and topics similarity can be downloaded from here, including both code embeddings and docstring embeddings evaluations. The resultant ROC AUC score of code embeddings is around 0.84, and the docstring embeddings is around 0.81.

License

Distributed under the MIT License. See LICENSE for more information.

Acknowledgments

The model and the fine-tuning dataset used:

Releases

No releases published

Packages

No packages published

Languages