GitHub - jacobmarks/semantic-document-search-plugin: Semantically search through OCR text blocks with Qdrant, Sentence Transformers, and FiftyOne!

Semantic Document Search Plugin

This plugin is a Python plugin that allows you to semantically search through your text blocks (from Optical Character Recognition) in your dataset.

It uses a Qdrant index, with the GTE-base model from Hugging Face's Sentence Transformers library.

Usage

You will need to have text blocks in your dataset. You can do this with the PyTesseract OCR plugin.

Create a vector index for your text blocks with the create_semantic_document_index operator. You can then use the semantically_search_documents operator to search through your text blocks.

If you have multiple detections with text blocks, you can create multiple indexes. The index is stored in Qdrant with the collection name <dataset_name>_sds_<field_name>. When you use the semantically_search_documents operator, you can specify which index to use.

Watch On Youtube

Installation

Download the plugin with the following command:

fiftyone plugins download https://github.com/jacobmarks/semantic-document-search-plugin

You will need to install the Sentence Transformers library, and the Qdrant client Python library, which can be achieved with

fiftyone plugins requirements @jacobmarks/semantic_document_search --install

You will also need to have a Qdrant instance running. You can do this with Docker once you have your Docker daemon running:

docker run -p "6333:6333" -p "6334:6334" -d qdrant/qdrant

Using with PyTesseract OCR Plugin

This semantic search plugin is in many ways analogous to the keyword search plugin, and is likewise designed to be used with the PyTesseract OCR plugin.

You can install the PyTesseract OCR plugin with the following command:

fiftyone plugins download https://github.com/jacobmarks/pytesseract-ocr-plugin

Operators

`create_semantic_document_index`

Description: Create a Qdrant index for the specified text field within a detections label field.

`semantically_search_documents`

Description: Semantically search for text in your dataset. Only labels matching your query will be shown.

You can specify the number of results to return, and the threshold for the similarity score.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
assets		assets
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
__init__.py		__init__.py
fiftyone.yml		fiftyone.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Semantic Document Search Plugin

Usage

Watch On Youtube

Installation

Using with PyTesseract OCR Plugin

Operators

`create_semantic_document_index`

`semantically_search_documents`

About

Releases

Packages

Languages

jacobmarks/semantic-document-search-plugin

Folders and files

Latest commit

History

Repository files navigation

Semantic Document Search Plugin

Usage

Watch On Youtube

Installation

Using with PyTesseract OCR Plugin

Operators

create_semantic_document_index

semantically_search_documents

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

`create_semantic_document_index`

`semantically_search_documents`

Packages