Part of GPT-RAG
You can provision the infrastructure and deploy the whole solution using the GPT-RAG template, as instructed at: https://aka.ms/gpt-rag.
Eventually, you may want to make some adjustments to the data ingestion code and redeploy the component.
To redeploy only the ingestion component (after the initial deployment of the solution), you will need:
- Azure Developer CLI: Download azd for Windows, Other OS's.
- Powershell (Windows only): Powershell
- Git: Download Git
- Python 3.11: Download Python
Then just clone this repository and reproduce the following commands within the gpt-rag-ingestion
directory:
azd auth login
azd env refresh
azd deploy
Note: When running the
azd env refresh
, use the same environment name, subscription, and region used in the initial provisioning of the infrastructure.
How can I test the data ingestion component locally in VS Code?
To use version 4.0 of Document Intelligence, it is necessary to add the property DOCINT_API_VERSION
with the value 2024-07-31-preview
in the function app properties. It's important to check if this version is supported in the region where the service was created. More information can be found at this link. If the property has not been defined (default behavior), the version 2023-07-31
(3.1) will be used.
The document_chunking
function is responsible for breaking down documents into smaller pieces known as chunks.
When a document is submitted, the system identifies its file extension and selects the appropriate chunker to divide it into chunks, each tailored to the specific file type.
-
For
.pdf
files, the system leverages the DocAnalysisChunker to analyze the document using the Document Intelligence API. This analysis extracts structured elements, such as tables and sections, and converts them into Markdown format. The LangChain splitters are then applied to segment the content based on sections. If the Document Intelligence API 4.0 is enabled,.docx
and.pptx
files are also processed using this chunker. -
For image files such as
.bmp
,.png
,.jpeg
, and.tiff
, the DocAnalysisChunker is employed. This chunker includes Optical Character Recognition (OCR) to extract text from the images before chunking. -
For specialized formats, different chunkers are used:
.vtt
files (video transcriptions) are handled by the TranscriptionChunker, chunking content by time codes..xlsx
files (spreadsheets) are processed by the SpreadsheetChunker, chunking by rows or sheets..nl2sql
files are processed by the NL2SQLChunker, which handles JSON content containing natural language questions and their corresponding SQL queries. Click here to see a sample .nl2sql file.
-
For text-based files like
.txt
,.md
,.json
, and.csv
, the system uses the LangChainChunker, which uses LangChain splitters to divide the content based on logical separators such as paragraphs or sections.
This flow ensures that each document is processed with the chunker best suited for its format, leading to efficient and accurate chunking tailored to the specific file type.
Important
Note that the choice of chunker is determined by the format, following the guidelines provided above.
The chunking process is flexible and can be customized. You can modify the existing chunkers or create new ones to suit your specific data processing needs, allowing for a more tailored and efficient processing pipeline.
Here are the formats supported by the chunkers. Note that the decision on which chunker will be used based on the format is described earlier.
Extension | Doc Int API Version |
---|---|
3.1, 4.0 | |
bmp | 3.1, 4.0 |
jpeg | 3.1, 4.0 |
png | 3.1, 4.0 |
tiff | 3.1, 4.0 |
xslx | 4.0 |
docx | 4.0 |
pptx | 4.0 |
Extension | Format |
---|---|
md | Markdown document |
txt | Plain text file |
html | HTML document |
shtml | Server-side HTML document |
htm | HTML document |
py | Python script |
json | JSON data file |
csv | Comma-separated values file |
xml | XML data file |
Extension | Format |
---|---|
vtt | Video transcription |
Extension | Format |
---|---|
xlsx | Spreadsheet |
Extension | Description |
---|---|
nl2sql | JSON files containing natural language questions and corresponding SQL queries |
We appreciate your interest in contributing to this project! Please refer to the CONTRIBUTING.md page for detailed guidelines on how to contribute, including information about the Contributor License Agreement (CLA), code of conduct, and the process for submitting pull requests.
Thank you for your support and contributions!
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.