Skip to content

Latest commit

 

History

History
50 lines (34 loc) · 2.42 KB

File metadata and controls

50 lines (34 loc) · 2.42 KB

Content Processing

Additional details about how content processing is handled in the solution. This includes the workflow steps and how to use your own data in the solution.

Workflow

image

  1. Document upload
    Documents added to blob storage. Processing is triggered based on file check-in.

  2. Text extraction, context extraction (image)
    Based on file type, an appropriate processing pipeline is used

  3. Summarization
    LLM summarization of the extracted content.

  4. Keyword and entity extraction
    Keywords extracted from full document through an LLM prompt. If document is too large, keywords are extracted from the summarization.

  5. Text chunking from text extraction results
    Chunking size is aligned with the embedding model size.

  6. Vectorization
    Creation of embeddings from chunked text using text-embedding-3-large model.

  7. Save results to Azure AI Search index
    Data added to index including vectorized fields, text chunks, keywords, entity specific meta data.

Customizing With Your Own Documents

There are two methods to use your own data in this solution. It takes roughly 10-15 minutes for a file to be processed and show up in the index and in results on the web app.

  1. Web App - UI Uploading
    You can upload through the user interface files that you would like processed. These files are uploaded to blob storage, processed, and added to the Azure AI Search index. File uploads are limited to 500MB and restricted to the following file formats: Office Files, TXT, PDF, TIFF, JPG, PNG.

  2. Bulk File Processing
    You can take buik file processing since the web app saves uploaded files here also. This would be the ideal to upload a large number of document or files that are large in size.

Modifying Processing Prompts

Prompt based processing is used for context extraction, summarization, and keyword/entity extraction. Modifications to the prompts will change what is extracted for the related workflow step.

You can find the prompt configuration text files for summarization and keyword/entity extraction in this folder:

\App\kernel-memory\service\Core\Prompts\

Context extraction requires a code re-compile. You can modify the prompt in this code file on line 56:

\App\kernel-memory\service\Core\DataFormats\Image\ImageContextDecoder.cs