Content Processing

Additional details about how content processing is handled in the solution. This includes the workflow steps and how to use your own data in the solution.

Workflow

Document upload
Documents added to blob storage. Processing is triggered based on file check-in.
Text extraction, context extraction (image)
Based on file type, an appropriate processing pipeline is used
Summarization
LLM summarization of the extracted content.
Keyword and entity extraction
Keywords extracted from full document through an LLM prompt. If document is too large, keywords are extracted from the summarization.
Text chunking from text extraction results
Chunking size is aligned with the embedding model size.
Vectorization
Creation of embeddings from chunked text using text-embedding-3-large model.
Save results to Azure AI Search index
Data added to index including vectorized fields, text chunks, keywords, entity specific meta data.

Customizing With Your Own Documents

There are two methods to use your own data in this solution. It takes roughly 10-15 minutes for a file to be processed and show up in the index and in results on the web app.

Web App - UI Uploading
You can upload through the user interface files that you would like processed. These files are uploaded to blob storage, processed, and added to the Azure AI Search index. File uploads are limited to 500MB and restricted to the following file formats: Office Files, TXT, PDF, TIFF, JPG, PNG.
Bulk File Processing
You can take buik file processing since the web app saves uploaded files here also. This would be the ideal to upload a large number of document or files that are large in size.

Modifying Processing Prompts

Prompt based processing is used for context extraction, summarization, and keyword/entity extraction. Modifications to the prompts will change what is extracted for the related workflow step.

You can find the prompt configuration text files for summarization and keyword/entity extraction in this folder:

\App\kernel-memory\service\Core\Prompts\

Context extraction requires a code re-compile. You can modify the prompt in this code file on line 56:

\App\kernel-memory\service\Core\DataFormats\Image\ImageContextDecoder.cs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data_Processing.md

Data_Processing.md

Content Processing

Workflow

Customizing With Your Own Documents

Modifying Processing Prompts

Files

Data_Processing.md

Latest commit

History

Data_Processing.md

File metadata and controls

Content Processing

Workflow

Customizing With Your Own Documents

Modifying Processing Prompts