Update Prompt Tuning docs (#1057)

* Update Prompt Tuning docs * Semver
microsoft · Aug 29, 2024 · e023882 · e023882
1 parent d13aec5
commit e023882
Show file tree

Hide file tree

Showing 5 changed files with 45 additions and 14 deletions.
diff --git a/.semversioner/next-release/patch-20240829213842840703.json b/.semversioner/next-release/patch-20240829213842840703.json
@@ -0,0 +1,4 @@
+{
+  "type": "patch",
+  "description": "Update Prompt Tuning docs"
+}
diff --git a/docsite/_includes/page.njk b/docsite/_includes/page.njk
@@ -107,10 +107,10 @@ title: GraphRAG
                     {{link_to("/posts/prompt_tuning/overview/", "Prompt Tuning")}}
                     <ul>
                         <li>
-                            {{link_to("/posts/prompt_tuning/auto_prompt_tuning/", "Automatic Templating")}}
+                            {{link_to("/posts/prompt_tuning/auto_prompt_tuning/", "Auto Tuning")}}
                         </li>
                         <li>
-                            {{link_to("/posts/prompt_tuning/manual_prompt_tuning/", "Manual Prompt Tuning")}}
+                            {{link_to("/posts/prompt_tuning/manual_prompt_tuning/", "Manual Tuning")}}
                         </li>
                     </ul>
                   </li>

diff --git a/docsite/img/auto-tune-diagram.png b/docsite/img/auto-tune-diagram.png
diff --git a/docsite/posts/prompt_tuning/auto_prompt_tuning.md b/docsite/posts/prompt_tuning/auto_prompt_tuning.md
@@ -6,13 +6,20 @@ tags: [post, tuning]
 date: 2024-06-13
 ---
 
-GraphRAG provides the ability to create domain adaptive templates for the generation of the knowledge graph. This step is optional, though it is highly encouraged to run it as it will yield better results when executing an Index Run.
+GraphRAG provides the ability to create domain adapted prompts for the generation of the knowledge graph. This step is optional, though it is highly encouraged to run it as it will yield better results when executing an Index Run.
 
-The templates are generated by loading the inputs, splitting them into chunks (text units) and then running a series of LLM invocations and template substitutions to generate the final prompts. We suggest using the default values provided by the script, but in this page you'll find the detail of each in case you want to further explore and tweak the template generation algorithm.
+These are generated by loading the inputs, splitting them into chunks (text units) and then running a series of LLM invocations and template substitutions to generate the final prompts. We suggest using the default values provided by the script, but in this page you'll find the detail of each in case you want to further explore and tweak the prompt tuning algorithm.
+
+<p align="center">
+<img src="../../img/auto-tune-diagram.png" alt="Figure 1: Auto Tuning Conceptual Diagram." width="450" align="center" />
+</p>
+<p align="center">
+Figure 1: Auto Tuning Conceptual Diagram.
+</p>
 
 ## Prerequisites
 
-Before running the automatic template generation make sure you have already initialized your workspace with the `graphrag.index --init` command. This will create the necessary configuration files and the default prompts. Refer to the [Init Documentation](/posts/config/init) for more information about the initialization process.
+Before running auto tuning make sure you have already initialized your workspace with the `graphrag.index --init` command. This will create the necessary configuration files and the default prompts. Refer to the [Init Documentation](/posts/config/init) for more information about the initialization process.
 
 ## Usage
 
@@ -30,7 +37,7 @@ python -m graphrag.prompt_tune [--root ROOT] [--domain DOMAIN]  [--method METHOD
 
 - `--domain` (optional): The domain related to your input data, such as 'space science', 'microbiology', or 'environmental news'. If left empty, the domain will be inferred from the input data.
 
-- `--method` (optional): The method to select documents. Options are all, random, or top. Default is random.
+- `--method` (optional): The method to select documents. Options are all, random, auto or top. Default is random.
 
 - `--limit` (optional): The limit of text units to load when using random or top selection. Default is 15.
 
@@ -40,14 +47,20 @@ python -m graphrag.prompt_tune [--root ROOT] [--domain DOMAIN]  [--method METHOD
 
 - `--chunk-size` (optional): The size in tokens to use for generating text units from input documents. Default is 200.
 
+- `--n-subset-max` (optional): The number of text chunks to embed when using auto selection method. Default is 300.
+
+- `--k` (optional): The number of documents to select when using auto selection method. Default is 15.
+
+- `--min-examples-required` (optional): The minimum number of examples required for entity extraction prompts. Default is 2.
+
 - `--no-entity-types` (optional): Use untyped entity extraction generation. We recommend using this when your data covers a lot of topics or it is highly randomized.
 
 - `--output` (optional): The folder to save the generated prompts. Default is "prompts".
 
 ## Example Usage
 
 ```bash
-python -m graphrag.prompt_tune --root /path/to/project --config /path/to/settings.yaml --domain "environmental news" --method random --limit 10 --language English --max-tokens 2048 --chunk-size 256 --no-entity-types --output /path/to/output
+python -m graphrag.prompt_tune --root /path/to/project --config /path/to/settings.yaml --domain "environmental news" --method random --limit 10 --language English --max-tokens 2048 --chunk-size 256 --min-examples-required 3 --no-entity-types --output /path/to/output
 ```
 
 or, with minimal configuration (suggested):
@@ -58,19 +71,33 @@ python -m graphrag.prompt_tune --root /path/to/project --config /path/to/setting
 
 ## Document Selection Methods
 
-The auto template feature ingests the input data and then divides it into text units the size of the chunk size parameter.
-After that, it uses one of the following selection methods to pick a sample to work with for template generation:
+The auto tuning feature ingests the input data and then divides it into text units the size of the chunk size parameter.
+After that, it uses one of the following selection methods to pick a sample to work with for prompt generation:
 
 - `random`: Select text units randomly. This is the default and recommended option.
 - `top`: Select the head n text units.
 - `all`: Use all text units for the generation. Use only with small datasets; this option is not usually recommended.
+- `auto`: Embed text units in a lower-dimensional space and select the k nearest neighbors to the centroid. This is useful when you have a large dataset and want to select a representative sample.
 
 ## Modify Env Vars
 
-After running auto-templating, you should modify the following environment variables (or config variables) to pick up the new prompts on your index run. Note: Please make sure to update the correct path to the generated prompts, in this example we are using the default "prompts" path.
+After running auto tuning, you should modify the following environment variables (or config variables) to pick up the new prompts on your index run. Note: Please make sure to update the correct path to the generated prompts, in this example we are using the default "prompts" path.
 
 - `GRAPHRAG_ENTITY_EXTRACTION_PROMPT_FILE` = "prompts/entity_extraction.txt"
 
 - `GRAPHRAG_COMMUNITY_REPORT_PROMPT_FILE` = "prompts/community_report.txt"
 
 - `GRAPHRAG_SUMMARIZE_DESCRIPTIONS_PROMPT_FILE` = "prompts/summarize_descriptions.txt"
+
+or in your yaml config file:
+
+```yaml
+entity_extraction:
+  prompt: "prompts/entity_extraction.txt"
+
+summarize_descriptions:
+  prompt: "prompts/summarize_descriptions.txt"
+
+community_reports:
+  prompt: "prompts/community_report.txt"
+```
diff --git a/docsite/posts/prompt_tuning/overview.md b/docsite/posts/prompt_tuning/overview.md
@@ -17,10 +17,10 @@ The default prompts are the simplest way to get started with the GraphRAG system
 - [Claim Extraction](http://github.com/microsoft/graphrag/blob/main/graphrag/index/graph/extractors/claims/prompts.py)
 - [Community Reports](http://github.com/microsoft/graphrag/blob/main/graphrag/index/graph/extractors/community_reports/prompts.py)
 
-## Auto Templating
+## Auto Tuning
 
-Auto Templating leverages your input data and LLM interactions to create domain adaptive templates for the generation of the knowledge graph. It is highly encouraged to run it as it will yield better results when executing an Index Run. For more details about how to use it, please refer to the [Auto Templating](/posts/prompt_tuning/auto_prompt_tuning) documentation.
+Auto Tuning leverages your input data and LLM interactions to create domain adapted prompts for the generation of the knowledge graph. It is highly encouraged to run it as it will yield better results when executing an Index Run. For more details about how to use it, please refer to the [Auto Tuning](/posts/prompt_tuning/auto_prompt_tuning) documentation.
 
-## Manual Configuration
+## Manual Tuning
 
-Manual configuration is an advanced use-case. Most users will want to use the Auto Templating feature instead. Details about how to use manual configuration are available in the [Manual Prompt Configuration](/posts/prompt_tuning/manual_prompt_tuning) documentation.
+Manual tuning is an advanced use-case. Most users will want to use the Auto Tuning feature instead. Details about how to use manual configuration are available in the [Manual Tuning](/posts/prompt_tuning/manual_prompt_tuning) documentation.