You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There are two issues with the outputs of create_final_community_reports.parquet and create_final_communities.parquet:
Extra Row in Community Reports:
The create_final_community_reports.parquet contains one more row than create_final_communities.parquet. This discrepancy is likely caused by the hierarchical Leiden algorithm, where isolated nodes are not assigned to any community and are placed in a dummy community (-1). However, this dummy community seems to have been unintentionally included when generating the create_final_community_reports.parquet, resulting in an unnecessary or possibly incorrect community report.
Inconsistent Community Representation:
The create_final_communities.parquet uses titles like "Community xxx" to represent communities, while the create_final_community_reports.parquet uses community headers with IDs for representation. This inconsistency makes it difficult to align the information between the two files.
Steps to reproduce
Run the process that generates the create_final_community_reports.parquet and create_final_communities.parquet files.
Compare the number of rows in create_final_community_reports.parquet and create_final_communities.parquet.
Observe the dummy community (-1) in create_final_community_reports.parquet.
Check the naming conventions for communities in both files and note the inconsistency.
Expected Behavior
The dummy community (-1) should not be included when generating create_final_community_reports.parquet. This would prevent the creation of unnecessary or incorrect community reports.
Community representation should be consistent across both files. Either:
Use titles like "Community xxx" consistently in both files, or
Use community headers with IDs in both files.
This consistency would ensure clarity and facilitate easy correlation between the two files.
GraphRAG Config Used
### This config file contains required core defaults that must be set, along with a handful of common optional settings.### For a full list of available settings, see https://microsoft.github.io/graphrag/config/yaml/### LLM settings ##### There are a number of settings to tune the threading and token limits for LLM calls - check the docs.encoding_model: cl100k_base # this needs to be matched to your model!llm:
api_key: ${GRAPHRAG_API_KEY} # set this in the generated .env filetype: openai_chat # or azure_openai_chatmodel: gpt-4o-mini-2024-07-18model_supports_json: true # recommended if this is available for your model.# audience: "https://cognitiveservices.azure.com/.default"# api_base: https://<instance>.openai.azure.com# api_version: 2024-02-15-preview# organization: <organization_id># deployment_name: <azure_model_deployment_name>parallelization:
stagger: 0.3# num_threads: 50async_mode: threaded # or asyncioembeddings:
async_mode: threaded # or asynciovector_store:
type: lancedbdb_uri: 'output\lancedb'container_name: defaultoverwrite: truellm:
api_key: ${GRAPHRAG_API_KEY}type: openai_embedding # or azure_openai_embeddingmodel: text-embedding-3-small# api_base: https://<instance>.openai.azure.com# api_version: 2024-02-15-preview# audience: "https://cognitiveservices.azure.com/.default"# organization: <organization_id># deployment_name: <azure_model_deployment_name>### Input settings ###input:
type: file # or blobfile_type: text # or csvbase_dir: "input"file_encoding: utf-8file_pattern: ".*\\.txt$"chunks:
size: 1200overlap: 100group_by_columns: [id]### Storage settings ##### If blob storage is specified in the following four sections,## connection_string and container_name must be providedcache:
type: file # or blobbase_dir: "cache"reporting:
type: file # or console, blobbase_dir: "logs"storage:
type: file # or blobbase_dir: "output"## only turn this on if running `graphrag index` with custom settings## we normally use `graphrag update` with the defaultsupdate_index_storage:
# type: file # or blob# base_dir: "update_output"### Workflow settings ###skip_workflows: []entity_extraction:
prompt: "prompts/entity_extraction.txt"entity_types: [organization,person,geo,event]max_gleanings: 1summarize_descriptions:
prompt: "prompts/summarize_descriptions.txt"max_length: 500claim_extraction:
enabled: falseprompt: "prompts/claim_extraction.txt"description: "Any claims or facts that could be relevant to information discovery."max_gleanings: 1community_reports:
prompt: "prompts/community_report.txt"max_length: 2000max_input_length: 8000cluster_graph:
max_cluster_size: 10embed_graph:
enabled: false # if true, will generate node2vec embeddings for nodesumap:
enabled: false # if true, will generate UMAP embeddings for nodessnapshots:
graphml: trueraw_entities: falsetop_level_nodes: falseembeddings: truetransient: true### Query settings ##### The prompt locations are required here, but each search method has a number of optional knobs that can be tuned.## See the config docs: https://microsoft.github.io/graphrag/config/yaml/#querylocal_search:
prompt: "prompts/local_search_system_prompt.txt"global_search:
map_prompt: "prompts/global_search_map_system_prompt.txt"reduce_prompt: "prompts/global_search_reduce_system_prompt.txt"knowledge_prompt: "prompts/global_search_knowledge_system_prompt.txt"drift_search:
prompt: "prompts/drift_search_system_prompt.txt"
Logs and screenshots
Additional Information
GraphRAG Version: 0.5.0
Operating System: Windows 10
Python Version: 3.12.7
Related Issues:
The text was updated successfully, but these errors were encountered:
IT-Bill
added
bug
Something isn't working
triage
Default label assignment, indicates new issue needs reviewed by a maintainer
labels
Nov 19, 2024
Describe the bug
There are two issues with the outputs of
create_final_community_reports.parquet
andcreate_final_communities.parquet
:Extra Row in Community Reports:
The
create_final_community_reports.parquet
contains one more row thancreate_final_communities.parquet
. This discrepancy is likely caused by the hierarchical Leiden algorithm, where isolated nodes are not assigned to any community and are placed in a dummy community (-1). However, this dummy community seems to have been unintentionally included when generating thecreate_final_community_reports.parquet
, resulting in an unnecessary or possibly incorrect community report.Inconsistent Community Representation:
The
create_final_communities.parquet
uses titles like "Community xxx" to represent communities, while thecreate_final_community_reports.parquet
uses community headers with IDs for representation. This inconsistency makes it difficult to align the information between the two files.Steps to reproduce
create_final_community_reports.parquet
andcreate_final_communities.parquet
files.create_final_community_reports.parquet
andcreate_final_communities.parquet
.create_final_community_reports.parquet
.Expected Behavior
create_final_community_reports.parquet
. This would prevent the creation of unnecessary or incorrect community reports.This consistency would ensure clarity and facilitate easy correlation between the two files.
GraphRAG Config Used
Logs and screenshots
Additional Information
The text was updated successfully, but these errors were encountered: