[Bug]: Inconsistencies and Extra Row in Community Reports #1422

IT-Bill · 2024-11-19T04:12:17Z

Describe the bug

There are two issues with the outputs of create_final_community_reports.parquet and create_final_communities.parquet:

Extra Row in Community Reports:
The create_final_community_reports.parquet contains one more row than create_final_communities.parquet. This discrepancy is likely caused by the hierarchical Leiden algorithm, where isolated nodes are not assigned to any community and are placed in a dummy community (-1). However, this dummy community seems to have been unintentionally included when generating the create_final_community_reports.parquet, resulting in an unnecessary or possibly incorrect community report.
Inconsistent Community Representation:
The create_final_communities.parquet uses titles like "Community xxx" to represent communities, while the create_final_community_reports.parquet uses community headers with IDs for representation. This inconsistency makes it difficult to align the information between the two files.

Steps to reproduce

Run the process that generates the create_final_community_reports.parquet and create_final_communities.parquet files.
Compare the number of rows in create_final_community_reports.parquet and create_final_communities.parquet.
Observe the dummy community (-1) in create_final_community_reports.parquet.
Check the naming conventions for communities in both files and note the inconsistency.

Expected Behavior

The dummy community (-1) should not be included when generating create_final_community_reports.parquet. This would prevent the creation of unnecessary or incorrect community reports.
Community representation should be consistent across both files. Either:
- Use titles like "Community xxx" consistently in both files, or
- Use community headers with IDs in both files.

This consistency would ensure clarity and facilitate easy correlation between the two files.

GraphRAG Config Used

### This config file contains required core defaults that must be set, along with a handful of common optional settings.
### For a full list of available settings, see https://microsoft.github.io/graphrag/config/yaml/

### LLM settings ###
## There are a number of settings to tune the threading and token limits for LLM calls - check the docs.

encoding_model: cl100k_base # this needs to be matched to your model!

llm:
  api_key: ${GRAPHRAG_API_KEY} # set this in the generated .env file
  type: openai_chat # or azure_openai_chat
  model: gpt-4o-mini-2024-07-18
  model_supports_json: true # recommended if this is available for your model.
  # audience: "https://cognitiveservices.azure.com/.default"
  # api_base: https://<instance>.openai.azure.com
  # api_version: 2024-02-15-preview
  # organization: <organization_id>
  # deployment_name: <azure_model_deployment_name>

parallelization:
  stagger: 0.3
  # num_threads: 50

async_mode: threaded # or asyncio

embeddings:
  async_mode: threaded # or asyncio
  vector_store:
    type: lancedb
    db_uri: 'output\lancedb'
    container_name: default
    overwrite: true
  llm:
    api_key: ${GRAPHRAG_API_KEY}
    type: openai_embedding # or azure_openai_embedding
    model: text-embedding-3-small
    # api_base: https://<instance>.openai.azure.com
    # api_version: 2024-02-15-preview
    # audience: "https://cognitiveservices.azure.com/.default"
    # organization: <organization_id>
    # deployment_name: <azure_model_deployment_name>

### Input settings ###

input:
  type: file # or blob
  file_type: text # or csv
  base_dir: "input"
  file_encoding: utf-8
  file_pattern: ".*\\.txt$"

chunks:
  size: 1200
  overlap: 100
  group_by_columns: [id]

### Storage settings ###
## If blob storage is specified in the following four sections,
## connection_string and container_name must be provided

cache:
  type: file # or blob
  base_dir: "cache"

reporting:
  type: file # or console, blob
  base_dir: "logs"

storage:
  type: file # or blob
  base_dir: "output"

## only turn this on if running `graphrag index` with custom settings
## we normally use `graphrag update` with the defaults
update_index_storage:
  # type: file # or blob
  # base_dir: "update_output"

### Workflow settings ###

skip_workflows: []

entity_extraction:
  prompt: "prompts/entity_extraction.txt"
  entity_types: [organization,person,geo,event]
  max_gleanings: 1

summarize_descriptions:
  prompt: "prompts/summarize_descriptions.txt"
  max_length: 500

claim_extraction:
  enabled: false
  prompt: "prompts/claim_extraction.txt"
  description: "Any claims or facts that could be relevant to information discovery."
  max_gleanings: 1

community_reports:
  prompt: "prompts/community_report.txt"
  max_length: 2000
  max_input_length: 8000

cluster_graph:
  max_cluster_size: 10

embed_graph:
  enabled: false # if true, will generate node2vec embeddings for nodes

umap:
  enabled: false # if true, will generate UMAP embeddings for nodes

snapshots:
  graphml: true
  raw_entities: false
  top_level_nodes: false
  embeddings: true
  transient: true

### Query settings ###
## The prompt locations are required here, but each search method has a number of optional knobs that can be tuned.
## See the config docs: https://microsoft.github.io/graphrag/config/yaml/#query

local_search:
  prompt: "prompts/local_search_system_prompt.txt"

global_search:
  map_prompt: "prompts/global_search_map_system_prompt.txt"
  reduce_prompt: "prompts/global_search_reduce_system_prompt.txt"
  knowledge_prompt: "prompts/global_search_knowledge_system_prompt.txt"

drift_search:
  prompt: "prompts/drift_search_system_prompt.txt"

Logs and screenshots

Additional Information

GraphRAG Version: 0.5.0
Operating System: Windows 10
Python Version: 3.12.7
Related Issues:

The text was updated successfully, but these errors were encountered:

IT-Bill added bug Something isn't working triage Default label assignment, indicates new issue needs reviewed by a maintainer labels Nov 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Inconsistencies and Extra Row in Community Reports #1422

[Bug]: Inconsistencies and Extra Row in Community Reports #1422

IT-Bill commented Nov 19, 2024 •

edited

Loading

[Bug]: Inconsistencies and Extra Row in Community Reports #1422

[Bug]: Inconsistencies and Extra Row in Community Reports #1422

Comments

IT-Bill commented Nov 19, 2024 • edited Loading

Describe the bug

Steps to reproduce

Expected Behavior

GraphRAG Config Used

Logs and screenshots

Additional Information

IT-Bill commented Nov 19, 2024 •

edited

Loading