Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Inconsistencies and Extra Row in Community Reports #1422

Open
IT-Bill opened this issue Nov 19, 2024 · 0 comments
Open

[Bug]: Inconsistencies and Extra Row in Community Reports #1422

IT-Bill opened this issue Nov 19, 2024 · 0 comments
Labels
bug Something isn't working triage Default label assignment, indicates new issue needs reviewed by a maintainer

Comments

@IT-Bill
Copy link

IT-Bill commented Nov 19, 2024

Describe the bug

There are two issues with the outputs of create_final_community_reports.parquet and create_final_communities.parquet:

  1. Extra Row in Community Reports:
    The create_final_community_reports.parquet contains one more row than create_final_communities.parquet. This discrepancy is likely caused by the hierarchical Leiden algorithm, where isolated nodes are not assigned to any community and are placed in a dummy community (-1). However, this dummy community seems to have been unintentionally included when generating the create_final_community_reports.parquet, resulting in an unnecessary or possibly incorrect community report.

  2. Inconsistent Community Representation:
    The create_final_communities.parquet uses titles like "Community xxx" to represent communities, while the create_final_community_reports.parquet uses community headers with IDs for representation. This inconsistency makes it difficult to align the information between the two files.

Steps to reproduce

  1. Run the process that generates the create_final_community_reports.parquet and create_final_communities.parquet files.
  2. Compare the number of rows in create_final_community_reports.parquet and create_final_communities.parquet.
  3. Observe the dummy community (-1) in create_final_community_reports.parquet.
  4. Check the naming conventions for communities in both files and note the inconsistency.

Expected Behavior

  1. The dummy community (-1) should not be included when generating create_final_community_reports.parquet. This would prevent the creation of unnecessary or incorrect community reports.
  2. Community representation should be consistent across both files. Either:
    • Use titles like "Community xxx" consistently in both files, or
    • Use community headers with IDs in both files.

This consistency would ensure clarity and facilitate easy correlation between the two files.

GraphRAG Config Used

### This config file contains required core defaults that must be set, along with a handful of common optional settings.
### For a full list of available settings, see https://microsoft.github.io/graphrag/config/yaml/

### LLM settings ###
## There are a number of settings to tune the threading and token limits for LLM calls - check the docs.

encoding_model: cl100k_base # this needs to be matched to your model!

llm:
  api_key: ${GRAPHRAG_API_KEY} # set this in the generated .env file
  type: openai_chat # or azure_openai_chat
  model: gpt-4o-mini-2024-07-18
  model_supports_json: true # recommended if this is available for your model.
  # audience: "https://cognitiveservices.azure.com/.default"
  # api_base: https://<instance>.openai.azure.com
  # api_version: 2024-02-15-preview
  # organization: <organization_id>
  # deployment_name: <azure_model_deployment_name>

parallelization:
  stagger: 0.3
  # num_threads: 50

async_mode: threaded # or asyncio

embeddings:
  async_mode: threaded # or asyncio
  vector_store:
    type: lancedb
    db_uri: 'output\lancedb'
    container_name: default
    overwrite: true
  llm:
    api_key: ${GRAPHRAG_API_KEY}
    type: openai_embedding # or azure_openai_embedding
    model: text-embedding-3-small
    # api_base: https://<instance>.openai.azure.com
    # api_version: 2024-02-15-preview
    # audience: "https://cognitiveservices.azure.com/.default"
    # organization: <organization_id>
    # deployment_name: <azure_model_deployment_name>

### Input settings ###

input:
  type: file # or blob
  file_type: text # or csv
  base_dir: "input"
  file_encoding: utf-8
  file_pattern: ".*\\.txt$"

chunks:
  size: 1200
  overlap: 100
  group_by_columns: [id]

### Storage settings ###
## If blob storage is specified in the following four sections,
## connection_string and container_name must be provided

cache:
  type: file # or blob
  base_dir: "cache"

reporting:
  type: file # or console, blob
  base_dir: "logs"

storage:
  type: file # or blob
  base_dir: "output"

## only turn this on if running `graphrag index` with custom settings
## we normally use `graphrag update` with the defaults
update_index_storage:
  # type: file # or blob
  # base_dir: "update_output"

### Workflow settings ###

skip_workflows: []

entity_extraction:
  prompt: "prompts/entity_extraction.txt"
  entity_types: [organization,person,geo,event]
  max_gleanings: 1

summarize_descriptions:
  prompt: "prompts/summarize_descriptions.txt"
  max_length: 500

claim_extraction:
  enabled: false
  prompt: "prompts/claim_extraction.txt"
  description: "Any claims or facts that could be relevant to information discovery."
  max_gleanings: 1

community_reports:
  prompt: "prompts/community_report.txt"
  max_length: 2000
  max_input_length: 8000

cluster_graph:
  max_cluster_size: 10

embed_graph:
  enabled: false # if true, will generate node2vec embeddings for nodes

umap:
  enabled: false # if true, will generate UMAP embeddings for nodes

snapshots:
  graphml: true
  raw_entities: false
  top_level_nodes: false
  embeddings: true
  transient: true

### Query settings ###
## The prompt locations are required here, but each search method has a number of optional knobs that can be tuned.
## See the config docs: https://microsoft.github.io/graphrag/config/yaml/#query

local_search:
  prompt: "prompts/local_search_system_prompt.txt"

global_search:
  map_prompt: "prompts/global_search_map_system_prompt.txt"
  reduce_prompt: "prompts/global_search_reduce_system_prompt.txt"
  knowledge_prompt: "prompts/global_search_knowledge_system_prompt.txt"

drift_search:
  prompt: "prompts/drift_search_system_prompt.txt"

Logs and screenshots

Image
Image
Image

Additional Information

  • GraphRAG Version: 0.5.0
  • Operating System: Windows 10
  • Python Version: 3.12.7
  • Related Issues:
@IT-Bill IT-Bill added bug Something isn't working triage Default label assignment, indicates new issue needs reviewed by a maintainer labels Nov 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage Default label assignment, indicates new issue needs reviewed by a maintainer
Projects
None yet
Development

No branches or pull requests

1 participant