Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

STaRK-Prime answers wrong? #9

Open
LacombeLouis opened this issue Jul 3, 2024 · 4 comments
Open

STaRK-Prime answers wrong? #9

LacombeLouis opened this issue Jul 3, 2024 · 4 comments

Comments

@LacombeLouis
Copy link

LacombeLouis commented Jul 3, 2024

During my exploration of the STaRK-Prime dataset, I looked into a few questions (human-generated ones specifically). I've discovered a couple of answers that I find strange, where the answer to the question is the topic entity.

For example, check question index 47 for the STaRK-Prime dataset (human-generated): "What diseases is exposure to 2,3',4,4',5-pentachlorobiphenyl associated with?", the answer ID is 61686. The name of the node 61686 is "2,3',4,4',5-pentachlorobiphenyl", which is already mentioned in the question. I also experience the same type of result for the question index 62.

Is this the behavior that is expected, and if so, could you explain why, as I would have expected to have responses that differ from the topic entity (especially in the human-generated).

You can re-create this by running the following code:

from stark_qa import load_qa, load_skb

dataset_name = 'prime'

qa_dataset = load_qa(dataset_name, human_generated_eval=True)
idx_split = qa_dataset.get_idx_split()

skb = load_skb(dataset_name, download_processed=False, root='.')

qa_dataset[47]
# Output
("What diseases is exposure to 2,3',4,4',5-pentachlorobiphenyl associated with?",
 47,
 [61686],
 None)

print(skb.get_doc_info(61686, add_rel=True))
# Output
- name: 2,3',4,4',5-pentachlorobiphenyl
- type: exposure
- source: CTD
- relations:
  parent-child: {exposure: (2,2',3',4,4',5-hexachlorobiphenyl, 2,4,4',5-tetrachlorobiphenyl, Endocrine Disruptors, Environmental Pollutants, Pesticides, Polychlorinated Biphenyls, 2,2',3,3',4,4',5-heptachlorobiphenyl, 2,3,3',4,4',5-hexachlorobiphenyl, 2,4,5,2',4',5'-hexachlorobiphenyl, Hydrocarbons, Chlorinated, Organic Chemicals, Thyroxine, Triiodothyronine),}
  interacts_with: {gene/protein: (TSHB, SERPINA7),biological_process: (thyroid hormone metabolic process, cognition, regulation of thyroid-stimulating hormone secretion, production of molecular mediator of immune response, regulation of bone mineralization, hypermethylation of CpG island, male meiosis chromosome separation),}
  linked_to: {disease: (osteoporosis, metabolic syndrome X, non-Hodgkin lymphoma, respiratory tract infectious disease, fatty liver disease, colorectal neoplasm),}
@Wuyxin
Copy link
Collaborator

Wuyxin commented Jul 4, 2024

Hi, thanks for reporting the issue!

This issue, as you mentioned, only exists in the human-generated dataset and is not expected. The reason for such mislabeling is that one or two participants did not write the query as we intended.

By my estimation the number of such queries should be small but feel free to let me know if there are other problematic ones besides query 47 (in the code example) and 62.

We are checking the human-generated datasets again for STaRK-Prime. Future version of our human-generated datasets will have such queries removed. I will post here once this is updated. Thanks.

@LacombeLouis
Copy link
Author

Here is the code that I used (very simple):

from stark_qa import load_qa, load_skb
from stark_qa.tools.process_text import normalize_answer

dataset_name = 'prime'

# Load the retrieval dataset
qa_dataset = load_qa(dataset_name, human_generated_eval=True)

# Load the semi-structured knowledge base
skb = load_skb(dataset_name, download_processed=False, root='.')

def check_word_in_text(word, text):
    return word in text

def check_similarity_question_answer(question, list_answers, show=False):
    question_ = normalize_answer(question)
    for answer in list_answers:
        answer_ = normalize_answer(answer)
        if check_word_in_text(answer_, question_):
            if show:
                print('Answer:', answer)
                print('Question:', question)
                print("-"*10)
            return True
    return False


def check_questions(qa_dataset, max_number_answers=5, show=False):
    exclude_questions = []
    for item in qa_dataset:
        question_ = item[0]
        list_answer_ = item[2]

        list_answer_names_ = []
        for answer_ in list_answer_:
            list_answer_names_.append(skb.__getitem__(int(answer_)).name)

        # Check if one of the answers is in the questions using regex
        if check_similarity_question_answer(question_, list_answer_names_, show=show):
            print('Question index:', item[1])
            exclude_questions.append(item[1])

    exclude_questions = list(set(exclude_questions))
    if show:
        print('Number of questions to exclude:', len(exclude_questions))

    qa_dataset_filtered = []
    for item in qa_dataset:
        if item[1] not in exclude_questions:
            qa_dataset_filtered.append(item)
    return qa_dataset_filtered

filtered_questions = check_questions(qa_dataset, show=True)

# Ouput
Answer: mixed mucinous and nonmucinous bronchioloalveolar adenocarcinoma
Question: mixed mucinous and nonmucinous bronchioloalveolar adenocarcinoma is a subtype of what disease?
----------
Question index: 1
Answer: MTND5P11
Question: Is MTND5P11 expressed in any part of the brain?
----------
Question index: 27
Answer: 2,3',4,4',5-pentachlorobiphenyl
Question: What diseases is exposure to 2,3',4,4',5-pentachlorobiphenyl associated with?
----------
Question index: 47
Answer: HIF3A
Question: The protein encoded by HIF3A is associated with negative regulation of what?
----------
Question index: 62
Answer: Protein repair
Question: Complex machine learning methods like alpha fold could help scientists study protein repair and which other pathways?
----------
Question index: 82


Number of questions to exclude: 5

@alexlorenzo
Copy link

Same issue for :
Question id: 7
"My friend has been prescribed Tasonermin, what diseases might they have?"
"Cancer"

But in the Source DrugBank it also includes: "sarcoma"

@Wuyxin
Copy link
Collaborator

Wuyxin commented Jul 11, 2024

Thanks for mentioning this too! There could be some missing false positive (entities that are part of the answer set but weren't included) when we construct ground truth because the LLMs used for validation could misclassify those as unsatisfied. We did a study to estimate this in Section 2.4 4) Filtering Additional Answers.

RE Prev: thanks for the code! It is helpful, we are checking on other questions as well!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants