STaRK-Prime answers wrong? #9

LacombeLouis · 2024-07-03T11:44:21Z

During my exploration of the STaRK-Prime dataset, I looked into a few questions (human-generated ones specifically). I've discovered a couple of answers that I find strange, where the answer to the question is the topic entity.

For example, check question index 47 for the STaRK-Prime dataset (human-generated): "What diseases is exposure to 2,3',4,4',5-pentachlorobiphenyl associated with?", the answer ID is 61686. The name of the node 61686 is "2,3',4,4',5-pentachlorobiphenyl", which is already mentioned in the question. I also experience the same type of result for the question index 62.

Is this the behavior that is expected, and if so, could you explain why, as I would have expected to have responses that differ from the topic entity (especially in the human-generated).

You can re-create this by running the following code:

from stark_qa import load_qa, load_skb

dataset_name = 'prime'

qa_dataset = load_qa(dataset_name, human_generated_eval=True)
idx_split = qa_dataset.get_idx_split()

skb = load_skb(dataset_name, download_processed=False, root='.')

qa_dataset[47]
# Output
("What diseases is exposure to 2,3',4,4',5-pentachlorobiphenyl associated with?",
 47,
 [61686],
 None)

print(skb.get_doc_info(61686, add_rel=True))
# Output
- name: 2,3',4,4',5-pentachlorobiphenyl
- type: exposure
- source: CTD
- relations:
  parent-child: {exposure: (2,2',3',4,4',5-hexachlorobiphenyl, 2,4,4',5-tetrachlorobiphenyl, Endocrine Disruptors, Environmental Pollutants, Pesticides, Polychlorinated Biphenyls, 2,2',3,3',4,4',5-heptachlorobiphenyl, 2,3,3',4,4',5-hexachlorobiphenyl, 2,4,5,2',4',5'-hexachlorobiphenyl, Hydrocarbons, Chlorinated, Organic Chemicals, Thyroxine, Triiodothyronine),}
  interacts_with: {gene/protein: (TSHB, SERPINA7),biological_process: (thyroid hormone metabolic process, cognition, regulation of thyroid-stimulating hormone secretion, production of molecular mediator of immune response, regulation of bone mineralization, hypermethylation of CpG island, male meiosis chromosome separation),}
  linked_to: {disease: (osteoporosis, metabolic syndrome X, non-Hodgkin lymphoma, respiratory tract infectious disease, fatty liver disease, colorectal neoplasm),}

The text was updated successfully, but these errors were encountered:

Wuyxin · 2024-07-04T02:31:33Z

Hi, thanks for reporting the issue!

This issue, as you mentioned, only exists in the human-generated dataset and is not expected. The reason for such mislabeling is that one or two participants did not write the query as we intended.

By my estimation the number of such queries should be small but feel free to let me know if there are other problematic ones besides query 47 (in the code example) and 62.

We are checking the human-generated datasets again for STaRK-Prime. Future version of our human-generated datasets will have such queries removed. I will post here once this is updated. Thanks.

LacombeLouis · 2024-07-04T10:07:35Z

Here is the code that I used (very simple):

from stark_qa import load_qa, load_skb
from stark_qa.tools.process_text import normalize_answer

dataset_name = 'prime'

# Load the retrieval dataset
qa_dataset = load_qa(dataset_name, human_generated_eval=True)

# Load the semi-structured knowledge base
skb = load_skb(dataset_name, download_processed=False, root='.')

def check_word_in_text(word, text):
    return word in text

def check_similarity_question_answer(question, list_answers, show=False):
    question_ = normalize_answer(question)
    for answer in list_answers:
        answer_ = normalize_answer(answer)
        if check_word_in_text(answer_, question_):
            if show:
                print('Answer:', answer)
                print('Question:', question)
                print("-"*10)
            return True
    return False


def check_questions(qa_dataset, max_number_answers=5, show=False):
    exclude_questions = []
    for item in qa_dataset:
        question_ = item[0]
        list_answer_ = item[2]

        list_answer_names_ = []
        for answer_ in list_answer_:
            list_answer_names_.append(skb.__getitem__(int(answer_)).name)

        # Check if one of the answers is in the questions using regex
        if check_similarity_question_answer(question_, list_answer_names_, show=show):
            print('Question index:', item[1])
            exclude_questions.append(item[1])

    exclude_questions = list(set(exclude_questions))
    if show:
        print('Number of questions to exclude:', len(exclude_questions))

    qa_dataset_filtered = []
    for item in qa_dataset:
        if item[1] not in exclude_questions:
            qa_dataset_filtered.append(item)
    return qa_dataset_filtered

filtered_questions = check_questions(qa_dataset, show=True)

# Ouput
Answer: mixed mucinous and nonmucinous bronchioloalveolar adenocarcinoma
Question: mixed mucinous and nonmucinous bronchioloalveolar adenocarcinoma is a subtype of what disease?
----------
Question index: 1
Answer: MTND5P11
Question: Is MTND5P11 expressed in any part of the brain?
----------
Question index: 27
Answer: 2,3',4,4',5-pentachlorobiphenyl
Question: What diseases is exposure to 2,3',4,4',5-pentachlorobiphenyl associated with?
----------
Question index: 47
Answer: HIF3A
Question: The protein encoded by HIF3A is associated with negative regulation of what?
----------
Question index: 62
Answer: Protein repair
Question: Complex machine learning methods like alpha fold could help scientists study protein repair and which other pathways?
----------
Question index: 82


Number of questions to exclude: 5

alexlorenzo · 2024-07-09T12:32:51Z

Same issue for :
Question id: 7
"My friend has been prescribed Tasonermin, what diseases might they have?"
"Cancer"

But in the Source DrugBank it also includes: "sarcoma"

Wuyxin · 2024-07-11T01:11:30Z

Thanks for mentioning this too! There could be some missing false positive (entities that are part of the answer set but weren't included) when we construct ground truth because the LLMs used for validation could misclassify those as unsatisfied. We did a study to estimate this in Section 2.4 4) Filtering Additional Answers.

RE Prev: thanks for the code! It is helpful, we are checking on other questions as well!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

STaRK-Prime answers wrong? #9

STaRK-Prime answers wrong? #9

LacombeLouis commented Jul 3, 2024 •

edited

Loading

Wuyxin commented Jul 4, 2024 •

edited

Loading

LacombeLouis commented Jul 4, 2024

alexlorenzo commented Jul 9, 2024

Wuyxin commented Jul 11, 2024

STaRK-Prime answers wrong? #9

STaRK-Prime answers wrong? #9

Comments

LacombeLouis commented Jul 3, 2024 • edited Loading

Wuyxin commented Jul 4, 2024 • edited Loading

LacombeLouis commented Jul 4, 2024

alexlorenzo commented Jul 9, 2024

Wuyxin commented Jul 11, 2024

LacombeLouis commented Jul 3, 2024 •

edited

Loading

Wuyxin commented Jul 4, 2024 •

edited

Loading