Testing the reproducibility of the hybrid approach #2

Soudeh-Jahanshahi · 2023-11-27T14:32:51Z

The code "xml_translate.py" has a bug for processing 32 annotated xml-files!

defect_list = [27817193, 28240519, 28244787, 28438127, 28670879, 28707850, 28749127, 28749635, 28843255, 29095577, 29099159, 29116736, 29132205, 29172291, 29206099, 29220461, 29235983, 29283531, 29373899, 29374411, 29388757, 29451968, 29481028, 29533587, 29616530, 29630142, 29644823, 29688353, 29688370, 29693981, 29716180, 29801411]

For PMIDs in this list, the single tsv file is not generated correctly: the code splits their title and abstract between different lines.

rohitharavinder · 2023-11-30T18:09:36Z

The formatting of the TSV seems to be disrupted due to certain special characters present in the XML files listed above. To handle these characters within the TSV file, we utilized the "quotechar" parameter.

The relevant code can be found at line 335 in the xml_translate.py script, where the XML is converted into a TSV format using the following functionality:

publications_df.to_csv(output_file, sep="\t", index=False, quotechar="`")

Certain special characters, such as Î, Â±, ≥, %, and a few more, are part of the text. The '%' symbol, while generally a regular symbol, can cause a formatting issue in specific cases where there is no whitespace between the number and the '%' symbol, e.g., 20% vs. 20 %.

Nothing to be fixed.

ljgarcia · 2024-04-10T12:26:08Z

@Soudeh-Jahanshahi this is marked as nothing to fix but, how the bug that originated this issue affects the approaches you are working on? Does it have an effect or was something that you observed and got your attention? Please clarify, thanks.

Soudeh-Jahanshahi · 2024-04-10T14:11:34Z

@ljgarcia : These 32 annotated xml-files do not have any contribution in post-processing approach. Specifically, (If they are part of the input data) their tokens just contribute in creating Word2Vec model, but when doing post-annotation, the presence of MeSH-terms in the corresponding documents is neglected... However comparing to the number of entire dataset, ignoring these documents for post-processing would have just a negligible impact on final evaluation results ...

ljgarcia assigned Soudeh-Jahanshahi Nov 29, 2023

ljgarcia closed this as completed Apr 10, 2024

ljgarcia reopened this Apr 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Testing the reproducibility of the hybrid approach #2

Testing the reproducibility of the hybrid approach #2

Soudeh-Jahanshahi commented Nov 27, 2023 •

edited

Loading

rohitharavinder commented Nov 30, 2023 •

edited

Loading

ljgarcia commented Apr 10, 2024

Soudeh-Jahanshahi commented Apr 10, 2024 •

edited

Loading

Testing the reproducibility of the hybrid approach #2

Testing the reproducibility of the hybrid approach #2

Comments

Soudeh-Jahanshahi commented Nov 27, 2023 • edited Loading

rohitharavinder commented Nov 30, 2023 • edited Loading

ljgarcia commented Apr 10, 2024

Soudeh-Jahanshahi commented Apr 10, 2024 • edited Loading

Soudeh-Jahanshahi commented Nov 27, 2023 •

edited

Loading

rohitharavinder commented Nov 30, 2023 •

edited

Loading

Soudeh-Jahanshahi commented Apr 10, 2024 •

edited

Loading