Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Testing the reproducibility of the hybrid approach #2

Open
Soudeh-Jahanshahi opened this issue Nov 27, 2023 · 3 comments
Open

Testing the reproducibility of the hybrid approach #2

Soudeh-Jahanshahi opened this issue Nov 27, 2023 · 3 comments
Assignees

Comments

@Soudeh-Jahanshahi
Copy link
Contributor

Soudeh-Jahanshahi commented Nov 27, 2023

The code "xml_translate.py" has a bug for processing 32 annotated xml-files!

defect_list = [27817193, 28240519, 28244787, 28438127, 28670879, 28707850, 28749127, 28749635, 28843255, 29095577, 29099159, 29116736, 29132205, 29172291, 29206099, 29220461, 29235983, 29283531, 29373899, 29374411, 29388757, 29451968, 29481028, 29533587, 29616530, 29630142, 29644823, 29688353, 29688370, 29693981, 29716180, 29801411]

For PMIDs in this list, the single tsv file is not generated correctly: the code splits their title and abstract between different lines.

@rohitharavinder
Copy link
Contributor

rohitharavinder commented Nov 30, 2023

The formatting of the TSV seems to be disrupted due to certain special characters present in the XML files listed above. To handle these characters within the TSV file, we utilized the "quotechar" parameter.

The relevant code can be found at line 335 in the xml_translate.py script, where the XML is converted into a TSV format using the following functionality:

publications_df.to_csv(output_file, sep="\t", index=False, quotechar="`")

Certain special characters, such as Î, ±, ≥, %, and a few more, are part of the text. The '%' symbol, while generally a regular symbol, can cause a formatting issue in specific cases where there is no whitespace between the number and the '%' symbol, e.g., 20% vs. 20 %.

Nothing to be fixed.

@ljgarcia
Copy link
Contributor

@Soudeh-Jahanshahi this is marked as nothing to fix but, how the bug that originated this issue affects the approaches you are working on? Does it have an effect or was something that you observed and got your attention? Please clarify, thanks.

@ljgarcia ljgarcia reopened this Apr 10, 2024
@Soudeh-Jahanshahi
Copy link
Contributor Author

Soudeh-Jahanshahi commented Apr 10, 2024

@ljgarcia : These 32 annotated xml-files do not have any contribution in post-processing approach. Specifically, (If they are part of the input data) their tokens just contribute in creating Word2Vec model, but when doing post-annotation, the presence of MeSH-terms in the corresponding documents is neglected... However comparing to the number of entire dataset, ignoring these documents for post-processing would have just a negligible impact on final evaluation results ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants