Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Author info missing when analyzing papers published in PubPub #6

Open
stefanofasciani opened this issue Dec 2, 2022 · 4 comments
Open
Assignees
Labels
enhancement New feature or request

Comments

@stefanofasciani
Copy link
Collaborator

The analyzer fails in getting author information for paper published in PubPub, and therefore al location-based analysis fails.

The current PubPub publishing interface and NIME author guideline does not guarantee consistent and complete author information in the NIME proceedings (indeed, looking at 2021 papers, we have paper with author names only, cases with author names and affiliation, and cases in which all "traditional" information is provided, such as name, affiliation, email).

Author information are partially hidden in PubPub (you have to click on the "show details" button on top right. However, the analyzer downloads the XML directly from PubPub, and only author names are included (additional info visible in show details are missing).

A possible workaround is to try downloading the PDF generated by PubPub, and proceed as for pre-2021 papers (use Grobid to process PDF files). However, PDF from PubPub are malformed (but can be fixed in the analyzer script).

At the time of opening this issue, the NIME paper bibtex file still does not include the 2022 papers. At some point, organizers of 2022 conference downloaded the Latex files from PubPub and used these to build paper PDF files with the traditional columns format.

Before making any modification to the proceeding analyzer it is important to understand what will be the current and future publishing format for NIME papers. If PDF papers will come back at some point (perhaps also for 2021), we can simply scrap the current handling of PubPub (or perhaps we can handle 2021 manually as an exception).

@stefanofasciani stefanofasciani added the enhancement New feature or request label Dec 2, 2022
@stefanofasciani stefanofasciani changed the title Problem with papers published in PubPub Author info missing when analyzing papers published in PubPub Dec 2, 2022
@stefanofasciani
Copy link
Collaborator Author

Also worth considering the fact that NIME 2023 will not use PubPub
image

@jacksongoode
Copy link
Owner

jacksongoode commented Dec 2, 2022

Wow this would be a very significant change... And one that would hinder projects like this one. It seems a lot of the feedback has been the editing process and the PDF rendering, which I feel are complaints about style and traditional processes? But the decision has been made I assume? This shouldn't had any major impact with this project now - but I was hoping a digital/structured solution would enable non-machine learning parsing that would be immediate in the future.

@stefanofasciani
Copy link
Collaborator Author

2022 proceedings have beed added to the NIME bibtex file. Although proceedings are still stored in PubPub, the 2022 bibtex entries are different (the URL fields contains the DOI and no longer the string "pubpub"). This can be easily fixed changing line 277 of pa_extract.py to "if 'pubpub' or 'doi.org' in pub['url']:". However, we will still suffer from the same problem (i.e. we cannot fetch author information).

Since pubpib may lo longer be used in future, we can opt to consider 2021 and 2022 as "exceptions", and manually download and store the PDF in the repository.

@stefanofasciani
Copy link
Collaborator Author

Furthermore, the following code to download XML from PubPub (in pa_load.py) lo longer works:

                if pub['puppub'] and '.xml' not in url:
                    url = re.search(r"jats","url":"(.*?.xml)", r.text).group(1)
                    r = session.get(url)
                open(dl_path + fn, 'wb').write(r.content)

In particular, it seems that PubPup blocks the download attempt recognizing that there is not a human+browser on the other side. Indeed the downloaded XML does not include any paper-related info, but the following (plus some other info I did not check).

    <div id="challenge-body-text" class="core-msg spacer">
        assets.pubpub.org needs to review the security of your connection before proceeding.
    </div>

However, for 2022, traditional 2 columns pdf papers have been generated by the organizers (pubpub --> latex --> PDF) and are somehow hidden here https://www.nime.org/proceedings/2022/115.pdf (the last part of the path is the "pdf" field in the bibtex.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants