Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Empty affiliation strings from PMCID #1

Open
trangdata opened this issue Aug 24, 2020 · 1 comment
Open

Empty affiliation strings from PMCID #1

trangdata opened this issue Aug 24, 2020 · 1 comment

Comments

@trangdata
Copy link

I'm trying to extract affiliation information from PMCID. For example, for PMC6986235, I tried the following:

from lxml.etree import tostring

art = get_frontmatter_etree_via_api('PMC6986235')
print(tostring(art, encoding = 'unicode'))

Part of the output contains the affiliation of the corresponding author:

<aff id="A1">Georgetown University, Department of Oncology and Lombardi
Comprehensive Cancer Center, Washington, DC, 20007.</aff>

However, when I tried

extract_authors_from_article(art)

all affiliations is empty:

[{'pmcid': 'PMC6986235',
  'position': 1,
  'fore_name': 'Ziling',
  'last_name': 'Fan',
  'corresponding': 0,
  'reverse_position': 3,
  'affiliations': []},
 {'pmcid': 'PMC6986235',
  'position': 2,
  'fore_name': 'Yuan',
  'last_name': 'Zhou',
  'corresponding': 0,
  'reverse_position': 2,
  'affiliations': []},
 {'pmcid': 'PMC6986235',
  'position': 3,
  'fore_name': 'Habtom W.',
  'last_name': 'Ressom',
  'corresponding': 0,
  'reverse_position': 1,
  'affiliations': []}]

It is possible that we can't extract this information because of the way journals deposited the metadata. I just wanted to make sure that there is not a better alternative than skipping these articles entirely.

@dhimmel
Copy link
Owner

dhimmel commented Aug 28, 2020

I think the problem is that there is a coded affiliation of A1 but neither of the authors are linked to that affiliation. If you have the author frontmatter XML handy, we could confirm this.

The only workaround I see is to assume that if there's a single affiliation that is not linked to any authors, we could assume it applied to all authors. Not sure how many articles it affects. If it affects many, perhaps this is something we could implement.

However, it would stop working if there were multiple affiliations, since we then couldn't match affiliation to author.

If there's only a single author and many affiliations, we could assume all affiliations applied to the author. Although perhaps there are situations this backfires. Don't know

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants