Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create feature identification API #320

Open
paulalbert1 opened this issue Mar 13, 2019 · 2 comments
Open

Create feature identification API #320

paulalbert1 opened this issue Mar 13, 2019 · 2 comments
Labels
Disambiguation Profile Creation of disambiguation profile

Comments

@paulalbert1
Copy link
Contributor

paulalbert1 commented Mar 13, 2019

Create a disambiguation profile API

Mockup

Screen Shot 2020-04-02 at 11 57 34 AM

Business need

Certain profiled users have features that aren't properly captured in an identity source. This is especially true for users on our periphery such as alumni and residents. For example, ReCiter doesn't perform well for Harold Varmus, because we have no record in a source system for his time at NIH. We would like to give users the opportunity to assert additional characteristics of themselves and have that be used to improve accuracy.

With an API, we can display features in an external interface and ask users to accept or reject these features.

Note that this output would need to occur against the output of the feature generator. It would also require doing a query against the Identity table. The building of the feedback interface would go in a separate issue.

Process

1. Compute and score the suggested articles for a given person.

2. From the computed articles in the Analysis output where score is greater than the minimum threshold for storing articles and the article is not rejected, identify any of the following features.

a. Institutional affiliations (Scopus Author ID and label)
b. Organizational units (label)
- Look up org units in the ScienceMetrixJournalDepartmentCategory.primaryDepartment field using one of the following against affiliation:
- "Department of " + [primaryDepartment]
- "Division of " + [primaryDepartment]
- "Dept of " + [primaryDepartment]
- any of the other patterns defined in org unit matching
- Look for patterns from org unit synonyms
- Some org units in ScienceMetrixJournalDepartmentCategory are substrings of each other. Match to longest unit if possible.

c. Aliases of target author

  • Sanitize names using standard function
  • Identify cases where targetAuthor=TRUE and name is not:
    • listed among existing names in aliases or primary name.
    • first name is not one initial, or 2-3 initials all capitals
  • Dedupe substrings. Prefer longer versions.

d. Aliases of non-target authors

  • Sanitize names using standard function
  • Identify cases where targetAuthor=FALSE and name is not:
    • listed among existing names in aliases or primary name.
    • first name is not one initial, or 2-3 initials all capitals
  • Dedupe substrings. Prefer longer versions.
  • CWID shouldn’t be required for importing additional relationships

e. Email address(es)

f. ORCID identifier(s)

3. Compute a score for each feature

  • Generally speaking, we want to prioritize features associated with high scoring articles.
  • We create arrays for each of the above, e.g., {emails: palbert324234@gmail.com, paul2323@aol.com...}
  • Count the number of each, e.g., for palbert324234@gmail.com, n=3
  • For each distinct feature, multiply the average score of the candidate articles associated with that feature times the count raised to some constant N, e.g., 3 x ((8.1 + 12.2 + 13.1)/3)^N
  • N is averageArticleScore-DisambiguationExponent. Default is 1.5.

4. Determine status of each feature

  • Options are:
    • assertedInSystemOfRecord (e.g., determine if feature is already located in Identity table)
    • accepted
    • rejected
    • null

5. Output features

{  
   "uid":"paa2013",
   "emails":[  
      {  
         "email":"palbert324234@gmail.com",
         "status":"accepted",
         "score":53.2
      },
      {  
         "email":"paul2323@aol.com",
         "status":"assertedInSystemOfRecord",
         "score":31.7
      }
   ],
   "targetAuthorAliases":[  
      {  
         "firstName":"Paul",
         "lastName":"Shmalbert",
         "status":"null",
         "score":32.1
      }
   ],
   "nonTargetAuthorAliases":[  
      {  
         "firstName":"Joe",
         "lastName":"Schmoe",
         "status":"assertedInSystemOfRecord",
         "score":32.1
      }
   ],
   "organizationalUnits":[  
      {  
         "organizationLabel":"Physiology",
         "status":"rejected",
         "score":53.2
      },
      {  
         "organizationLabel":"Biophysics",
         "status":"assertedInSystemOfRecord",
         "score":31.7
      }
   ],
   "institutionalAffiliations":[  
      {  
         "targetAuthorInstitutionalAffiliationArticleScopusLabel":"Weill Cornell Medical College",
         "targetAuthorInstitutionalAffiliationArticleScopusAffiliationId":60007997,
         "status":"null",
         "score":32.1
      }
   ],
   "orcidIdentifiers":[  
      {  
         "orcid":"http://orcid.org/0000-0003-3115-4777",
         "status":"null",
         "score":32.1
      }
   ]
}

Notes

See here for some up to data thoughts on possible features.

@sarbajitdutta
Copy link
Contributor

Possible use case using this https://aws.amazon.com/personalize/

@paulalbert1 paulalbert1 changed the title Create disambiguation profile API Create feature identification profile API Apr 2, 2020
@paulalbert1 paulalbert1 changed the title Create feature identification profile API Create feature identification API Apr 2, 2020
@paulalbert1
Copy link
Contributor Author

paulalbert1 commented Apr 18, 2020

Some examples where this would be helpful.
Screen Shot 2020-04-18 at 7 16 26 AM

For Harold Varmus, co-author = Peter Vogt.

Screen Shot 2021-07-23 at 12 42 09 PM

Institutional affiliation = UCSF and department = Microbiology & Immunology

Screen Shot 2021-07-23 at 12 42 48 PM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Disambiguation Profile Creation of disambiguation profile
Projects
None yet
Development

No branches or pull requests

2 participants