Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EuPMC returns a different number of results by API than by UI #95

Open
tarrow opened this issue Apr 26, 2016 · 6 comments
Open

EuPMC returns a different number of results by API than by UI #95

tarrow opened this issue Apr 26, 2016 · 6 comments
Assignees

Comments

@tarrow
Copy link
Contributor

tarrow commented Apr 26, 2016

No description provided.

@tarrow tarrow self-assigned this Apr 26, 2016
@blahah
Copy link
Member

blahah commented Apr 26, 2016

Yup, the UI search seems pretty broken to me at the moment. A search for an article title often returns 20 other articles before the one with the exact name.

@petermr
Copy link
Member

petermr commented Apr 26, 2016

Ah - the EPMC UI is giving massive false positives? that makes sense. The API seems to give fewer hits.

We should probably have a filter that checks whether the paper actually contains the search phrase or words. If it doesn't maybe we have to filter it out? Or does Lucene do concept searches?

@blahah
Copy link
Member

blahah commented Apr 26, 2016

Lucene can do lots of different kinds of searches - it depends what indexers have been set up. So for example it can do normal NLP processing and match shared stems etc, or it can resolve synonyms, or whatever. I think we should, for now, trust the results from eupmc. However, my https://github.com/blahah/yunodb is built for doing this kind of refined search on the client side.

@blahah
Copy link
Member

blahah commented Apr 26, 2016

Note that iterative filtering is on the general to-do list and will be in science fair's miner

@tarrow
Copy link
Contributor Author

tarrow commented Apr 29, 2016

I'm now struggling to replicate this issue:
See table. Only when getting lots of results is there more than a +- 1 deviation. Perhaps they've fixed a bug.

Something to keep an eye on is that the UI groups the whole search with () before appending OPEN_ACCESS:Y. Perhaps there are some cases (I haven't yet found them) where we also need to do this.

Literal Query Typed or Passed to API hitCount from API hits from EuropePMC.org
"malaria" 132592 134242
malaria 132800 134242
ebola 9869 9869
malaria AND ebola 1096 1097
ebola AND OPEN_ACCESS:Y 2933 2933
ebola malaria AND OPEN_ACCESS:Y 559 560
"ebola malaria" AND OPEN_ACCESS:Y 8 8
"ebola malaria" 10 10

@sedimentation-fault
Copy link

@blahah is right - trust the API. The online interface seems to be "enriched" with extra results. While this may increase recall, it definitely decreases precision.

See my comment in #140 for the whole story.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants