Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Malformed response from arXiv API - no data in feed" woes... #179

Open
sedimentation-fault opened this issue Aug 27, 2019 · 0 comments
Open

Comments

@sedimentation-fault
Copy link

I have been having a hard time to get past my queries lately - they get into almost infinite loops of messages like:

Malformed response from arXiv API - no data in feed
Malformed response from arXiv API - no data in feed
Malformed response from arXiv API - no data in feed
...

The queries actually return far less than 50000 results, the supposed limit of arxiv's API - they reside anywhere between 3000 and 12000 results. Here is an example:

category='math.AG'; start_date='20170101'; end_date='20190827'; getpapers --api 'arxiv' --query "cat:$category AND lastUpdatedDate:[${start_date}* TO ${end_date}*] " --outdir "$category" -p -l debug

In this issue (not strictly a 'bug') I document my attempts to get past those showstoppers. Here's what I did:

Set page size to 1000

I experimented with page sizes from 200 to 2000:

  • At 200, it takes ages to get all 10000+ results and you run a higher risk of entering the above-mentioned infinite loop of death due to the much-increased number of extra queries required to fetch them all.
  • At 2000, you get many responses that contain far less than 2000 results - yet the feed is not completely empty, so this is currently not detected. See arXiv API feed contains less data than page size - but getpapers starts new query with the next start parameter #177 for a description of this bug and a solution.
  • At 500, it still takes too long to get them all.
  • At 1000, you get more results at once, you finish faster, you send less queries - and the risk of entering the infinite loop of death is not higher than with just 500. Plus: you don't automatically get just 200 results back, as seems to be the case with 2000...

I thus settled for a page size of 1000 in getpapers/lib/arxiv.js:

arxiv.pagesize = 1000

Set a higher delay between retries

I experimented with various delays too: the default 3 seconds are really too fast a hammering. 30 seconds is too much of sleeping. 15 or 20 seconds seem to be O.K., so I have set

arxiv.page_delay = 20000

in getpapers/lib/arxiv.js

Do not urlencode the whole query URL, only the parts that need it

See #178 for this.

Correct bug where the results feed is not empty - but not full either...

See #177 for details.

Last but not least...(I will repeat myself on this): do yourself a favour and spoof your User Agent in getpapers/lib/config.js:

config.userAgent = 'Mozilla/5.0 (X11; Linux x86_64; rv:58.0) Gecko/20100101 Firefox/58.0'

With the above changes in place, things have been getting better for me - and I hope the same for you too! :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant