Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

arXiv.org PDFs denying access to User-Agent "getpapers/(TDM Crawler contact@contentmine.org)" #167

Open
larsgw opened this issue Oct 18, 2017 · 9 comments

Comments

@larsgw
Copy link

larsgw commented Oct 18, 2017

See #166 (comment). Note that User-Agent: getpapers/TDM seems to be working (for me) again (for now).

@tarrow
Copy link
Contributor

tarrow commented Oct 19, 2017

We have clearly made a mistake here. I imagine that we're hammering them too hard/not following a delay between requests etc...

Probably the answer is to

  1. fix it so we're playing by their rules
  2. release a new version with the new version number in the UserAgent
  3. let them know that we've fixed it in the new version (should they keep blocking the old one?)

Also: did we get an email to contact@contentmine.org

@sedimentation-fault
Copy link

...or simply spoof the UserAgent header with some innocent string and move on. :-)
A list of most common UA strings can be found in:
https://techblog.willshouse.com/2012/01/03/most-common-user-agents/

@rossmounce
Copy link
Member

I also bumped into this issue just now. A pity...

@merkys
Copy link

merkys commented Sep 24, 2018

Same problem here. It's important to maintain arXiv downloader working, thus I suggest

  1. fix it so we're playing by their rules

For starters, crawl delays at least of 15 seconds must be introduced.

  1. let them know that we've fixed it in the new version (should they keep blocking the old one?)

Yes and yes. It is a bit strange that arXiv discourages automated access to /api, but this is probably (?) a bug.

@petermr
Copy link
Member

petermr commented Sep 25, 2018

OK I will write to Paul.

@rossmounce
Copy link
Member

@petermr actually, you might be better-off emailing the lead software architect at arxiv (Erick Peirson). I've found him to be quite helpful & communicative: brp53@cornell.edu https://erickpeirson.github.io/

@petermr
Copy link
Member

petermr commented Sep 25, 2018

Thanks Ross.

@sdruskat
Copy link

sdruskat commented Aug 9, 2019

Just to re-iterate that arXiv will block your IP (or your employer's) if you use the getpapers API as is to try and download PDFs.

May I suggest switching off the PDF download for now until a refactoring to conform with the current API guidelines is in place?

@petermr
Copy link
Member

petermr commented Aug 9, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants