arXiv.org PDFs denying access to User-Agent "getpapers/(TDM Crawler contact@contentmine.org)" #167

larsgw · 2017-10-18T14:26:58Z

See #166 (comment). Note that User-Agent: getpapers/TDM seems to be working (for me) again (for now).

The text was updated successfully, but these errors were encountered:

tarrow · 2017-10-19T08:04:00Z

We have clearly made a mistake here. I imagine that we're hammering them too hard/not following a delay between requests etc...

Probably the answer is to

fix it so we're playing by their rules
release a new version with the new version number in the UserAgent
let them know that we've fixed it in the new version (should they keep blocking the old one?)

Also: did we get an email to contact@contentmine.org

sedimentation-fault · 2017-12-17T18:40:02Z

...or simply spoof the UserAgent header with some innocent string and move on. :-)
A list of most common UA strings can be found in:
https://techblog.willshouse.com/2012/01/03/most-common-user-agents/

rossmounce · 2018-08-22T21:03:07Z

I also bumped into this issue just now. A pity...

merkys · 2018-09-24T14:52:24Z

Same problem here. It's important to maintain arXiv downloader working, thus I suggest

fix it so we're playing by their rules

For starters, crawl delays at least of 15 seconds must be introduced.

let them know that we've fixed it in the new version (should they keep blocking the old one?)

Yes and yes. It is a bit strange that arXiv discourages automated access to /api, but this is probably (?) a bug.

petermr · 2018-09-25T09:41:35Z

OK I will write to Paul.

rossmounce · 2018-09-25T14:41:50Z

@petermr actually, you might be better-off emailing the lead software architect at arxiv (Erick Peirson). I've found him to be quite helpful & communicative: brp53@cornell.edu https://erickpeirson.github.io/

petermr · 2018-09-25T19:20:41Z

Thanks Ross.

sdruskat · 2019-08-09T09:39:54Z

Just to re-iterate that arXiv will block your IP (or your employer's) if you use the getpapers API as is to try and download PDFs.

May I suggest switching off the PDF download for now until a refactoring to conform with the current API guidelines is in place?

petermr · 2019-08-09T19:35:05Z

On Fri, Aug 9, 2019 at 10:39 AM Stephan Druskat ***@***.***> wrote: Just to re-iterate that arXiv will block your IP (or your employer's) if you use the getpapers API as is to try and download PDFs.

Thank you. I'll add a comment

…

May I suggest switching off the PDF download for now until a refactoring to conform with the current API guidelines is in place? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#167?email_source=notifications&email_token=AAFTCS3WPOGO2VNC33QO3ZTQDU3OVA5CNFSM4D7XGVWKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD36FS2A#issuecomment-519854440>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAFTCSYJXCODE4VA3FT5J7TQDU3OVANCNFSM4D7XGVWA> .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

sedimentation-fault mentioned this issue Aug 27, 2019

"Malformed response from arXiv API - no data in feed" woes... #179

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

arXiv.org PDFs denying access to User-Agent "getpapers/(TDM Crawler contact@contentmine.org)" #167

arXiv.org PDFs denying access to User-Agent "getpapers/(TDM Crawler contact@contentmine.org)" #167

larsgw commented Oct 18, 2017 •

edited

Loading

tarrow commented Oct 19, 2017

sedimentation-fault commented Dec 17, 2017

rossmounce commented Aug 22, 2018

merkys commented Sep 24, 2018

petermr commented Sep 25, 2018

rossmounce commented Sep 25, 2018

petermr commented Sep 25, 2018

sdruskat commented Aug 9, 2019

petermr commented Aug 9, 2019 via email

arXiv.org PDFs denying access to User-Agent "getpapers/(TDM Crawler contact@contentmine.org)" #167

arXiv.org PDFs denying access to User-Agent "getpapers/(TDM Crawler contact@contentmine.org)" #167

Comments

larsgw commented Oct 18, 2017 • edited Loading

tarrow commented Oct 19, 2017

sedimentation-fault commented Dec 17, 2017

rossmounce commented Aug 22, 2018

merkys commented Sep 24, 2018

petermr commented Sep 25, 2018

rossmounce commented Sep 25, 2018

petermr commented Sep 25, 2018

sdruskat commented Aug 9, 2019

petermr commented Aug 9, 2019 via email

larsgw commented Oct 18, 2017 •

edited

Loading