Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Corrupt record handling #53

Open
Ariel225 opened this issue Feb 4, 2021 · 2 comments
Open

Corrupt record handling #53

Ariel225 opened this issue Feb 4, 2021 · 2 comments
Labels

Comments

@Ariel225
Copy link

Ariel225 commented Feb 4, 2021

Certain records seem to cause a crash. We have narrowed it down to this query, which should retrieve all records submitted in a one-minute period of 22:16 to 22:17 on January 24, 2018.

dfy<-arxiv_search(query = "submittedDate:[201801242216 TO 201801242217]", limit = 15000, batchsize=2000)

which returns an error of:

> Error in attr(results, "search_info") <- search_attributes(query, id_list,  : 
>   attempt to set an attribute on NULL
> 

We can isolate the record, which appears to be this one:
https://arxiv.org/abs/1610.04266

If we were to search using title, the same error appears:
dfy<-arxiv_search(query = "ti:Fourfolds", limit = 1200, batchsize=300)
We therefore think that either the record is corrupt (e.g., hidden unintentional column delimiter, etc.)

A similar error occurs on this single-date range, though we have not isolated the individual record causing the error:
dfy<-arxiv_search(query = "submittedDate:[201612030000 TO 201612040000]", limit = 15000, batchsize=2000)
Does the query need to be modified? Can the query auto-skip corrupt records? Should arxiv be notified?

@kbroman
Copy link
Member

kbroman commented Feb 4, 2021

Thanks for your very clear bug report! I'll look into the details. I see that arxiv_search(query="ti:Fourfolds", limit=100) works but arxiv_search(query="ti:Fourfolds", limit=101) gives the error.

I'll follow both of your suggestions: trap such errors better and also report the problem to arxiv, if there's a problem either with the record or with their API.

@kbroman kbroman added the bug label Feb 4, 2021
@kbroman
Copy link
Member

kbroman commented Feb 4, 2021

Okay, I get it. For this search, you get proper results if limit <= 77, but if limit >= 78, it returns NULL. If batchsize < limit and you're in this latter case, you get the error about assigning attributes to NULL.

 > dim(result <- arxiv_search(query="ti:Fourfolds", limit=77))
 [1] 77 15
 > dim(result <- arxiv_search(query="ti:Fourfolds", limit=78))
 [1]  0 15
!> dim(result <- arxiv_search(query="ti:Fourfolds", limit=78, batchsize=50))
 retrieved batch 1
 Error in attr(results, "search_info") <- search_attributes(query, id_list,  :
   attempt to set an attribute on NULL

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants