Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nrow(arxiv_search()) is unpredictable #54

Open
vgherard opened this issue Jul 17, 2021 · 2 comments
Open

nrow(arxiv_search()) is unpredictable #54

vgherard opened this issue Jul 17, 2021 · 2 comments

Comments

@vgherard
Copy link

vgherard commented Jul 17, 2021

Hi and thanks for this very nice package (it made my day!).

I'm trying to scrape the last, say, 15k papers from the hep-ph category, with:

res <- arxiv_search(
	"cat:hep-ph",
	limit = 15000,
	batchsize = 1000,
	sort_by = "submitted", ascending = F
	)

However, the number of rows in the returned dataframe varies from query to query (usually it is around 10k, but once I also got 1k)... I would love to provide a reproducible example but could not come up with one.

I'm not sure whether this is due to aRxiv or arXiv 😃 Have you ever noticed something similar? Might have something to do with your comments to #14 ?

Thanks,
Valerio

@kbroman
Copy link
Member

kbroman commented Jul 17, 2021

I think the arXiv works best for searches that return a smaller number of documents, so if you're looking for a reproducible example, maybe focus on the results for a particular year.

res <- arxiv_search(
        "cat:hep-ph AND submittedDate:[1992 TO 1993]",
        limit=1000,
        batchsize=1000,
        sort_by="submitted",
        ascending=FALSE)

I don't know for sure, but I expect you're occasionally getting a connection error part-way through and getting truncated results.

@vgherard
Copy link
Author

Thanks for your help, I see. I find several similar issues on the arXiv API google group, so that I guess the problem (if any) is from their part. I'll either try to limit my queries or switch to the OPI-PMH interface.

Just for the records (not a reprex, just an instance of what I was saying above):

library(aRxiv)


query <- "cat:hep-ph AND submittedDate:[2020 TO 2021]"

res <- arxiv_search(
    query,
    limit = 10000,
    batchsize = 1000,
    sort_by = "submitted", ascending = F
)
#> retrieved batch 1
#> retrieved batch 2
#> retrieved batch 3
#> retrieved batch 4
#> retrieved batch 5
#> retrieved batch 6

c(nrow(res), arxiv_count(query))
#> [1] 4100 6930

res <- arxiv_search(
    query,
    limit = 10000,
    batchsize = 1000,
    sort_by = "submitted", ascending = F
)
#> retrieved batch 1
#> retrieved batch 2
#> retrieved batch 3
#> retrieved batch 4

c(nrow(res), arxiv_count(query))
#> [1] 3000 6930

Created on 2021-07-18 by the reprex package (v2.0.0)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants