`nrow(arxiv_search())` is unpredictable #54

vgherard · 2021-07-17T15:27:25Z

Hi and thanks for this very nice package (it made my day!).

I'm trying to scrape the last, say, 15k papers from the hep-ph category, with:

res <- arxiv_search(
	"cat:hep-ph",
	limit = 15000,
	batchsize = 1000,
	sort_by = "submitted", ascending = F
	)

However, the number of rows in the returned dataframe varies from query to query (usually it is around 10k, but once I also got 1k)... I would love to provide a reproducible example but could not come up with one.

I'm not sure whether this is due to aRxiv or arXiv 😃 Have you ever noticed something similar? Might have something to do with your comments to #14 ?

Thanks,
Valerio

The text was updated successfully, but these errors were encountered:

kbroman · 2021-07-17T18:44:17Z

I think the arXiv works best for searches that return a smaller number of documents, so if you're looking for a reproducible example, maybe focus on the results for a particular year.

res <- arxiv_search(
        "cat:hep-ph AND submittedDate:[1992 TO 1993]",
        limit=1000,
        batchsize=1000,
        sort_by="submitted",
        ascending=FALSE)

I don't know for sure, but I expect you're occasionally getting a connection error part-way through and getting truncated results.

vgherard · 2021-07-18T07:15:07Z

Thanks for your help, I see. I find several similar issues on the arXiv API google group, so that I guess the problem (if any) is from their part. I'll either try to limit my queries or switch to the OPI-PMH interface.

Just for the records (not a reprex, just an instance of what I was saying above):

library(aRxiv)


query <- "cat:hep-ph AND submittedDate:[2020 TO 2021]"

res <- arxiv_search(
    query,
    limit = 10000,
    batchsize = 1000,
    sort_by = "submitted", ascending = F
)
#> retrieved batch 1
#> retrieved batch 2
#> retrieved batch 3
#> retrieved batch 4
#> retrieved batch 5
#> retrieved batch 6

c(nrow(res), arxiv_count(query))
#> [1] 4100 6930

res <- arxiv_search(
    query,
    limit = 10000,
    batchsize = 1000,
    sort_by = "submitted", ascending = F
)
#> retrieved batch 1
#> retrieved batch 2
#> retrieved batch 3
#> retrieved batch 4

c(nrow(res), arxiv_count(query))
#> [1] 3000 6930

^{Created on 2021-07-18 by the reprex package (v2.0.0)}

nicholasmfraser mentioned this issue Aug 20, 2021

update 2020-08-20 for preprints posted up until 2020-08-15 nicholasmfraser/covid19_preprints#22

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`nrow(arxiv_search())` is unpredictable #54

`nrow(arxiv_search())` is unpredictable #54

vgherard commented Jul 17, 2021 •

edited

Loading

kbroman commented Jul 17, 2021 •

edited

Loading

vgherard commented Jul 18, 2021

nrow(arxiv_search()) is unpredictable #54

nrow(arxiv_search()) is unpredictable #54

Comments

vgherard commented Jul 17, 2021 • edited Loading

kbroman commented Jul 17, 2021 • edited Loading

vgherard commented Jul 18, 2021

`nrow(arxiv_search())` is unpredictable #54

`nrow(arxiv_search())` is unpredictable #54

vgherard commented Jul 17, 2021 •

edited

Loading

kbroman commented Jul 17, 2021 •

edited

Loading