Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added compatibility with raw vectors of pdf (this is very useful when reading files for example from aws.s3) #126

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -41,4 +41,4 @@ Suggests:
testthat
SystemRequirements: Java (>= 7.0)
VignetteBuilder: knitr
RoxygenNote: 6.0.1
RoxygenNote: 7.1.1
2 changes: 1 addition & 1 deletion R/extract_metadata.R
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
#' @title extract_metadata
#' @description Extract metadata from a file
#' @param file A character string specifying the path or URL to a PDF file.
#' @param file A character string specifying the path or URL to a PDF file, or raw vector with pdf data.
#' @param password Optionally, a character string containing a user password to access a secured PDF.
#' @param copy Specifies whether the original local file(s) should be copied to
#' \code{tempdir()} before processing. \code{FALSE} by default. The argument is
Expand Down
2 changes: 1 addition & 1 deletion R/extract_tables.R
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
#' @title extract_tables
#' @description Extract tables from a file
#' @param file A character string specifying the path or URL to a PDF file.
#' @param file A character string specifying the path or URL to a PDF file, or raw vector with pdf data.
#' @param pages An optional integer vector specifying pages to extract from.
#' @param area An optional list, of length equal to the number of pages specified, where each entry contains a four-element numeric vector of coordinates (top,left,bottom,right) containing the table for the corresponding page. As a convenience, a list of length 1 can be used to extract the same area from all (specified) pages. Only specify \code{area} xor \code{columns}.
#' @param columns An optional list, of length equal to the number of pages specified, where each entry contains a numeric vector of horizontal (x) coordinates separating columns of data for the corresponding page. As a convenience, a list of length 1 can be used to specify the same columns for all (specified) pages. Only specify \code{area} xor \code{columns}.
Expand Down
2 changes: 1 addition & 1 deletion R/extract_text.R
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
#' @title extract_text
#' @description Extract text from a file
#' @param file A character string specifying the path or URL to a PDF file.
#' @param file A character string specifying the path or URL to a PDF file, or raw vector with pdf data.
#' @param pages An optional integer vector specifying pages to extract from.
#' @param area An optional list, of length equal to the number of pages specified, where each entry contains a four-element numeric vector of coordinates (top,left,bottom,right) containing the table for the corresponding page. As a convenience, a list of length 1 can be used to extract the same area from all (specified) pages.
#' @param password Optionally, a character string containing a user password to access a secured PDF.
Expand Down
2 changes: 1 addition & 1 deletion R/get_page_dims.R
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
#' @rdname get_page_dims
#' @title Page length and dimensions
#' @description Get Page Length and Dimensions
#' @param file A character string specifying the path or URL to a PDF file.
#' @param file A character string specifying the path or URL to a PDF file, or raw vector with pdf data.
#' @param pages An optional integer vector specifying pages to extract from.
#' @param doc Optionally,, in lieu of \code{file}, an rJava reference to a PDDocument Java object.
#' @param password Optionally, a character string containing a user password to access a secured PDF.
Expand Down
24 changes: 15 additions & 9 deletions R/utils.R
Original file line number Diff line number Diff line change
Expand Up @@ -19,17 +19,23 @@ localize_file <- function(path, copy = FALSE, quiet = TRUE) {
path
}

load_doc <- function(file, password = NULL, copy = FALSE) {
load_doc <- function(file = NULL, password = NULL, copy = FALSE) {
pdfDocument <- new(J("org.apache.pdfbox.pdmodel.PDDocument"))
if(typeof(file) != "raw"){
localfile <- localize_file(path = file, copy = copy)
pdfDocument <- new(J("org.apache.pdfbox.pdmodel.PDDocument"))
fileInputStream <- new(J("java.io.FileInputStream"), name <- localfile)
if (is.null(password)) {
doc <- pdfDocument$load(input = fileInputStream)
} else {
doc <- pdfDocument$load(input = fileInputStream, password = password)
}
pdfDocument$close()
doc
}
else {
fileInputStream <- new(J("java.io.ByteArrayInputStream"), buf = rJava::.jbyte(file))
}

if (is.null(password)) {
doc <- pdfDocument$load(input = fileInputStream)
} else {
doc <- pdfDocument$load(input = fileInputStream, password = password)
}
pdfDocument$close()
doc
}

make_pages <- function(pages, oe) {
Expand Down
9 changes: 7 additions & 2 deletions man/extract_areas.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion man/extract_metadata.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

21 changes: 15 additions & 6 deletions man/extract_tables.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

12 changes: 9 additions & 3 deletions man/extract_text.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion man/get_page_dims.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

11 changes: 9 additions & 2 deletions man/make_thumbnails.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.