Skip to content

Commit

Permalink
Merge branch 'dev'
Browse files Browse the repository at this point in the history
  • Loading branch information
ablaette committed Jun 26, 2021
2 parents b82337b + 9382ce8 commit d1eff50
Show file tree
Hide file tree
Showing 53 changed files with 57,402 additions and 299 deletions.
2 changes: 2 additions & 0 deletions .github/workflows/R-CMD-check.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,10 +5,12 @@ on:
branches:
- main
- master
- dev
pull_request:
branches:
- main
- master
- dev

name: R-CMD-check

Expand Down
2 changes: 0 additions & 2 deletions CRAN-RELEASE

This file was deleted.

8 changes: 5 additions & 3 deletions DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
Package: RcppCWB
Type: Package
Title: 'Rcpp' Bindings for the 'Corpus Workbench' ('CWB')
Version: 0.3.2
Date: 2021-02-03
Version: 0.4.0
Date: 2021-06-25
Author: Andreas Blaette [aut, cre],
Bernard Desgraupes [aut],
Sylvain Loiseau [aut],
Expand Down Expand Up @@ -52,7 +52,8 @@ LinkingTo: Rcpp
Biarch: true
URL: https://github.com/PolMine/RcppCWB
BugReports: https://github.com/PolMine/RcppCWB/issues
RoxygenNote: 6.1.1
RoxygenNote: 7.1.1
Roxygen: list(markdown = TRUE)
Collate:
'RcppCWB_package.R'
'cl.R'
Expand All @@ -66,3 +67,4 @@ Collate:
'region_matrix.R'
'misc.R'
'zzz.R'
'xml.R'
5 changes: 5 additions & 0 deletions NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,8 @@ export(check_s_attribute)
export(check_strucs)
export(cl_charset_name)
export(cl_delete_corpus)
export(cl_struc_values)
export(corpus_data_dir)
export(cqp_dump_subcorpus)
export(cqp_get_registry)
export(cqp_initialize)
Expand All @@ -22,6 +24,7 @@ export(cqp_query)
export(cqp_reset_registry)
export(cqp_subcorpus_size)
export(cwb_compress_rdx)
export(cwb_encode)
export(cwb_huffcode)
export(cwb_makeall)
export(get_cbow_matrix)
Expand All @@ -31,6 +34,8 @@ export(get_region_matrix)
export(get_tmp_registry)
export(region_matrix_to_count_matrix)
export(region_matrix_to_ids)
export(s_attr_is_descendent)
export(s_attr_is_sibling)
export(s_attribute_decode)
export(use_tmp_registry)
exportPattern("^[[:alpha:]]+")
Expand Down
22 changes: 22 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,25 @@
# RcppCWB 0.4.0

## New Features

* Encode XML (vrt file format) with new function `cwb_encode()` that exposes functionality of cwb-encode CWB utility.
* Functions `cl_cpos2lbound()` and `cl_cpos2rbound()` will now accept an integer vector with length > 1 as argument `cpos` and return a vector with the same length. Useful to speed up iterated queries for left and right boundaries of regions (#19).
* A new function `cl_struc_values()` exposes the corresponding C function of the Corpus Library (CL). The previous implicit assumption that all structural attributes have values can thus be tested. Intended to work with annotations of sentences and paragraphs, i.e. common structural attributes that do usually not have values.
* A new function `corpus_data_dir()` will derive the data directory from the internal C representation of a corpus.
* New function `s_attr_regions()` will derive regions defined by a structural attribute from the *.rng file. Fastest option for large corpora.
* New functions `s_attr_is_sibling()` and `s_attr_is_descendent()` test the sibling/descendent relationship of structural attributes.


## Minor Improvements

* Function `check_corpus()` now includes checks whether the registry provided (argument `registry`) is identical with the registry defined internally by CQP. The registry is reset if directories are not identical.
* Minor adjustments of configure script for aarch64, adding -fPIC to CFLAGS so that this flag will be used when Linux default configuration is used as fallback.
* The implementation of the `s_attribute_decode()` method was incomplete for method "Rcpp". This alternative to the "pure R" approach is now implemented (#2).
* The unused file 'setpaths.R' has been removed from the tools directory (#10).
* The argument `method` previously setting "wininet" in ./tools/winlibs.R is omitted to avoid the warning "the 'wininet' method is deprecated for http:// and https:// URLs" on Windows.
* The configure script will print the libdirs derived using pcre-config and link against libintl on macOS by default.


# RcppCWB 0.3.2

* If RcppCWB is compiled on macOS, the package configure script checks the architecture of the machine and ensures that (if glib-2.0 is not yet present) a version of glib-2.0 compiled for Apple Silicon/the M1 chip is loaded in case an amd64 architecture is detected.
Expand Down
12 changes: 12 additions & 0 deletions R/RcppExports.R
Original file line number Diff line number Diff line change
Expand Up @@ -109,6 +109,14 @@
.Call(`_RcppCWB_check_corpus`, corpus)
}

.cl_struc_values <- function(corpus, s_attribute, registry) {
.Call(`_RcppCWB__cl_struc_values`, corpus, s_attribute, registry)
}

.corpus_data_dir <- function(corpus, registry) {
.Call(`_RcppCWB__corpus_data_dir`, corpus, registry)
}

.decode_s_attribute <- function(corpus, s_attribute, registry) {
.Call(`_RcppCWB_decode_s_attribute`, corpus, s_attribute, registry)
}
Expand Down Expand Up @@ -149,3 +157,7 @@
.Call(`_RcppCWB_cwb_compress_rdx`, x, registry_dir, p_attribute)
}

.cwb_encode <- function(regfile, data_dir, vrt_dir, p_attributes, s_attributes_anno, s_attributes_noanno) {
.Call(`_RcppCWB_cwb_encode`, regfile, data_dir, vrt_dir, p_attributes, s_attributes_anno, s_attributes_noanno)
}

20 changes: 16 additions & 4 deletions R/checks.R
Original file line number Diff line number Diff line change
Expand Up @@ -31,14 +31,26 @@ check_registry <- function(registry){
#' @rdname checks
#' @export check_corpus
check_corpus <- function(corpus, registry){
if (length(corpus) != 1)

if (length(corpus) != 1L)
stop("corpus needs to be a vector of length 1")

if (!is.character(corpus))
stop("corpus needs to be a character vector")
if (!cqp_is_initialized()) cqp_initialize()
if (.check_corpus(toupper(corpus)) == 0)
# if (!tolower(corpus) %in% list.files(registry))

registry <- normalizePath(path.expand(registry))
if (isFALSE(dir.exists(registry)))
stop(sprintf("Registry directory '%s' does not exist.", registry))

if (isFALSE(cqp_is_initialized())) cqp_initialize(registry = registry)
if (cqp_get_registry() != registry){
warning(sprintf("Resetting registry directory from '%s' to '%s'", cqp_get_registry(), registry))
cqp_reset_registry(registry = registry)
}

if (.check_corpus(toupper(corpus)) == 0L)
stop(sprintf("corpus %s is not available (check whether there is a typo)", sQuote(corpus)))

return( TRUE )
}

Expand Down
39 changes: 38 additions & 1 deletion R/cl.R
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ cl_lexicon_size <- function(corpus, p_attribute, registry = Sys.getenv("CORPUS_R
#'
#' @param corpus name of a CWB corpus (upper case)
#' @param s_attribute name of structural attribute (character vector)
#' @param cpos corpus positions (integer vector)
#' @param cpos An \code{integer} vector with corpus positions.
#' @param struc a struc identifying a region
#' @param registry path to the registry directory, defaults to the value of the
#' environment variable CORPUS_REGISTRY
Expand Down Expand Up @@ -303,3 +303,40 @@ cl_charset_name <- function(corpus, registry = Sys.getenv("CORPUS_REGISTRY")){
.cl_charset_name(corpus = corpus, registry = registry)
}

#' Check whether structural attribute has values
#'
#' Structural attributes do not necessarily have values, structural attributes
#' (such as annotations of sentences or paragraphs) may just define regions of
#' corpus positions. Use this function to test whether an attribute has values.
#'
#' @param corpus Corpus ID, a length-one `character` vector.
#' @param s_attribute Structural attribute to check, a length-one `character` vector.
#' @param registry The registry directory of the corpus.
#' @return `TRUE` if the attribute has values and `FALSE` if not. `NA` if the structural
#' attribute is not available.
#' @export cl_struc_values
#' @examples
#' cl_struc_values("REUTERS", "places") # TRUE - attribute has values
#' cl_struc_values("REUTERS", "date") # NA - attribute does not exist
cl_struc_values <- function(corpus, s_attribute, registry = Sys.getenv("CORPUS_REGISTRY")){
check_corpus(corpus = corpus, registry = registry)
registry <- normalizePath(path.expand(registry))
i <- .cl_struc_values(corpus = corpus, s_attribute = s_attribute, registry = registry)
if (i == 1L) TRUE else if (i == 0L) FALSE else if (i < 0L) as.integer(NA)
}

#' Get data directory of a corpus
#'
#' Extract the data directory from the intenal C representation of the content
#' of the registry file for a corpus.
#' @param corpus A length-one `character` vector with the corpus ID.
#' @param registry A length-one `character` vector with the registry directory.
#' @return A length-one `character` vector stating the data directory.
#' @export corpus_data_dir
#' @examples
#' corpus_data_dir("REUTERS")
corpus_data_dir <- function(corpus, registry = Sys.getenv("CORPUS_REGISTRY")){
check_corpus(corpus = corpus, registry = registry)
registry <- normalizePath(path.expand(registry))
.corpus_data_dir(corpus = corpus, registry = registry)
}
67 changes: 66 additions & 1 deletion R/cwb.R
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,6 @@
#' wrappers will always perform a specific indexing/compression step on one
#' positional attribute, and produce all components.
#'
#' @rdname cwb_utils
#' @param corpus name of a CWB corpus (upper case)
#' @param p_attribute name p-attribute
#' @param registry path to the registry directory, defaults to the value of the
Expand Down Expand Up @@ -79,3 +78,69 @@ cwb_huffcode <- function(corpus, p_attribute, registry = Sys.getenv("CORPUS_REGI
cwb_compress_rdx <- function(corpus, p_attribute, registry = Sys.getenv("CORPUS_REGISTRY")){
.cwb_compress_rdx(x = corpus, p_attribute = p_attribute, registry_dir = registry)
}

#' @param p_attributes Positional attributes (p-attributes) to be declared.
#' @param data_dir The data directory where `cwb_encode` will put the binary
#' files of the indexed corpus.
#' @param vrt_dir Directory with input corpus files (verticalised format / file
#' ending *.vrt).
#' @param s_attributes A `list` of named `character` vectors to declare
#' structural attributes that shall be encoded. The names of the list are the
#' XML elements present in the corpus. Character vectors making up the list
#' declare the attributes that include the metadata of regions. To declare a
#' structural attribute without annotations, provide a zero-length character
#' vector using `character()` - see examples.
#' @rdname cwb_utils
#' @export cwb_encode
#' @examples
#' \dontrun{
#' data_dir <- file.path(tempdir(), "tmp_data_dir")
#' dir.create(data_dir)
#'
#' cwb_encode(
#' corpus = "BTMIN",
#' registry = Sys.getenv("CORPUS_REGISTRY"),
#' vrt_dir = system.file(package = "RcppCWB", "extdata", "vrt"),
#' data_dir = data_dir,
#' p_attributes = c("word", "pos", "lemma"),
#' s_attributes = list(
#' plenary_protocol = c(
#' "lp", "protocol_no", "date", "year", "birthday", "version",
#' "url", "filetype"
#' ),
#' speaker = c(
#' "id", "type", "lp", "protocol_no", "date", "year", "ai_no", "ai_id",
#' "ai_type", "who", "name", "parliamentary_group", "party", "role"
#' ),
#' p = character()
#' )
#' )
#'
#' unlink(data_dir)
#' unlink(file.path(Sys.getenv("CORPUS_REGISTRY"), "BTMIN"))
#' }
cwb_encode <- function(corpus, registry = Sys.getenv("CORPUS_REGISTRY"), data_dir, vrt_dir, p_attributes = c("word", "pos", "lemma"), s_attributes){

s_attributes_noanno <- unlist(lapply(
names(s_attributes),
function(s_attr) if (length(s_attributes[[s_attr]]) == 0L) s_attr else character()
))

for (s_attr in s_attributes_noanno) s_attributes[[s_attr]] <- NULL

s_attributes_anno <- unname(
sapply(
names(s_attributes),
function(s_attr) paste(s_attr, ":", 0L, "+", paste(s_attributes[[s_attr]], collapse = "+"), sep = "")
)
)

.cwb_encode(
regfile = file.path(registry, tolower(corpus)),
data_dir = data_dir,
vrt_dir = vrt_dir,
p_attributes = p_attributes,
s_attributes_anno = s_attributes_anno,
s_attributes_noanno = s_attributes_noanno
)
}
90 changes: 72 additions & 18 deletions R/decode.R
Original file line number Diff line number Diff line change
Expand Up @@ -7,18 +7,20 @@
#' the directory specified by \code{data_dir}. An implementation using Rcpp will use the
#' registry file for \code{corpus} to find the data directory.
#'
#' @param corpus a CWB corpus
#' @param s_attribute a structural attribute
#' @param data_dir data directory where binary files for corpus are stored
#' @param encoding encoding of the values ("latin-1" or "utf-8")
#' @param registry registry directory
#' @param method character vector, whether to use "R" or "Rcpp" implementation
#' @return A \code{data.frame} with three columns. Column \code{cpos_left} are the start
#' corpus positions of a structural annotation, \code{cpos_right} the end corpus positions.
#' Column \code{value} is the value of the annotation.
#' @param corpus A CWB corpus (ID in upper case).
#' @param s_attribute A structural attribute (length 1 `character` vector).
#' @param data_dir The data directory where the binary files of the corpus are
#' stored.
#' @param encoding Encoding of the values ("latin-1" or "utf-8")
#' @param registry The CWB registry directory.
#' @param method A length-one `character` vector, whether to use "R" or "Rcpp"
#' implementation for decoding structural attribute.
#' @return A \code{data.frame} with three columns. Column \code{cpos_left} are
#' the start corpus positions of a structural annotation, \code{cpos_right}
#' the end corpus positions. Column \code{value} is the value of the
#' annotation.
#' @export s_attribute_decode
#' @rdname s_attribute_decode
#' @return a character vector
#' @examples
#' registry <- if (!check_pkg_registry_files()) use_tmp_registry() else get_pkg_registry()
#' Sys.setenv(CORPUS_REGISTRY = registry)
Expand All @@ -27,8 +29,21 @@
#' b <- s_attribute_decode(
#' data_dir = system.file(package = "RcppCWB", "extdata", "cwb", "indexed_corpora", "reuters"),
#' s_attribute = "places", method = "R"
#' )
#' )
#'
#' # Using Rcpp wrappers for CWB C code
#' b <- s_attribute_decode(
#' corpus = "REUTERS",
#' data_dir = system.file(package = "RcppCWB", "extdata", "cwb", "indexed_corpora", "reuters"),
#' s_attribute = "places",
#' method = "Rcpp"
#' )
s_attribute_decode <- function(corpus, data_dir, s_attribute, encoding = NULL, registry = Sys.getenv("CORPUS_REGISTRY"), method = c("R", "Rcpp")){

if (!is.character(method)) stop("Argument 'method' needs to be a character vector.")
if (length(method) != 1L) stop("Argument 'method' needs to be a length 1 vector.")
if (!method %in% c("Rcpp", "R")) stop("Argument 'method' needs to be either 'R' or 'Rcpp'.")

if (method == "R"){

if (missing(data_dir)) stop("data_dir needs to be specified to use R method")
Expand Down Expand Up @@ -64,19 +79,58 @@ s_attribute_decode <- function(corpus, data_dir, s_attribute, encoding = NULL, r
check_registry(registry = registry)
check_corpus(corpus = corpus, registry = registry)
check_s_attribute(corpus = corpus, registry = registry, s_attribute = s_attribute)

values <- .decode_s_attribute(corpus = corpus, s_attribute = s_attribute, registry = registry)
warning("region matrix can not yet be generated with Rcpp method")
region_matrix <- NULL

s_attr_size <- cl_attribute_size(
corpus = corpus,
attribute = s_attribute,
attribute_type = "s",
registry = registry
)

region_matrix <- get_region_matrix(
corpus = corpus,
s_attribute = s_attribute,
strucs = 0L:(s_attr_size - 1L),
registry = registry
)

df <- data.frame(
cpos_left = NA,
cpos_right = NA,
cpos_left = region_matrix[,1],
cpos_right = region_matrix[,2],
value = values,
stringsAsFactors = FALSE
)

} else {
stop("method needs to be either 'R' or 'Rcpp'")
}

df
}


#' Get regions defined by a structural attribute
#'
#' Get all regions defined by a structural attribute. Unlike
#' `get_region_matrix()` that returns a region matrix for a defined subset of
#' strucs, all regions are returned. As it is the fastest option, the function
#' reads the binary *.rng file for the structural attribute directly. The corpus
#' library (CL) is not used in this case.
#'
#' @param corpus A length-one `character` vector with a corpus ID.
#' @param s_attr A length-one `character` vector stating a structural attribute.
#' @param registry A length-one `character` vector stating the registry
#' directory (defaults to CORPUS_REGISTRY environment variable).
#' @param data_dir The data directory of the corpus.
#' @return A two-colum `matrix` with the regions defined by the structural
#' attribute: Column 1 defines left corpus positions and column 2 right corpus
#' positions of regions.
#' @examples
#' s_attr_regions("REUTERS", s_attr = "id")
s_attr_regions <- function(corpus, s_attr, registry = Sys.getenv("CORPUS_REGISTRY"), data_dir = corpus_data_dir(corpus = corpus, registry = registry)){
rng_file <- file.path(data_dir, paste(s_attr, "rng", sep = "."))
rng_file_size <- file.info(rng_file)[["size"]]
con <- file(rng_file, open = "rb")
rng <- readBin(con, what = integer(), size = 4L, n = rng_file_size / 4L, endian = "big")
close(con)
matrix(rng, ncol = 2L, byrow = TRUE)
}
Loading

0 comments on commit d1eff50

Please sign in to comment.