Skip to content

Commit

Permalink
v0.5.0 full merge
Browse files Browse the repository at this point in the history
  • Loading branch information
ablaette committed Feb 1, 2022
2 parents 699e3f9 + 33357b9 commit f78610e
Show file tree
Hide file tree
Showing 927 changed files with 106,779 additions and 123,487 deletions.
2 changes: 1 addition & 1 deletion .Rbuildignore
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
^windows$
^prep$
^patch$
^cran-comments\.md$
^paper.md$
^data-raw$
Expand Down
53 changes: 30 additions & 23 deletions .github/workflows/R-CMD-check.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ jobs:
matrix:
config:
- {os: windows-latest, r: 'release'}
# - {os: windows-2022, r: 'devel'}
- {os: windows-2022, r: 'devel'}
- {os: macOS-latest, r: 'release'}
- {os: ubuntu-20.04, r: 'release', rspm: "https://packagemanager.rstudio.com/cran/__linux__/focal/latest"}
- {os: ubuntu-20.04, r: 'devel', rspm: "https://packagemanager.rstudio.com/cran/__linux__/focal/latest"}
Expand All @@ -39,32 +39,31 @@ jobs:
steps:
- uses: actions/checkout@v2

- name: Setup R
- uses: r-lib/actions/setup-r@v2
if: matrix.config.os != 'windows-2022'
uses: r-lib/actions/setup-r@v1
with:
r-version: ${{ matrix.config.r }}
http-user-agent: ${{ matrix.config.http-user-agent }}
use-public-rspm: true

# - name: Setup R (Windows UCRT)
# if: matrix.config.os == 'windows-2022'
# uses: kalibera/ucrt3/actions/r-install@main
- name: Setup R (Windows UCRT)
if: matrix.config.os == 'windows-2022'
uses: kalibera/ucrt3/actions/r-install@main

# - name: Install UCRT toolchain
# if: matrix.config.os == 'windows-2022'
# uses: kalibera/ucrt3/actions/toolchain-install@main
# with:
# # base ... toolchain has the compilers and libraries to build R and recommended packages
# # full ... additional libraries to build CRAN packages
# # none ... no toolchain is needed (no native code)
# toolchain-type: full

- name: Check package
- name: Install UCRT toolchain
if: matrix.config.os == 'windows-2022'
uses: kalibera/R-actions/pkg-check@master
uses: kalibera/ucrt3/actions/toolchain-install@main
with:
# base ... toolchain has the compilers and libraries to build R and recommended packages
# full ... additional libraries to build CRAN packages
# none ... no toolchain is needed (no native code)
toolchain-type: full

- name: Setup pandoc
if: matrix.config.os != 'windows-2022'
uses: r-lib/actions/setup-pandoc@v1
- uses: r-lib/actions/setup-pandoc@v2

- uses: r-lib/actions/setup-r-dependencies@v2
with:
extra-packages: rcmdcheck

- name: Query dependencies
run: |
Expand All @@ -91,7 +90,6 @@ jobs:
done < <(Rscript -e 'writeLines(remotes::system_requirements("ubuntu", "20.04"))')
- name: Install dependencies
if: matrix.config.os != 'windows-2022'
run: |
remotes::install_deps(dependencies = TRUE)
remotes::install_cran("rcmdcheck")
Expand All @@ -101,20 +99,28 @@ jobs:
shell: Rscript {0}

- name: Check
if: matrix.config.os != 'windows-2022'
env:
_R_CHECK_CRAN_INCOMING_REMOTE_: false
run: |
options(crayon.enabled = TRUE)
rcmdcheck::rcmdcheck(args = c("--no-manual", "--as-cran"), error_on = "warning", check_dir = "check")
shell: Rscript {0}


- name: Check package (UCRT)
if: matrix.config.os == 'windows-2022'
env:
_R_INSTALL_TIME_PATCHES_: no
TZ: UTC
uses: kalibera/R-actions/pkg-check@master

- name: Build Windows binary package
if: matrix.os == 'windows-latest'
run: pkgbuild::build(binary = TRUE, dest_path = Sys.getenv("GITHUB_WORKSPACE"))
shell: Rscript {0}

- name: Upload Windows binary
if: matrix.os == 'windows-latest'
if: matrix.config.os == 'windows-latest'
uses: actions/upload-artifact@v2
with:
name: RcppCWB-Windows-binary
Expand All @@ -141,5 +147,6 @@ jobs:
path: check

- name: Test coverage
if: matrix.os != 'windows-latest'
run: covr::codecov()
shell: Rscript {0}
2 changes: 2 additions & 0 deletions CRAN-RELEASE
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
This package was submitted to CRAN on 2022-01-31.
Once it is accepted, delete this file and tag the release (commit 6ed0e4c).
1 change: 1 addition & 0 deletions NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ export(cwb_compress_rdx)
export(cwb_encode)
export(cwb_huffcode)
export(cwb_makeall)
export(cwb_version)
export(get_cbow_matrix)
export(get_count_vector)
export(get_pkg_registry)
Expand Down
2 changes: 1 addition & 1 deletion NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -179,7 +179,7 @@ are now exported and documented
# RcppCWB 0.2.2

* Compiling RcppCWB on unix-like systems (macOS, Linux) will work now without
the presence of glib (on Windows, the dependency persists).
the presence of glib (on Windows, the dependency persists).#
* The presence of the bison parser is not required any more. The package includes
the C source generated by the bison parser along with the original input files.
* Functionality to generate CWB-indexed corpora and to generate and manipulate
Expand Down
64 changes: 34 additions & 30 deletions R/RcppExports.R
Original file line number Diff line number Diff line change
@@ -1,6 +1,34 @@
# Generated by using Rcpp::compileAttributes() -> do not edit by hand
# Generator token: 10BE3573-1514-4C36-9D1C-5A225CD40393

.decode_s_attribute <- function(corpus, s_attribute, registry) {
.Call(`_RcppCWB_decode_s_attribute`, corpus, s_attribute, registry)
}

.get_count_vector <- function(corpus, p_attribute, registry) {
.Call(`_RcppCWB_get_count_vector`, corpus, p_attribute, registry)
}

.get_region_matrix <- function(corpus, s_attribute, strucs, registry) {
.Call(`_RcppCWB_get_region_matrix`, corpus, s_attribute, strucs, registry)
}

.get_cbow_matrix <- function(corpus, p_attribute, registry, matrix, window) {
.Call(`_RcppCWB_get_cbow_matrix`, corpus, p_attribute, registry, matrix, window)
}

.region_matrix_to_ids <- function(corpus, p_attribute, registry, matrix) {
.Call(`_RcppCWB_region_matrix_to_ids`, corpus, p_attribute, registry, matrix)
}

.ids_to_count_matrix <- function(ids) {
.Call(`_RcppCWB_ids_to_count_matrix`, ids)
}

.region_matrix_to_count_matrix <- function(corpus, p_attribute, registry, matrix) {
.Call(`_RcppCWB_region_matrix_to_count_matrix`, corpus, p_attribute, registry, matrix)
}

.cl_attribute_size <- function(corpus, attribute, attribute_type, registry) {
.Call(`_RcppCWB__cl_attribute_size`, corpus, attribute, attribute_type, registry)
}
Expand Down Expand Up @@ -121,34 +149,6 @@
.Call(`_RcppCWB__corpus_data_dir`, corpus, registry)
}

.decode_s_attribute <- function(corpus, s_attribute, registry) {
.Call(`_RcppCWB_decode_s_attribute`, corpus, s_attribute, registry)
}

.get_count_vector <- function(corpus, p_attribute, registry) {
.Call(`_RcppCWB_get_count_vector`, corpus, p_attribute, registry)
}

.get_region_matrix <- function(corpus, s_attribute, strucs, registry) {
.Call(`_RcppCWB_get_region_matrix`, corpus, s_attribute, strucs, registry)
}

.get_cbow_matrix <- function(corpus, p_attribute, registry, matrix, window) {
.Call(`_RcppCWB_get_cbow_matrix`, corpus, p_attribute, registry, matrix, window)
}

.region_matrix_to_ids <- function(corpus, p_attribute, registry, matrix) {
.Call(`_RcppCWB_region_matrix_to_ids`, corpus, p_attribute, registry, matrix)
}

.ids_to_count_matrix <- function(ids) {
.Call(`_RcppCWB_ids_to_count_matrix`, ids)
}

.region_matrix_to_count_matrix <- function(corpus, p_attribute, registry, matrix) {
.Call(`_RcppCWB_region_matrix_to_count_matrix`, corpus, p_attribute, registry, matrix)
}

.cwb_makeall <- function(x, registry_dir, p_attribute) {
.Call(`_RcppCWB_cwb_makeall`, x, registry_dir, p_attribute)
}
Expand All @@ -161,7 +161,11 @@
.Call(`_RcppCWB_cwb_compress_rdx`, x, registry_dir, p_attribute)
}

.cwb_encode <- function(regfile, data_dir, vrt_dir, p_attributes, s_attributes_anno, s_attributes_noanno) {
.Call(`_RcppCWB_cwb_encode`, regfile, data_dir, vrt_dir, p_attributes, s_attributes_anno, s_attributes_noanno)
.cwb_encode <- function(regfile, data_dir, vrt_dir, encoding, p_attributes, s_attributes_anno, s_attributes_noanno) {
.Call(`_RcppCWB_cwb_encode`, regfile, data_dir, vrt_dir, encoding, p_attributes, s_attributes_anno, s_attributes_noanno)
}

.cwb_version <- function() {
.Call(`_RcppCWB_cwb_version`)
}

14 changes: 13 additions & 1 deletion R/cwb.R
Original file line number Diff line number Diff line change
Expand Up @@ -97,6 +97,7 @@ cwb_compress_rdx <- function(corpus, p_attribute, registry = Sys.getenv("CORPUS_
#' files of the indexed corpus.
#' @param vrt_dir Directory with input corpus files (verticalised format / file
#' ending *.vrt).
#' @param encoding The encoding of the files to be encoded, defaults to "utf8".
#' @param s_attributes A `list` of named `character` vectors to declare
#' structural attributes that shall be encoded. The names of the list are the
#' XML elements present in the corpus. Character vectors making up the list
Expand Down Expand Up @@ -132,7 +133,7 @@ cwb_compress_rdx <- function(corpus, p_attribute, registry = Sys.getenv("CORPUS_
#' unlink(data_dir)
#' unlink(file.path(Sys.getenv("CORPUS_REGISTRY"), "btmin"))
#' }
cwb_encode <- function(corpus, registry = Sys.getenv("CORPUS_REGISTRY"), data_dir, vrt_dir, p_attributes = c("word", "pos", "lemma"), s_attributes){
cwb_encode <- function(corpus, registry = Sys.getenv("CORPUS_REGISTRY"), data_dir, vrt_dir, encoding = "utf8", p_attributes = c("word", "pos", "lemma"), s_attributes){

s_attributes_noanno <- unlist(lapply(
names(s_attributes),
Expand All @@ -152,8 +153,19 @@ cwb_encode <- function(corpus, registry = Sys.getenv("CORPUS_REGISTRY"), data_di
regfile = file.path(registry, tolower(corpus)),
data_dir = data_dir,
vrt_dir = vrt_dir,
encoding = encoding,
p_attributes = p_attributes,
s_attributes_anno = s_attributes_anno,
s_attributes_noanno = s_attributes_noanno
)
}

#' Get CWB version
#'
#' Get the CWB version used and available when compiling the source code.
#'
#' @export
#' @return A `numeric_version` object.
#' @examples
#' cwb_version()
cwb_version <- function() as.numeric_version(.cwb_version())
27 changes: 24 additions & 3 deletions cran-comments.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,19 +4,40 @@ The RcppCWB package is a wrapper for the Corpus Workbench (CWB). Previous versio
included the C code of an increasingly outdated version of CWB (CWB v3.4.14).
This version re-aligns RcppCWB with upstream CWB development (CWB v3.4.44).

Previous message: This version (v0.4.4) solves issues with setting and re-setting paths. I am confident that this release will solve an issue with the 'cwbtools' package (r-devel-windows-x86_64-new-UL and r-devel-windows-x86_64-new-TK).
It uses automated patches (see code in ./patch) to make the CWB fit into this R
package. It will now be painless to include the most recent (patched)
CWB code in this package.

The most important immediate purpose of this release is to integrate
patches developed by Tomas Kalibera, bringing it up the the requirements of the
UCRT toolchain. It is not necessary to apply patches at CRAN
(installation with _R_INSTALL_TIME_PATCHES_=no)

On this occassion, I would like to gratefully acknowledge the inredible helpful
guidance and support of Tomas Kalibera, and support and advice offered by Brian
Ripley.


## Test environments

* Standard checks with R-hub
* CI checks with GitHub Actions (Windows/macOS/Ubuntu)
* local macOS R 4.1.1
* local macOS R 4.1.2 (both x86_64 and arm64)
* local Windows machine (both R 4.1.2 and R 4.2)


## R CMD check results

I see a NOTE concerning package size on CRAN machines (5.2 MB on r-release-macos-x86_64). I hope this is still tolerable.
Compatibility with the UCRT toolchain has been the primary concern of this
release. I applied patches to remove all compiler warnings I saw. Yet there is
one remaining 'stringop-overflow' compiler warning indicating a fairly
unlikely scenario that I find hard to address at this time.

gcc -c -o cdaccess.o -O2 -Wall -D__MINGW__ -DEMULATE_SETENV -DCOMPILE_DATE=\""Sat Jan 29 07:12:07 WEST 2022"\" -DCWB_VERSION=\"3.4.33\" -IC:/rtools42/x86_64-w64-mingw32.static.posix/include/glib-2.0 -IC:/rtools42/x86_64-w64-mingw32.static.posix/lib/glib-2.0/include -I/rtools42/x86_64-w64-mingw32.static.posix/include/glib-2.0 -DPCRE_STATIC cdaccess.c
cdaccess.c: In function 'cl_read_stream':
cdaccess.c:982:5: warning: 'memcpy' specified bound between 18446744065119617024 and 18446744073709551612 exceeds maximum object size 9223372036854775807 [-Wstringop-overflow=]
| memcpy(buffer, ps->base + ps->nr_items, items_to_read * sizeof(int));
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


## Downstream dependencies
Expand Down
Loading

0 comments on commit f78610e

Please sign in to comment.