Skip to content

R package that benchmark regular expression functions.

Notifications You must be signed in to change notification settings

Mark-Nawar/evilRegex

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 

Repository files navigation

RE2R-back-on-CRAN

R package that benchmark regular expression functions.

To install this package follow the following steps carefully:

note: you have to have Rstudio installled on your machine follow the instructions here to download it if you don't already have it.

TEST NO1 (create a package with a vignette containg the new timing figures:

1- Install Rtools: click here and type the following code in the R console

writeLines('PATH="${RTOOLS40_HOME}\\usr\\bin;${PATH}"', con = "~/.Renviron")

Now Restart R and write this code and check the output:

Sys.which("make")
## "C:\\rtools40\\usr\\bin\\make.exe"

2- Install devtools:

install.packages("devtools")
library(devtools)

3- Install the regular expression functions:

install.packages("microbenchmark")
install.packages("ggplot2")
install.packages("directlabels")
install.packages("stringi")
install_github("qinwf/re2r", build_vignettes = T)

4-Install this package evilRegex( RE2R-back-on-CRAN):

install_github("Mark-Nawar/evilRegex/evilRegex/", build_vignettes = T)

5-To browse through the vignette of the package where the timing figures are:

browseVignettes("evilRegex")

THIS IS THE CONTENT OF THE VIGNETTE do not proceed if you already downloaded the package use step 5 and you will be redirected to an HTML page


Regular expressions simply help us extract some text patterns from large texts below is an example to extract some dates from a 5MB text file related to homicides in the united states you can find the text file from Kaggle. You can see a sample of the text structure below:

homicides <- readLines("homicides.txt")

homicides[1]

[1] "39.311024, -76.674227, iconHomicideShooting, 'p2', '<dl><dt>Leon Nelson</dt><dd class=\"address\">3400 Clifton Ave.<br />Baltimore, MD 21216</dd><dd>black male, 17 years old</dd><dd>Found on January 1, 2007</dd><dd>Victim died at Shock Trauma</dd><dd>Cause: shooting</dd></dl>'"


I can extract the data from this text using the following pattern: Pattern <- c("<dd>[F|f]ound(.*?)</dd>") and apply the same pattern using different packages.

     ICU= stringi::stri_match(x, regex="<dd>[F|f]ound(.*?)</dd>")  # Stringi ----> ICU ( exp complexity for large evil large inputs)
     PCRE= regexpr("<dd>[F|f]ound(.*?)</dd>", x, perl=TRUE)        # regexpr where perl is set to TRUE ---> PCRE (exp complexity for large evil inputs)
     TRE=regexpr("<dd>[F|f]ound(.*?)</dd>", x, perl=FALSE)         # regexpr where perl is set to False ----> TRE ( polynomial complexity  )
     RE2=re2r::re2_match(x,"<dd>[F|f]ound(?P<Date>.*?)</dd>" )     # RE2r package -------> RE2 (polynomial complexity)

To further understand the various packages check this link

Now We will compare the performance of the 4 functions to see which one is faste is extracting the date from the homocide txt file, note that this pattern is not evil( in terms of complexity it is relatively simple ). Let's benchmark it and see!

The timing figures for the following benchmark note:

homicides <- readLines("homicides.txt")
  max.N <- 25
  times.list <- list()
  for(N in 1:max.N){
    cat(sprintf("subject/pattern size %4d / %4d\n", N, max.N))
    x<-paste(homicides[1:N], collapse=" ")
    N.times <- microbenchmark::microbenchmark(
      ICU= stringi::stri_match(x, regex="<dd>[F|f]ound(.*?)</dd>"),
      PCRE= regexpr("<dd>[F|f]ound(.*?)</dd>", x, perl=TRUE),
      TRE=regexpr("<dd>[F|f]ound(.*?)</dd>", x, perl=FALSE),
      RE2=re2r::re2_match(x,"<dd>[F|f]ound(?P<Date>.*?)</dd>" ),
      times=10)

    times.list[[N]] <- data.frame(N, N.times)
  }
  times <- do.call(rbind, times.list)
  save(times, file="times.RData")

package microbenchmark gives us this wonderful summary:

N.times
Unit: microseconds
 expr   min    lq   mean median    uq   max neval
  ICU  41.2  45.3  56.11  52.25  58.9  96.2    10
 PCRE  89.0  90.7  94.91  93.10  96.8 113.3    10
  TRE  21.8  24.3  29.20  29.95  31.7  39.2    10
  RE2 118.4 130.6 149.33 138.60 153.0 241.1    10

noting that the numbers are in micro seconds so in linear time we can assume that the 4 methods consumed 0 seconds, but how will the results change when the no of possible comparisions greatly increse ( EVIL regular expression)???

figure-complexity-linear-good figure-complexity-log-good

As you see on the linear scale they all took almost 0 seconds


Let's Now try an Evil Pattern that will show the real power of polynomial complexity:

Now lets try our the function evilRegex() pattern <- "^(a+)+$"

subject<-paste(rep(c("a","X"), c(N,1)), collapse="") where subject = ax (N=1) , =aax(N=2), = aaax(N=3) and so on

evilRegex::evilRegex()

Timings from micro benchmark:

N.times
Unit: microseconds
 expr       min        lq       mean     median        uq       max neval
  ICU 2134799.1 2147828.3 2193954.75 2174281.80 2191353.4 2384468.1    10
 PCRE  222484.6  225298.7  234283.99  232554.05  240938.2  249156.0    10
  TRE      11.4      28.4      32.25      31.65      38.8      46.7    10
  RE2      58.6      67.4     104.57     111.95     127.8     148.0    10

TRE might be slighty better than RE2 but it doesn't support named capture and that makes RE2 much more useful. Can you spot the big differnece between TRE,RE2 and ICU,PCRE this will become more clear with the following graphs:

figure-complexity-linear figure-complexity-log


Test NO2 Medium:write some R code that tests for unicode support in each of the 3 currently available regex engines (PCRE, TRE, ICU).

The following code tests the unicode support in the three engines. I am Egyptian so I will test the phrase "I Love Egypt" for 8 different languages and try to find the word "EGYPT"

testspace<-c(
porto<-'Eu amo o Egito',
portoP<- 'Egito',
hindi<-'मुझे मिस्र से प्यार है',
hindiP<-'मिस्र',
Russian<-'Я люблю египет',
RussianP<-'египет',
swed<-'Jag älskar Egypten',
swedP<-'Egypten',
turkish<-'Mısır\'ı seviyorum',
turkishP<-'Misir',
german<-'Ich liebe Ägypten',
germanP<-'Ägypten',
greek<-'Λατρεύω την Αίγυπτο',
greekP<-'Αίγυπτο',
arabic<-'انا احب مصر',
arabicP<-'مصر' )


results_to_be_checked<-c(
    10,5,
    6,5,
    9,6,
    12,7,
    1,5,
    11,7,
    13,7,
    9,3
)
results_to_be_checked_ICU<-c(
    "Egito","",
    "<U+092E><U+093F><U+0938><U+094D><U+0930>","",
    "<U+0435><U+0433><U+0438><U+043F><U+0435><U+0442>","",
    "Egypten","",
    "Misir","",
    "Ägypten","",
    "<U+0391><U+03AF><U+03B3><U+03C5>pt<U+03BF>","",
    "<U+0645><U+0635><U+0631>"
)
test_PCRE<-'testing PCRE for 8 languages'
test_PCRE
counter<-0
for(i in seq(1,16,2)){
    PCRE= regexpr(enc2utf8(testspace[i+1]), enc2utf8(testspace[i]), perl=TRUE)
    if(PCRE[1]==results_to_be_checked[i] & attr(PCRE, "match.length")==results_to_be_checked[i+1])
        counter = counter + 1
}
sprintf("%d tests passsed out of 8 ", counter)


test_TRE<-'testing TRE for 8 languages'
test_TRE
counter<-0
for(i in seq(1,16,2)){
    TRE= regexpr(enc2utf8(testspace[i+1]), enc2utf8(testspace[i]), perl=FALSE)
    if(TRE[1]==results_to_be_checked[i] & attr(TRE, "match.length")==results_to_be_checked[i+1])
        counter = counter + 1
}
sprintf("%d tests passsed out of 8 ", counter)


test_ICU<-'testing ICU for 8 languages'
test_ICU
counter<-0
for(i in seq(1,16,2)){
    ICU= stringi::stri_match( enc2utf8(testspace[i]), regex=enc2utf8(testspace[i+1]))
    if(enc2native(ICU[1,1]) == results_to_be_checked_ICU[i])
        counter = counter + 1
}
sprintf("%d tests passsed out of 8 ", counter)

OUTPUT :

[1] "testing PCRE for 8 languages"
[2] "8 tests passsed out of 8 "
[3] "testing TRE for 8 languages"
[4] "8 tests passsed out of 8 "
[5] "testing ICU for 8 languages"
[6] "8 tests passsed out of 8 "

About

R package that benchmark regular expression functions.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages