Merge pull request #28 from ismaelgutier/work_dog

updating README and vignettes
ismaelgutier · Nov 22, 2024 · a7f02d9 · a7f02d9
2 parents 607c2e2 + f6515a3
commit a7f02d9
Show file tree

Hide file tree

Showing 14 changed files with 312 additions and 502 deletions.
diff --git a/.Rhistory b/.Rhistory
@@ -1,6 +1,3 @@
-ungroup() %>%
-select(ID, task, item_ID, item, response = Response, RA, Attempt, accessed)
-# For IGC_long_phon
 IGC_long_phon <- IGC_long_phon %>%
 mutate(Attempt = as.numeric(Attempt)) %>%  # Convertir Attempt a numérico
 group_by(task) %>%                     # Agrupar solo por task
@@ -510,3 +507,6 @@ dplyr::filter(!stringr::str_detect(task_type, "nonword")) %>%
 dplyr::arrange(ID)
 document()
 install()
+document()
+install()
+print(333333)
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -1,7 +1,7 @@
 Package: sunflower
 Type: Package
 Title: Managing Multiple Responses, Computing Formal Quality Measures, and Classifying Language Production Errors
-Version: 0.16.11
+Version: 0.22.11
 Author: Gutiérrez-Cordero, I [aut][cre](<https://orcid.org/0000-0003-1508-4203>)
 Authors@R: person(
                given = "Ismael",

diff --git a/README.Rmd b/README.Rmd
@@ -19,7 +19,7 @@ knitr::opts_chunk$set(
 
 <!-- badges start -->
 
-![](https://img.shields.io/badge/sunflower-v._0.16.11-orange?style=flat&link=https%3A%2F%2Fgithub.com%2Fismaelgutier%2Fsunflower) [![License: GPL v3](https://img.shields.io/badge/License-GPLv3-blue.svg)](https://www.gnu.org/licenses/gpl-3.0) ![](https://img.shields.io/badge/Language-grey?style=flat&logo=R&color=grey&link=https%3A%2F%2Fwww.r-project.org%2F)
+![](https://img.shields.io/badge/sunflower-v._0.22.11-orange?style=flat&link=https%3A%2F%2Fgithub.com%2Fismaelgutier%2Fsunflower) [![License: GPL v3](https://img.shields.io/badge/License-GPLv3-blue.svg)](https://www.gnu.org/licenses/gpl-3.0) ![](https://img.shields.io/badge/Language-grey?style=flat&logo=R&color=grey&link=https%3A%2F%2Fwww.r-project.org%2F)
 
 <!-- badges end -->
 
@@ -202,7 +202,7 @@ errors_classified %>%
 
 *sunflower* allows for the classification of production errors once some indexes related to responses to a stimulus have been obtained and contextualized based on whether they come from repeated attempts or single productions. This process involves three steps.
 
-First, a lexicality check of the response is performed using the `lexicality_check()` function, which involves determining whether the response is a real word. To do this, the package searches for the response in a database such as *BuscaPalabras* ([BPal](https://www.uv.es/~mperea/Davis_Perea_in_press.pdf)) and compares its frequency with the target word to determine if it is a real word based on whether it has a higher frequency or not when the parameter `criterion = "database"` is set. Alternatively, the response can be checked against a dictionary (*sunflower* searches for responses among entries from the *Real Academia Española*, [RAE](https://www.rae.es/)) when the parameter `criterion = "dictionary"` is used.
+First, a lexicality check of the response is performed using the `check_lexicality()` function, which involves determining whether the response is a real word. To do this, the package searches for the response in a database such as *BuscaPalabras* ([BPal](https://www.uv.es/~mperea/Davis_Perea_in_press.pdf)) and compares its frequency with the target word to determine if it is a real word based on whether it has a higher frequency or not when the parameter `criterion = "database"` is set. Alternatively, the response can be checked against a dictionary (*sunflower* searches for responses among entries from the *Real Academia Española*, [RAE](https://www.rae.es/)) when the parameter `criterion = "dictionary"` is used.
 
 Next, similarity measures between the targets and the responses are obtained using various algorithms within the `get_formal_similarity()` function. Finally, the cosine similarity between the two productions is computed if possible using the `get_semantic_similarity()` function, based on an NLP model. In our case, the parameter `model = m_w2v` refers to a binary file containing a Spanish Billion Words embeddings corpus created using the word2vec algorithm. This file is included in the zip file (for more information, see the markdown in the vignettes) located within the <a href="https://osf.io/mfcvb" style="color: purple;">dependency-bundle zip</a>, which can be found in our supplementary [OSF repository mirror](https://osf.io/akuxv/).
 

diff --git a/README.md b/README.md
@@ -5,7 +5,7 @@
 
 <!-- badges start -->
 
-![](https://img.shields.io/badge/sunflower-v._0.16.11-orange?style=flat&link=https%3A%2F%2Fgithub.com%2Fismaelgutier%2Fsunflower)
+![](https://img.shields.io/badge/sunflower-v._0.22.11-orange?style=flat&link=https%3A%2F%2Fgithub.com%2Fismaelgutier%2Fsunflower)
 [![License: GPL
 v3](https://img.shields.io/badge/License-GPLv3-blue.svg)](https://www.gnu.org/licenses/gpl-3.0)
 ![](https://img.shields.io/badge/Language-grey?style=flat&logo=R&color=grey&link=https%3A%2F%2Fwww.r-project.org%2F)
@@ -75,7 +75,7 @@ formal_metrics_computed = df_to_formal_metrics %>%
                           response_col = "response",
                           attempt_col = "attempt",
                           group_cols = c("ID", "item_ID"))
-#> The function get_formal_similarity() took 2.52 seconds to be executed
+#> The function get_formal_similarity() took 3.44 seconds to be executed
 ```
 
 Display some of the results from the formal quality analysis.
@@ -137,9 +137,9 @@ errors_classified = df_to_classify %>%
   get_semantic_similarity(item_col = "item", response_col = "response", model = m_w2v) %>%
   classify_errors(response_col = "response", item_col = "item",
                   access_col = "accessed", RA_col = "RA", also_classify_RAs = T)
-#> The function check_lexicality() took 0.49 seconds to be executed
+#> The function check_lexicality() took 0.54 seconds to be executed
 #> The function get_formal_similarity() took 0.68 seconds to be executed
-#> The function get_semantic_similarity() took 0.75 seconds to be executed
+#> The function get_semantic_similarity() took 0.73 seconds to be executed
 #> The function classify_errors() took 0.80 seconds to be executed
 ```
 
@@ -165,7 +165,7 @@ contextualized based on whether they come from repeated attempts or
 single productions. This process involves three steps.
 
 First, a lexicality check of the response is performed using the
-`lexicality_check()` function, which involves determining whether the
+`check_lexicality()` function, which involves determining whether the
 response is a real word. To do this, the package searches for the
 response in a database such as *BuscaPalabras*
 ([BPal](https://www.uv.es/~mperea/Davis_Perea_in_press.pdf)) and

diff --git a/vignettes/functioning_example.Rmd b/vignettes/functioning_example.Rmd
@@ -0,0 +1,172 @@
+---
+title: "Data Analysis Workflow using the Sunflower Package"
+author: "Ismael Gutiérrez-Cordero"
+date: "`r Sys.Date()`"
+output: 
+  html_document:
+    toc: true
+    toc_float: true
+    number_sections: true
+    theme: cerulean
+---
+
+```{r setup, include=FALSE}
+
+knitr::opts_chunk$set(echo = TRUE)
+
+```
+
+In this vignette, we present a practical example of using the sunflower package to work with datasets that include a column of responses containing multiple answers. We demonstrate how to convert the dataset into a long format to obtain formal similarity metrics. Additionally, we illustrate how to perform error classification based on classical criteria found in the literature (e.g., [Dell et al., 1997](https://doi.org/10.1037/0033-295x.104.4.801); [Gold & Kertesz, 2001](https://doi.org/10.1006/brln.2000.2441); see also, [García-Orza et al., 2020](https://doi.org/10.1016/j.cortex.2020.03.020)).
+
+# Environment Setup
+
+```{r}
+# Clear the workspace and unload all packages
+#rm(list = ls())
+#invisible(lapply(paste("package:", names(sessionInfo()$otherPkgs), sep = ""),
+#                 detach, character.only = TRUE, unload = TRUE))
+
+# Install and load `devtools` package
+if (!requireNamespace("devtools", quietly = TRUE)) {
+  install.packages("devtools")
+}
+library(devtools)
+
+# Install RTools on Windows (if applicable)
+# Visit: https://cran.rstudio.com/bin/windows/Rtools/ for installation.
+# Not needed on macOS or Linux (I am not an user, so I am guessing)
+```
+
+# Install and Load Required Packages
+
+```{r}
+# Install additional packages if needed
+possible_dependencies <- c("tidyverse", "htmlTable", "knitr")
+for (pkg in possible_dependencies) {
+  if (!requireNamespace(pkg, quietly = TRUE)) {
+    install.packages(pkg)
+  }
+}
+
+# Install the `xfun` package (if necessary)
+if (!requireNamespace("xfun", quietly = TRUE)) {
+  install.packages("xfun", type = "source")
+}
+
+# Install the `sunflower` package from GitHub
+if (!requireNamespace("sunflower", quietly = TRUE)) {
+  devtools::install_github("ismaelgutier/sunflower")
+}
+
+# Load required libraries
+library(sunflower)
+```
+
+# Step 1
+## Load and Wrangle Data
+
+```{r}
+# Load dataset
+dataframe0 <- sunflower::IGC_sample
+
+# Separate responses
+dataframe1 <- dataframe0 %>% 
+  sunflower::separate_responses(col_name = "response",
+                                separate_with = ", ")
+
+# Extract attempts and clean blank spaces
+dataframe2 <- dataframe1 %>% 
+  sunflower::get_attempts(first_production = attempt_1, 
+                          drop_blank_spaces = TRUE)
+```
+
+# Step 2
+## Formal Similarity Analysis
+
+```{r}
+# Calculate formal similarity
+dataframe3 <- dataframe2 %>%
+    sunflower::get_formal_similarity(item_col = "item",
+                                     response_col = "response",
+                                     attempt_col = "attempt",
+                                     group_cols = c("task_item_ID"))
+```
+
+# Step 2.1
+## Positional Accuracy
+
+```{r}
+# Calculate positional accuracy
+dataframe3.1 <- dataframe3 %>% 
+    sunflower::positional_accuracy(item_col = "item", 
+                                   response_col = "response",
+                                   match_col = "adj_strict_match_pos")
+```
+
+# Step 3
+## Lexicality Check
+
+```{r}
+# Check lexicality
+dataframe4 <- dataframe3 %>%
+    sunflower::check_lexicality(item_col = "item",
+                                response_col = "response",
+                                criterion = "dictionary")
+```
+
+## Semantic Similarity Analysis
+
+```{r}
+# Load a pre-trained word2vec model
+model <- word2vec::read.word2vec(file = file.choose(), normalize = FALSE)
+
+# Calculate semantic similarity
+dataframe5 <- dataframe4 %>%
+    sunflower::get_semantic_similarity(item_col = "item",
+                                       response_col = "response",
+                                       model = model)
+```
+
+## Error Classification
+
+### Classify Errors Considering Retrieval Attempts (RAs)
+
+```{r}
+dataframe6a <- dataframe5 %>%
+  dplyr::select(-correct) %>% # remove old correct_column (the user might also rename if want to keep it)
+  dplyr::mutate(accessed = ifelse(item == response, 1, 0)) %>%
+  sunflower::classify_errors(access_col = "accessed", 
+                  RA_col = "RA",
+                  response_col = "response", 
+                  item_col = "item",
+                  also_classify_RAs = TRUE,
+                  cosine_limit_value = 0.46)
+knitr::kable(dataframe6a) %>%
+  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
+  kableExtra::scroll_box(width = "120%", height = "500px")
+```
+
+### Classify Errors Without Considering RAs
+
+```{r}
+dataframe6b <- dataframe5 %>%
+  dplyr::select(-correct) %>% # remove old correct_column (the user might also rename if want to keep it)
+  dplyr::mutate(accessed = ifelse(item == response, 1, 0)) %>%
+  sunflower::classify_errors_regular(access_col = "accessed", 
+                          response_col = "response", 
+                          item_col = "item",
+                          cosine_limit_value = 0.46)
+knitr::kable(dataframe6b) %>%
+  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
+  kableExtra::scroll_box(width = "120%", height = "500px")
+
+```
+
+# Conclusion
+
+This R Markdown document provides a complete workflow for analyzing data using the sunflower package, incorporating data wrangling, similarity metrics, and error classification.
+
+# R Session Info.
+```{r}
+sessionInfo()
+```
diff --git a/vignettes/sepex_presentation/f1_initial.png b/vignettes/sepex_presentation/f1_initial.png
diff --git a/vignettes/sepex_presentation/f1_second.png b/vignettes/sepex_presentation/f1_second.png
diff --git a/vignettes/sepex_presentation/f2.png b/vignettes/sepex_presentation/f2.png
diff --git a/vignettes/sepex_presentation/f2b.png b/vignettes/sepex_presentation/f2b.png
diff --git a/vignettes/sepex_presentation/f3.png b/vignettes/sepex_presentation/f3.png
diff --git a/vignettes/sepex_presentation/pa.png b/vignettes/sepex_presentation/pa.png