Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dates are not correctly derived (SAS vs R) #142

Closed
adchan11 opened this issue May 25, 2023 · 10 comments · Fixed by #145
Closed

Dates are not correctly derived (SAS vs R) #142

adchan11 opened this issue May 25, 2023 · 10 comments · Fixed by #145
Assignees
Labels
bug Something isn't working programming

Comments

@adchan11
Copy link
Collaborator

adchan11 commented May 25, 2023

What happened?

In R,

 Dates are stored internally as the number of days since 1970-01-01 while times are stored internally as the number of seconds since 1970-01-01``

but in SAS:

The date counter started
at zero on January 1, 1960. Any day before 1/1/1960 is a negative number, and any day
after that is a positive number. Every day at midnight, the date counter is increased by one.
The time counter runs from zero (at midnight) to 86,399.9999, when it resets to zero. The last
counter is the datetime counter. This is the number of seconds since midnight, January 1,
1960.

Sources:
https://support.sas.com/publishing/pubcat/chaps/59411.pdf
https://bookdown.org/rdpeng/rprogdatascience/dates-and-times.html

Therefore, we see a 10 year difference in dates after running the xportr functions:

image

Session Information

No response

Reproducible Example

No response

@adchan11 adchan11 added bug Something isn't working programming labels May 25, 2023
@cpiraux
Copy link
Collaborator

cpiraux commented May 26, 2023

@adchan11 Do you know which specific xportr function is causing the issue?

When I run only xportr_write, the conversion from R to SAS date is correctly handled by haven.

advs in R
image

> xportr_write(advs, "adam/advs.xpt")

advs in SAS
image

@cpiraux
Copy link
Collaborator

cpiraux commented May 26, 2023

After reviewing the screenshot of your R df, it appears that the variable's type might not being recognized as a date and is therefore not being correctly converted.

I think it is crucial to ensure accurate conversion of dates. We should verify that numeric date, time, and datetime values, variables ending with -DT, -TM, or -DTM, have the correct R type (if it is the cause of the issue) for successful conversion. Thank you for highlighting this issue.

@adchan11
Copy link
Collaborator Author

adchan11 commented May 26, 2023

Thanks @cpiraux. Here's example R code which demonstrates the issue:

library(dplyr)
library(metatools)
library(xportr)
library(metacore)
library(haven)

##################
#Create metacore
#################

doc <- read_all_sheets(metacore_example("mock_spec.xlsx"))

ds_spec <- spec_type_to_ds_spec(doc, 
                                cols = c("dataset" = "Name", 
                                         "structure" = "Data Structure",  
                                         "label" = "Label"))

var_spec <- spec_type_to_var_spec(doc, cols = c("variable" = "Variable Name",
                                                "length" = "[L|l]ength",
                                                "label" = "[L|l]abel",
                                                "type" = "[T|t]ype",
                                                "dataset" = "[D|d]ataset|[D|d]omain",
                                                "format" = "Format"))

var_spec <- var_spec %>% 
  mutate(format = if_else(str_detect(format, "\\."), format, ""))

value_spec <- spec_type_to_value_spec(doc, cols = c("dataset" = "VLM Name|Domain",
                                                    "variable" = "VLM Name|Variable Name",
                                                    "origin" = "[O|o]rigin",
                                                    "type" = "[T|t]ype",
                                                    "code_id" = "Controlled Term",
                                                    "where" = "Parameter Code",
                                                    "derivation_id" = "Method",
                                                    "predecessor" = "Method"),
                                      where_sep_sheet = FALSE)

derivation <- spec_type_to_derivations(doc, cols = c("derivation_id" = "Name",
                                                     "derivation" = "[D|d]efinition|[D|d]escription"), 
                                       var_cols = c("dataset" = "Domain Name",
                                                    "variable" = "Variable Name|VLM",
                                                    "origin" = "[O|o]rigin",
                                                    "predecessor" = "Comment",
                                                    "comment" = "Comment")) 

ds_vars<- spec_type_to_ds_vars(doc, cols = c("dataset" = "Domain",
                                             "variable" = "[V|v]ariable [N|n]ame",
                                             "order" = "[V|v]ariable [O|o]rder",
                                             "keep" = "[M|m]andatory"),
                               key_seq_cols = c("dataset" = "Domain Name",
                                                "key_seq" = "Key"),
                               sheet = "[V|v]ar|Domains")

codelist <- spec_type_to_codelist(doc, codelist_cols = c("code_id" = "Codelist Code",
                                                         "name" = "Codelist Name",
                                                         "code" = "Coded Value",
                                                         "decode" = "Decoded Value"),
                                  simplify = TRUE,
                                  dict_cols = NULL)

value_spec <- tibble::tribble(
  ~dataset, ~variable, ~type, ~sig_dig, ~origin, ~code_id, ~where, ~derivation_id,
  "ADSL", 'USUBJID', 'text', NA, 'Predecessor', '1', NA, '1',
  "ADSL", 'DMDTC', 'text', NA, 'Predecessor', NA, NA, NA,
  "ADSL", 'RFICDT', 'integer', NA, 'Derived', NA, NA, NA,
  "ADSL", 'RFICDTM', 'integer', NA, 'Derived', NA, NA, NA
)

var_spec <- tibble::tribble(
  ~variable, ~type, ~length, ~label, ~format, ~common,
  'USUBJID', 'text', 30, 'Unique Subject Identifier', NA, NA,
  'DMDTC', 'text', 25, 'Date/Time of Collection', NA, NA, 
  'RFICDT', 'integer', 8, 'Date of Informed Consent', 'DATE9.', NA, 
  'RFICDTM', 'integer', 8, 'Datetime of Informed Consent', 'Datetime20.', NA
)

ds_vars <- tibble::tribble(
  ~dataset, ~variable, ~key_seq, ~order, ~keep, ~core, ~supp_flag,
  "ADSL", 'USUBJID', 1, 1, FALSE, NA, NA, 
  "ADSL", 'DMDTC', NA, 2, FALSE, NA, NA, 
  "ADSL", 'RFICDT', NA, 3, FALSE, NA, NA, 
  "ADSL", 'RFICDTM', NA, 4, FALSE, NA, NA
)


metacore <- metacore(ds_spec, ds_vars, var_spec, value_spec,
                     derivation, codelist)

#################
#Data
#################

adsl_original <- tibble::tribble(
  ~USUBJID, ~DMDTC, ~RFICDT, ~RFICDTM,
  "test1", "2017-03-30", "2017-03-30", "2017-03-30",
  "test2", "2017-01-08", "2017-01-08", "2017-01-08"
)

adsl_original$RFICDT <- as.Date(adsl_original$RFICDT)

adsl_original$RFICDTM <- as.POSIXct(adsl_original$RFICDTM)

adsl_xpt <- adsl_original %>%
  # before converting to XPT set NA values back to blank
  convert_na_to_blanks() %>%
  xportr_type(metacore) %>% # Coerce variable type to match spec - stop if any mismatch
  xportr_length(metacore) %>% # Assigns SAS length from a variable level metadata
  xportr_label(metacore) %>% # Assigns variable label from metacore specifications
  xportr_format(metacore) %>% # Assigns variable format from metacore specifications
  xportr_df_label(metacore)%>% # Assigns dataset label from metacore specifications
  
  xportr_write("adsl.xpt") 

adsl_xpt_written <- haven::read_xpt('adsl.xpt')

you can see the differences of the final adsl_xpt_written df vs adsl_original

Note adsl_original is mock data of a dataset that I have internally.

@cpiraux
Copy link
Collaborator

cpiraux commented May 27, 2023

Thanks for your example code.

After I ran xport_type, the dates are coerced to dbl and are not correctly converted to SAS date
image

SAS:
image

I also created the xpt without running xport_type and date are correctly converted in SAS

R:
image

SAS:
image

My suggestion:

  • Inform the user when numeric date does not have a ‘date’ type
  • Not coerced date to dbl

@adchan11
Copy link
Collaborator Author

Thanks for taking a look into this.

I thought dates and datetimes are supposed to be numeric in SAS:
https://libguides.library.kent.edu/SAS/DatesTime#:~:text=Dates%2C%20times%2C%20and%20date-times%20are%20commonly%20used%20variable,dataset%2C%20SAS%20%22internally%22%20records%20datetime%20variables%20as%20integers.

so shouldn't RFICDT and RFICDTM supposed to be coerced to numeric in order to apply the formats of Date9. and Datetime20.?

@kaz462
Copy link
Collaborator

kaz462 commented May 30, 2023

Thank you both for the examples and discussion.

  • As @cpiraux mentioned above, the date conversion between R and SAS is correctly handled by haven. While the R to SAS conversion only triggers when the variable in R has the appropriate type/class, e.g., Date, POSIXct.
  • There are some issues in xportr_type() -
    1. in the following example, CHARCTER column remains as date after applying type character
    2. character is not an allowable value per Define-XML Data Type Considerations table from section 4.2.1 Page 20, should we throw a message when spec contains data type values out of this table?
    3. DATE, CHAR2DT, POSIXct2DT columns are not converted to date after applying type date from metacore, because xportr_type() converts the variable into character if the spec type is one of the values from getOption("xportr.character_types"): "character" "char" "text" "date" "posixct" "posixt"
    4. CHAR, TEXT, INTEGER are converted as expected. For INTEGER, should xportr_type be enhanced when dealing with date variables but data type = integer? e.g., if data type = integer and a date format is populated in the spec, we shouldn't do as.numeric() for this column:

      xportr/R/type.R

      Line 101 in 231e959

      else .df[[i]] <<- as.numeric(.df[[i]])
metacore <- data.frame(
  dataset = "test",
  variable = c("SUBJID", "CHARCTER", "CHAR", "TEXT", "DATE", "INTEGER", "CHAR2DT", "POSIXct2DT"),
  type = c("numeric", "character",   "char", "text", "date",   "integer", "date", "date"),
  format = c(NA, "yymmdd10.",NA, NA, NA,  NA, NA, NA)
)

.df <- tibble(
  SUBJID = c("1", "2", "3"),
  CHARCTER = as.Date(0:2, origin = "1960-01-01"),
  CHAR = as.Date(0:2, origin = "1960-01-01"),
  TEXT = as.Date(0:2, origin = "1960-01-01"),
  DATE = as.Date(0:2, origin = "1960-01-01"),
  INTEGER = as.Date(0:2, origin = "1960-01-01"),
  CHAR2DT = "1960-01-01",
  POSIXct2DT = as.POSIXct("1960-01-01")
) 

image

@cpiraux
Copy link
Collaborator

cpiraux commented May 30, 2023

I added some feedback:

  • There are some issues in xportr_type() -

    1. in the following example, CHARCTER column remains as date after applying type character
    2. character is not an allowable value per Define-XML Data Type Considerations table from section 4.2.1 Page 20, should we throw a message when spec contains data type values out of this table?

Specs file might not follow the define xml data type. Some post-processing between specs and final define.xml are possible. I suggest keeping numeric and character in Type to avoid being too restrictive.

  1. DATE, CHAR2DT, POSIXct2DT columns are not converted to date after applying type date from metacore, because xportr_type() converts the variable into character if the spec type is one of the values from getOption("xportr.character_types"): "character" "char" "text" "date" "posixct" "posixt"

Date type in metadata is character type in SAS. Date type is use for --DTC SDTM variables. More information can be found in CDISC Define-XML Specification Version 2.1 page 29.

  1. CHAR, TEXT, INTEGER are converted as expected. For INTEGER, should xportr_type be enhanced when dealing with date variables but data type = integer? e.g., if data type = integer and a date format is populated in the spec, we shouldn't do as.numeric() for this column:

    xportr/R/type.R

    Line 101 in 231e959

    else .df[[i]] <<- as.numeric(.df[[i]])

Per CDISC ADaM IG v1.2, variables with numeric date, datetime or time should have the following suffix, -- DT, --DTM or –TM. They should also be formatted to be human-readable. The numeric dates have an integer/numeric metadata type and should have the format populated.
We can either use the suffix DT, DTM and TM to detect date variables or check if the format is populated with a date format.

@cpiraux
Copy link
Collaborator

cpiraux commented May 30, 2023

I have created a diagram to have an overview of the different types:

image

@adchan11 adchan11 self-assigned this May 30, 2023
@kaz462
Copy link
Collaborator

kaz462 commented May 31, 2023

Based on @cpiraux 's comments, the above items ii and iii can be left as they are, items i and iv need further updates
Thanks, @cpiraux for the detailed comments and nice diagram!

@bms63
Copy link
Collaborator

bms63 commented Jun 3, 2023

I would like to see more diagrams like this in xportr help pages. I must admit this stuff gets really murky and being able to look at this diagram I was able to quickly understand the issue!! @cpiraux thank you for taking the time to create this!

@bms63 bms63 linked a pull request Jun 15, 2023 that will close this issue
14 tasks
bms63 added a commit that referenced this issue Jun 15, 2023
bms63 added a commit that referenced this issue Jun 15, 2023
Closes #142 issue_142_updated to account for DT, DTM, TM variables
@bms63 bms63 closed this as completed Jun 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working programming
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

4 participants