Import of SAS .xpt files into R. Generation of a Data Definition Table (DDT) in SAS.
Data checks regarding values in data base vs. format definitions/value labels.
The goal of this SAS program is to import SAS .xpt files by generating automatically an executable .R script file that generates R data frames - step by step - with variable factors, levels, labels and R date/times.
For importing of .xpt files two steps must be followed:
- Run SAS program sasxpt_r.sas in SAS
- Run impsas_xpt.R (generated by the SAS program) in R or RStudio
The results are:
- A Data Definition Table (SAS Output)
- An .R script file that contains R statements/functions to import .xpt files in R data frames including variable factors, levels, labels and R date/times
SAS on Windows, R or RStudio
Modify the SAS program sasxpt_r.sas
Enter manually:
- Directory where SAS .xpt files are stored (e.g. C:\Temp\R)
filename dir_list pipe 'dir "C:\Temp\R" /a:-d-h-s /b';
- R working directory (the same directory as in 1. but with slash /)
%let RSETWD=C:/Temp/R/;
- Create and specify a Subdirectory where SAS .xpt files will be converted temporary to .sas7bdat files
libname tmpsas "C:\Temp\R\SAS_TEMP";
- Specify the location of the SAS program with proc format statements (alternatively copy SAS format catalog to 'work.formats')
%include "C:\Temp\R\Formats01.sas";
Afterwards:
-> Run program sasxpt_r.sas in SAS
-> Open impsas_xpt.R (generated by the SAS program) in R or RStudio
-> Set encoding to UTF-8 (e.g. RStudio -> File -> Reopen with Encoding -> UTF-8) and run the script file
Following SAS output (example) will be automatically generated by the SAS program:
Member Column Column
Name Name Column Label Type Format Decode
DM STUDYID Study Identifier char $13.
DOMAIN Domain Abbreviation char $2.
USUBJID Unique Subject Identifier char $20.
SUBJID Subject Identifier for the Study char $3.
RFSTDTC Subject Reference Start Date/Time char $16.
RFENDTC Subject Reference End Date/Time char $10.
RFXSTDTC Date/Time of First Study Treatment char $16.
RFXENDTC Date/Time of Last Study Treatment char $16.
RFICDTC Date/Time of Informed Consent char $10.
RFPENDTC Date/Time of End of Participation char $10.
DTHDTC Date/Time of Death char $10.
DTHFL Subject Death Flag char $YES. Y = Yes
SITEID Study Site Identifier char $2.
INVNAM Investigator Name char $28.
AGE Age num BEST8.
AGEU Age Units char $5.
SEX Sex char $SEX. F = Female, M = Male
RACE Race char $41.
ARMCD Planned Arm Code char $20.
ARM Description of Planned Arm char $11.
TESTDTM Datetime of (Test) dtim DATETIME16.
TESTDT Date of (Test) date DATE9.
TESTTM Time of (Test) time TIME6.
a.s.o.
Following .R script file (example) will be automatically generated by the SAS program:
#### Set R working directory
setwd("C:/Temp/R/")
#### Use install.packages("foreign"), install.packages("Hmisc"), install.packages("chron") if necessary
library(foreign)
library(Hmisc)
library(chron)
#### Read all .xpt files from working directory. Derive date/times. Set factor, levels, labels.
DM <- read.xport("DM.xpt")
DS <- read.xport("DS.xpt")
TA <- read.xport("TA.xpt")
# a.s.o.;
DM$TESTDTM <- as.POSIXct(DM$TESTDTM, tz = "GMT", origin = "1960-01-01")
DM$TESTDT <- as.Date(DM$TESTDT, origin = "1960-01-01")
DM$TESTTM <- as.POSIXct(DM$TESTTM, tz = "GMT", origin = "0001-01-01")
DM$TESTTM <- times(format(DM$TESTTM, "%H:%M:%S")) # chron times class
# a.s.o.;
label(DM$STUDYID) <- "Study Identifier"
label(DM$DOMAIN) <- "Domain Abbreviation"
label(DM$USUBJID) <- "Unique Subject Identifier"
label(DM$SUBJID) <- "Subject Identifier for the Study"
label(DM$RFSTDTC) <- "Subject Reference Start Date/Time"
label(DM$RFENDTC) <- "Subject Reference End Date/Time"
label(DM$RFXSTDTC) <- "Date/Time of First Study Treatment"
label(DM$RFXENDTC) <- "Date/Time of Last Study Treatment"
label(DM$RFICDTC) <- "Date/Time of Informed Consent"
label(DM$RFPENDTC) <- "Date/Time of End of Participation"
label(DM$DTHDTC) <- "Date/Time of Death"
DM$DTHFL <- factor(DM$DTHFL, c("Y"), exclude = "")
levels(DM$DTHFL) <- c("Yes")
label(DM$DTHFL) <- "Subject Death Flag"
label(DM$SITEID) <- "Study Site Identifier"
label(DM$INVNAM) <- "Investigator Name"
label(DM$AGE) <- "Age"
label(DM$AGEU) <- "Age Units"
DM$SEX <- factor(DM$SEX, c("F", "M"), exclude = "")
levels(DM$SEX) <- c("Female", "Male")
label(DM$SEX) <- "Sex"
label(DM$RACE) <- "Race"
label(DM$ARMCD) <- "Planned Arm Code"
label(DM$ARM) <- "Description of Planned Arm"
label(DM$ACTARMCD) <- "Actual Arm Code"
label(DM$TESTDTM) <- "Datetime of (Test)"
label(DM$TESTDT) <- "Date of (Test)"
label(DM$TESTTM) <- "Time of (Test)"
# a.s.o.;
After running this .R script file in R or RStudio -> look for warnings (there should not be any) -> finished
Note: Beside the R base functions (as.POSIXct, as.Date, factor, levels and format) only the functions foreign::read.xport, Hmisc::label and chron::times will be used for generating R data frames.
Note: There is an alternative - very much shorter - method to import SAS .xpt files into R, e.g. using the function Hmisc::sasxport.get:
library(foreign)
# Formats exported in SAS with 'proc format cntlout=format; run; quit;'
form <- read.xport("format.xpt")
library(Hmisc)
# Convert .xpt files to R data frames with formats, levels, labels and R date/times;
DM <- sasxport.get("DM.xpt", formats = form, lowernames = FALSE, allow = "_")
Test platform: SAS 9.4 windows 7 64-bit, R 3.3.2
See test data in .../testsas
Only "simple" user-defined SAS formats (numeric or character) can be used, e.g.:
********** Definition of SAS Formats in library 'work';
proc format;
value $NY (notsorted)
"N"="No"
"Y"="Yes"
"U"="Unknown";
valee $YES "Y"="Yes";
value $SEX "F"="Female"
"M"="Male";
value AESEV 1="MILD"
2="MODERATE"
3="SEVERE";
run;
quit;
Note that the notsorted option is very useful if the rank order should be maintained (instead of an alphabetic order), here No, Yes, Unknown
Formats like e.g. 30 - 50 = 'Between 30 and 50', or picture formats cannot be used.
SAS XPORT files should be created by the COPY Procedure in SAS Version 5, Version 6 and higher of the SAS Software.
In case that the .xpt files cannot be loaded, modify the statements in macro %ImpDset for loading the .xpt files (just as you usually do):
********** Read .xpt Files and Copy to .sas7bdat files (library 'tmpsas');
%macro ImpDset(XPTFILE);
%local File;
%let File = &RSETWD.&XPTFILE&..xpt;
;/********** Library File */
libname xportin xport "&File";
;/********** .xpt File to .sas7bdat File */
proc copy in=xportin out=tmpsas;
select &XPTFILE;
run;
libname xportin clear;
%mend ImpDset;
Or, for creating .xpt files use code similar to:
%let RSETWD=C:/Temp/R/;
%macro WriteXPT(INDB);
;/********** Library for output must be defined */
libname xportout xport "&RSETWD.&INDB..xpt";
;/********** Export to x port File */
proc copy in=work out=xportout memtype=data;
select &INDB;
run;
libname xportout clear;
%mend WriteXPT;
%WriteXPT(DM);
or:
%let SASWD=C:\Temp\R\;
%macro WriteXPT(INDB, MYLABEL=);
;/********** Set Dataset Label to File Name as default*/
%if &MYLABEL= %then %let MYLABEL="&INDB";
;/********** Library for output must be defined */
libname xportout xport "&SASWD&INDB..xpt";
;/********** Export to x port File */
data xportout.&INDB (label=&MYLABEL);
set &INDB;
run;
libname xportout clear;
%mend WriteXPT;
%WriteXPT(DS, MYLABEL="Disposition");
When you use date or datetime formats other than 'DATE', 'DDMMYY', 'MMDDYY', 'YYMMDD', 'E8601DT', 'EURDFD' , 'TIME', 'HHMM', 'DATETIME', 'DATEAMPM', 'EURDFDT' extend the SAS code in:
********** Identify Dates (extend code, if neccessary);
if index(upcase(FORMAT), 'DATE') gt 0 then TYP="date";
if index(upcase(FORMAT), 'DDMMYY') gt 0 then TYP="date";
if index(upcase(FORMAT), 'MMDDYY') gt 0 then TYP="date";
if index(upcase(FORMAT), 'YYMMDD') gt 0 then TYP="date";
if index(upcase(FORMAT), 'E8601DT') gt 0 then TYP="date";
if index(upcase(FORMAT), 'EURDFD') gt 0 then TYP="date"; * can be overwritten with EURDFDT (dtim);
********** Identify Time Formats;
if index(upcase(FORMAT), 'TIME') gt 0 then TYP="time";
if index(upcase(FORMAT), 'HHMM') gt 0 then TYP="time";
********** Identify DateTime Formats;
if index(upcase(FORMAT), 'DATETIME') gt 0 then TYP="dtim";
if index(upcase(FORMAT), 'DATEAMPM') gt 0 then TYP="dtim";
if index(upcase(FORMAT), 'EURDFDT') gt 0 then TYP="dtim";
Times will be converted 1st to POSIXct with origin = '0001-01-01'. Afterwards the function chron::times will be applied to convert the date/times to a variable with a times class (w/o date part). Check if this is desired (otherwise change or delete code in the SAS program, e.g. if you prefer time variables in POSIXct format).
********** Derive Time. Note: Prior to the time the date 0001-01-01 is assigned (POSIXct format);
if TYP="time" then do;
RCODE=trim(R_NAME) || " <- as.POSIXct(" || strip(R_NAME)
|| ", tz = " || quote('GMT') || ", origin = " || quote("0001-01-01") || ")";
output;
********** Delete code below or remove *times*, if you prefer POSIXct format for time variables;
RCODE=trim(R_NAME) || " <- times(format(" || strip(R_NAME)
|| ", " || quote(strip('%H:%M:%S')) || ")) # chron times class";
output;
end;
Times will be displayed as HH:MM:SS (w/o milliseconds, if any)
Recommendation: All SAS variables should have "labels". If SAS labels are missing the labels will be set (by the SAS program) to <- "" in R.
Consider also to assign 'formats' to all variables, e.g. $20., BEST12., 8., date9., datetime22.3, a.s.o., depending on the type of the variables.
There may be some values in the SAS database - where a user-defined format is assigned - but with no corresponding format definition/value label, e.g. when $SEX in 'proc format' is defined as "F", "M" but in the database also "U" (for Unknown) is present. In this case the value "U" will be set to 'NA' in R, w/o any warning.
Consider extension of definitions in SAS proc format (highly recommended!).
On the other hand, there may be SAS formats/value labels defined, but the value in the database is missing, e.g. $SEX is defined as "F", "M" and "U", but the value "U" (for Unknown) is missing in the database.
Check, if changes are necessary. Such categorization can be intended e.g. for tabulation of categories with missing values, here category Unknown with N=0.
Include or run the SAS program sasformat_chk.sas directly after sasxpt_r.sas. This results in an output of all Critical Errors and Non-Critical Discrepancies (if any).
This SAS program allows a quick transformation of many SAS .xpt files to R data frames. To implement the program only a few parameters - namely the directory where the xpt files are stored and the location of the user-defined SAS formats - have to be entered and in two steps R data frames were generated automatically - step by step - with variable factors, levels, labels and R date/times. The program provides also a Data Definition Table to see the variable properties and the decodes of the formats. Finally, a 2nd SAS program can be run to check if there are values in the database w/o corresponding value labels/format definitions (and vice versa).
Beside the R base functions only the functions foreign::read.xport, Hmisc::label and chron::times will be used for generating R data frames.