Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Testing rsdmx with Canada Statistics #47

Closed
eblondel opened this issue Mar 23, 2015 · 10 comments
Closed

Testing rsdmx with Canada Statistics #47

eblondel opened this issue Mar 23, 2015 · 10 comments
Assignees
Milestone

Comments

@eblondel
Copy link
Member

Testing rsdmx with Canada Statistics. Examples are provided here.

Theses tests will aim to provide support to a request sent on the rsdmx mailing list, and identify/fix potential issues in the code.

Note: The case of Canada Statistics represents a useful use case for rsdmx, as it shows that not all data providers necessarily handle an SDMX web-service API, and that many SDMX resources may come as downloaded files, hence the added value of rsdmx to enable reading SDMX local files.

@eblondel eblondel self-assigned this Mar 23, 2015
@eblondel eblondel added this to the 0.4 milestone Mar 23, 2015
@eblondel
Copy link
Member Author

The Canada Statistics portal provides a download facility in SDMX-ML format. This download allows to save a zip file containing two XML files, one representing the GenericData set, the other giving the DataStructure.
Two (minor) issues were identified with the current code:

  • Datasets are correctly converted into R sdmx objects, but an error is raised when using the as.data.frame method to convert the dataset into data.frame (very minor bug, to be fixed ASAP)
  • An error is raised when trying to convert the DataStructure xml file in R sdmx object

@eblondel eblondel added the bug label Mar 23, 2015
@nordicgnome
Copy link

Hello Emmanuel;

I installed the package with update.packages("rsdmx")
I ran the code that you suggested:
library((rsdmx)
setwd(location of file)
sdmx <- readSDMX("GenericAbPop.xml", isURL = FALSE)
I did this the first time from within RStudio and it completely locked up the system. I thought this was an anomaly because I was running an emerge -auv @world, (Gentoo HP DV6 Quadcore laptop) at the same time, thought that possibly I had over loaded the system. So, I ran it again in RStudio without a mess of other processes and the system locked up again.

So, I rebooted and ran from the command line without any windowing running (normally run KDE 4.14.3) and left it overnight. Came back this morning and it had returned to the command prompt and output the error message "Killed". How do I turn on more comprehensive error messaging?
GenericAbPop.xml is 6.8GB and StructureAbPop.xml is 123.6KiB

Jan

@eblondel
Copy link
Member Author

Hello, about the bugs i've highlighted above, i've solved it (it was a minor bug), but still needs to commit it to the code repository (it's still voluntary basis on my side, so i need to do it on after work). With this, reading the data as data.frame will be operational.
Once it is ok, i will share an example.

Afterwhat i will closely look the issue of datastructure.

This being said, datasets provided by Canada Statistics are big files. It logically requires lot of time to parse the document (while there is still matter to improve performance), but especially requires memory. On this aspect, rsdmx currently relies on xPath to read the xml file, which means that the XML tree is loaded in R, the double then once you transform to data.frame.
I've investigated an important enhancement here where rsdmx would support SAX parser, where the XML would not be loaded into R, but still maintaining the object-oriented rsdmx model, mapped to the SDMX standard model.
Having this SAX method would be especially required for reading big datasets (avoid issues of memory). This enhancement is in preparation, but requires sponsoring / funding given the amount of work. See #36

By the way, i will also test against huge datasets.

eblondel added a commit that referenced this issue Mar 24, 2015
@eblondel
Copy link
Member Author

@nordicgnome I've pushed the first bug fix (dealing with the dataset)
For testing, you will need to install rsdmx from CRAN. Please follow the indications in the wiki.

The sample code is as follows:

require(rsdmx)
sdmx <- readSDMX("myfile.xml", isURL = FALSE)
sdmx.df <- as.data.frame(sdmx)

I've tested it on a smaller dataset (a file of ~ 50mb), it works but it takes about 20min, for a dataset of more than 127,000 records. I will issue a separate ticket to investigate gaining in performance (processing time). Canada Statistics datasets will be a good test case.

Once i have some more few time, i look into the 2d fix. Anyway, your feedback is welcome.

eblondel added a commit that referenced this issue Mar 24, 2015
@eblondel
Copy link
Member Author

@nordicgnome the 2d minor bug has been fixed. DataStructuresDefinition files from Canada Statistics are now properly read in R. For the example, you can follow the one provided in the wiki, with the exception that you will need to use isURL = FALSE in readSDMX.

Note that following these fixs, i've opened 2 tickets that i will investigate further, one dealing with codelist content &encoding (see #48 ) and as mentioned above, one about as.data.frame performance (see #49).

Your feedback is welcome,

@ghawkins-ott
Copy link

ghawkins-ott commented Dec 13, 2017

I'm having an issue with DataStructures - i'm not sure what is going on. Using the most current version of rsdmx (0.5-10) and I can't read the StatsCan structure data.

I'm downloading this file: http://www12.statcan.gc.ca/nhs-enm/2011/dp-pd/dt-td/OpenDataDownload.cfm?PID=105470

The dropping it into my RStudio Server. Following the instructions on the wiki (eg: sdmx <- readSDMX(sdmx_files[2], isURL = FALSE)). Then trying to read that into a data.frame and getting the following error:

Error in as.data.frame.default(sdmx) :
cannot coerce class "structure("SDMXDataStructureDefinition", package = "rsdmx")" to a data.frame

Thoughts? I can read the data file but I don't get any of the codes mapped in that case.

@eblondel
Copy link
Member Author

@ghawkins-ott a SDMX DataStructureDefinition can't be read as data.frame because it's a complex object (meaning it includes several subparts that can them be individually read as data.frames such as codelists and concepts).

To extract codelists and concepts from the DSD, and read them as data.frame you can look at DSD example in https://github.com/opensdmx/rsdmx/wiki#sdmx-datastructuredefinition-dsd

@ghawkins-ott
Copy link

@eblondel Thank you! I see the codelists now. I am still struggling with the concept of how to apply them to the data file. For example, I'd like a data frame that would display the code value, (eg: "Female" instead of "2" in the Sex column)...

Sorry, I'm fairly new to this.

@eblondel
Copy link
Member Author

eblondel commented Jan 10, 2018

@ghawkins-ott No need to apology, what you need for code labels instead of values, is supported by rsdmx in a very easy way by associating the corresponding DSD (data structure definition) to the dataset, but in case of SDMX files downloaded manually (without a proper SDMX web-service) which is the case of Canada Statistics, there is one line of code to write to associate the DSD to the dataset, using the function setDSD. See below the code that should save some time on your side:

require(rsdmx)

#read DSD
dsd <- readSDMX("Structure_99-010-X2011027.xml", isURL=FALSE)

#read dataset
data <- readSDMX("Generic_99-010-X2011027.xml", isURL = FALSE)

 #associate the DSD to the dataset
data <- setDSD(data, dsd)

#because you associated the DSD, you can now apply labels = TRUE
df <- as.data.frame(data, labels = TRUE) 

Hope this helps

@ghawkins-ott
Copy link

@eblondel Perfect, thank you so much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants