From 28673b74d66cdea68d9caa6d04b62746bd8fa56c Mon Sep 17 00:00:00 2001 From: Yao-Yuan Mao Date: Sat, 13 Apr 2024 11:25:31 -0600 Subject: [PATCH 1/4] Update CONTRIBUTING.md --- CONTRIBUTING.md | 30 +++++++++++++++++++++++++++++- 1 file changed, 29 insertions(+), 1 deletion(-) diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index f2b27830..7afe6bdc 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -1,5 +1,21 @@ # Contributing to GCRCatalogs +## Preparing catalog files + +Consider the following when you prepare the catalog files that you plan to add to `GCRCatalogs`. + +- File format: We strongly recoommend that you store the file in the Apache Parquet format. + Both astropy and pandas support reading and writing Parquet files. +- File partition: For large data sets, the files should be partitioned to enable parallel access. + Most commonly we partition the data by sky areas, but the choice of course would depend on the specific data content. +- Data schema: Make sure the schema (including column names, types, and units) are in the same schema + that the users should be use (that is, no further transformation of column names, types, and units would be needed). + +Once you have your data files ready, the data files should be copied to a specific location on NERSC +that all DESC members can access. +You can do that by opening an issue at https://github.com/LSSTDESC/desc-help/issues. +After your data files are copied, note the location as you will need it when specifying the catalog config (see below). + ## Preparing a catalog reader If you are writing a new reader, please see this [guide](https://github.com/yymao/generic-catalog-reader#usage) @@ -7,7 +23,9 @@ for an overview and an example of a minimal reader. The guide will explain that your reader must be a subclass of the `BaseGenericCatalog` parent class and that you will need to supply a minimum of 3 methods to specify how to read in the underlying file. -You can also supply a translation dictionary between the native quantities in your +The best practice is to ensure the schema (including column names, types, and units) of your data files +is identical to what you expect the users will be using. +However, if really needed, you can also supply a translation dictionary between the native quantities in your catalog and the quantities that are presented to the user via the `GCRCatalogs` interface. You may want to look at existing readers in this repository as additional examples. @@ -24,6 +42,7 @@ Each yaml config file should specify the reader class to use and all input argum For example, if the reader class asks for `catalog_root_dir` as an input argument to specify the location of the catalog files, you need to include `catalog_root_dir` as a keyword in the corresponding yaml config file, and set it to the correct location. + All keywords in the yaml config file will be passed to the reader class. Below is a list of required, recommended, or reserved keywords that may appear in a yaml config file. @@ -41,6 +60,15 @@ subclass_name: . See the "Reserved Keywords" section below for more information on these keywords. +### Location keywords + +Tyically, the location of the file (or the directory where the files are stored) is specified by one of the following keywords: +`base_dir`, `catalog_root_dir`, `filename` (there are a few other possiblities or historic reasons). +You should use the keyword that is consistent with what is implemented in the reader. + +When specifying the path for the location keyword, the path should always start with `^/`, where `^` represents the +top-level of the shared directory. You can find what `^` will be translated to [here](https://github.com/LSSTDESC/gcr-catalogs/blob/master/GCRCatalogs/site_config/site_rootdir.yaml). + ### Recommended keywords ```yaml From 715d0cf58670e7e7ce2af96ed5d869fc49cfe54c Mon Sep 17 00:00:00 2001 From: Yao-Yuan Mao Date: Mon, 15 Apr 2024 09:23:59 -0600 Subject: [PATCH 2/4] Update CONTRIBUTING.md --- CONTRIBUTING.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 7afe6bdc..5b98e46b 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -4,12 +4,13 @@ Consider the following when you prepare the catalog files that you plan to add to `GCRCatalogs`. -- File format: We strongly recoommend that you store the file in the Apache Parquet format. +- File format: While GCRCatalogs can support any file format, + we strongly recoommend that the files are stored in the Apache Parquet format. Both astropy and pandas support reading and writing Parquet files. - File partition: For large data sets, the files should be partitioned to enable parallel access. Most commonly we partition the data by sky areas, but the choice of course would depend on the specific data content. - Data schema: Make sure the schema (including column names, types, and units) are in the same schema - that the users should be use (that is, no further transformation of column names, types, and units would be needed). + that the users should be using (that is, no further transformation of column names, types, and units would be needed). Once you have your data files ready, the data files should be copied to a specific location on NERSC that all DESC members can access. From d9bb9e8f71cbf90ed642d26abfa3bd7c3a2043ea Mon Sep 17 00:00:00 2001 From: Yao-Yuan Mao Date: Mon, 15 Apr 2024 09:25:35 -0600 Subject: [PATCH 3/4] Update CONTRIBUTING.md --- CONTRIBUTING.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 5b98e46b..f5b04b39 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -64,7 +64,7 @@ See the "Reserved Keywords" section below for more information on these keywords ### Location keywords Tyically, the location of the file (or the directory where the files are stored) is specified by one of the following keywords: -`base_dir`, `catalog_root_dir`, `filename` (there are a few other possiblities or historic reasons). +`base_dir`, `catalog_root_dir`, `filename` (there are a few other possiblities for historic reasons). You should use the keyword that is consistent with what is implemented in the reader. When specifying the path for the location keyword, the path should always start with `^/`, where `^` represents the From 6d8a4f611a88c348915fa89780c4dffcffdb266b Mon Sep 17 00:00:00 2001 From: Yao-Yuan Mao Date: Mon, 15 Apr 2024 09:26:59 -0600 Subject: [PATCH 4/4] Update CONTRIBUTING.md --- CONTRIBUTING.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index f5b04b39..a55617f5 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -68,7 +68,8 @@ Tyically, the location of the file (or the directory where the files are stored) You should use the keyword that is consistent with what is implemented in the reader. When specifying the path for the location keyword, the path should always start with `^/`, where `^` represents the -top-level of the shared directory. You can find what `^` will be translated to [here](https://github.com/LSSTDESC/gcr-catalogs/blob/master/GCRCatalogs/site_config/site_rootdir.yaml). +top level of the shared directory. You can find what `^` will be translated to in +[`site_rootdir.yaml`](https://github.com/LSSTDESC/gcr-catalogs/blob/master/GCRCatalogs/site_config/site_rootdir.yaml). ### Recommended keywords