Skip to content

Commit

Permalink
CORE: Put the main epilog in a .txt file and track it
Browse files Browse the repository at this point in the history
  • Loading branch information
AbdouSeck committed Dec 23, 2023
1 parent bf54afb commit 71a14da
Show file tree
Hide file tree
Showing 2 changed files with 184 additions and 0 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -150,6 +150,7 @@ dead_letters/
!requirements_dev.txt
!simeon/scripts/data
!simeon/scripts/data/*.csv
!simeon/scripts/data/*.txt

# Simeon configs
*.cfg
Expand Down
183 changes: 183 additions & 0 deletions simeon/scripts/data/simeon_epilog.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,183 @@
RETURN CODES:
simeon returns either 0 or 1, depending on whether an error was encountered.
If any error is encountered with any of the subcommands, 1 is returned.
For simeon list and simeon download, if nothing is listed or downloaded, then 1 is returned.
For simeon split and simeon push, if nothing ends up being split or pushed, then 1 is returned.
For simeon report, if any error is encountered while running the queries, then 1 is returned.

SETUP and CONFIGURATIONS:
simeon is a glorified downloader and uploader set of scripts. Much of the downloading and uploading that it does makes the assumptions that you have
your AWS credentials configured properly and that you've got a service account file for GCP services available on your machine. If the latter is
missing, you may have to authenticate to GCP services through the SDK. However, both we and Google recommend the use of service accounts.

Every downloaded file is decrypted either during the download process or while it gets split by the simeon split command. So, this tool assumes that
you have installed and configured gpg to be able to decrypt files from edX.

The following steps may be useful to someone just getting started with the edX data package:

1. Credentials from edX

o Reach out to edX to get your data czar credentials

o Configure both AWS and gpg, so your credentials can access the S3 buckets and your gpg key can decrypt the files there

2. Setup a GCP project

o Create a GCP project

o Set up a BigQuery workspace

o Create a GCS bucket

o Create a service account and download the associated file

o Give the service account Admin Role access to both the BigQuery project and the GCS bucket

If the above steps are carried out successfully, then you should be able to use simeon without any issues.

However, if you have taken care of the above steps but are still unable to get simeon to work, please open an issue.

Further, simeon can parse INI formatted configuration files. It, by default, looks for files in the user's home directory, or in the current working
directory of the running process. The base names that are targeted when config files are looked up are: simeon.cfg or .simeon.cfg or simeon.ini or .simeon.ini.
You can also provide simeon with a config file by using the global option --config-file or -C, and giving it a path to the file with the corresponding configurations.

The following is a sample file content:

# Default section for things like the organization whose data package is processed
# You can also set a default site as one of the following: edx, edge, patches
[DEFAULT]
site = edx
org = yourorganizationx
clistings_file = /path/to/file/with/course_ids

# Section related to Google Cloud (project, bucket, service account)
[GCP]
project = your-gcp-project-id
bucket = your-gcs-bucket
service_account_file = /path/to/a/service_account_file.json
wait_for_loads = True
geo_table = your-gcp-project.geocode_latest.geoip
youtube_table = your-gcp-project.videos.youtube
youtube_token = your-YouTube-API-token

# Section related to the AWS credentials needed to download data from S3
[AWS]
aws_cred_file = ~/.aws/credentials
profile_name = default

The options in the config file(s) should match the optional arguments of the CLI tool. For instance, the --service-account-file, --project and
--bucket options can be provided under the GCP section of the config file as service_account_file, project and bucket, respectively. Similarly, the
--site and --org options can be provided under the DEFAULT section as site and org, respectively.


EXAMPLES:
List files
simeon can list files on S3 for your organization based on criteria like file type (sql or log or email), time intervals (begin and end dates),
and site (edx or edge or patches).
# List the latest SQL data dump
simeon list -s edx -o mitx -f sql -L
# List the latest email data dump
simeon list -s edx -o mitx -f email -L
# List the latest tracking log file
simeon list -s edx -o mitx -f log -L

Download and split files
simeon can download, decrypt and split up files into folders belonging to specific courses.

o Example 1: Download, split and push SQL bundles to both GCS and BigQuery

# Download the latest SQL data dump
simeon download -s edx -o mitx -f sql -L -d data/

# Download SQL bundles dumped any time since 2021-01-01 and
# extract the contents for course ID MITx/12.3x/1T2021.
# Place the downloaded files in data/ and the output of the split in data/SQL
simeon download -s edx -o mitx -c "MITx/12.3x/1T2021" -f sql -b 2021-01-01 -d data -S -D data/SQL/

# Push to GCS the split up SQL files inside data/SQL/MITx__12_3x__1T2021
simeon push gcs -f sql -p ${GCP_PROJECT_ID} -b ${GCS_BUCKET} -S ${SAFILE} data/SQL/MITx__12_3x__1T2021

# Push the files to BigQuery and wait for the jobs to finish
# Using -s or --use-storage tells BigQuery to extract the files
# to be loaded from Google Cloud Storage.
# So, use the option when you've already called simeon push gcs
simeon push bq -w -s -f sql -p ${GCP_PROJECT_ID} -b ${GCS_BUCKET} -S ${SAFILE} data/SQL/MITx__12_3x__1T2021

o Example 2: Download, split and push tracking logs to both GCS and BigQuery

# Download the latest tracking log file
simeon download -s edx -o mitx -f log -L -d data/

# Download tracking logs dumped any time since 2021-01-01
# and extract the contents for course ID MITx/12.3x/1T2021
# Place the downloaded files in data/ and the output of the split in data/TRACKING_LOGS
simeon download -s edx -o mitx -c "MITx/12.3x/1T2021" -f log -b 2021-01-01 -d data -S -D data/TRACKING_LOGS/

# Push to GCS the split up tracking log files inside
# data/TRACKING_LOGS/MITx__12_3x__1T2021
simeon push gcs -f log -p ${GCP_PROJECT_ID} -b ${GCS_BUCKET} -S ${SAFILE} data/TRACKING_LOGS/MITx__12_3x__1T2021

# Push the files to BigQuery and wait for the jobs to finish
# Using -s or --use-storage tells BigQuery to extract the files
# to be loaded from Google Cloud Storage.
# So, use the option when you've already called simeon push gcs
simeon push bq -w -s -f log -p ${GCP_PROJECT_ID} -b ${GCS_BUCKET} -S ${SAFILE} data/TRACKING_LOGS/MITx__12_3x__1T2021

o If you have already downloaded SQL bundles or tracking log files, you can use simeon split them up.

Make secondary/aggregated tables
simeon can generate secondary tables based on already loaded data. Call simeon report --help for the expected positional and optional arguments.

o Example: Make person_course for course ID MITx/12.3x/1T2021

# Make a person course table for course ID MITx/12.3x/1T2021
# Provide the -g option to give a geolocation BigQuery table
# to fill the ip-to-location details in the generated person course table
COURSE=MITx/12.3x/1T2021
simeon report -w -g "${GCP_PROJECT_ID}.geocode.geoip" -t "person_course" -p ${GCP_PROJECT_ID} -S ${SAFILE} ${COURSE}


NOTES:
1. Please note that SQL bundles are quite large when split up, so consider using the -c or --courses option when invoking simeon download -S or
simeon split to make sure that you limit the splitting to a set of course IDs. The `--clistings-file` option is an alternative to `--courses`.
It expects a text file with one course ID per line.
If those options are not used, simeon may end up failing to complete the split operation
due to exhausted system resources (storage to be specific).

2. simeon download with file types log and email will both download and decrypt the files matching the given criteria. If the latter operations are
successful, then the encrypted files are deleted by default. This is to make sure that you don't exhaust storage resources. If you wish to keep
those files, you can always use the --keep-encrypted option that comes with simeon download and simeon split. SQL bundles are only downloaded (not decrypted).
Their decryption is done during a split operation.

3. Unless there is an unhandled exception (which should be reported as a bug), simeon should, by default, print to the standard output both information
and errors encountered while processing your files. You can capture those logs in a file by using the global option --log-file and providing
a destination file for the logs.

4. When using multi argument options like --tables or --courses, you should try not to place them right before the expected positional arguments.
This will help the CLI parser not confuse your positional arguments with table names (in the case of --tables) or course IDs (when --courses is used).

5. Splitting tracking logs is a resource intensive process. The routine that splits the logs generates a file for each course ID encountered. If you
happen to have more course IDs in your logs than the running process can open operating system file descriptors, then simeon will put away records
it cannot save to disk for a second pass. Putting away the records involves using more memory than normally required. The second pass will only
require one file descriptor at a time, so it should be safe in terms of file descriptor limits. To help simeon not have to do a second pass, you
may increase the file descriptor limits of processes from your shell by running something like ulimit -n 2000 before calling simeon split on Unix
machines. For Windows users, you may have to dig into the Windows Registries for a corresponding setting. This should tell your OS kernel to allow
OS processes to open up to 2000 file handles.

6. Care must be taken when using simeon split and simeon push to make sure that the number of positional arguments passed does not lead to the
invoked command exceeding the maximum command-line length allowed for arguments in a command. To avoid errors along those lines, please consider
passing the positional arguments as UNIX glob patterns. For instance, simeon split --file-type log 'data/TRACKING-LOGS/*/*.log.gz' tells simeon to
expand the given glob pattern, instead of relying on the shell to do it.

7. The report subcommand relies on the presence of SQL query files to parse and send to BigQuery to execute. Any errors arising from executing the parsed
queries will be shown to the end user through the given log stream. While the simeon tool ships with query files for most secondary/reporting tables
that are based on the edx2bigquery tool, an end user should be able to point simeon to a different location with SQL query files by using
the --query-dir option that comes with simeon report. Additionally, these query files can contain jinja2 templated SQL code.
Any mentioned variables within these templated queries can be passed to simeon report by using the --extra-args option and passing key-value pair items
in the format var1=value1,var2=value2,var3=value3,...,var_n=value_n. Further, these key-value pair items can also be typed by using the format
var1:i=value1,var2:s=value2,var3:f=value3,...,var_n:s=value_n. In this format, the type is appended to the key, separated by a colon.
The only supported scalar types, so far, are s for str, i for int, and f for float. If any conversion errors occur during value parsing,
then those are shown to the end user, and the query won't get executed. Finally, if you wish to pass an array or list to the template,
you will need to repeat a key multiple times. For instance, if you want to pass a list named mylist containing the integers,
you could write something like --extra-args mylist:i=1,mylist:i=2,mylist:i=3. This means that you'll have a python list named
mylist within your template, and it should contain [1, 2, 3]. You can also pass a JSON file whose top-level objects are parsed as variables. Use a leading @ when passing a JSON file.

0 comments on commit 71a14da

Please sign in to comment.