Skip to content

Commit

Permalink
WL#15746: Encryption squashed commits
Browse files Browse the repository at this point in the history
This is a combination of 5 commits.

This is the 1st commit message:

WL#15746: TLS Enhancements for HeatWave-AutoML & Dask Comm. Upgrade

Problem:
--------
- HeatWave-AutoML communication was unauthenticated, unauthorized,
  and unencrypted.
- Dask communication utilized TCP, not aligning with FedRamp
  guidelines.

Solution:
---------
- Introduced TLS and mTLS in HeatWave-AutoML's plugin and driver for
  authentication, authorization, and encryption.
- Applied TLS to Dask to ensure authentication, encryption, and
  authorization.

Dask Authorization (OCID-based):
--------------------------------
1. For each DBsystem:
    - MySQL node sends OCIDs of authorized nodes to the head driver
      via:
        a. rapid_net_nodes
        b. rapid_net_allowed_ocids (older API, mainly for MTR tests)
    - Scenarios:
        a. All OCIDs provided: Dask authorizes.
        b. Any OCID absent: ML call fails with message.
2. During Dask worker registration to the Dask scheduler, a script is
    dispatched to the Dask worker for execution, retrieving the worker's
    OCID for authorization purposes.
    - If the OCID isn't approved, the connection is denied, terminating
      the worker and causing the ML query to fail.
3. For every Dask worker (both as listener and connector), an OCID-
    based authorization is performed post SSL/TLS connection handshake.
    The process compares the OCID from the peer's certificate against
    the allowed_ocids received from the HeatWave-AutoML MySQL plugin.

HWAML Plugin Changes:
---------------------
- Sourced certificate data and SSL setup from disk, incorporating
  SSL/TLS for HWAML.
- Reused "keystore" variable to specify disk location for
  certificate retrieval.
- Certificates and keys expected in PKCS12 format.
- Introduced "have_ml_encryption" variable (default=0).
    > Acts as a switch to explicitly deactivate HWAML network
      encryption, akin to "disable_net_encryption" affecting
      network encryption for HeatWave. Set to 1 to enable.
- Introduced a customized verifier function for verify_callback to
  be set in SSL_CTX_set_verify and used in the handshake process
  of SSL/TLS. The customized verifier function will perform
  instance id (OCID) based authorization on the plugin side during
  standard SSL/TLS handshake process.
- CRL (Certificate Revocation List) checks are also conducted if CRL
  Distribution Points are present and accessible in the provided
  certificate.

HWAML Driver Changes & OCID-based Authorization:
------------------------------------------------
- Introduced "enable_encryption" (default=0).
    > Set to 1 to enable encryption.
- When receiving a new connection request and encryption is on, the
  driver performs OCID-based self-checking, comparing OCID retrieved
  from its own instance principal with the OCID in the
  provided certificate on disk.
- The driver compares OCID from "mysql_compute_id" and extracted OCID
  from mTLS certificate during connection.
- Introduced "cert_dir" argument for certificate directory
  specification.
- Expected files: cert_chain.pem, certificate.pem, private_key.pem.
    > OCID should be in the userID (UID) or CN field of the
      certificate.pem subject.
- CRL (Certificate Revocation List) checks are also conducted post
  handshake, if CRL Distribution Points are present and accessible in
  the provided certificate, alongside OCID authorization.

Encryption Behavior:
--------------------
- If encryption is deactivated on both plugin and driver side, HWAML
  will work without encryption as it was before this commit.

Enabling Encryption:
--------------------
- By default, "have_ml_encryption" and "enable_encryption" are set to 0
    > Encryption is disabled by default.
- For the HWAML plugin:
    > "have_ml_encryption" set to 1 (default is 0).
    > Specify the .pfx file's path using the "keystore".
- For the HWAML Driver:
    > "enable_encryption" set to 1 (default is 0)
    > Specify "mysql_instance_id" and "cert_dir".

Testing:
--------
- MTR has been modified for the encryption setup.
    > Runs with encryption if "OCI_INSTANCE_ID" is set to a valid
      value.
- On OCI (when "OLRAPID_KEYSTORE" is not set):
    > Certificates and keys are generated; PEMs for driver and PKCS12
      for plugin.
- On AWS (when "OLRAPID_KEYSTORE" is set as the path to PKCS12
  keystore files):
    > PEM files are extracted from the provided PKCS12 and used for
      the driver. The plugin uses the provided PKCS12 keystore file.

Change-Id: I553ca135241e03484db6debbe186e6d34d582bf4

This is the commit message #2:

WL#15746 - Adding ML encryption support to BM

Enabling ML encryption on Baumeister:
- Certificates are generated on MySQLd during initialization
- Needed certicates for workers are packaged and sent to worker nodes
- Workers use packaged files to generate their certificates
- Arguments are added to driver.py invoke
- Keystore path is added to mysql config

Change-Id: I11a5cc5926488ff4fbf91bb6c10a091358db7dc9

This is the commit message #3:

WL#15746: Enhanced CRL Daemon Checker

Issue
=====
The previous design assumed a plain HTTPS link for the CRL distribution
point, accessible to all. This assumption no longer holds, as public
accessibility for CRL distribution contradicts OCI guidelines. Now, the
CRL distribution point in certificates provided by the control plane is
expected to be protected by OCI Instance Principal Authentication.
However, using this authentication method introduces a delay of several
seconds, which is impractical for HeatWave-AutoML.

Solution
========
The CRL fetching code now uses OCI Instance Principal Authentication.
To mitigate performance issues, the CRL checking process has been
redesigned. Instead of enforcing CRL checks per connection in MySQL
Plugin and HeatWave-AutoML Driver communications, a daemon thread in
HeatWave-AutoML Driver, Dask scheduler, and Dask Worker now periodically
fetches and verifies the CRL against all active connections. This
separation minimizes performance impacts. Consequently, MySQL Plugin's
CRL checks have been removed, as checks in the Driver, Scheduler, and
Worker sufficiently cover all cluster nodes.

Changes
=======
- Implemented CRL checker as a daemon thread in Driver, Scheduler, and
  Worker.
- Each connection/socket has an associated CRL checker.
- CRL checks occur periodically at set intervals.
- Skips CRL check if the CRL is temporarily unavailable.
- Failing a CRL check results in the associated connection/socket being
  closed. On the Driver, a stop event is triggered (akin to CTRL-C).

Change-Id: Id998cfe9e15d9236291b0ae420d65c2197837966

This is the commit message #4:

WL#15746: Fix Dask workers being shutdown without releasing address

Issue
=====
Dask workers getting shutting but not releasing the address used
properly sometimes.

Solution
========
Reverted some changes in heatwave_cluster.py in dask worker shutdown
function. Hopefully this will fix the address issue

Change-Id: I5a6749b5a25b0ccb73ba7369e545bc010da1b84f

This is the commit message #5:

WL#15746: Implement Dask Worker Join Timeout for Head Node

Issue:
======
In the cluster_shutdown method, the join operation on the head node's
worker process lacked a timeout. This led to potential indefinite
waiting and eventual hanging of the head node.

Solution:
=========
A timeout has been introduced for the worker process join on the head
node. Unlike non-head nodes, which rely on worker join to complete Dask
tasks and cannot have a timeout, the head node can safely implement
this. Now, if the worker process on the head node fails to shut down
post-join, indicating a timeout, it will be manually terminated. This
ensures proper release of associated resources and prevents hanging of
the head node.

Additional Change:
==================
Added Cert Rotation Guard for DASK clusters. This feature initiates on
the first plugin-driver connection when the DASK cluster is off,
recording the certificate's expiry date. During driver idle times,
it checks the current cert's expiry against this date. If it detects a
change, indicating a certificate rotation, it shuts down the DASK
cluster. The cluster restarts on the next connection request, ensuring
the use of the latest certificate.

Change-Id: Ie63a2e2b7664e05e1622d8bd6503663e13fa73cb
  • Loading branch information
Jingxiao Cai committed May 22, 2024
1 parent 9a7091b commit 95a49ee
Showing 1 changed file with 27 additions and 0 deletions.
27 changes: 27 additions & 0 deletions mysql-test/mysql-test-run.pl
Original file line number Diff line number Diff line change
Expand Up @@ -701,6 +701,17 @@ sub main {
exit(0);
}

if ($secondary_engine_support) {
# Enable mTLS in Heatwave-AutoML by setting up certificates and keys.
# - Paths are pre-set in the environment for accurate server initialization.
# - Certificates and keys are generated after the Python virtual environment
# setup.
my $oci_instance_id = $ENV{'OCI_INSTANCE_ID'} || "";
if ($oci_instance_id) {
$ENV{'ML_CERTIFICATES'} = "$::opt_vardir/" . "hwaml_cert_files/";
}
}

initialize_servers();

# Run unit tests in parallel with the same number of workers as
Expand Down Expand Up @@ -758,6 +769,22 @@ sub main {
secondary_engine_offload_count_report_init();
# Create virtual environment
find_ml_driver($bindir);

# Generate mTLS certificates & keys for Heatwave-AutoML post Python
# virtual environment setup and path pre-setting (referenced earlier).
my $oci_instance_id = $ENV{'OCI_INSTANCE_ID'} || "";
my $aws_key_store = $ENV{'OLRAPID_KEYSTORE'} || "";
if ($oci_instance_id) {
if($aws_key_store)
{
extract_certs_from_pfx($aws_key_store, $ENV{'ML_CERTIFICATES'});
}
else
{
create_cert_keys($ENV{'ML_CERTIFICATES'});
}
}

reserve_secondary_ports();
}

Expand Down

0 comments on commit 95a49ee

Please sign in to comment.