Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dagster-aws] add Pipes cloudwatch message reader #23353

Merged
merged 2 commits into from
Aug 8, 2024

Conversation

danielgafni
Copy link
Contributor

@danielgafni danielgafni commented Jul 31, 2024

Summary & Motivation

resolve #23056

Adds PipesCloudWatchMessageReader which can be used with different AWS services.

It's a solid default MessageReader since various AWS services already emit CloudWatch logs by default.

Right now it's reading the full CloudWatch log stream provided by the user and routing it to Dagster's stdout. That's the minimal functionality needed by Glue Pipes.

In the future, we can think about changing this behavior, mainly:

  • optionally, filtering non-dagster logs in the CloudWatch query
  • optionally, route logs to stderr

It's work mentioning that Glue routes both stdout and stderr to /output log group. Thus, we only need to read a single stream (with the job run id) from this group in order to receive all necessary information.

How I Tested These Changes

  • [x]: add moto tests with emulating CloudWatch logs

@danielgafni danielgafni force-pushed the pipes-cloudwatch-message-reader branch from 747dce0 to 18aafe3 Compare July 31, 2024 22:06
@graphite-app graphite-app bot added the area: docs Related to documentation in general label Jul 31, 2024
Copy link
Contributor Author

danielgafni commented Jul 31, 2024

This stack of pull requests is managed by Graphite. Learn more about stacking.

Join @danielgafni and the rest of your teammates on Graphite Graphite

Copy link
Member

@schrockn schrockn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great overall. Needs more tests and a more flesh-out test plan.

Can you also provide a more thorough "summary and motivation" section about why this and the tradeoffs involved. If memory serves there are a real cost tradeoffs here and it would be good to understand that and document them.

@danielgafni
Copy link
Contributor Author

Side note: I discovered that moto + S3 tests fail if no credentials are configured. This is surprising a for a mocking software. Took me a while to figure this out on a new clean VM...

Found this issue here getmoto/moto#1941) describing similar situations.

Apparently, it's necessary to have at least some credentials configured.

Should we provide testing credentails for all our pytest tests automatically?

@danielgafni danielgafni force-pushed the pipes-cloudwatch-message-reader branch from 2e31171 to d31a46d Compare August 5, 2024 19:26
Copy link

github-actions bot commented Aug 5, 2024

Deploy preview for dagit-core-storybook ready!

✅ Preview
https://dagit-core-storybook-8ld7qz9sy-elementl.vercel.app
https://pipes-cloudwatch-message-reader.core-storybook.dagster-docs.io

Built with commit 2e31171.
This pull request is being automatically deployed with vercel-action

@danielgafni
Copy link
Contributor Author

@schrockn hey, please take a look at moto tests.

I had to manually create CloudWatch logs (since moto doesn't actually run the Glue job and thus doesnt create any logs) in the fake Glue client.

I'm planning to move examples/docs changes into a separate PR tomorrow.

Copy link
Member

@schrockn schrockn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some minor inline comments.

My big question here is what will appear in Dagster UI? We are now getting all the cloudwatch logs. Are they all going to show up in stdout? Where does stderr emitted from a glue job end up?

Paint me a picture of the end-to-end logging experience in Dagster in this case.

@danielgafni
Copy link
Contributor Author

@schrockn As I suspected, AWS Glue routes both stdout and stderr to /output CloudWatch log group. The /error sterams in CloudWatch only contain Spark logs.

The following code:

with open_dagster_pipes(
        params_loader=PipesCliArgsParamsLoader(),
        context_loader=PipesS3ContextLoader(client=boto3.client("s3"))
    ) as pipes:
    pipes.log.info("Hello from external process!")
    pipes.report_asset_materialization(
            metadata={
                "some_metric": {"raw_value": 0, "type": "int"}
            },
            data_version="alpha",
        )
    
    print("hello from stdout")
    print("hello from stderr", file=sys.stderr)

produces this:

image

image

The line "hello from stderr" which is explicitly printed to stderr still ends up in the /output LogStream.

I don' think we can (or should) do anything about it. This is just how Glue is.

Copy link
Member

schrockn commented Aug 6, 2024

@schrockn As I suspected, AWS Glue routes both stdout and stderr to /output CloudWatch log group. The /error sterams in CloudWatch only contain Spark logs.

The following code:

with open_dagster_pipes(
        params_loader=PipesCliArgsParamsLoader(),
        context_loader=PipesS3ContextLoader(client=boto3.client("s3"))
    ) as pipes:
    pipes.log.info("Hello from external process!")
    pipes.report_asset_materialization(
            metadata={
                "some_metric": {"raw_value": 0, "type": "int"}
            },
            data_version="alpha",
        )
    
    print("hello from stdout")
    print("hello from stderr", file=sys.stderr)

produces this:

image

image

The line "hello from stderr" which is explicitly printed to stderr still ends up in the /output LogStream.

I don' think we can (or should) do anything about it. This is just how Glue is.

Yup 100% agree. Can you document this behavior in both the PR summary and as a docblock?

Copy link
Member

schrockn commented Aug 6, 2024

In terms of eliminating the glue-specific wrapper reader/writer classes, can you do a seperate PR stacked on this with colton included so we can discuss it separately?

Copy link
Member

@schrockn schrockn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

request for docs

@danielgafni danielgafni force-pushed the pipes-cloudwatch-message-reader branch from 89ec8ce to 3cf6d49 Compare August 7, 2024 15:39
@danielgafni
Copy link
Contributor Author

I believe this PR should be ready

@cmpadden cmpadden self-requested a review August 7, 2024 16:19
@danielgafni danielgafni force-pushed the pipes-cloudwatch-message-reader branch 2 times, most recently from af7fce7 to afe07d1 Compare August 8, 2024 09:01
@danielgafni danielgafni force-pushed the pipes-cloudwatch-message-reader branch from afe07d1 to b9a5c95 Compare August 8, 2024 14:04
danielgafni added a commit that referenced this pull request Aug 8, 2024
## Summary & Motivation

Deleting dummy Pipes classes from AWS Pipes. These classes didn't
provide any functionality and their introduction was questionable from
the start.

Context:
-
#23353 (comment)
- #22968

I decided to keep `PipesGlueLambdaEventContextInjector`, because unlike
the the other classes, it's actually used for a unique purpose:
injecting variables into Lambda event input (it might be a bit confusing
because it inherits from `PipesEnvContextInjector` but doesn't actually
do anything with environment variables).

@schrockn let me know if you think if it makes sense 

## How I Tested These Changes

Nothing was really using these classes
## Summary & Motivation

Deleting dummy Pipes classes from AWS Pipes. These classes didn't
provide any functionality and their introduction was questionable from
the start.

Context:
-
#23353 (comment)
- #22968

I decided to keep `PipesGlueLambdaEventContextInjector`, because unlike
the the other classes, it's actually used for a unique purpose:
injecting variables into Lambda event input (it might be a bit confusing
because it inherits from `PipesEnvContextInjector` but doesn't actually
do anything with environment variables).

@schrockn let me know if you think if it makes sense 

## How I Tested These Changes

Nothing was really using these classes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: docs Related to documentation in general
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement PipesCloudWatchMessageReader
2 participants