Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added spark streaming custom receiver for pulsar #296

Merged
merged 2 commits into from
Apr 4, 2017

Conversation

yush1ga
Copy link
Contributor

@yush1ga yush1ga commented Mar 17, 2017

Motivation

Recent years, more and more people are interested in Apache Spark for machine learning or something and there is demand for flowing data from message queue into Apache Spark.

Modifications

Added Spark Streaming Custom Receiver for Pulsar and its test.

Result

Pulsar can be used as a data source for Apache Spark.

pulsarClient = PulsarClient.create(url, clientConfiguration);
consumerConfiguration.setMessageListener(new MessageListener() {
public void received(Consumer consumer, Message msg) {
store(new String(msg.getData()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid the copy and pass the buffer instead?

Does store() throw any exception? If so, we should also handle it right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we change from extends Receiver<String> to extends Receiver<byte[]>, we will be able to avoid the copy.
Does it look good ?

}

private static final Logger log = LoggerFactory.getLogger(SparkStreamingPulsarReceiver.class);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also update documentation .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I'll make it.

});
pulsarClient.subscribe(topic, subscription, consumerConfiguration);
} catch (PulsarClientException e) {
restart(e.getMessage(), e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exception is logged by restart()? Else it's better to log it here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are situations where client failures can be seen for a reasonably long period of time, so you might want to think about resource consumption in that case given that restart() will recurse into onStart().

  1. you create a new PulsarClient on each call, turning the previous one into garbage.
  2. you create a new MessageListener implementation on each call, with similar garbage.

i'd set those up in the constructor while leaving the call to subscribe() to onStart().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@saandrews
I added logging process.

@msb-at-yahoo
I reimplemented like that.


public void onStop() {
try {
if (pulsarClient != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this transition valid? onStart -> onStop -> onStart?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

onStart and onStop are called only once.
onStart -> onStop

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. the reason i asked is that it's now a bit asymmetric after you made the change i suggested to onStart(): onStop() invalidates the invariants that onStart() relies upon.

Copy link
Contributor Author

@yush1ga yush1ga Mar 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I confirmed that when restart() is called, onStop() -> onStart() is called.
As we create a client in the constructor, it cannot start subscribing when restarting.
I think we should create a client in onStart().

@yush1ga
Copy link
Contributor Author

yush1ga commented Mar 21, 2017

I added documentation.

private String topic;
private String subscription;

public SparkStreamingPulsarReceiver(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please format the files with the Eclipse formatter profile that can be found at src/formatter.xml

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I applied the format.

this.topic = topic;
this.subscription = subscription;
consumerConfiguration.setMessageListener((consumer, msg) -> {
store(new String(msg.getData()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if store() fails?

If you want to retry the operation, you should then block the listener thread until the store() succeeds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find the case that store() fails in the official documents and others.
I think spark streaming is mainly used for calculating statistical data and so on and we don't have to mind some data lost in those use cases.
What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are 2 options in case the store() calls fails with exception:

  1. Don't ack the message and rely on ack-timeout to replay the message some time later (eg: 1min)
  2. Just ack the message anyway and move on, basically ignoring the exception.

I prefer option 1 by default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@merlimat

I chose option 1.

  • AckTimeout is set if it's not set
  • Added try block
  • When failing to store, consumer doesn't ack
  • The message will be resent based on AckTimeout in consumerConfiguration.

pulsarClient.subscribe(topic, subscription, consumerConfiguration);
} catch (PulsarClientException e) {
log.error("Failed to start subscription : {}", e.getMessage());
restart("Restart a consumer");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't be possible to just propagate the exception at this point?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PulsarClientException is so a checked exception that it has to be caught here.
If we try to propagate like following, onStart() cannot be overrode.

public void onStart() throws PulsarClientExceptio{
catch (PulsarClientException e) {
    throw new PulsarClientException("error!");
}

this.topic = topic;
this.subscription = subscription;
consumerConfiguration.setMessageListener((consumer, msg) -> {
store(new String(msg.getData()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

besides the concerns from @saandrews regarding the data copy, by using String will this prevent Spark users from consuming binary messages?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, that is right.
CustomReceiver may have to extend Receiver<byte[]> to make users be able consume binary messages.
I will implement like so.

@merlimat merlimat added this to the 1.17 milestone Mar 21, 2017
@merlimat merlimat added the type/feature The PR added a new feature or issue requested a new feature label Mar 21, 2017
@yush1ga yush1ga force-pushed the pulsar-spark branch 6 times, most recently from 5b6879c to 547fb8a Compare March 28, 2017 05:01
Copy link
Contributor

@merlimat merlimat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@merlimat merlimat modified the milestones: 1.17, 1.18 Mar 31, 2017
@merlimat merlimat merged commit b1df955 into apache:master Apr 4, 2017
@yush1ga yush1ga deleted the pulsar-spark branch April 18, 2017 04:13
hangc0276 pushed a commit to hangc0276/pulsar that referenced this pull request May 26, 2021
Fix apache#290 
This PR supports continuous offset for kop.
Since this PR disable the orginal design of mapping between Pulsar `MessageId` and Kafka `offset`. So I just ignore some UnitTest based on the original desing. 
Maybe we can raise up another issue to track how we deal with these ignored tests.
hangc0276 pushed a commit to hangc0276/pulsar that referenced this pull request May 26, 2021
Fixes apache#312

These tests were ignored temporarily just because they rely on the outdated methods that convert between MessageId and Kafka offset. So this PR fixes these tests and deletes these outdated methods.

The exception is testBrokerRespectsPartitionsOrderAndSizeLimits, it's a broken test that is easily affected by some params. apache#287 and apache#246 are tracking the test.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/feature The PR added a new feature or issue requested a new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants