Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix flakiness in fixed batch size integration test #387

Merged
merged 6 commits into from
Feb 7, 2018

Conversation

Corey-Zumar
Copy link
Contributor

Fixes #386

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/927/
Test FAILed.

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/928/
Test FAILed.

Copy link
Contributor

@dcrankshaw dcrankshaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/936/
Test FAILed.

@dcrankshaw
Copy link
Contributor

Hmm @Corey-Zumar looks like this maybe doesn't fix the flakiness https://amplab.cs.berkeley.edu/jenkins/job/Clipper-PRB/936/console

@Corey-Zumar
Copy link
Contributor Author

@dcrankshaw Based on testing with AWS, it appears that the container crashes when queried. No stack trace is produced. The issue appears to be associated with the serialization and deserialization of functions that call time.sleep or Timer.Start().

For now, in place of time.sleep(), it seems reasonable to perform a higher-latency computation before returning the batch sizes as outputs. Do you agree?

@dcrankshaw
Copy link
Contributor

Hmm weird. You could just hardcode a new testing docker container that has a sleep in it (similar to the noop container). That seems like a potentially cleaner and more controlled solution.

@Corey-Zumar
Copy link
Contributor Author

Good call. Will do that.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/956/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/957/
Test FAILed.

@Corey-Zumar
Copy link
Contributor Author

Corey-Zumar commented Feb 6, 2018

@dcrankshaw The container-based approach doesn't fare any better. I'm still encountering the same traceless crash. Will need to debug.


COPY containers/python/python_closure_container.py containers/python/python_closure_container_entry.sh /container/


CMD ["/container/python_closure_container_entry.sh"]
ENTRYPOINT ["/container/python_closure_container_entry.sh"]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing this from CMD to ENTRYPOINT fixes the crash associated with the python call to time.sleep. I'll have to explore why this is the case. CMD executes the python script within /bin/sh -c directive, while ENTRYPOINT executes the python script in some other fashion; I'd imagine that this accounts for the difference.

@@ -3,12 +3,11 @@ FROM clipper/py-rpc:${CODE_VERSION}

COPY clipper_admin/clipper_admin/python_container_conda_deps.txt /lib/
RUN conda config --set ssl_verify no \
&& conda install -y -c anaconda cloudpickle=0.5.2 \
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Installing this first fixes an installation error.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/960/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/961/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/962/
Test PASSed.

@dcrankshaw dcrankshaw merged commit ec777a2 into ucbrise:develop Feb 7, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants