-
Notifications
You must be signed in to change notification settings - Fork 283
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add PcapDataset #303
Add PcapDataset #303
Conversation
Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please visit https://cla.developers.google.com/ to sign. Once you've signed (or fixed any issues), please reply here (e.g. What to do if you already signed the CLAIndividual signers
Corporate signers
ℹ️ Googlers: Go here for more info. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ivelin Thanks for the PR! Overall looks impressive. There are only a couple of changes needed to support DatasetV2.
The lint is processed by .travis/lint.python.sh
and is just tab space changes.
Also, can you sign CLA as was noted by the google bot.
for (size_t i = 0; i < out_tensors->size(); i++) { | ||
Tensor tensor = (*out_tensors)[i].Slice(0, *record_read); | ||
(*out_tensors)[i] = std::move(tensor); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slice the overreserved memory in Tensor is processed automatically from the parent class:
io/tensorflow_io/core/kernels/dataset_ops.h
Lines 283 to 286 in 7e5fcbc
for (size_t i = 0; i < chunk_tensors.size(); i++) { | |
// In case record_read is smaller than record_to_read, cut a slice | |
Tensor tensor = chunk_tensors[i].Slice(0, record_read); | |
out_tensors->emplace_back(std::move(tensor)); |
Since PcapInput is a subclass of FileInput
, I think Line 187-190 could be deleted.
int64 record_count = 0; | ||
double packet_timestamp; | ||
string packet_data_buffer; | ||
TF_RETURN_IF_ERROR(state.get()->ReadRecord(packet_timestamp, &packet_data_buffer, record_count)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The returning status might be OutOfRange
which needs to be converted to
Status = Status::OK() and (*record_read) == 0
for the parent class to properly identify the end of the sequence. For that I think the following will work:
Status status = state.get()->ReadRecord(packet_timestamp, &packet_data_buffer, record_count);
if (!(status.ok() || errors::IsOutOfRange(status))) {
return status;
}
|
||
@property | ||
def output_types(self): | ||
return tuple([dtypes.float64, dtypes.string]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are moving toward DatasetV2. See PR #268 for some background about DatasetV2.
The DatasetV2 is quite easier than DatasetV1 (invoked when tensorflow.compat.v1
is used) actually. Below is the code snippet that will match DatasetV2. You can delete everything else in this file and it should work:
import tensorflow as tf
from tensorflow_io.core.python.ops import data_ops as data_ops
from tensorflow_io import _load_library
pcap_ops = _load_library('_pcap_ops.so')
class PcapDataset(data_ops.Dataset):
"""A pcap Dataset. Pcap is a popular file format for capturing network packets.
"""
def __init__(self, filenames, batch=None):
"""Create a pcap Reader.
Args:
filenames: A `tf.string` tensor containing one or more filenames.
"""
batch = 0 if batch is None else batch
dtypes = [tf.float64, tf.string]
shapes = [
tf.TensorShape([]), tf.TensorShape([])] if batch == 0 else [
tf.TensorShape([None]), tf.TensorShape([None])]
super(PcapDataset, self).__init__(
pcap_ops.pcap_dataset,
pcap_ops.pcap_input(filenames),
batch, dtypes, shapes)
tests/test_pcap.py
Outdated
|
||
|
||
if __name__ == "__main__": | ||
test.main() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also as part of the effort to move to DatasetV2, we want the test to focus on TF 2.0 env. In TF 2.0 env, the biggest difference is that Eager execution is enabled by default. There is no need for tf.Session
anymore.
Since we still want to support TF 1.x, we test TF 1.x in eager execution (non-default). The following will bring in eager execution in TF 1.x, and is compatible with TF 2.0 in any case:
if not (hasattr(tf, "version") and tf.version.VERSION.startswith("2.")):
tf.compat.v1.enable_eager_execution()
With Eager execution enabled, the iteration of dataset could be simplified to
for v in dataset:
print(v)
And OutOfRange exception is handled automatically.
Overall, here is the test case:
"""
Test PcapDataset
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import os
import tensorflow as tf
if not (hasattr(tf, "version") and tf.version.VERSION.startswith("2.")):
tf.compat.v1.enable_eager_execution()
import tensorflow_io.pcap as pcap_io # pylint: disable=wrong-import-position
def test_pcap_input():
"""test_pcap_input
"""
print("Testing PcapDataset")
pcap_filename = os.path.join(
os.path.dirname(os.path.abspath(__file__)), "test_pcap", "http.pcap")
file_url = "file://" + pcap_filename
url_filenames = [file_url]
dataset = pcap_io.PcapDataset(url_filenames, batch=1)
packets_total = 0
for v in dataset:
(packet_timestamp, packet_data) = v
if packets_total == 0:
assert packet_timestamp.numpy()[0] == 1084443427.311224 # we know this is the correct value in the test pcap file
assert len(packet_data.numpy()[0]) == 62 # we know this is the correct packet data buffer length in the test pcap file
packets_total += 1
assert packets_total == 43 # we know this is the correct number of packets in the test pcap file
if __name__ == "__main__":
test.main()
Additional notes:
1: We use .travis/lint.python.sh
to lint python code:
curl -o .pylint -sSL https://raw.githubusercontent.com/tensorflow/tensorflow/master/tensorflow/tools/ci_build/pylintrc
find . -name \*.py | xargs pylint --rcfile=.pylint
- TF's eager execution does not play well with non-eager mode within the same process. For that reason, our CI will invoke twice. One for files ended with
test_xxx_eager.py
and another one for files not end with_eager.py
.
With pcap support DatasetV2, the file name could be changed totests/test_pcap_eager.py
.
CLAs look good, thanks! ℹ️ Googlers: Go here for more info. |
python lint
Changes applied and committed. In my local environment there seems to be a v1 vs v2 python lib mixup. I am trying to trace the root of this new error in the test case:
Updated tf to 2.0.beta and rebuilt tf io from source. Didn't help. |
@ivelin I think your version of tensorflow is likely The DatasetV2 will require tensorflow 1.14.0+ or 2.0.0+. You can install with pip install
to allow bazel picking up the changes in configuration. Also, you may want to change The reason is that TF's eager execution does not play well with non-eager mode within the same process. For that reason, our CI will invoke twice. One for files ended with test_xxx_eager.py and another one for files not end with _eager.py. Changing |
The PR looks good to me and all tests passed. Thanks for the contribution! 👍 🎉 |
* Initial push for pcap support in tensorflow io. Still untested. * updates to finalize the initial version of pcap dataset * fixed syntax errors * added debug info * added more debug info. Still working on crash in pcap_input.cc * working version with debug data * working version for pcap dataset * updated git email address * email address * email * applied Yong's recommended changes for TF v2 python lint * renamed test_pcap.py to test_pcap_eager.py lint pcap/pcap_ops.py
@yongtang this is a pull request for a PcapDataset.
C++ code builds and python unit test passes.
Hopefully the code is in compliance with best practices for TF IO. Looking forward to your comments.