-
Notifications
You must be signed in to change notification settings - Fork 373
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PR to merge feature/flow-aggregator with master #1671
PR to merge feature/flow-aggregator with master #1671
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1671 +/- ##
==========================================
- Coverage 63.31% 61.45% -1.87%
==========================================
Files 170 184 +14
Lines 14250 15781 +1531
==========================================
+ Hits 9023 9698 +675
- Misses 4292 5050 +758
- Partials 935 1033 +98
Flags with carried forward coverage won't be shown. Click here to find out more.
|
f6e19dc
to
0ae5164
Compare
0ae5164
to
9daeca9
Compare
/test-all |
c95ec9c
to
381fda3
Compare
381fda3
to
ea1e55c
Compare
test/e2e/fixtures.go
Outdated
if net.ParseIP(ipStr).To4() == nil { | ||
ipStr = fmt.Sprintf("[%s]", ipStr) | ||
isIPv6 = true | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
confused by this change, first because of the comment at the beginning of the function:
// TODO: remove hardcoding to IPv4 after flow aggregator supports IPv6
isIPv6 := false
but also because of ipStr := ipfixCollectorIP.ipv4.String()
, which would imply that ipStr
is an IPv4 address
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes did not realize the comment is there. Yes there is a bug. I tried to carry over the existing v6 code on master, but missed a line. Will fix.
test/e2e/flowaggregator_test.go
Outdated
// InterNodeFlows tests the case, where Pods are deployed on different Nodes | ||
// and their flow information is exported as IPFIX flow records. | ||
t.Run("InterNodeFlows", func(t *testing.T) { | ||
// In worker nodes, Flow Exporter on Antrea Agent is not able to resolve | ||
// the DNS name of Flow Aggregator Service. It is flaky depending on the | ||
// underlay connectivity in the Kind cluster. Please note that CoreDNS Pods | ||
// are deployed on master node because of connectivity issues between kube-apiserver | ||
// and worker nodes. | ||
skipIfProviderIs(t, "kind", "DNS resolution of Flow Aggregator Service is flaky on worker nodes") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I notice that you are only skipping inter-node tests (InterNodeFlows
and RemoteServiceAccess
). Given that I am more inclined to think that this is another instance of issue #897?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This happened with AntreaPolicy enabled case too whose tunnel is vxlan, so not sure if that is the root cause.
test/e2e/flowaggregator_test.go
Outdated
} | ||
return strings.Contains(collectorOutput, srcIP) && strings.Contains(collectorOutput, dstIP), nil | ||
}) | ||
assert.NoError(t, err, "IPFIX collector did not receive expected number of data records and timed out.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be require.NoError
? If we don't receive the data records, should be bother with the validation after?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense.
@@ -6,7 +6,7 @@ function usage() { | |||
--prometheus Deploy Prometheus service to scrape metrics from Antrea Agents and Controllers | |||
--flow-collector Provide the IPFIX flow collector address to collect the flows from the Flow Aggregator service | |||
It should be given in the format IP:port:proto. Example: 192.168.1.100:4739:udp | |||
Please note that with this option we deploy the Flow Aggregator service along with the Antrea daemonset." | |||
Please note that with this option we deploy the Flow Aggregator service along with the Antrea." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please note that with this option we deploy the Flow Aggregator service along with the Antrea." | |
Please note that with this option we deploy the Flow Aggregator Service along with Antrea." |
test/e2e/framework.go
Outdated
if err = data.moveDeploymentToSpecificNode(flowAggregatorDeployment, flowAggregatorNamespace, masterNodeName()); err != nil { | ||
return err | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is the right way to go:
- if Netpol test suite randomly fails when run in Kind and with tunnel type set to Geneve #897 is the root cause of the issue, then we should just skip the affected tests until this issue is resolved, while making sure that all the tests are run in Jenkins
- if there is an extra DNS issue, on top of that, then we can deploy the Aggregator first and access the service through the ClusterIP in Kind tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the second point is a good suggestion. Let me quickly try that out instead of skipping.
ea1e55c
to
972b873
Compare
972b873
to
3ab073a
Compare
/test-all |
3ab073a
to
92b6976
Compare
083d3c1
to
2079720
Compare
2079720
to
fc7ef0c
Compare
/test-all |
0fc44eb
to
73caee1
Compare
/test-e2e |
.github/workflows/build.yml
Outdated
- name: Build flow-aggregator Docker image | ||
run: make flow-aggregator-ubuntu | ||
- name: Push flow-aggregator Docker image to registry | ||
# Will remove the feature/flow-aggregator branch later |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stale comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
# TODO: Create path to two image artifacts when uploading artifacts, so multiple | ||
# artifacts can be downloaded at once. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I understand this comment: do you mean bundle both images in a single artifact?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was looking at this: https://github.com/actions/download-artifact#download-all-artifacts
They still seem to be different artifacts.
ci/jenkins/test-vmc.sh
Outdated
VERSION="$CLUSTER" DOCKER_REGISTRY="${DOCKER_REGISTRY}" make flow-aggregator-ubuntu && break | ||
done | ||
cd ci/jenkins | ||
|
||
if [ "$?" -ne "0" ]; then | ||
echo "=== Antrea Image build failed ===" | ||
echo "=== Antrea Image and Flow Aggregator Image build failed ===" | ||
exit 1 | ||
fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this code so I think we need @lzhecheng to review. Specifically I am unsure why we try to build the images twice. Also it seems that the [ "$?" -ne "0" ]
test would apply to the cd
command (last command run), not the image build.
Finally, maybe the error message should be "=== Antrea Image or Flow Aggregator Image build failed ==="
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we just compile once as a workaround? I added some code for that. Earlier code did not work. Will see if new code succeeds.
ci/kind/test-e2e-kind.sh
Outdated
for image in "${COMMON_IMAGES_LIST[@]}"; do | ||
docker pull $image | ||
done | ||
if $coverage; then | ||
manifest_args="$manifest_args --coverage" | ||
COMMON_IMAGES_LIST+=("antrea/antrea-ubuntu-coverage:latest") | ||
COMMON_IMAGES_LIST+=("antrea/antrea-ubuntu-coverage:latest" "antrea/flow-aggregator:latest") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why? we don't have a different image for coverage (for flow-aggregator)
"projects.registry.vmware.com/antrea/flow-aggregator:latest" should be added unconditionally to the COMMON_IMAGES_LIST
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought of adding coverage, but not super familar with all components involving in it so gone for a simpler route of using the direct image considering time. Should be taken up as TODO.. hope that is ok.
In generate_manifest_flow_aggregator.sh, for kind environments I always use antrea/flow-aggregator:latest. Even with this change should "projects.registry.vmware.com/antrea/flow-aggregator:latest" be added always? I have done that but just curious on why.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you misunderstood me - there is no need for antrea/flow-aggregator:latest
here, this image is not used by any test
so projects.registry.vmware.com/antrea/flow-aggregator:latest
should always be added, antrea/flow-aggregator:latest
should never be added. All yaml manifests and tools should reference projects.registry.vmware.com/antrea/flow-aggregator:latest
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the only time antrea/flow-aggregator:latest
is used is when pushing to dockerhub
isIPv6 := false | ||
if err := data.setupLogDirectoryForTest(tb.Name()); err != nil { | ||
tb.Errorf("Error creating logs directory '%s': %v", data.logsDirForTestCase, err) | ||
if err := testData.setupLogDirectoryForTest(tb.Name()); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
idea for later: this should just call setupTest
instead of duplicating the code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes added in above TODO
test/e2e/framework.go
Outdated
} | ||
|
||
func (data *TestData) mutateFlowAggregatorConfigMap(ipfixCollector string) error { | ||
// get flow aggregator config map |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not a very useful comment, that's the name of the function
test/e2e/fixtures.go
Outdated
tb.Logf("Applying flow aggregator YAML with ipfix collector address: %s:%s:tcp", ipStr, ipfixCollectorPort) | ||
faClusterIP, err := testData.deployFlowAggregator(fmt.Sprintf("%s:%s:tcp", ipStr, ipfixCollectorPort)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tb.Logf("Applying flow aggregator YAML with ipfix collector address: %s:%s:tcp", ipStr, ipfixCollectorPort) | |
faClusterIP, err := testData.deployFlowAggregator(fmt.Sprintf("%s:%s:tcp", ipStr, ipfixCollectorPort)) | |
ipfixCollectorAddr := fmt.Sprintf("%s:%s:tcp", ipStr, ipfixCollectorPort) | |
tb.Logf("Applying flow aggregator YAML with ipfix collector address '%s'", ipfixCollectorAddr) | |
faClusterIP, err := testData.deployFlowAggregator(ipfixCollectorAddr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
test/e2e/fixtures.go
Outdated
tb.Logf("Deploying flow exporter with collector address: %s:%s:tcp", faClusterIP, ipfixCollectorPort) | ||
if err = testData.deployAntreaFlowExporter(fmt.Sprintf("%s:%s:tcp", faClusterIP, ipfixCollectorPort)); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same suggestion as above
test/e2e/flowaggregator_test.go
Outdated
// TODO: remove this limitation after flow aggregator supports IPv6 | ||
skipIfIPv6Cluster(t) | ||
skipIfNotIPv4Cluster(t) | ||
skipIfProviderIs(t, "remote", "This test is not yet supported in jenkins e2e runs.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
jenkins is using the "vagrant" provider apparently, as discussed offline
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed.
test/e2e/fixtures.go
Outdated
if err := data.deployAntreaFlowExporter(fmt.Sprintf("%s:%s:tcp", ipStr, ipfixCollectorPort)); err != nil { | ||
return data, err, isIPv6 | ||
tb.Logf("Deploying flow exporter with collector address: %s:%s:tcp", faClusterIP, ipfixCollectorPort) | ||
if err = testData.deployAntreaFlowExporter(fmt.Sprintf("%s:%s:tcp", faClusterIP, ipfixCollectorPort)); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we should only use the ClusterIP on Kind (since apparently there are DNS issue, which should be explained in a comment). On Jenkins we should use the default (DNS path), since this is what users will use by default and it's better to test for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Thought of doing this later once everything runs perfectly on jenkins. Added it now.
2f301d8
to
a4df2e5
Compare
- Add the build system, scripts for manifest generation and corresponding workflow changes for Flow Aggregator. - The main logic implementation of the flow aggregator that uses the go-ipfix library v0.4.2 with required unit tests. - Agent side changes in Flow Exporter for integration with Flow Aggregator using DNS name resolution. - Add e2e tests for flow aggregator and remove flow exporter tests. Co-authored-by: dyongming@vmware.com Co-authored-by: zyiou@vmware.com Co-authored-by: stati@vmware.com
a4df2e5
to
7cb2e24
Compare
/test-e2e |
@@ -244,6 +244,20 @@ function setup_cluster() { | |||
fi | |||
} | |||
|
|||
function copy_image { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello, it is good to have a copy_image
function to make deliver_antrea
clean. However, it confuses me that this function doesn't include judgement for "Coverage" while "centos-7" is included. I think these 2 parameters should be treated equally. So one parameter IP
is enough for this function.
Could you please explain why you design like this? I just woke up and maybe my mind is not clear enough...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lzhecheng you make a good point. If the tests don't pass and we need to amend this PR, we will change this function as well as per your suggestion. If the tests do pass however, I suggest we merge this in order to make the release, and @srikartati can take as an action item to improve the function implementation. Does that work for you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"coverage" is used in deliver_antrea (L328-L333). @antoninbas identified a bug with image name when coverage is not true; it should be projects.registry.vmware.com and not docker.io. Therefore we considered 4 args.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lzhecheng is wondering if we can reduce the number of parameters by accessing $COVERAGE
directly in the function implementation. However, thinking more about this, I think it would make the function implementation dependent on which image we are pushing (antrea or flow-aggregator), which is probably not great.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, we can improve CI after this PR is merged. I fully understand the pain to re-run all tests...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@antoninbas sorry I didn't get your point. I think all logic of the loop in deliver_antrea
will be in copy_image
. And flow-aggregator image always gets delivered, right? "centos-7" and "coverage" take care of the original logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the idea was to have a function that support uploading any image when invoked, rather than a function that takes care of uploading all images at once in a single invocation. With your proposal, won't we need to duplicate logic inside of the function for the antrea image and for the flow-aggregator image?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, now I get your point. Yes, you are right. Such function is the one needed and I just try to move as much as logic to other functions to make deliver_antrea
clean.
Ideally, we can put a copy_some_images
in deliver_antrea
and copy_image
in copy_some_images
but it is unnecessary in our case.
/test-conformance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
- Add the build system, scripts for manifest generation and corresponding workflow changes for Flow Aggregator. - The main logic implementation of the flow aggregator that uses the go-ipfix library v0.4.2 with required unit tests. - Agent side changes in Flow Exporter for integration with Flow Aggregator using DNS name resolution. - Add e2e tests for flow aggregator and remove flow exporter tests. Co-authored-by: dyongming@vmware.com Co-authored-by: zyiou@vmware.com Co-authored-by: stati@vmware.com
No description provided.