-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch from ovn-nbctl, ovn-sbctl, and ovs-vsctl to using ovsdbapp #136
Conversation
dbee725
to
de03381
Compare
95fc6c4
to
282f44e
Compare
That looks very promising. Thanks, @putnopvut ! I didn't look close at the code, but I have one high-level comment after looking at the CI logs. Another high level comment is that with this change we no longer have a live stream of the test progress. That is unfortunate, but it looks like ansible doesn't support streaming of the command output, even though this feature is long requested (ansible/proposals#92). It's not a huge problem, I guess. |
It sounds good. I'll have to look into this to find out what this would entail. If I were guessing, I'm guessing you get verbose logging from ovsdbapp by changing the level of the logger to DEBUG. The only downside of this is that it will also give you low-level output from paramiko and other libraries. I'll see what I can work out.
Yes, I mentioned this in commit e8704a5 in this series. I was disappointed to find that ansible cannot stream output in realtime. It appears this has been proposed several times in the ansible project and it's a difficult problem to fix, mainly because it becomes the responsibility of all ansible submodules to support this, rather than a single global top-level change. |
It looks like ovsdbapp is using named loggers, so something like this should, probably, work:
Looking at the ovn-tester playbook, I'm not sure what is an actual value of running the test itself from ansible. It only provides an ssh connection, but we still need to call docker manually. Maybe we can keep the configuration/startup in ansible, but call the actual test directly? E.g.:
This way we can still see the log in real time and even store it locally with tee as we do now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Despite the fact that the containerized tester is intended to
communicate directly with the DBs on the test nodes, there is still a
need for the tester to ssh to the hosts. This is because we still need
to be able to do non-DB operations, such as creating network namespaces.
This is not entirely true. Tester doesn't need access to physical hosts in order to create namespaces inside containers. It just needs a direct SSH connection to all containers to do that.
As such, the tester needs to be given an SSH private key that can be
used to communicate with the hosts. This is configured in the physical
deployment configuration. Perhaps a future improvement could eliminate
all SSH usage in ovn-tester, but for now it's still necessary.
Perhaps, we just need to go "all in" with the containerization and avoid impersonating the orchestrator.
I spent some time with ansible and came up with the following playbook: igsilya@816f670 . This playbook is able to bring up all the containers we need and wire them up for passwordless SSH from the tester container.
Having that, we should be able to remove the usage of ./ovn_cluster.sh
and the process monitor script from the ovn-tester code. For all the other operations we should not need an access to physical hosts, IIUC. ovn-tester can be converted to establish direct SSH connections to all containers and not mess with a docker and ugly 'invoke_shell' calls.
What do you think?
It's quiet a bit of work cleaning up ovn-tester, and we'll also need a good solution to not duplicate defaults, e.g. get the test config in the orchestrator, apply all the defaults for all the variables that are not specified and pass a full yml with no omitted values to the tester, so we will not decide what the default value for the mgmt_net, for example, in two different places.
Saying that, it might make sense to do that as a separate PR or we can integrate that work into current PR to not touch the same code twice.
OK, this seems easy enough to at least follow up on. I can test it out and see what sort of output we end up getting.
This sounds reasonable to me. I'll give it a try and see how it works out. |
I like the idea of starting all containers before executing the test. That way, IP address management for the containers can all be orchestrated prior to running the test. It would make for some less hack-y IP address calculations in general, I think. I also really like the idea of setting up SSH connectivity in the playbook before running the test. It allows for you to generate the SSH keys as part of the playbook, so there is no need to create SSH keys beforehand, and there is no need to tell the physical deployment where this SSH key is. The bigger question is whether SSHing directly to containers is actually a net positive. The number of SSH connections from the tester container would jump from a relatively low number (40 at most?) to the number of nodes under test. So for a 120 node test, the tester container would now potentially have 120 SSH connections open. I actually have similar concerns about the connections from the tester to the OVS database running on each container. |
I think, it should not be a problem, unless we ran out of open file descriptors, but that should not be a big problem either, we can always change the limit in the tester container. Our databases have sustained SSL/TLS connections to all ovn-controllers, which is the same number, and they handle that just fine. |
It took some work, but I got this working finally. The problem is that the logging name is not just "transaction" as I would expect, but rather "ovsdbapp.backend.ovs_idl.transaction". I will include this in the next version of the PR. I'll wait until I have everything else implemented first. |
This worked as expected. It actually made things ever so slightly faster since I didn't have to compress, copy, and then uncompress the ovn-tester log file. |
282f44e
to
de24f1e
Compare
I just pushed a new version of the series. The differences are:
What I have not addressed is the idea of replacing SSHing to the hosts with SSHing to the containers directly. IMO, this would probably work better as a follow-up PR. |
de24f1e
to
be164b0
Compare
I'm aware of the failing check. The issue is that the
I thought the issue was that EDIT: I should note that in my working copy |
be164b0
to
b70e796
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @putnopvut this is great! I didn't do a full review yet so I just left some notes from when I was trying it out on our scale setup.
9efb620
to
eb028f0
Compare
Commit fb170c8 adds gc debugging. I strongly suggest removing this commit before merging. However, while we are using this PR to do higher-scale tests, it may be useful to have this while debugging the latency spikes we see. The reason I suggest removing this is because the garbage collector is EXTREMELY NOISY. It easily doubles or triples the size of |
Current status: This is not ready to be merged for two reasons: |
eb028f0
to
fd5429f
Compare
I fixed many of the issues pointed out in the PR. The remaining issues are:
Edit: I forgot to mention that the reason I pushed new commits is that I want to ensure the changes I made to the cirrus CI are working as intended. |
@igsilya I have good news and bad news about the odd latencies. The good news is that there is no bug in |
This is likely a side effect of my suggestion to actually add and delete ports in the startup phase of the cluster-density test that was implemented here: #140 . |
First, this actually makes the process-monitor run correctly on the ovn-tester container. Second, it updates the process-monitor script to use the full p.cmdline() if we couldn't find find a match using just p.name().
Latest push updates the README and gets |
We don't wait for them to be up so it's possible that the port gets added and removed from the NB before ovn-controller has the chance to claim them. That will mess up the 'ovn-installed' latency calculation as we expect all LSPs to be claimed eventually. Reported-at: #136 (comment) Signed-off-by: Dumitru Ceara <dceara@redhat.com>
I don't think we need the re-tagging in any of the ansible playbooks. "docker pull" should be enough. But it's ok if you prefer to do this as a follow up, thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a closer look at all changes and they mostly look good to me. I left a few comments though. I think the only important comments are about the resulting html files and the load balancer related transactions.
Thanks!
local test_file=$1 | ||
|
||
# The tester gets the first IP address in the configured node_net. | ||
node_net=$(${ovn_fmn_get} ${test_file} cluster node_net --default=192.16.0.0/16) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: now we hardcode 192.16.0.0/16
in two places: here and in ovn_tester.py. Do you think there's an easy way to just define this default in a single place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about this, and there are a few ways that could work.
- Use an environment variable to define default values and ensure this same environment variable is used both in the orchestrator and the tester container.
- Define default values in some file, and refer to this file from
do.sh
andovn-tester
. - Write a python module similar to
ovn-fake-multinode-tools/get-config-value.py
and put it inovn-tester
and define defaults in that module. Thendo.sh
andovn-tester
could both call into this module to get configuration values. - Use a template value in
ovn_tester.py
that can be filled in bydo.sh
at install-time.
Of these, I think 1 is clearly the worst. 2 is simple but also a bit clumsy. For me, it's a toss-up whether 3 or 4 is better. I think I would lean more towards option 3 myself. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if that is what you meant in 4, but one thing I had in mind previously was:
- Make ovn-tester fail if at least one variable is not defined in the yml file, i.e. no defaults.
- Make do.sh generate the full yml from a test scenario yml filling all the undefined values with defaults.
- Run ovn-tester with a generated fully filled yml.
This will ensure that we do not have defaults defined in multiple places. Will also be easier to track down undefined/misconfigured values if ovn-tester will just fail and not proceed using unexpected config values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds interesting! It also sounds like something we can do as a follow up in my opinion. Shall we open an issue to track this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@igsilya Sorry for the ambiguity on number 4. What I meant was that ovn_tester.py
would have something like this in it:
DEFAULT_NODE_NET = {{ default_node_net }}
node_net = netaddr.IPNetwork(cluster_args.get('node_net', DEFAULT_NODE_NET))
We might also rename it to ovn_tester.py.in
to make it clear that it can't be run as-is. Then, when running do.sh install
, the do.sh
code will populate {{ default_node_net }}
with the proper default. This way, the default only is defined in do.sh
.
It's not my favorite suggestion, but it's something I thought of.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@putnopvut Hmm, I'm not sure I like the idea of an ovn_tester.py.in
. I think I'd prefer @igsilya's suggestion of do.sh filling in all required input values in the yml we finally pass to ovn-tester.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's why from the 4 suggestions I came up with, 3 was the one I preferred most. Mine and @igsilya's are more-or-less the same, except that in his, do.sh
controls the defaults, and in mine, ovn-tester
controls the defaults.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, your option "3" is also fine, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've actually implemented @igsilya's idea at https://github.com/putnopvut/ovn-heater/tree/default_dumper . I was hoping to be able to open a PR here, but github does not allow using a PR as the basis for another PR. Therefore, when I created the PR, it was way larger and more confusing than necessary. I'll hold off and post that PR once this one is merged.
Previously, the physical deployment was copied during installation, but since it is feasible to want to change the physical deployment between test runs, this is moved to runtime instead.
This isn't necessarily more correct, but it makes the code more consistent.
This is taken care of implicitly at a later stage.
As of the latest update, the only outstanding comment is #136 (comment) . Should I attempt to do this as part of this PR or save it for a follow-up? |
Follow up sounds good to me, let's try to get this PR in. It looks almost ready to me. I'll have another look tomorrow. |
OK nice. As of this moment, this PR has 36 commits associated with it, because of the many individual commits I did to address PR issues. Is it OK to merge these as-is or should I squash them into a more reasonable series of commits? |
I think the current series of commits is reasonable enough. I'd maybe squash Address PR comments., Address some more PR comments: and Fix pep8 issues. into the appropriate commits. The rest is fine by me, thanks! |
I mangled the commits a bit to include the review incremental updates in the correct commits. Apart from that I didn't change anything. I went ahead and pushed this to the main branch. Thanks! |
As the title says, this series of commits switches the vast majority of calls to *ctl utilities to using ovsdbapp, a python library that wraps the python OVSDB IDL provided by OVS. It would be apt to compare ovsdbapp to utilities like ovn-nbctl, since it provides shortcut methods for modifying the database.
This marks a change where, instead of having to ssh into the hosts, the ovn-tester code now needs to be able to connect directly to the OVSDB servers running in the containers. The best way to do this is to containerize the ovn-tester and bridge that container to the rest of the containers. In principal, this is incredibly simple, but in order for things to go smoothly, there was a lot of extra stuff that had to be done. Even if this PR is denied, I would argue that containerizing the tester has merits on its own beyond the ability to use an IDL client.
Unfortunately, this change does not eliminate SSH usage entirely. We still use
SSH + docker exec
for a few things:ovs-appctl
command in order to make the database listen for incoming connections. Ifenable_ssl
is true, then we also have to issue anovs-vsctl
command to populate the SSL record.ext_cmd
s configured in test scenarios.The database transactions are translated 1-to-1 in this PR. In other words, if we performed a single action in an
ovn-nbctl
command, for instance, we now are creating a transaction that does one command. However, since this change introduces first-class Transaction objects, it allows for easily combining multiple commands into a transaction.Regarding performance, I have done tests of
ovn-low-scale.yml
andocp-20-density-heavy.yml
on a single machine. My findings are that ovsdbapp performs similarly to the previous method, especially when using the C JSON parser (ovn-org/ovn-fake-multinode#69 makes it so that OVS built by ovn-fake-multinode will automatically use the C JSON parser) . However, it is possible that certain operations within ovsdbapp may be optimized further.Picking up new optimizations is one reason that we are pulling from the
master
branch of ovsdbapp when installing it in the tester container. However, the other reason is that at the time of this PR, there is a bug fix that is not in any current releases that allow for more graceful operation when a RAFT leader election takes place.