Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sensors health reporting improvements #13025

Closed

Conversation

dusan19
Copy link
Contributor

@dusan19 dusan19 commented Sep 25, 2019

Main issues:

  1. Gyro, accel, mag and baro health is evaluated by different modules (sensors, commander) and each of them set the health bitmask inconsistently, furthermore each of them overrides the health reported by the other.
  2. Preflight checks do not consider sensor failures reported by the sensors module

Related bugs:

  1. Mag inconsistency check fail when there is only one mag enabled, due to a late publishing of a disabled mag
  2. Disabled mag flagged as used in ecl data validator due to not skipping a late published disabled mag
  3. Commander: bug when reading all available subsystem_info
  4. CAL_*_EN and CAL_*_ID parameters are handled inconsistently

More detailed explanation and problem examples:

  1. Sensors health bitmask is set in commander::status.onboard_control_sensors_health. It is set by preflight tests function, and by the health reported on the subsystem_info topic by sensors module.
    Preflight checks set the bit in the health bitmask that matches the sensor instance. For example SUBSYSTEM_TYPE_GYRO bit is set for gyro 0, and SUBSYSTEM_TYPE_GYRO2 for gyro 1. On the other hand sensors module sets the bits considering the number of healthy sensors. Which means that in case of a gyro 0 failure, commander would set SUBSYSTEM_TYPE_GYRO to false, and sensors would set SUBSYSTEM_TYPE_GYRO2 to false. That behavior is not consistent.
    Solution: sensors module should set the health of a specific sensor according to its index and not to the number of healthy sensors. This way we know which sensor failed by looking at the vehicle_status message. Especially useful when using external sensors, to be able to know which one to replace.
    Furthermore, the imu, mag and baro health should be evaluated in the sensors module and reported over subsystem_info topic. Also preflight metrics and relevant flags should be reported by the sensors module, and be used in the commanders preflight checks. Preflight checks shouldn't subscribe to the topics reported by the imu, mag and baro drivers but only to the relevant topics from the sensors module.

  2. Preflight checks do not consider health reported by sensors module (no data, stale data, timeout, high error count, high error density).
    I could reproduce the following scenario: I set the required number of healthy sensors to 2 instead of default 1, enabled 2 mags, and had stale data on the second mag. The checks passed even though the second mag was broken.
    Solution: use the health reported by the sensors module together with the preflight metrics and health flags from the sensor_preflight topic.

  3. Preflight checks before used the max_mandatory_*_count constant to determine if the sensor was required. So by setting this to 2 would mean that 2 sensors were required to pass the checks, but not any two, it had to be the first two, (instances 0 and 1), so in my case mag 1 broke, so healthy mags 0 and 2 wouldn't allow arming. This is changed, such that each sensor has a prameter that specifies how many healthy sensors of that type have to be healthy to allow flying. The sensors are not ordered by their index. If the param is set to two, any two sensors of that type that are healthy would allow arming. Although primary sensor has to be healthy as a separate checks.

  4. voted_sensors_update::parameters_update() sets the enabled flags for each sensor using the CAL_*i_EN parameter.
    CAL_*_EN and CAL_*_ID params are handled inconsistently in parameters_update(). Enabled parameters are handled such that CAL_MAGi_EN refers to the sensor on uorb instance i.
    The ID param is searched for, which means that if uorb instance i has the same id as param CAL_MAGj_ID then CAL_MAGj calibration will be set for driver i. But still CAL_MAGi_EN will enable driver i.
    Fix is made such that enabled is set for the driver with the ID that's set with CAL_*_ID param.

Other fixes and improvements:

  • Commander runs preflight tests every 2 seconds, if the vehicle is unarmed. This way we have up to date health checks, and flight readiness, which can be reported over mavlink to the base station such that the user is informed about the system state and failures all the time and not only when arming is requested. Or be used otherwise.

  • I encountered a bug where I had 1 mag enabled but got mag sensors inconsistency failure. This happened because mag 2 was disabled but still had priority set to 100 instead of zero and was used in the inconsistency checks. This happens when the mag data is published after the parameter update in the voted_senors_update has already finished, and the enabled flag was not set to false for this sensor. When the data is available on the topic for the first time the mag is given the priority and after that the enabled flag is set to false, but the data is still pushed to the ecl data validator.
    Solution: when the parameter update is run after the first publication of the late mag, set the priority to 0 when setting the enabled flag to false, and skip pushing the data to the validator so the validator doesn't set the used flag to true for this mag

@dusan19
Copy link
Contributor Author

dusan19 commented Sep 25, 2019

note: changing the vehicle status health flag field from uint32 to uint64 variable, requires the same changes in the mavlink message since mavlink messages just copies the value over form uorb to mavlink msg. Although at the moment there is exactly 32 enum values defined so it shouldnt be a problem

@dagar
Copy link
Member

dagar commented Oct 4, 2019

There's quite a lot to digest here and some of this will need careful/deliberate testing. It might ultimately be easier to bring it in separate chunks where we can exhaustively test a single thing at a time.

  1. Preflight checks before used the max_mandatory_*_count constant to determine if the sensor was required. So by setting this to 2 would mean that 2 sensors were required to pass the checks, but not any two, it had to be the first two, (instances 0 and 1), so in my case mag 1 broke, so healthy mags 0 and 2 wouldn't allow arming. This is changed, such that each sensor has a prameter that specifies how many healthy sensors of that type have to be healthy to allow flying. The sensors are not ordered by their index. If the param is set to two, any two sensors of that type that are healthy would allow arming. Although primary sensor has to be healthy as a separate checks.

Why does it make sense for this to be user configurable? Shouldn't it be information provided by each board?

@dagar
Copy link
Member

dagar commented Oct 4, 2019

note: changing the vehicle status health flag field from uint32 to uint64 variable, requires the same changes in the mavlink message since mavlink messages just copies the value over form uorb to mavlink msg. Although at the moment there is exactly 32 enum values defined so it shouldnt be a problem

Unfortunately we can't change the existing fields in mavlink. This would have to be new fields at the end or a new message entirely.

@dagar
Copy link
Member

dagar commented Oct 4, 2019

Preflight checks set the bit in the health bitmask that matches the sensor instance. For example SUBSYSTEM_TYPE_GYRO bit is set for gyro 0, and SUBSYSTEM_TYPE_GYRO2 for gyro 1. On the other hand sensors module sets the bits considering the number of healthy sensors. Which means that in case of a gyro 0 failure, commander would set SUBSYSTEM_TYPE_GYRO to false, and sensors would set SUBSYSTEM_TYPE_GYRO2 to false. That behavior is not consistent.
Solution: sensors module should set the health of a specific sensor according to its index and not to the number of healthy sensors. This way we know which sensor failed by looking at the vehicle_status message. Especially useful when using external sensors, to be able to know which one to replace.

One thought here to consider is if we should be leaning towards using device ids whenever possible instead of indexes. These are ultimately much easier to decode back to the particular sensor, bus, calibration, etc.

Furthermore, the imu, mag and baro health should be evaluated in the sensors module and reported over subsystem_info topic. Also preflight metrics and relevant flags should be reported by the sensors module, and be used in the commanders preflight checks. Preflight checks shouldn't subscribe to the topics reported by the imu, mag and baro drivers but only to the relevant topics from the sensors module.

I'm not really a fan of the subsystem_info topic. I think we can do better than Present/Enabled/Ok, and it's not very helpful for debugging logs. A higher level sensor status message (maybe by type like sensor_accel_status) could be interesting. Something that helps you reconstruct the valuable data equivalent to sensors status output. Sensors available, priority, healthy, timed out, etc. This message could publish constantly, but only on changes. This could always be logged.

@dusan19
Copy link
Contributor Author

dusan19 commented Oct 7, 2019

@dagar thanks for the review. I would be interested in working on this if it is something that would end up being merged. It is essential for our application to have real time sensors health reporting and flight readiness status.
It would also be beneficial for us not to have such a huge diff from PX4 master that is hard to rebase later. So, if these fixes are beneficial for PX4 as well, I could spend a little more time on doing it another way, in which case we would have to discuss first in more detail what is exactly the right solution to this in the sense of architecture and logic changes.

@julianoes
Copy link
Contributor

@dusan19 I think it would be great if you could make a pull request just for the two bugs that you found and mentioned.

note: changing the vehicle status health flag field from uint32 to uint64 variable, requires the same changes in the mavlink message since mavlink messages just copies the value over form uorb to mavlink msg. Although at the moment there is exactly 32 enum values defined so it shouldnt be a problem

As @dagar already answered, unfortunately this can't be changed. This means that this message is limited to 32 different sensors (or things). Right now, the spec already lists 25 options which means that we don't have a lot of room for whatever will come up in the future. This, to me, means that we need to come up with a new message (or better mechanism) to report preflight requirements and errors, sensor status, etc..
There is an events interface proposal by @bkueng to tackle this. It would be great if you could have a look at it and comment on it.

@dusan19
Copy link
Contributor Author

dusan19 commented Oct 8, 2019

@dusan19 I think it would be great if you could make a pull request just for the two bugs that you found and mentioned.

Bugfixes in:
#13119
and
#13121

@julianoes
Copy link
Contributor

Great, thanks @dusan19!

@@ -1739,14 +1739,21 @@ Commander::run()

/* update subsystem info which arrives from outside of commander*/
do {
if (subsys_sub.updated()) {
if ( (updated=subsys_sub.updated()) ) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An assignment in a condition rings a bell for me.
what about a while loop since the condition is anyways at the beginning?

Copy link
Contributor Author

@dusan19 dusan19 Oct 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was fixed in a separate PR #13121

Also there is one more PR that was a result of the discussion in this PR and resolves a bug mentioned here (#13119)

By the way this assignment in a condition works properly, also tested here.

@stale
Copy link

stale bot commented Jan 16, 2020

This issue has been automatically marked as stale because it has not had recent activity. Thank you for your contributions.

@stale stale bot added the stale label Jan 16, 2020
@dagar
Copy link
Member

dagar commented Sep 7, 2020

FYI I haven't forgotten about this.

For each sensor type I'm introducing a sensors status msg (eg sensors_status_imu) that will contain the voting status, errors, inconsistency check, etc. First example is in #15681.

We're also getting pretty close to having multi-EKF support merged (#14650), where we'll have a lot more data to work with (eg estimated biases for all sensors).

@stale stale bot removed the stale label Sep 7, 2020
@stale
Copy link

stale bot commented Dec 25, 2020

This issue has been automatically marked as stale because it has not had recent activity. Thank you for your contributions.

@stale stale bot added the stale label Dec 25, 2020
@LorenzMeier
Copy link
Member

Thanks for the contribution! Given how much the state estimation framework has moved, I think taking a fresh pass is simpler than solving the many merge conflicts here.

My apologies for this sitting in the review queue for so long - we're currently working to make this a much more timely process.

@dagar
Copy link
Member

dagar commented Jan 12, 2021

FYI quite a lot of this has been addressed over the past year.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants