Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPI: support for ISR and work_queue driven transfers #11302

Closed
wants to merge 11 commits into from

Conversation

dakejahl
Copy link
Contributor

Added an _is_locked variable to the SPI class which is a simple bit mask where the bit position corresponds to the SPI bus number. When running SPI::transfer from an ISR, LockMode is selected as LOCK_NONE (because you cannot pend on a sem in interrupt context). I have added a check for the _is_locked flag for the given bus when LockMode is none, thus preventing an ISR driven SPI::transfer from clobbering an already active transfer taking place in workqueue/task/etc.

@dakejahl dakejahl requested a review from dagar January 25, 2019 20:43
@dakejahl
Copy link
Contributor Author

Next steps, test this with two IMUs (6500/9250) running on the same SPI bus (SPI1). Schedule one using the new px4_workq and the other with the current hrt_call_every. Look at spi_isr_deferred perf counter to see if the ISR driven transfer is being occasionally aborted.

@dakejahl
Copy link
Contributor Author

dakejahl commented Jan 25, 2019

Merged into #11261 and tested like so

	if(_whoami == MPU_WHOAMI_6500) {
		ScheduleOnInterval(_call_interval - MPU9250_TIMER_REDUCTION, 10000);
	} else {

		hrt_call_every(&_call,
			       1000,
			       _call_interval - MPU9250_TIMER_REDUCTION,
			       (hrt_callout)&MPU9250::measure_trampoline, this);
	}

This appears to be working correctly
selection_018

And I am seeing perf counts for deferred ISR driven transfers
selection_017
which match the bad transfers perf counter from a failed SPI::transfer
selection_020

@dagar if you want to test it yourself
https://github.com/PX4/Firmware/tree/spi_locking_workq_and_isr

Copy link
Member

@davids5 davids5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dakejahl I am a tad bit uncomfortable with the removal of result, but I guess it can be added back if we extent the upstream nuttx SPI api.

Please verify the 2 comments raised

src/lib/drivers/device/nuttx/SPI.cpp Outdated Show resolved Hide resolved
src/lib/drivers/device/nuttx/SPI.cpp Outdated Show resolved Hide resolved
@dakejahl
Copy link
Contributor Author

I removed result because _transfer was always returning OK. And yeah nuttx SPI exchange has no return value.

@dakejahl
Copy link
Contributor Author

dakejahl commented Jan 27, 2019

I need to do one more test on this Monday, I had left out applying the bus number to the bitmask when checking if the bus was locked. This shouldn't have had any affect with the way the system runs currently, I just want to sanity check it.

@dakejahl
Copy link
Contributor Author

Okay I think this is done now. I fixed an issue where I wasn't checking the bit position in the _is_locked bit mask. So @dagar the ISR driven cycle is actually deferring much less than what was apparent previously (previously I was just looking at if anyone was using any SPI bus, i.e activity on SPI2).
selection_022

This actually isn't relevant though because the default LOCK_MODE is LOCK_PREEMPTION anyways (enters critical section for transfer). However I changed it to LOCK_THREADS for the purpose of demonstrating this _is_locked mechanism is functional.

Bench tested on a Teal with 2 IMUs(6500/9250) on SPI1, one running from the new wq and the other from an hrt_call_every

src/lib/drivers/device/nuttx/SPI.cpp Outdated Show resolved Hide resolved

void SPI::lock(struct spi_dev_s *dev)
{
_is_locked |= (1 << _device_id.devid_s.bus);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dakejahl this needs to be atomic. It's a shared state potentially written to by multiple threads.
You can use the API in #11328.

Also this assumes that all SPI instances for a given bus run on the same thread. Can you add this as comment here to make the assumption explicit?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI once we're sure there's no longer a mix of PX4 SPI drivers running out of HRT and threads we can drop all of this and lean on the underlying nuttx SPI_LOCK (semaphore).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I meant to emphasize was that this only exists for the case of a potential HRT and thread SPI conflict.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs to be atomic. It's a shared state potentially written to by multiple threads.
You can use the API in #11328.

Cool! I'll add that in as soon as #11328 is merged.

Copy link
Contributor Author

@dakejahl dakejahl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also this assumes that all SPI instances for a given bus run on the same thread.

Could you explain please? I don't think it does.

@bkueng
Copy link
Member

bkueng commented Jan 30, 2019

Could you explain please? I don't think it does.

The implementation is not correct if this does not hold. This is what can go wrong:

  • 2 threads enter SPI::lock with the same bus id.
  • both set _is_locked
  • one of the threads (let's say thread1) enters the critical section, the other one (thread2) blocks
  • thread1 leaves the CS and clears the bit in _is_locked.
  • thread2 enters the CS, while the bit in _is_locked is cleared.

@dagar
Copy link
Member

dagar commented Jan 30, 2019

@dakejahl #11328 has merged.

@dakejahl dakejahl force-pushed the pr-spi_correct_locking branch from 0abc38c to b7dce1c Compare January 30, 2019 20:57
@dakejahl
Copy link
Contributor Author

dakejahl commented Jan 30, 2019

@bkueng Thanks for the explanation! I've updated the code. Lock free atomic operations are a new concept to me, please let me know if this doesn't look right.

edit: Oh, ignore me. I need to check if the lock bit is already set before setting, otherwise it can potentially be "set" twice.

@davids5
Copy link
Member

davids5 commented Jan 30, 2019

@dakejahl @dagar - Is it true that in the end (all done), there is only 1 thread per bus and the possibility of hrt interruption.?

If so:There is a 1:1 of bus:semaphore so this reduce to

interrupt context && dev->semaphore count == 0 -> exit

So the loser is always running in interrupt context. Therefore the test does not need protection as it is non interruptible - no nesting or HI priority IRQ) So dev->semaphore count == 0 => Exit. the sem_wait is already wrapped in a DI, EI and is safe.

Is the problem a shared bus? Barro and Nuttx driver on same bus?

@dagar
Copy link
Member

dagar commented Jan 30, 2019

I believe the only reason that wasn't done is accessing the somewhat hidden spidev semaphore.

@dakejahl
Copy link
Contributor Author

dakejahl commented Jan 31, 2019

Yeah I wasn't sure how to get at that semaphore. btw thanks everyone for reviewing, lots of little nuances I wasn't aware of with this.

Copy link
Member

@bkueng bkueng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to check if the lock bit is already set before setting, otherwise it can potentially be "set" twice.

That's fine, if the at-most-one-thread-per-bus assumption holds (same as @davids5's question: 'Is it true that in the end (all done), there is only 1 thread per bus and the possibility of hrt interruption.?').

src/lib/drivers/device/nuttx/SPI.cpp Outdated Show resolved Hide resolved
@davids5
Copy link
Member

davids5 commented Jan 31, 2019

Given the " at-most-one-thread-per-bus assumption" has a complication we need to resolve to make it so.

A spi bus on some platforms can have a device that is accessed from a device driver in nuttx in the foreground. The params module can use a mtd derivative such as at25 eeprom or a ramtron FRAM and should have only non hrt threads running it. This needs to be true because the OS is used for the reader-writer serization (we should check with an test build with PX4_VERIFY_ASSERT(!up_interrupt_context()) in that driver)!

FMUv4 has on SPI2 a ms5611 barometer and the FRAM. The barometer data sheet states effectively do not wiggle SPI pins doing conversions. Hece this complication that I am not 100% solves the noise issues.

The only way this worked is because A) the ms5611 driver's bus transactions are on a work queue (ISR CAN NOT use OS functions that wait (sem_wait)) and B) it is not on a bus that has hrt threaded devices.

We need to catch a potential misuse: PX4_SPI_BUS_BARO == PX4_SPI_BUS_RAMTRON and some hrt device also on bus.

Which is the same problem the lock is needed for:

The default is locking_mode(LOCK_PREEMPTION), right-red path - this locks the hrt threads out.
the baro uses locking_mode(LOCK_THREADS)` the blue path which shares the dev->smaphore with the Nuttx Driver

  1. hrt takes the left-red path - assume it owns the bus ONLY because of 2
  2. The default non hrt (threads init or wq) take the right-red path - this locks the hrt threads out
  3. baro (and the OS's drivers in spirit) takes right-blue path

The root problem is 1 can trounce on 3.

image

We should look at moving param module's IO to the SPIx bus thread (and enforce this architecturally) if we do that then everything will be on ONE bus thread and the we can then drop the sem_wait and use only the atomic for isr protection.

We should also change the parameter to not be written during flight. That will fully solve the barro noise issue. If we had adequate holdup time and a power fail detection we could commit them with an orderly shutdown as can be done with SW reboot. But that is a whole other issue.

Co-Authored-By: dakejahl <37091262+dakejahl@users.noreply.github.com>
@dakejahl
Copy link
Contributor Author

dakejahl commented Jan 31, 2019

Given the " at-most-one-thread-per-bus assumption" has a complication we need to resolve to make it so.

I think this makes sense architecturally and we should enforce it and document it.

We should also change the parameter to not be written during flight.

You are referring to _locking_mode? I agree.

That will fully solve the barro noise issue. If we had adequate holdup time and a power fail detection we could commit them with an orderly shutdown as can be done with SW reboot. But that is a whole other issue.

I think the way the ms5611 collects is:

  1. Schedules the measurement on the hp_wq -> spi::transfer (tells ms5611 to measure)
  2. Reschedules the collection phase on the hp_wq for some time in the future. Meanwhile ms5611 is performing a measurement, the SPI bus is not locked, and param module can use the bus.

I think we would need to keep the SPIx (spi2 on fmu-v4) bus locked for the entirety of the measure/collect of the ms5611 if the condition is "don't wiggle SPI while doing a conversion".

Yeah?

@davids5
Copy link
Member

davids5 commented Jan 31, 2019

We should also change the parameter to not be written during flight.

You are referring to _locking_mode? I agree

No param save to the mtd.

I think the way the ms5611 collects is:

Schedules the measurement on the hp_wq -> spi::transfer (tells ms5611 to measure)
Reschedules the collection phase on the hp_wq for some time in the future. Meanwhile ms5611 is performing a measurement, the SPI bus is not locked, and param module can use the bus.

I think we would need to keep the SPIx (spi2 on fmu-v4) bus locked for the entirety of the measure/collect of the ms5611 if the condition is "don't wiggle SPI while doing a conversion".

I agree but only on HW where it matters and it can be done and that is case by case.

@dakejahl
Copy link
Contributor Author

dakejahl commented Feb 2, 2019

Okay, maybe we revisit this once #11261 is merged? I added the comment for the one thread per SPI bus assumption.

@dagar dagar mentioned this pull request Mar 2, 2019
2 tasks
@dagar
Copy link
Member

dagar commented Mar 2, 2019

Another option to consider is to simply prevent the SPI transfer from an interrupt. Once #11571 is merged the only concern is 3rd party drivers rebased on newer PX4. I think it would be acceptable to fail at runtime with an error (or even assert) if it points to instructions that show how trivial it is to migrate.

@dakejahl dakejahl closed this Jun 21, 2019
@dakejahl dakejahl deleted the pr-spi_correct_locking branch July 8, 2019 04:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants