Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

soundwire: bus: fix io error when processing alert event #1589

Conversation

RanderWang
Copy link

When audio system is going into suspend state, plug and unplug
headphone a few times, then codec driver will report io error
when doing jack detection. At this time codec is just suspended,
so io registers can't be accessed.

This patch wakes up codec when processing slave alert interrupt
triggered by jack or other events, so that io registers can be
accessed.

@RanderWang
Copy link
Author

RanderWang commented Dec 4, 2019

@YvonneYang2 hi, can you help to validate this PR on soundwire-latest on comet lake for PV? please remove RP1580 first. Thanks!

@YvonneYang2
Copy link

@RanderWang
Function test is done, jack detection works fine and no other regression found with #1589 . Stress test is still going on. Will check it later.

Copy link
Member

@plbossart plbossart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry @RanderWang I think we need more details here, see below.

@@ -880,20 +880,26 @@ static int sdw_handle_slave_alerts(struct sdw_slave *slave)

sdw_modify_slave_status(slave, SDW_SLAVE_ALERT);

ret = pm_runtime_get_sync(&slave->dev);
if (ret < 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add the usual && ret != -EACCESS

this is needed in case the device doesn't support pm_runtime

@@ -1013,6 +1019,10 @@ static int sdw_handle_slave_alerts(struct sdw_slave *slave)
if (count == SDW_READ_INTR_CLEAR_RETRY)
dev_warn(slave->bus->dev, "Reached MAX_RETRY on alert read\n");

io_err:
pm_runtime_mark_last_busy(&slave->dev);
pm_runtime_put(&slave->dev);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commit message doesn't really describe the issue and I am having a hard time convincing myself.

An alert can only come when the bus is operational and the device is enumerated with a non-zero device number.
So in theory read/write access to registers should not be an issue.

It would really help if you could describe in more details what the root cause is, so that we can justify this change.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@plbossart For rt711, the suspend function only change regmap to cache only. And in this case, slave cant access its register when processing jack event. how about to get pm up when dispatching irq event to slave only ? Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RanderWang I see your point but help me understand when the Slave device would be pm_runtime suspended when you get an interrupt. is this in the case where the master is active but the slave suspends? Is this in the case where you do clock stop with the DSP active? Is this in the case where you do clock stop and a complete re-enumeration.

I trust that this is real, but since I didn't do any experiment on my side I can only rely on your explanation to reconstruct the idea. Thanks!

Copy link
Author

@RanderWang RanderWang Dec 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RanderWang I see your point but help me understand when the Slave device would be pm_runtime suspended when you get an interrupt. is this in the case where the master is active but the slave suspends? Is this in the case where you do clock stop with the DSP active?
Is this in the case where you do clock stop and a complete re-enumeration.

if DSP is active, so device is in D0ix, driver doesn't need a complete re-enumeration and clock stop can works as the flow defined in spec. If device is in D3 and we still need to enable clock stop, slave may send unattached or alert event (actually alert event is rare and it should be incorrect for slave lost sync with master, and I got it only on one device) to master when plugging a headphone. I got Io timeout when alert event was sent to master. This pm sync can fix this io timeout.

Copy link
Member

@plbossart plbossart Dec 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, not getting your points. There are multiple devices, the PCI, master and Slave ones.

There is also a difference between the wake in clock stop, and alerts that can only happen when the device is attached and with a non-zero Device Number.

If the PCI device is in D3, the wake from the Slave will result in a bus reset and re-enumeration. The Slave will become re-attached.

So overall I just don't have any idea of when this IO error happens? Can you try to write a sequence (maybe with bullet points to that the transitions are clear), that may be more helpful than sentences with and/or statements.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, not getting your points. There are multiple devices, the PCI, master and Slave ones.

There is also a difference between the wake in clock stop, and alerts that can only happen when the device is attached and with a non-zero Device Number.

If the PCI device is in D3, the wake from the Slave will result in a bus reset and re-enumeration. The Slave will become re-attached.

So overall I just don't have any idea of when this IO error happens? Can you try to write a sequence (maybe with bullet points to that the transitions are clear), that may be more helpful than sentences with and/or statements.

yes, it is good idea to make it more clear
if the PCI device is in D3 and clock stop mode 0 is enabled
(1) plug a headphone, the alert event sent by slave is captured by glue hw, not master
(2) glue hw will wake up pci device and power on master, then initialize the master
(3) master enable clk and bus data to communicate with slave
(4) slave found the synchronization is not correct, and send alert event or unattached event to master. here I don't know why slave send alert event, and at this time IO transfer timed out when accessing slave register (sdw_read(slave, SDW_SCP_INT1)). And I also checked slave status by cdns_readl(cdns, CDNS_MCP_SLAVE_STATE) when IO transfer timed out for alert event, the status showed slave was unattached! So I think if master is powered off, we need to reset bus before communicating with slave.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something is very odd here.
I reproduced this error when the PCI device is active, the master is active and the slave device is suspended
That's not at all what your description is.
I took your code but I'd really like to understand this better.

Copy link
Author

@RanderWang RanderWang Dec 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@plbossart , I mentioned in my comments: it is rare, and last week it also could be reproduced on one device at @bardliao. He tested my branch https://github.com/RanderWang/linux/tree/my_clk_stop_jack_work and also got such issue.

I could easily reproduced it with the branch in this PR #1498 : in this branch, I didn't reset bus and clear slave status to unattached.

please check my comments and log. #1498 (comment)

please check line 307 in the log: master got a alert event
https://github.com/thesofproject/linux/files/3874045/sdw_io_timeout.log

plbossart and others added 7 commits December 4, 2019 16:08
When a Slave device becomes synchronized with the bus, it may report
its presence in PING frames, as well as optionally asserting an
in-band PREQ signal.

The bus driver will detect a new Device0, start the enumeration
process and assign it a non-zero device number. The SoundWire
enumeration provides an arbitration to deal with multiple Slaves
reporting ATTACHED at the same time. The bus driver will also invoke
the driver .probe() callback associated with this device. The probe()
depends on the Linux device core, which handles the match operations
and may result in modules being loaded.

Once the non-zero device number is programmed, the Slave will report
its new status in PING frames and the Master hardware will typically
report this status change with an interrupt. At this point, the
.update_status() callback of the codec driver will be invoked (usually
from an interrupt thread or workqueue scheduled from the interrupt
thread).

The first race condition which can happen is between the .probe(),
which allocates the resources, and .update_status() where
initializations are typically handled. The .probe() is only called
once during the initial boot, while .update_status() will be called
for every bus hardware reset and if the Slave device loses
synchronization (an unlikely event but with non-zero probability).

The time difference between the end of the enumeration process and a
change of status reported by the hardware may be as small as one
SoundWire PING frame. The scheduling of the interrupt thread, which
invokes .update_status() is not deterministic, but can be small enough
to create a race condition. With a 48 kHz frame rate and ideal
scheduling cases, the .probe() may be pre-empted within double-digit
microseconds.

Since there is no guarantee that the .probe() completes by the time
.update_status() is invoked as a result of an interrupt, it's not
unusual for the .update_status() to rely on data structures that have
not been allocated yet, leading to kernel oopses.

This patch adds a probe_complete utility, which is used in the
sdw_update_slave_status() routine. The codec driver does not need to
do anything and can safely assume all resources are allocated in its
update_status() callback.

Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
When the Master starts the bus (be it during the initial boot or
system resume), it usually performs a HardReset to make sure
electrical levels are correct, then enables the control channel.

While the PM framework guarantees that the Slave devices will only
become 'active' once the Master completes the bus initialization,
there is still a risk of a race condition: the Slave enumeration is
handled in a separate interrupt thread triggered by hardware status
changes, so the Slave device may not be ready to accept commands when
the Slave driver tries to access the registers and restore settings in
its resume or pm_runtime_resume callbacks. In those cases, any
read/write commands from/to the Slave device will result in a timeout.

This patch adds an enumeration_complete structure. When the bus is
goes through a HardReset sequence and restarted, the Slave will be
marked as UNATTACHED, which will result in a call to
init_completion().

When the Slave reports its presence during PING frames as a non-zero
Device, the Master hardware will issue an interrupt and the bus driver
will invoke complete(). The order between init_completion()/complete()
is predictable since this is a Master-initiated transition.

The Slave driver may use wait_for_completion() in its resume callback.
When regmap is used, the Slave driver will typically set its regmap in
cache-only mode on suspend, then on resume block on
wait_for_completion(&enumeration_complete) to guarantee it is safe to
start read/write transactions. It may then exit the cache-only mode
and use a regmap_sync to restore settings. All these steps are
optional, their use completely depends on the Slave device
capabilities and how the Slave driver is implemented.

Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
Slave drivers may have different ways of handling their settings, with
or without regmap.

During the integration of codec drivers, done in partnership between
Intel and Realtek, it became desirable to implement a predictable
order between low-level initializations performed in .update_status()
(invoked by an interrupt thread) and the settings restored in the
resume steps (invoked by the PM core).

This patch builds on the previous solution to wait for the Slave
device to be fully enumerated. The complete() in this case is signaled
not before the .update_status() is called, but after .update_status()
returns. Without this patch, the settings were not properly restored,
leading to timing-dependent 'no sound after resume' or 'no headset
detected after resume' bug reports.

Depending on how initialization is handled, a Slave device driver may
wait for enumeration_complete, or for initialization_complete, both
are valid synchronization points. They are initialized at the same
time, they only differ on when complete() is invoked.

Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
…nces

The Slave device initialization can be split in 4 different cases:

1. Master-initiated hardware reset, system suspend-resume and
pm_runtime based on clock-stop mode1. To avoid timeouts and a bad
audio experience, the Slave device resume operations need to wait for
the Slave device to be re-enumerated and its settings restored.

2. Exit from clock-stop mode0. In this case, the Slave device is
required to remain enumerated and its context preserved while the
clock is stopped, so no re-initialization or wait_for_completion() is
necessary.

3. Slave-initiated pm_runtime D3 transition. With the parent child
relationship, it is possible that a Slave device becomes 'suspended'
while its parent is still 'active' with the bus clock still
toggling. In this case, during the pm_runtime resume operation, there
is no need to wait for any settings to be restored.

4. Slave reset (sync loss or implementation-defined). In that case the
bus remains operational and the Slave device will be re-initialized
when it becomes ATTACHED again.

In previous patches, we suggested the use of wait_for_completion() to
deal with the case #1, but case thesofproject#2 and thesofproject#3 do not need any wait.

To account for those differences, this patch adds an unattach_request
field. The field is explicitly set by the Master for the case #1, and
if non-zero the Slave device shall wait on resume. In all other cases,
the Slave resume operations can proceed without wait.

The only request tracked so far is Master HardReset, but the request
is declared as a bit mask for future extensions (if needed). The
definition for this value is added in bus.h and does not need to be
exposed in sdw.h

Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
The current interfaces between ASoC and SoundWire are limited by the
platform_device infrastructure to an init() and exit() (mapped to the
platform driver.probe and .remove)

To help with the platform detection, machine driver selection and
management of power dependencies between DSP and SoundWire IP, the
ASoC side requires:

a) an ACPI scan helper, to report if any devices are exposed in the
DSDT tables, and if any links are disabled by the BIOS.

b) a probe helper that allocates the resources without actually
starting the bus.

c) a startup helper which does start the bus when all power
dependencies are settled.

d) an exit helper to free all resources

e) an interrupt_enable/disable helper, typically invoked after the
startup helper but also used in suspend routines.

This patch moves all required interfaces to sdw_intel.h, mainly to
allow SoundWire and ASoC parts to be merged separately once the header
files are shared between trees.

To avoid compilation issues, the conflicts in intel_init.c are blindly
removed. This would in theory prevent the code from working, but since
there are no users of the Intel Soundwire driver this has no
impact. Functionality will be restored when the removal of platform
devices is complete.

Support for SoundWire + SOF builds will only be provided once all the
required pieces are upstream.

Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
…erations

The SoundWire DAIs for Intel platform are created in
drivers/soundwire/intel.c, while the communication with the Intel DSP
is all controlled in soc/sof/intel

When the DAI status changes, a callback is used to bridge the gap
between the two subsystems.

The naming of the existing 'config_stream' callback does not map well
with any of ALSA/ASoC concepts. This patch renames it as
'params_stream' to be more self-explanatory.

A new 'free_stream' callback is added in case any resources allocated
in the 'params_stream' stage need to be released. In the SOF
implementation, this is used in the hw_free case to release the DMA
channels over IPC.

These two callbacks now rely on structures which expose the link_id
and alh_stream_id (required by the firmware IPC), instead of a list of
parameters. The 'void *' definitions are changed to use explicit
types, as suggested on alsa-devel during earlier reviews.

Signed-off-by: Rander Wang <rander.wang@linux.intel.com>
Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
The existing use of 6 handlers is problematic in MSI mode. Update
headers so that all shared interrupts can be handled with a single
handler.

Signed-off-by: Bard Liao <yung-chuan.liao@linux.intel.com>
Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
bardliao and others added 4 commits December 6, 2019 15:13
…read

In MSI mode, the use of separate handlers and threads for the Intel
IPC, stream and SoundWire shared interrupt leads to timeouts and lost
interrupts.

The solution is to merge all interrupt handling across all links with
a single thread function. The use of a linked list enables this thread
function to walk through all contexts and figure out which link needs
attention.

Signed-off-by: Bard Liao <yung-chuan.liao@linux.intel.com>
Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
In ClockStop mode, the PCI device will be notified of a wake, which
will be handled from an interrupt thread.

Signed-off-by: Rander Wang <rander.wang@intel.com>
Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
Some of the Intel SoundWire SHIM registers contain fields for
different links. Without protection, the master drivers for the
different links will access these shared registers, leading to invalid
configurations and timeouts (specifically when changing CPA/SPA
power-related registers and polling for the changes to be applied).

A mutex is added to make sure all rmw access to those registers are
serialized.

Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
Due to power rail dependencies, the SoundWire Master driver cannot
make decisions on its own when entering pm runtime suspend.

Add quirk mask for each link, so that the SOF parent driver can inform
the SoundWire master driver of the desired behavior:
a) leave clock on
b) power-off instead of clock stop
c) power-off if all devices cannot generate wakes
d) force bus reset on clock restart

Note that for now the interface with the SOF driver relies on a single
mask for all links. If needed, the interface might be modified at a
later point to provide more freedom. The code at the lower level does
not assume any commonality between links.

Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
plbossart and others added 10 commits December 6, 2019 15:34
Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
Another Dell laptop uses SOF with CML platform

Signed-off-by: Bard Liao <yung-chuan.liao@linux.intel.com>
We will free params when we goto out in sdw_compute_port_params(). But
we can't free params if it is not allocated successfully.

Signed-off-by: Bard Liao <yung-chuan.liao@linux.intel.com>
Fix cppcheck warning:

drivers/soundwire/cadence_master.c:992:9: style: Variable 'offset' is
assigned a value that is never used. [unreadVariable]
 offset += stream->num_out;
        ^

Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
…nto integration/soundwire-latest

Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
…ntegration/soundwire-latest

Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
…egration/soundwire-latest

Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
…o integration/soundwire-latest

Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
…rs' into integration/soundwire-latest

Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
…into integration/soundwire-latest

Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
There are two types of io error when processing alert event.
The first one is: master receives an alert event for jack event
and invokes implement_def function in slave to check jack status.
At this time codec is just suspended, so io registers can't be
accessed. Another one is: when waking up from clock stop state
and bus needs a complete re-enumeration for synchronization issue
, slave register can't be accessed.

This patch wakes up codec and wait for initialization complete
when processing slave alert event, so that io registers can be
accessed.

Signed-off-by: Rander Wang <rander.wang@intel.com>
@RanderWang
Copy link
Author

update my comments based on the feedback from @ranj063

@plbossart
Copy link
Member

this patch is already merged on integration/soundwire-debug-fixes
However we need to spend more time on this and retest

@plbossart plbossart closed this Dec 13, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants