Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible error in timer.c once in 49 days / on INT_MAX overflow #1016

Closed
MarekNovakACRIOS opened this issue Dec 22, 2020 · 11 comments
Closed

Comments

@MarekNovakACRIOS
Copy link
Contributor

Currently, in TimerGetElapsedTime(), there is this code:

TimerTime_t TimerGetElapsedTime( TimerTime_t past )
{
    if ( past == 0 )
    {
        return 0;
    }
   etc...

Which basically means - whenever the timestamp has a value of 0, return no time elapsed....
So when a timer is started and the 0xFFFFFFFF/INT_MAX boundary is hit, then the timer can be by accident / not willingly stopped!

This may result, if the system uses 1ms tick and 32bit variable for storing tick (quite common I would say), in a situation where some timers can be stopped after 49 days of running.... This may look like a little problem, but IOT devices easily are designed to run for years and if the user application uses the timer.h/timer.c API for timing of his own events in the system, it can make him very unhappy........

My workaround is to do something like this:

/*!
 * \brief converts time in ticks to time in ms
 *
 * \param[IN] time in timer ticks
 * \retval returns time in milliseconds
 */
uint32_t RtcTick2Ms( uint32_t tick )
{

	if(tick == 0)
	{
		/* Never ever return value 0! or the timer could be accidentally stopped */
		return ( uint32_t ) 1;
	}
	else
	{
		return ( uint32_t ) tick;
	}
}

Which is uggly and can cause a 1ms jitter during timing once in 49 days, which is more acceptable than random stopping of started timers. If there is no other way how to mark a timer as "stopped" I would suggest this my workaround to be somehow made part of the "generic" layer to avoid a timestamp of 0 to be accepted from the porting layers. I however DONT like my workaround, but it is MUCH better than random stopping of timers....

Opinions are welcome.

@MarekNovakACRIOS MarekNovakACRIOS changed the title Possible error in timer.c once in 49 days Possible error in timer.c once in 49 days / on INT_MAX overflow Dec 22, 2020
@alexrayne
Copy link

obvious way, imho, provide TimerTime_t that not overlap (int64).
It is essential demand to time type - definitely point to right point in past or future. if timer overlaps - then, it is not a time but some kind of surrogate. this surrogate should be named some other. lora stack timer point to operate with timeouts, it is not same as time - timeout is a distance. Also may have sence a time difference - it is like distance, but can be signed.

@MarekNovakACRIOS
Copy link
Contributor Author

I agree 100%, but currently TimerTime_t uses 32bit integer.... unfortunatelly.

@djaeckle
Copy link

Hi all,

thanks for the reports. We tried to reproduce the issue on our side by returning a 0 in the function RtcTick2Ms multiple times during runtime. But the node is not stopping its operation.

Could you provide some more detailed information about what is happening exactly? Or provide an easy way to reproduce the issue? At the moment, i do not understand the relation between the two functions TimerGetElapsedTime and RtcTick2Ms .

@altishchenko
Copy link

Hello everyone!
I am using this library for more than three years now and looking through the discussion got a bit curious.
@MarekNovakACRIOS Marek, you are saying this:

So when a timer is started and the 0xFFFFFFFF/INT_MAX boundary is hit, then the timer can be by accident / not willingly stopped!

But, really, I don't see how! This function, TimerGetElapsedTime() is used in:

  • LoRaMac.c: to stop RxWindowTimer2 for ClassB devices (matter of seconds, it will stop running on its own if left there alone anyway).
  • RegionCommon.c: to update band/channel usage statistics
  • radio.c: To perform carrier sense timeout, which means that returning 0 once will not hurt anyone (well, there is a potential of a dead loop here, but very improbable).

This is it. This function is not used for anything else! What timer will stop? What process will hang?

@MarekNovakACRIOS
Copy link
Contributor Author

Hi all,

I will try to get back in the context, where this was an issue for me... but to see when this becomes an issue, please check function LoRaMacHandleResponseTimeout() in LoRaMac.c. The idea is, that the timer value of 0 is used to mark a stopped timer - if you however, by chance, get your starting time to be 0 (which happens after wrapping around 0xFFFFFFFF once in 4.3 bilion ticks)((but it can possibly happen!)), you get your timestamp say, that it is actually a stopped one, even if this is not an intention. That is why I simply never return 0 as a tick value - this way I am ok with using the tick value of 0 to mark a stopped timer AND it cannot happen for me, that a timer started at 0xFFFFFFFF+1 simply is considered stopped. Which may cause bad things happen.

You also refer to some points, where timer.h is used - in my app I use timer.h to time/plan also some events other than LoRaWAN related things - this is how I found out by checking the code..... I don't think that this is an illegal usage of timer.h - I am trying to have just one timer queue of events / timestamp library in my app to be let's say optimal from code&data usage point of view.
Let me know if it makes more sense now.... or if I am being wrong somewhere in some way. IMHO - less probable a bug of such matter can happen, more of issue it is, as it is very hard to reproduce..... as you say.

Anyway, thanks for checking this issue after 5 months!

@altishchenko
Copy link

altishchenko commented May 28, 2021

Hi Marek,
As for the TimerGetElapsedTime() and its usage in LoRaMacHandleResponseTimeout(), this is not an issue there as I can see, because the only consequence of its incorrect behaviour will be an attempt of one more retransmission of the confirmed server downlink acknowledgement for class B and C devices, if it did happen at all.
I don't think, that value 0 of the past is meant to be a stopped timer or, properly speaking, something that has no past. I may be mistaken here, but I don't get that impression from the code and value 0 is not specifically documented on the timing variables.
Yes, I agree, that consequences of overlap may have strange and unpredictable effects on the code, but fixing RtcTick2Ms() to always return non-zero is not a solution in this case (or TimerGetCurrentTime() really). I tested this overlap in our codebase a couple of years ago and didn't get anything abnormal by that time. (I was also keen on changing uint32_t into uint64_t and found it to be a waste of memory, cpu resources and time spent on changing all possible places where TimerTime_t can be accidentally used in the uint32_t variable).

I don't think that this is an illegal usage of timer.h - I am trying to have just one timer queue of events / timestamp library in my app to be let's say optimal from code&data usage point of view.

It is, really, a proper usage - doing this way ensures that your code stays in sync with the stack and the library time-wise. Only you have to consider the effects of these functions behaviour (implied, intentional or just traditional) on your own use and mitigate that some way or the other. In your position, if you hit this problem really, I'd use a wrapper function that will explicitly make sure that overlap really happened. Long time ago we had to put such or similar controls in the RtcXXX function block, because for whatever unknown reason our RTC on STM32L0 sometimes (!) returns a value that is a few ticks less then the previous reading - this was something to track down, I am telling you.

Anyway, thanks for checking this issue after 5 months!

Well, that is my usual cycle of considering the stack upgrade in our codebase :) - once in half a year. We deploy our products in hundreds mostly and we have to be dead sure of the code stability - our customers are usually extremely far from the civilization and usual transport.

P.S.: For the historical background of past == 0 and its reasoning, have a look at the PRs #600, #602 and #610

@MedadRufus
Copy link

Interesting discussion. I think some of these functions could benefit from unit testing. The timer overflow and other questionable behaviour can easily be tested with certainty. The issue is solely a logic issue, independent of hardware so it should be easy to test off target.

Any plans to integrate unit testing into the Loramac node codebase?

@mluis1
Copy link
Contributor

mluis1 commented May 28, 2021

We do have unit tests in our internal repository git server.
At the moment it is not in our plans to make the unit tests public.

@mluis1
Copy link
Contributor

mluis1 commented May 28, 2021

If I understand well the reported issue the problem lies in detecting if a virtual timer is running or not.

The virtual timer provides an API to query if a given timer is started or not.

/*!
* \brief Checks if the provided timer is running
*
* \param [IN] obj Structure containing the timer object parameters
*
* \retval status returns the timer activity status [true: Started,
* false: Stopped]
*/
bool TimerIsStarted( TimerEvent_t *obj );

@mluis1
Copy link
Contributor

mluis1 commented Sep 30, 2021

As there is no more activity on this issue.
Can we close it?

@mluis1
Copy link
Contributor

mluis1 commented Oct 14, 2021

No feedback we close the issue.

@mluis1 mluis1 closed this as completed Oct 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants