Commit
abc63d8d825eb56fdcd7e01bf8824915c8780e18
by Kirill Smelkovtrx_toolkit/clck_gen.py: Fix clock generator not to accumulate timing error
CLCKGen currently works as follows:
sleep(ctr_interval)
some work
sleep(ctr_interval)
some work
sleep(ctr_interval)
some work
...
The intent here is to do some work at timestamps that are multiple of ctr_interval,
however the implementation does not match the intent, because
1) sleep(ctr_interval) is not guaranteed by the OS to be ideal, so there
will always be some jitter in actually slept time without any
guarantee that the error will fluctuate over zero without accumulating.
2) "some work" takes some time to run and that time adds again and again
to the current time when next sleep(ctr_interval) starts. As the
result even if sleep implementation would be ideal, then n'th sleep
would start not at
t₀ + n·ctr_interval
but instead at
t₀ + n·ctr_interval + Σ1..n t(work_i)
where trailing Σ term adds over and over as the timing error which can
be seen as e.g. increasing trend of received GSM clock jitter in
https://osmocom.org/issues/4658#note-10 .
The thinko in the clock generator logic is not so much visible if "some
work" takes only a bit of time or is done infrequently. That was
actually the case before fake_trx added tx queueing in 6e1c82d2
(trx_toolkit/transceiver.py: implement the transmit burst queue) because
before that commit some work was only "send IND CLOCK data every ~ 100th
tick". However after 6e1c82d2 the work was adjusted to do linear scan of
tx queue over and over at every tick which amplified error accumulation
and highlighted the problem.
With that tx queuing in fake_trx was disabled in d4ed09df (Revert
"trx_toolkit/transceiver.py: implement the transmit burst queue") with
the rationale being most likely, as https://osmocom.org/issues/4658#note-10 says,
Unfortunately, Python is not fast enough to handle the queues in time.
Despite the relatively low CPU usage, fake_trx.py fails to scheduler
everything during one TDMA frame period. This causes some of our TTCN-3
test cases to fail.
...
Most likely, the problem is that Python's threading.Event is not
accurate enough. Running with SCHED_RR does not change anything.
However with the above analysis we can see that it is the logic in
CLCKgen that needs fixing, not threading.Event . For the reference
threading.Event indeed used dumb timeout implementation on Python2:
https://github.com/python/cpython/blob/2.7-0-g8d21aa21f2c/Lib/threading.py#L597-L615
https://github.com/python/cpython/blob/2.7-0-g8d21aa21f2c/Lib/threading.py#L343-L369
but on Python3 it essentially uses plain Lock.acquire(timeout) which,
under the hood, uses PyThread_acquire_lock_timed - a plain wrapper over
sem_timedwait:
https://github.com/python/cpython/blob/v3.11.9-9-g1b0e63c81b5/Lib/threading.py#L330-L331
https://github.com/python/cpython/blob/v3.11.9-9-g1b0e63c81b5/Modules/_threadmodule.c#L75-L100
https://github.com/python/cpython/blob/v3.11.9-9-g1b0e63c81b5/Python/thread_pthread.h#L480-L491
so at least with py3 there should be no question about threading.Event .
-> Fix timing error accumulation by reworking the clock generator loop
to compensate observed jitter, caused by OS noise and the work
taking time, by adjusting to-sleep δt each tick accordingly.
This is generally good for correctness and will allow us to reinstate
tx queueing in fake_trx.
Without the fix added test fails as
FAIL: test_no_timing_error_accumulated (test_clck_gen.CLCKGen_Test.test_no_timing_error_accumulated)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/kirr/src/osmocom/bb/src/target/trx_toolkit/test_clck_gen.py", line 60, in test_no_timing_error_accumulated
self.assertTrue((ntick+1)*clck.ctr_interval > δT, "tick #%d: time overrun by %dµs total" %
AssertionError: False is not true : tick #200: time overrun by 572478µs total
Change-Id: I928801422c9af80c368261f617b91d7ecfedbabf
Related: OS#4658, OS#6672