Changes

Summary

trx_toolkit/clck_gen.py: Fix clock generator not to accumulate timing (details)
Commit abc63d8d825eb56fdcd7e01bf8824915c8780e18 by Kirill Smelkov
trx_toolkit/clck_gen.py: Fix clock generator not to accumulate timing error

CLCKGen currently works as follows:

	sleep(ctr_interval)
	some work
	sleep(ctr_interval)
	some work
	sleep(ctr_interval)
	some work
	...

The intent here is to do some work at timestamps that are multiple of ctr_interval,
however the implementation does not match the intent, because

1) sleep(ctr_interval) is not guaranteed by the OS to be ideal, so there
   will always be some jitter in actually slept time without any
   guarantee that the error will fluctuate over zero without accumulating.

2) "some work" takes some time to run and that time adds again and again
   to the current time when next sleep(ctr_interval) starts. As the
   result even if sleep implementation would be ideal, then n'th sleep
   would start not at

	t₀ + n·ctr_interval

   but instead at

	t₀ + n·ctr_interval + Σ1..n t(work_i)

   where trailing Σ term adds over and over as the timing error which can
   be seen as e.g. increasing trend of received GSM clock jitter in
   https://osmocom.org/issues/4658#note-10 .

The thinko in the clock generator logic is not so much visible if "some
work" takes only a bit of time or is done infrequently. That was
actually the case before fake_trx added tx queueing in 6e1c82d2
(trx_toolkit/transceiver.py: implement the transmit burst queue) because
before that commit some work was only "send IND CLOCK data every ~ 100th
tick". However after 6e1c82d2 the work was adjusted to do linear scan of
tx queue over and over at every tick which amplified error accumulation
and highlighted the problem.

With that tx queuing in fake_trx was disabled in d4ed09df (Revert
"trx_toolkit/transceiver.py: implement the transmit burst queue") with
the rationale being most likely, as https://osmocom.org/issues/4658#note-10 says,

    Unfortunately, Python is not fast enough to handle the queues in time.
    Despite the relatively low CPU usage, fake_trx.py fails to scheduler
    everything during one TDMA frame period. This causes some of our TTCN-3
    test cases to fail.

    ...

    Most likely, the problem is that Python's threading.Event is not
    accurate enough. Running with SCHED_RR does not change anything.

However with the above analysis we can see that it is the logic in
CLCKgen that needs fixing, not threading.Event . For the reference
threading.Event indeed used dumb timeout implementation on Python2:

    https://github.com/python/cpython/blob/2.7-0-g8d21aa21f2c/Lib/threading.py#L597-L615
    https://github.com/python/cpython/blob/2.7-0-g8d21aa21f2c/Lib/threading.py#L343-L369

but on Python3 it essentially uses plain Lock.acquire(timeout) which,
under the hood, uses PyThread_acquire_lock_timed - a plain wrapper over
sem_timedwait:

    https://github.com/python/cpython/blob/v3.11.9-9-g1b0e63c81b5/Lib/threading.py#L330-L331
    https://github.com/python/cpython/blob/v3.11.9-9-g1b0e63c81b5/Modules/_threadmodule.c#L75-L100
    https://github.com/python/cpython/blob/v3.11.9-9-g1b0e63c81b5/Python/thread_pthread.h#L480-L491

so at least with py3 there should be no question about threading.Event .

-> Fix timing error accumulation by reworking the clock generator loop
   to compensate observed jitter, caused by OS noise and the work
   taking time, by adjusting to-sleep δt each tick accordingly.

   This is generally good for correctness and will allow us to reinstate
   tx queueing in fake_trx.

Without the fix added test fails as

    FAIL: test_no_timing_error_accumulated (test_clck_gen.CLCKGen_Test.test_no_timing_error_accumulated)
    ----------------------------------------------------------------------
    Traceback (most recent call last):
      File "/home/kirr/src/osmocom/bb/src/target/trx_toolkit/test_clck_gen.py", line 60, in test_no_timing_error_accumulated
        self.assertTrue((ntick+1)*clck.ctr_interval > δT, "tick #%d: time overrun  by %dµs total" %
    AssertionError: False is not true : tick #200: time overrun  by 572478µs total

Change-Id: I928801422c9af80c368261f617b91d7ecfedbabf
Related: OS#4658, OS#6672
src/target/trx_toolkit/clck_gen.py
src/target/trx_toolkit/test_clck_gen.py