Experiments with precise timing in iOS

iOS is by no means a realtime operating system, but I’m aware that NSTimer and NSObject’s performSelector:withObject:afterDelay: mechanism aren’t particularly accurate, and I was curious to see whether I could do better.

Hands up, backing away

Disclaimer: I am not at all an expert in realtime programming, or Mach, or iOS-device optimisation, so this is pretty much a fumble in the dark. I won’t be at all offended if anyone wishes to shoot me down and offer a more sensible solution — in fact, please do! Until then, watch as I stumble on…

Also note that there are often ways to eliminate the need for precise timing of this nature, by architecting code appropriately — when it comes to audio, for example, CoreAudio provides a very accurate time base in render callbacks. For things like metronomes or audio synthesizers, it’s always better to establish a starting time, and use the difference between the current time and the starting time in order to determine state, rather than using a timer to advance the state. Still, sometimes, you just need a timer…

What the blazes?

So, I’m working on an update to Loopy, which uses a shared clock object to synchronise tracks and a variety of events (like user interface updates or timed track manipulations). A tester noted that the mute/unmute quantisation feature that I’ve recently implemented, which will mute or unmute a loop at its starting point (rather than whenever you tap it), tends to overshoot a little, resulting in a small part of the beginning of the loop being audible.

Of course, there are other solutions to this particular problem (like stopping or starting playback from the audio render callback, and using Core Audio’s timestamps for exact timing), but I use timers in other places outside Core Audio’s domain, which makes Core Audio’s timing mechanism unavailable, and I wanted to see how accurate I could get the timing.

Our friend, mach_wait_until

I read in several places mention of the Mach API utility mach_wait_until (from mach/mach_time.h), which is very low-level and supposedly fairly accurate. So, based on that lead, I put together an Objective-C singleton class that launches a high-priority thread, and uses said thread to schedule events.

An NSArray of events are maintained, and a scheduleAction:target:inTimeInterval: routine creates and adds events to this array, then pokes the thread.

The thread grabs the next event in sequence, then uses mach_wait_until to sleep until the time of the next event arrives, then performs the specified action on the target. It’s kinda a DIY NSRunLoop.

Here’s a comparison between this technique, and just using performSelector:withObject:afterDelay: (which schedules a timer on the NSRunLoop), observed while performing various scheduled events within Loopy running on my iPhone 4 with the debugger, and derived by comparing the time of event execution with the event’s scheduled time:

Mechanism	Average discrepancy	Minimum discrepancy	Maximum discrepancy
NSRunLoop	16.9ms	0.25ms	153.7ms
TPPreciseTimer	5.5ms	0.033ms	72.0ms

That was attempt number 1: This seems to give us about 11.4ms better accuracy on average (three times more accurate).

Not bad, but it turns out mach_wait_until isn’t really that accurate, particularly if there’s a bunch of other stuff going on in other threads.

Spinning, for fun and profit

For my second attempt, the thread performs a mach_wait_until until just before the event is due, then performs a spin lock until the time arrives, using mach_absolute_time to compare the current time with the target time.

This gave further improved results — here’s that table again, but with the new scheme added, with a few different spin lock times:

Mechanism	Average discrepancy	Minimum discrepancy	Maximum discrepancy
NSRunLoop	16.9ms	0.25ms	153.7ms
TPPreciseTimer (original)	5.5ms	0.033ms	72.0ms
TPPreciseTimer (10ms spinlock)	6.0ms	0.002ms	76.5ms
TPPreciseTimer (100ms spinlock)	3.7ms	0.002ms	44.8ms
TPPreciseTimer (200ms spinlock)	2.91ms	0.002ms	74.1ms

It appears that the more stuff there is going on in other threads, the more likely the mach_absolute_time call is to overshoot. So, the more time spent in the spin lock, the more leeway mach_absolute_time has to wait too long. Of course, that’s at the cost of making the CPU twiddle its thumbs for the duration.

Better than a punch in the knee

The results weren’t quite as fantastic as I’d hoped — still within the same order of magnitude, that’s for sure — but the average case for the 200ms spinlock approach is 14ms, or 5.8 times, more accurate than the traditional approach, and the minimum case is dramatically better.

You know, I think if I was aware of the results in advance, I might not bother, but I’ll stick with my hard-won 14ms now that I’m here (that’s 617 audio samples, I’ll have you know).

If anyone’s curious about the implementation (or wants to take a stab at doing better), here it is, along with a wildly simplistic commandline test app: TPPreciseTimer.zip

Now to get back to some real work.

Addendum: GCD follow-up

Chris in the comments below suggested trying a GCD-based approach, using dispatch_after. Curious, I rigged it up, and these are the stats, collected the same way as above, added to the prior table:

Mechanism	Average discrepancy	Minimum discrepancy	Maximum discrepancy
NSRunLoop	16.9ms	0.25ms	153.7ms
TPPreciseTimer (original)	5.5ms	0.033ms	72.0ms
TPPreciseTimer (10ms spinlock)	6.0ms	0.002ms	76.5ms
TPPreciseTimer (100ms spinlock)	3.7ms	0.002ms	44.8ms
TPPreciseTimer (200ms spinlock)	2.91ms	0.002ms	74.1ms
dispatch_after (main queue)	14.8ms	0.16ms	161.2ms
dispatch_after (dedicated queue)	19.2ms	0.1ms	174.9ms
dispatch_after (dedicated queue + 100ms spinlock)	22.4ms	0.002ms	306.8ms

So, they appear pretty much the same as the NSRunLoop stats.

One Comment

Chris Stawarz
September 7, 2011 at 6:32 pm

Thanks for the post. This is interesting stuff!

Personally, I’d like to see how a version based on Grand Central Dispatch (e.g. using dispatch_after to schedule the tasks) performs versus the mach_wait_until approach. Presumably, GCD threads spend less time sleeping than application-created threads (as long as they’re fed a steady diet of tasks), so perhaps you’d see fewer tasks that are late because the OS didn’t wake the worker thread in a timely manner.
1. Michael Tyson
  September 7, 2011 at 6:34 pm
  
  Cheers, Chris!
  
  That’s a very interesting idea – I hadn’t thought to try that. I’ll give it a whirl sometime! Thanks =)
  1. Michael Tyson
    September 7, 2011 at 6:59 pm
    
    Just gave it a try – it looks to be about the same as NSRunLoop, unless I’ve done something wonky!
John McLaughlin
September 7, 2011 at 10:51 pm

When you used the GCD approach did you run it on the main thread? I remember vaguely that the run loop for the main thread was integrated with GCD which would explain the same timing but I wonder if that applies to non-main threads.

Anyway nice write up.

-John
1. Michael Tyson
  September 7, 2011 at 11:03 pm
  
  Hey John – cheers! I tried both, using the main thread’s queue and with a dedicated queue.
Adam Jansch
September 7, 2011 at 10:55 pm

I’ve found the AV Foundation class AVMutableComposition (played through an instance of AVPlayer) to be very well timed and so much easier to use than NSTimers, when its AVAssets have their AVURLAssetPreferPreciseDurationAndTimingKey set to YES. The problem is that the whole system is fairly self-contained, so may not integrate out effectively for your needs.
Rich E
October 3, 2011 at 7:22 pm

Hi Micheal,

I enjoyed your post and thanks for sharing the code.

I was wondering if you’ve heard anything about the ability to use the Mach API in iOS apps. I recent response on the core audio mailing list suggested that apple considers it as part of their ‘System Programming Interface’ and because of that apps can’t use it. I don’t know how much validity there is to this, but I suppose it is something to ask about before getting an app banned.

Cheers,
Rich
1. Michael Tyson
  October 7, 2011 at 6:57 pm
  
  Hey Rich,
  
  I can confirm that Apple had no problem with my use of mach in Loopy – in fact, I’m surprised that there are people that think it would be a problem. It’s part of their public API, so I can’t imagine there’d be any problems!
  
  Cheers =)
  Michael
  
  Ps. (Oops! I’m sorry I missed your comment here, I dunno what happened!)
  1. Rich E
    October 7, 2011 at 6:58 pm
    
    Nice, thanks for the reassurance. :) Yea a few people had mentioned mach is part of the “SPI”, so ‘forbidden to touch’ because it is the most likely to change, but I agree that is bs, they are perfectly fine with changing the low level code and leaving it up to app developers to update.
    
    Those results about Mach are really enticing, I’ll be giving it a try. Music apps really need latencies < ~6ms to be convincing and NSTimer just doesn’t cut it. Thanks again for taking the time to post that..
    
    Cheers,
    Rich
    1. Michael Tyson
      October 7, 2011 at 6:59 pm
      
      Yeah, I think if Apple didn’t want us using it, they wouldn’t include the headers!
      
      By the way, as a little followup that I haven’t yet mentioned in the blog entry, I’ve since changed my strategy from that one I mentioned in the article: I wanted buffer-level accuracy (0.005s, in my case), and I got it by performing the timer stuff from within Core Audio’s render/input thread, and using the provided AudioTimeStamps. It’s more efficient, too, and it’s spot-on. You still need mach to calculate the timestamps when scheduling events, but the actual firing happens from within the core audio thread, and it’s awesome.
      
      Cheers,
      Michael
      1. Rich E
        October 7, 2011 at 6:59 pm
        
        We’ve actually been taking a similar approach in a project I develop for, libpd, but we are running into some touch patches and are thinking of doing thins differently. The problem is that if you call out to other threads from the real-time audio thread, you have little option but to allocate memory, mind autorelease pools, spin locks, etc, which could all cause you to fall behind and drop audio packets.
        
        Did you get around having to allocate memory from the audio thread in your new method?
        
        The next approach is to write to a circular buffer from the audio thread, then use a second high priority thread to read from it, dispatching UI updates back on the main thread.
      2. Michael Tyson
        October 7, 2011 at 7:13 pm
        
        Ah, yes, I ran across that problem too – there are two solutions I’m using:
        
        The things that I can perform without holding up the run loop (or otherwise causing problems), I do straight away. Things like when a track is scheduled to start recording, I prepare the track for recording (setting a few flags, resetting counters and buffers, that kind of thing).
        
        The things that would hold up the run loop, I perform on the main loop using dispatch_async. So, with the record trigger example, I perform the record start notifications from the main loop this way.
        
        That way, I get immediate response where I need it (down to the buffer level), and I can avoid holding things up using dispatch.
        
        The one minor caveat is that, very, very occasionally, dispatch takes a tiny bit longer than I want, but it’s not a big problem, and I’m satisfied.
        
        As for memory allocation – yeah, that’s been a problem for me in the past. Since then, I’m using – as you say – an offline thread and a ring buffer. I noticed that even using an NSCondition to synchronise was causing problems, by the way – now, I’m using spin lock (with a tiny sleep to avoid thrashing) in the offline thread.
      3. Michael Tyson
        October 7, 2011 at 7:20 pm
        
        Oh, this is marginally off-topic, but I haven’t had a chance to boast about my solution: In Loopy, I couldn’t think of any other way in my render thread but to hold a lock that protects the track audio and data structures. In the lastest update, I’ve built in a mechanism to protect against lock contention and missing the deadline: In every pass of the render loop, I store the next buffer or two of samples in a dedicated ring buffer, then, if the lock is contended, I pull the samples out of the ring buffer instead, so that it doesn’t miss a beat. I was well chuffed when I came up with that ;-)
      4. Rich E
        October 7, 2011 at 8:10 pm
        
        I wanted to mention that dispatch_async() also allocates memory, in order to make sure its block stays on the heap. Some people have noted however, that the compiler is good at reusing this allocation across multiple invocations. That has been discovered only through profiling and isn’t documented.
        
        Whether this is a problem for apps or just something that core audio developers balk about, I can’t yet say.
      5. Michael Tyson
        October 7, 2011 at 8:13 pm
        
        Yeah, that doesn’t surprise me! How much impact it makes I think depends on how it’s used. I wish I could remember the results of my own timing experiments, but I can’t – I think it was somewhere around 0.001, worst case, or something like that. If you do it a lot, then there’s a problem, but one call here and there probably won’t cause problems – it’s what I’m doing in Loopy, and I haven’t seen any stuttering.
        
        I suppose the alternative is to build one’s own event loop using something like a ring buffer to store events, and an offline thread with a spin lock to synchronise, but I haven’t bothered going down that route yet, as I haven’t seen any performance issues.
Rich E
October 7, 2011 at 8:11 pm

Also, I think your approach for the lock problem that you mentioned is novel in your situation, as you have ample audio at your disposal to set aside ahead of time.
1. Michael Tyson
  October 7, 2011 at 8:13 pm
  
  Oh yes, quite possibly =)
Hari Karam Singh
January 25, 2012 at 7:30 pm

Thanks for this. It’s saved me a ton of time as it’s exactly what I was about to test…

You say: “…performing the timer stuff from within Core Audio’s render/input thread, and using the provided AudioTimeStamps…”

Would you mind elaborating a tiny bit on how one hooks into CoreAudio’s thread and uses it’s timing mechanism? I’d be even more grateful!
1. Michael Tyson
  January 26, 2012 at 11:17 am
  
  Hey Hari – sure: Basically, the idea is to use a render callback (added to your audio unit with AudioUnitAddRenderNotify, to have it happen in an ‘output’ context, or if you want your time relative to recording events (an ‘input’ context), use the kAudioOutputUnitProperty_SetInputCallback property) to do the timing check and to fire off events. The output vs input context decision is based on whether you want the events to fire at a time that corresponds to the moment in the near past when the audio was received by the mic (input context), or a time in the near future that corresponds to the moment the buffer is played out the speaker.
  
  Once in the callback, you check your event fire time against the host time given to you – inTimeStamp->mHostTime, which is the timestamp that corresponds to the moment in the near future that the current buffer hits the speaker (or, in a recording context, the moment that the first sound wave of the buffer hit the microphone, in the past, I think). One note: If you’re doing this from a ‘render notify’ callback, make sure you check *ioActionFlags & kAudioUnitRenderAction_PreRender is true – the callback is called once before the buffer is filled, and once after, so you probably only want it to happen at the start.
Hari Karam Singh
January 29, 2012 at 12:23 pm

Thanks Michael. I’ve realised shortly after reading this blog post that OpenAL is NOT the lowest level API into iOS audio, so I’m having to do (yet another) iphone API crash course and rewrite my audio engine! I follow what you’re saying after having been briefed by the apple docs. Thanks for the help. What latency is Loopy running at out of curiosity?

Another quick question if you don’t mind: You mentioned your ring buffer solution to locking issues. How are checking that a lock is contended? How do you know the check is accurate by the time you do the next step? Have I understood correctly that this implies you are writing to 2 buffers with every disk read (and reading from the non-locked one)?

Beautiful graphics in Loopy btw…
1. Michael Tyson
  January 30, 2012 at 7:21 pm
  
  Ah, yep, that’s how it goes =) Nevermind, the more you know, the better an iOS developer you’ll be! Loopy’s pass-through latency is apparently something like 21ms at the moment.
  
  The ring buffer thing: With my implementation (TPCircularBuffer), there’s no need to lock, so no mutex overhead. I’m afraid I’m not quite sure that I follow the rest of your question though – I think you’re referring to my discussion with Rich above, about the issues with using dispatch_async and similar primitives from a realtime Core Audio thread (which incur delays). How does that relate to the writing of buffers?
2. Michael Tyson
  January 30, 2012 at 7:22 pm
  
  …Oh, and thanks! =)
Hari Karam Singh
January 31, 2012 at 2:20 pm

Thanks again: I’ve just discovered OSAtomic operations. I can feel my powers growing by the minute! ;) I just love this day and age where you can learn these things in a moment rather than spending days with a thick book and lots of trial and error!

My comments were partly because I hadn’t fully understood the ring buffer’s internals but also do to the phrasing you used…

” if the lock is contended, I pull the samples out of the ring buffer instead”

…which I took to mean that that you used the ring buffer as a fallback after testing a lock contention with another data structure (something that I’ve since discovered are called “try locks”). I was wondering about the syntax to test for a lock contention and branch, rather than wait, if it’s contended. Ross Bencina recommends it in his article as a viable alternative to locks.

Thanks again for saving me endless time. I’ve got my audio up and running with Audio Units (and TPCircularBuffer) and even have pitch scaling working (though changing pitch dynamically still needs a few kinks ironed out).
1. Michael Tyson
  January 31, 2012 at 3:44 pm
  
  Oh, right! Yeah, that’s something I’m doing in Loopy. It’s as you say – you try the lock, instead of just locking and waiting. Calling convention varies between the different APIs; NSLock has tryLock, pthread_mutex has pthread_mutex_trylock.
  
  Note that it’s not an alternative to locks – as it’s an operation upon a lock, and you still undergo the overhead of testing the lock – but it is an alternative to locking (waiting on a lock).
  
  Glad to help!
Russ Maschmeyer
April 2, 2012 at 5:34 pm

I’m a newb figuring out how to make a sequencer app, which obviously requires precise time firing of a set of audio files. My current strategy is to use Audio Queue to do the playback. Do Audio Queues have access to the same audio time stamp features you mention here from Audio Units? If so, is there any documentation on how to use them? If not, would you recommend I switch my playback strategy to Audio Units or can I mix and match?

Thanks! This article has given me a good place to start!
Matthew Henry
August 13, 2012 at 3:59 pm

While this isn’t directly related to iOS audio programming, I’ve just spent some time tying to get some super precise timers on OSX that interacted with a standard NSRunLoop and have had luck with a very simple solution with the following results when firing at 1000Hz:

Non-precise timers (standard NSTimers in the main thread’s run loop) – 4.8% CPU.
Tick delta sample (times in ms):

0.9630150162
0.9985059733
1.0376019927
0.8482109988
1.0273949883
2.9999309918
0.9935940034
1.1132439831
0.9216809995
1.0553119937
0.9980620234
0.9257520142

Precise timers (again, in the main run loop) – 7.0% CPU.
Tick delta sample (times in ms):

1.0001820046
1.0000350012
1.0001679766
1.0001530172
1.0002279887
1.0302450100
1.0000499897
1.0008479876
1.0101079824
0.9993049898
1.0001519986
1.0030260019

All I did to get the better timings was simply run the timer with an interval that was half of what I actually wanted and then implement a waiting solution that slept for progressively less and less time:

while(glglGetTime() < requestedTime) { int64_t useconds = (requestedTime - glglGetTime()) * 0.25e6; if(useconds > 0) usleep(useconds); }

where glglGetTime is a wrapper on mach_absolute_time that returns the time in seconds (as a double) and requestedTime is the time that the timer’s callback needs to run at.

While these results are fairly promising for the 1000Hz situation, they’re even more amazing when running at 60Hz (no, this isn’t to drive a render loop – I know to use a CVDisplayLink for that), where NSTimer would give me deltas ranging from 17 to 14ms while the precise timing gave me deltas of 16.666ms every time (often with even more sixes). In the 60Hz case, NSTimer uses 0.9% CPU and the precise timing uses 1.5%.
1. Michael Tyson
  August 13, 2012 at 4:20 pm
  
  Just a comment – beware of using the main thread to do this kind of stuff; it may interfere with various interaction stuff, and vice versa. It’s probably better to do this on a thread (possibly a high-priority thread, if accuracy is important), where you’re less likely to be interrupted by main thread stuff.
  1. Matthew Henry
    August 13, 2012 at 5:16 pm
    
    Point taken. This was just a quick test to try and get a very precise timer with as simple a solution as possible. I only need accuracy in the millisecond range (this timer’s the backing for part of a C-level API that I’m intending on using for 2D games), so I was fairly blown away by how good the results were by simply fiddling with the wait period before firing the timer.
    
    Your timer’s still 2-3x more accurate than mine, but given that this is a 2-3x difference being measured in the tens of microseconds, I’m pretty satisfied by that outcome for such a simple solution.
    1. Michael Tyson
      August 14, 2012 at 10:12 am
      
      Cool, fair enough =)
Cody
August 26, 2012 at 6:57 am

Have you done any tests on dispatch_source_timer?
http://developer.apple.com/library/mac/#documentation/General/Conceptual/ConcurrencyProgrammingGuide/GCDWorkQueues/GCDWorkQueues.html
1. Michael Tyson
  August 26, 2012 at 12:41 pm
  
  I can’t say I have, nope – although at a guess, I would say it probably behaves in the same way as the NSRunLoop stuff.
Richard
November 20, 2012 at 1:57 pm

Just to say we’re starting to run a lot of timing critical tests on iPads and Android tablets. Take a look at our website and testing hardware.

http://www.blackboxtoolkit.com/

Also we’re writing up an academic paper. Guess it’s OK to quote you?
1. Michael Tyson
  November 20, 2012 at 1:59 pm
  
  Sure, fire away
Mark Pauley
November 20, 2012 at 8:01 pm
If you want really precise timing, you may wish to take a look at spawning a real-time thread. You can do this with pthread_setschedparam* or thread_policy_set**.

The main problem you’re running into is that the scheduler isn’t giving your thread enough time on the CPU when the system becomes busy. Using the set_realtime example given in the kernel programming guide below, you should be able to promote a timing thread to real-timeness. Be aware that you should really make sure you don’t blow your computation period estimate. Do no i/o on this thread. Memory and CPU bound stuff only. You can use mach semaphores to signal other threads, but don’t take locks and don’t make any untrusted system calls.
- http://developer.apple.com/library/ios/#documentation/System/Conceptual/ManPages_iPhoneOS/man3/pthread_setschedparam.3.html
  ** https://developer.apple.com/library/mac/#documentation/Darwin/Conceptual/KernelProgramming/scheduler/scheduler.html ( #import )
1. Michael Tyson
  November 20, 2012 at 8:08 pm
  
  Actually, last time I looked, this wasn’t possible on iOS. Apple have removed the realtime mode on iOS – the best we can do is a high-priority thread, but there’re no realtime guarantees there, of course.
  
  When it comes to audio, of course, this isn’t a problem – the Core Audio thread is realtime, and timing accurate to whatever the hardware buffer duration is can be achieved easily.
Niko
February 25, 2013 at 11:28 pm

I have been really pleased with the loopy app but I am glad you are working on the issue of delayed response. i have noticed that pressing record in another app using audiobus, there is a very slight lag. When it come to a looper, it has to be an immediate response or its just a not going to be as awesome as this app could be. Need of improvement to be a really great pro tool. Can’t wait for the next update. The reverse function will be stellar. A few suggestion: ring modulator would be cool. A choice for larger control panel for audiobus when using another app. And for a silly one, it would be great to have that nice chill blue available for the ipad version. Thanks for the creative tools.
1. Michael Tyson
  February 26, 2013 at 12:14 am
  
  Thanks for the comments, Niko – but please post these on the Loopy forum, forum.loopyapp.com.
  1. Niko
    February 26, 2013 at 12:58 am
    
    I read the complete article so I wanted to comment and also, even though it wasn’t in the technical scope, I thought it was relavent to the issue and still an issue with your product that I paid for. But yea, if I ever do feel the need to comment again, I will use the forum.

Comments are closed.