Where’s NSRegularExpression?
September 15th, 2007I wrote CocoaICU about a year ago when I made the following observations:
- ICU has great regular expression support
- ICU is installed on Mac OS X
- Foundation lacks support for regexes.
CocoaICU is simply a light layer of Objective-C code that encapsulates ICU.
It’s remarkable that Cocoa still lacks regular expression support. I am not a Leopard user or tester but my educated hunch is that Leopard will not add support for regexes which means that this problem is probably going to exist in the Mac developer community for at least another few years. Why is this such a bad thing?
- Inconsistent regex semantics across apps : Regex syntax leaks through to users of applications. Conventions for character classes and semantics for referencing capture groups differ across regex libraries. The result is that users must understand the particular regex semantics for each application they use. It would be nice if we all just agreed to use whatever Apple wanted (probably ICU).
- A confusing variety of regex libraries for developers : In the absence of a generally accepted regex implementation, a variety of third-party libraries have sprung up to fill the void. If you absolutely need regexes in your app, the otherwise simple problem of writing regular expressions has become the much harder problem of shopping for a regex library that suits your needs: what are the differences in features between libraries? how well-tested is library X?, etc. In addition, the userbase of regex users is now fragmented across several libraries and the benefits of a large userbase (bug reports, feature requests, etc) are now dissipated over many libraries. Developers who use a third-party library are also now responsible for incorporating updates to the library into their app.
- It’s embarrassing for Apple : It’s embarrassing that Cocoa is the only framework I can think of that lacks support for regexes.
I know Apple understands the importance of adding regular expression support to Foundation. What has mystified me until a few days ago is why Foundation still lacks regexes.
Apple is apparently waiting for at least one big improvement to ICU before they add regex support for NSStrings. A recent post in a Cocoa-Dev thread by an Apple employee mentions the missing ICU functionality that prevents Apple from adding regex support. I hesitate to claim that I completely understand the issue and its relation to Apple but I think the following describes the situation:
- ICU currently only works with UTF16 strings stored in a buffer
NSStrings are not stored as UTF16 strings and character data is not necessarily stored in one continuous buffer- The underlying characters in an
NSStringmust therefore be copied to a buffer as UTF16 strings before matching can occur using ICU - It would be nice to not have to do this potentially large convert/copy of an NSString to do matching
Apple could add regex support by simply doing the inefficient convert/copy of strings which would basically replicate what CocoaICU does. It’s clear that Apple is not going to do this (they would have done it by now). I think I agree with Apple’s stance on this issue; users of Foundation should expect that something as simple as regular expression matching works as expected. An implementation that would potentially consume lots of memory or perform poorly is worse than simply not including any functionality.