Next: Namespaces, Previous: Extended streams, Up: Features
Regular expressions, or "regexes", are a sophisticated way to efficiently match patterns of text. If you are unfamiliar with regular expressions in general, see 20.5 Syntax of Regular Expressions in GNU Emacs Manual, for a guide for those who have never used regular expressions.
GNU Smalltalk supports regular expressions in the core image with methods
on String
.
The GNU Smalltalk regular expression library is derived from GNU libc,
with modifications made originally for Ruby to support Perl-like syntax.
It will always use its included library, and never the ones installed on
your system; this may change in the future in backwards-compatible ways.
Regular expressions are currently 8-bit clean, meaning they can
work with any ordinary String, but do not support full Unicode, even
when package I18N
is loaded.
Broadly speaking, these regexes support Perl 5 syntax; register groups ‘()’ and repetition ‘{}’ must not be given with backslashes, and their counterpart literal characters should. For example, ‘\{{1,3}’ matches ‘{’, ‘{{’, ‘{{{’; correspondingly, ‘(a)(\()’ matches ‘a(’, with ‘a’ and ‘(’ as the first and second register groups respectively. GNU Smalltalk also supports the regex modifiers ‘imsx’, as in Perl. You can’t put regex modifiers like ‘im’ after Smalltalk strings to specify them, because they aren’t part of Smalltalk syntax. Instead, use the inline modifier syntax. For example, ‘(?is:abc.)’ is equivalent to ‘[Aa][Bb][Cc](?:.|\n)’.
In most cases, you should specify regular expressions as ordinary
strings. GNU Smalltalk always caches compiled regexes, and uses a special
high-efficiency caching when looking up literal strings (i.e. most
regexes), to hide the compiled Regex
objects from most code.
For special cases where this caching is not good enough, simply send
#asRegex
to a string to retrieved a compiled form, which
works in all places in the public API where you would specify a regex
string. You should always rely on the cache until you have demonstrated
that using Regex objects makes a noticeable performance difference in
your code.
Smalltalk strings only have one escape, the ‘'’ given by ‘''’, so backslashes used in regular expression strings will be understood as backslashes, and a literal backslash can be given directly with ‘\\’8.
The methods on the compiled Regex object are private to this interface. As a public interface, GNU Smalltalk provides methods on String, in the category ‘regex’. There are several methods for matching, replacing, pattern expansion, iterating over matches, and other useful things.
The fundamental operator is #searchRegex:
, usually written as
#=~
, reminiscent of Perl syntax. This method will always
return a RegexResults
, which you can query for whether
the regex matched, the location Interval and contents of the match and
any register groups as a collection, and other features. For example,
here is a simple configuration file line parser:
| file config | config := LookupTable new. file := (File name: 'myapp.conf') readStream. file linesDo: [:line | (line =~ '(\w+)\s*=\s*((?: ?\w+)+)') ifMatched: [:match | config at: (match at: 1) put: (match at: 2)]]. file close. config printNl.
As with Perl, =~
will scan the entire string and answer the
leftmost match if any is to be found, consuming as many characters as
possible from that position. You can anchor the search with variant
messages like #matchRegex:
, or of course ^
and
$
with their usual semantics if you prefer.
You shouldn’t modify the string while you want a particular RegexResults object matched on it to remain valid, because changes to the matched text may propagate to the RegexResults object.
Analogously to the Perl s
operator, GNU Smalltalk provides
#replacingRegex:with:
. Unlike Perl, GNU Smalltalk employs the pattern expansion
syntax of the #%
message here. For example, 'The ratio is
16/9.' replacingRegex: '(\d+)/(\d+)' with: '$%1\over%2$'
answers
'The ratio is $16\over9$.'
. In place of the g
modifier, use the #replacingAllRegex:with:
message instead.
One other interesting String message is #onOccurrencesOfRegex:do:
, which
invokes its second argument, a block, on every successful match found in the
receiver. Internally, every search will start at the end of the previous
successful match. For example, this will print all the words in a stream:
stream contents onOccurrencesOfRegex: '\w+' do: [:each | each match printNl]
Next: Namespaces, Previous: Extended streams, Up: Features