6.13.1 Regexp Functions

By default, Guile supports POSIX extended regular expressions. That means that the characters ‘(’, ‘)’, ‘+’ and ‘?’ are special, and must be escaped if you wish to match the literal characters and there is no support for “non-greedy” variants of ‘*’, ‘+’ or ‘?’.

This regular expression interface was modeled after that implemented by SCSH, the Scheme Shell. It is intended to be upwardly compatible with SCSH regular expressions.

Zero bytes (#\nul) cannot be used in regex patterns or input strings, since the underlying C functions treat that as the end of string. If there’s a zero byte an error is thrown.

Internally, patterns and input strings are converted to the current locale’s encoding, and then passed to the C library’s regular expression routines (see Regular Expressions in The GNU C Library Reference Manual). The returned match structures always point to characters in the strings, not to individual bytes, even in the case of multi-byte encodings. This ensures that the match structures are correct when performing matching with characters that have a multi-byte representation in the locale encoding. Note, however, that using characters which cannot be represented in the locale encoding can lead to surprising results.

Scheme Procedure: string-match pattern str [start]

Compile the string pattern into a regular expression and compare it with str. The optional numeric argument start specifies the position of str at which to begin matching.

string-match returns a match structure which describes what, if anything, was matched by the regular expression. See Match Structures. If str does not match pattern at all, string-match returns #f.

Two examples of a match follow. In the first example, the pattern matches the four digits in the match string. In the second, the pattern matches nothing.

(string-match "[0-9][0-9][0-9][0-9]" "blah2002")
⇒ #("blah2002" (4 . 8))

(string-match "[A-Za-z]" "123456")
⇒ #f

Each time string-match is called, it must compile its pattern argument into a regular expression structure. This operation is expensive, which makes string-match inefficient if the same regular expression is used several times (for example, in a loop). For better performance, you can compile a regular expression in advance and then match strings against the compiled regexp.

Scheme Procedure: make-regexp pat flag…
C Function: scm_make_regexp (pat, flaglst)

Compile the regular expression described by pat, and return the compiled regexp structure. If pat does not describe a legal regular expression, make-regexp throws a regular-expression-syntax error.

The flag arguments change the behavior of the compiled regular expression. The following values may be supplied:

Variable: regexp/icase

Consider uppercase and lowercase letters to be the same when matching.

Variable: regexp/newline

If a newline appears in the target string, then permit the ‘^’ and ‘$’ operators to match immediately after or immediately before the newline, respectively. Also, the ‘.’ and ‘[^...]’ operators will never match a newline character. The intent of this flag is to treat the target string as a buffer containing many lines of text, and the regular expression as a pattern that may match a single one of those lines.

Variable: regexp/basic

Compile a basic (“obsolete”) regexp instead of the extended (“modern”) regexps that are the default. Basic regexps do not consider ‘|’, ‘+’ or ‘?’ to be special characters, and require the ‘{...}’ and ‘(...)’ metacharacters to be backslash-escaped (see Backslash Escapes). There are several other differences between basic and extended regular expressions, but these are the most significant.

Variable: regexp/extended

Compile an extended regular expression rather than a basic regexp. This is the default behavior; this flag will not usually be needed. If a call to make-regexp includes both regexp/basic and regexp/extended flags, the one which comes last will override the earlier one.

Scheme Procedure: regexp-exec rx str [start [flags]]
C Function: scm_regexp_exec (rx, str, start, flags)

Match the compiled regular expression rx against str. If the optional integer start argument is provided, begin matching from that position in the string. Return a match structure describing the results of the match, or #f if no match could be found.

The flags argument changes the matching behavior. The following flag values may be supplied, use logior (see Bitwise Operations) to combine them,

Variable: regexp/notbol

Consider that the start offset into str is not the beginning of a line and should not match operator ‘^’.

If rx was created with the regexp/newline option above, ‘^’ will still match after a newline in str.

Variable: regexp/noteol

Consider that the end of str is not the end of a line and should not match operator ‘$’.

If rx was created with the regexp/newline option above, ‘$’ will still match before a newline in str.

;; Regexp to match uppercase letters
(define r (make-regexp "[A-Z]*"))

;; Regexp to match letters, ignoring case
(define ri (make-regexp "[A-Z]*" regexp/icase))

;; Search for bob using regexp r
(match:substring (regexp-exec r "bob"))
⇒ ""                  ; no match

;; Search for bob using regexp ri
(match:substring (regexp-exec ri "Bob"))
⇒ "Bob"               ; matched case insensitive
Scheme Procedure: regexp? obj
C Function: scm_regexp_p (obj)

Return #t if obj is a compiled regular expression, or #f otherwise.


Scheme Procedure: list-matches regexp str [flags]

Return a list of match structures which are the non-overlapping matches of regexp in str. regexp can be either a pattern string or a compiled regexp. The flags argument is as per regexp-exec above.

(map match:substring (list-matches "[a-z]+" "abc 42 def 78"))
⇒ ("abc" "def")
Scheme Procedure: fold-matches regexp str init proc [flags]

Apply proc to the non-overlapping matches of regexp in str, to build a result. regexp can be either a pattern string or a compiled regexp. The flags argument is as per regexp-exec above.

proc is called as (proc match prev) where match is a match structure and prev is the previous return from proc. For the first call prev is the given init parameter. fold-matches returns the final value from proc.

For example to count matches,

(fold-matches "[a-z][0-9]" "abc x1 def y2" 0
              (lambda (match count)
                (1+ count)))
⇒ 2

Regular expressions are commonly used to find patterns in one string and replace them with the contents of another string. The following functions are convenient ways to do this.

Scheme Procedure: regexp-substitute port match item …

Write to port selected parts of the match structure match. Or if port is #f then form a string from those parts and return that.

Each item specifies a part to be written, and may be one of the following,

  • A string. String arguments are written out verbatim.
  • An integer. The submatch with that number is written (match:substring). Zero is the entire match.
  • The symbol ‘pre’. The portion of the matched string preceding the regexp match is written (match:prefix).
  • The symbol ‘post’. The portion of the matched string following the regexp match is written (match:suffix).

For example, changing a match and retaining the text before and after,

(regexp-substitute #f (string-match "[0-9]+" "number 25 is good")
                   'pre "37" 'post)
⇒ "number 37 is good"

Or matching a YYYYMMDD format date such as ‘20020828’ and re-ordering and hyphenating the fields.

(define date-regex
   "([0-9][0-9][0-9][0-9])([0-9][0-9])([0-9][0-9])")
(define s "Date 20020429 12am.")
(regexp-substitute #f (string-match date-regex s)
                   'pre 2 "-" 3 "-" 1 'post " (" 0 ")")
⇒ "Date 04-29-2002 12am. (20020429)"
Scheme Procedure: regexp-substitute/global port regexp target item…

Write to port selected parts of matches of regexp in target. If port is #f then form a string from those parts and return that. regexp can be a string or a compiled regex.

This is similar to regexp-substitute, but allows global substitutions on target. Each item behaves as per regexp-substitute, with the following differences,

  • A function. Called as (item match) with the match structure for the regexp match, it should return a string to be written to port.
  • The symbol ‘post’. This doesn’t output anything, but instead causes regexp-substitute/global to recurse on the unmatched portion of target.

    This must be supplied to perform a global search and replace on target; without it regexp-substitute/global returns after a single match and output.

For example, to collapse runs of tabs and spaces to a single hyphen each,

(regexp-substitute/global #f "[ \t]+"  "this   is   the text"
                          'pre "-" 'post)
⇒ "this-is-the-text"

Or using a function to reverse the letters in each word,

(regexp-substitute/global #f "[a-z]+"  "to do and not-do"
  'pre (lambda (m) (string-reverse (match:substring m))) 'post)
⇒ "ot od dna ton-od"

Without the post symbol, just one regexp match is made. For example the following is the date example from regexp-substitute above, without the need for the separate string-match call.

(define date-regex
   "([0-9][0-9][0-9][0-9])([0-9][0-9])([0-9][0-9])")
(define s "Date 20020429 12am.")
(regexp-substitute/global #f date-regex s
                          'pre 2 "-" 3 "-" 1 'post " (" 0 ")")

⇒ "Date 04-29-2002 12am. (20020429)"