Archive for Regular Expressions

Back references

Back references are a means to use a previous captured sub-expression in the regular expression itself. It can be useful in situations such as matching html tags where you want to match the ending tag when the starting tag is not known.

The syntax for back references is: `\1` or any digit above one, maximum number of back references allowed are 99.


<?php
$pattern = '!<(.*?)>.*?</\1>!';
$string = 'some text <tag> text </tag> some text';
preg_match($pattern, $string, $matches);
?>

Back references must refer to capturing sub-expressions, they can not be used with non-capturing sub-expressions. The following will not work, it will raise an error.

$pattern = '!<(?:.*?)>.*?</\1>!';

because you are referencing a sub-expression which does not exist, as it was not captured. It is the same as matching a sub-expression which was not used because of another alternative being used as in the following

$pattern = '!(a|(bc))\1!';

This will not match if the string starts with `a` but will match if the string starts with `bc`.

Comments (1)

Non-capturing Parentheses

Sometimes we add sub-expressions as part of a larger expression but we don’t need the match data for that sub-expression. This is where non-capturing parentheses (aka. grouping-only parentheses) come in use. non-capturing parentheses prevent the sub-expression match being stored in the match array.

The syntax for non-capturing parenthese is: (?:…)


<?php

$string = '<tr><td>table data 1</td><td>table data 2</td></tr>';

$pattern = '/<tr><td>(.*?)<\/td><td>(.*?)<\/td><\/tr>/';

preg_match($pattern, $string, $match);

print_r($match);

?>

The above code snippet will print:

Array
(
[0] => <tr><td>table data 1</td><td>table data 2</td></tr>
[1] => table data 1
[2] => table data 2
)
Assuming that we only require data in the second <td> we can change the sub-expression in the first <td> into non-capturing as follows.


$pattern = '/<tr><td>(?:.*?)<\/td><td>(.*?)<\/td><\/tr>/';

and this will print:

Array
(
[0] => <tr><td>table data 1</td><td>table data 2</td></tr>
[1] => table data 2
)

It is said that the second method is more efficient and faster than the first (capturing everything) and also uses less memory (logical).

Leave a Comment

Named Capture

Named capturing means to capture a part of an expression into a named location, i.e. the match array will contain an element where the key will be the name specified in the named capture and the value will be the matched expression.

The syntax for named capturing is: (?P<name>…)

Look at the following code snippet and it will become clear.


<?php

$string = "<head><title>my title</title></head>";

$pattern = "/<title>(?P<page_title>.*?)<\/title>/";

preg_match($pattern, $string, $match);

echo $match['page_title'];

?>

What are the benefits of named capturing?

  1. Easier to access the captured data rather than having to work out the array index, especially in much larger and complex regular expressions which contain many sub-expressions.
  2. You don’t have to modify existing code, i.e. if another matching sub-expression is added before the named capture you can still access the value of the named capture using the same key. Whereas if named capturing was not used the array index of all the matches after it would change to $i+1.
  3. Easier code readability.

Leave a Comment