PHP preg_match and regex (Regular Expressions)
PHP preg_match and regex (Regular Expressions)
WARNING!!! For some reason the escaped values of this article will no longer appear escaped due to the PHP script replacing them. You can find the official article on my archive server at http://rasterized.net:8080/regex.txt
This is a tutorial I decided to make because alot of tutorials never explained regex syntax clearly or did not cover everything. I hope to make this a complete tutorial, and plan to develope oppon it as I find out more.
Now let me say, I am going to be explaining the syntax for preg_match() in php, but the command is develope off of perl script, so I dont think there will be very many issues with compadability
Lets first start with escape characters. Most of you who maybe reading this tutorial right now, know what they are, but in regex there are things to keep in mind. The characters that you MUST escape are:
-
- ? . ( ) [ ] { } \ / | ^ $ If you were trying to match strings that include these characters. Examples: The String: “How much money do you need? ($100, $200)” Escaped: “How much money do you need? ($100, $200)”
Now, you are going to learn the various wildcards. You need to know what ‘*’, ‘+’, and ‘?’ mean. They are really easy to figure out :D . ‘*’ - match 0 or more characters ‘+’ - match 1 or more characters ‘?’ - match 0 or 1 characters Example: ‘ref*’ - match 0 or more f’s ‘sds_d+’ - match 1 or more d’s ‘terr?’ - match 0 or 1 r’s Note: You are not only limited to letters, you can use any character you like.
If you want to specify how manny characters to match you can use the ‘{’ and ‘}’ brackets (not sure on the real name of these brackets :P ) The basic syntax is: (character){minimum [, maximum]} Example: ‘something_r{2}’ - match exactly 2 r’s ‘ni{3,} - matches 3 or more i’s ‘blay{5,10}’ - match atleast 5 y’s but not more than 10. Asin match 5, 6, 7, 8, 9, or 10 y’s
Other special characters: ‘.’ - can be any character what so ever Example: ’The letter of today is .!“ - the ‘.’ can equal anything ‘/’ - Used to denote the beginning of a string and an end (will be covered later) ‘^’ - denotes the beginning of a line (notice I said line instead of string this time) ‘$’ - denotes the end of a line (again notice I used line) ‘|’ - The OR function (will also be covered later)
Now lets get to the ‘[’ and ‘]’ brackets. These brackets are used to match a range of characters. (The dashes ‘-’ are used to apply a range, if you want to match a character ‘-’ then you would have to escape it) Example: ‘[a-zA-Z]’ - Will match 1 character that is either an upper case letter or a lower case letter.
You can also expand the use of this by using our friendly wildcards. Example: ‘[0-9ab]*’ - will match 0 or more characters that are 0 through 9, a, or b. ‘[0-9a-f]{32}’ - match exactly 32 characters that are 0 thourgh 9, or a through f
You now should understand the basic syntax of regex, now lets get into using it with preg_match()! int preg_match ( string pattern, string subject [, array &matches])
Here is an example html that is stored in var $html:
Ok, lets say you want to match all the tags that contain the word test. You would use: “/.test.</td>/” (notice I used the ‘/’ slashes at the beginning and end) Which then in preg_match() it would be: preg_match(“/.test.</td>/“, $html); At this point preg_match() will return the amount of times it has matched the string above, which would be 3.
Now lets say you want to match all tags that contain the word test followed by a set of numbers, you would use: “/test[0-9]+</td>/” Which then in preg_match() it would be: preg_match(“/test[0-9]+</td>/“, $html); This time it would return a value of 2.
I am sure you have the basic idea now, but lets say you want to organize data within that html. Say you want to grab any string between the tags and starts with the 3 numbers 789. You would use: “/(.*)789</td>/” Now in preg_match(): preg_match(“/(.*)789</td>/“, $html, $matches) (Notice I used ‘(’ and ‘)’, these are used to tell the program ‘In this location, with these characters, store in array’)
Now $matches is an array, the first part of the array ($matches[0]) is the string it has matched, then $matches[1] will contain ‘nota’ because it is the only string between the tags and starts with 789.
The parenthisys are not limited to only (.), you can also use things like ([0-9]) to match a single number ([a-z]) to match 0 or more characters that contain the letters a-z, and so on…
What I have experienced with preg_match(), it has a hard time dealing with multiple lines that are needed to be matched. Good luck and feel free to post your questions, and maybe fix any mistakes that I might of made.
ghost 18 years ago
I will post a more advanced article of regex including modifiers and such when I get the chance. Hope this helps people out a bit.
ghost 18 years ago
For some reason… the forward slashes and such are not escaped anymore…. I think it might be because of the php script thinking they are being escaped…
ghost 18 years ago
http://rasterized.net:8080/regex.txt <– here I uploaded it to my server, this has the acctual escaped characters and such…
ghost 18 years ago
Just like to say that the PHP cheat sheet has some useful notes about RegEx syntax and regexp.net is another useful resource
ghost 18 years ago
Just for the record; you can't use POSIX-style regexes with preg_match(), only with ereg() or eregi() :) Nine article ;)