Welcome to HBH! If you have tried to register and didn't get a verification email, please using the following link to resend the verification email.

PHP - Regex - Quantifier limit?


ghost's Avatar
0 0

I posted this on phpfreaks.. I'm going to post it here as well.

I am using preg_match to parse large dynamic HTML pages.. I have a problem where preg_match fails at a certain point while using a lazy quantifier (it also fails when setting the entire expression as ungreedy)

here is sample code illustrating the problem test.txt is a 1 megabyte file with the letter 'A' repeating for the first 500kb. The text "I WILL NOT REACH HERE" occurs a little after the 500kb mark.

<?php
$handle = fopen("test.txt", "r");
$var = fread($handle, 1048576);
$arr = array();
echo preg_match('/.+?I WILL NOT REACH/s', $var, $arr); //will output 0
?>

Why does it fail? Can anyone suggest a work around for this? It seems to fail around the 99999'th character mark..

To give you an example of the situation where I use lazy quantifiers:

Consider the html code

<div class="topictitle">title</div>(dynamically generated data)<div class="post">first post</div>

I would use a regular expression to extract the topic title and the first post via backreferences.

preg_match(&#39;/opictitle&quot;&gt;([^&lt;]+)&lt;&#92;/div&gt;.+?ost&quot;&gt;(.+?)&lt;&#92;/div&gt;/s&#39;, $data, $output); //something similar to this. 

The goal in mind is that I want to only extract the topic title and the first post. I cannot use ungreedy quantifiers because they would give the last post on the page. Regardless.. even with ungreedy quantifiers if there is more than 99999 characters AFTER the regular expression being matched it will also fail.

Can anyone suggest an alternative approach or maybe a workaround?

Thanks.

edit:smileys


ghost's Avatar
0 0

would something like this help ?


&lt;?php

$example_post = &#39;&lt;div class=&quot;topictitle&quot;&gt;title&lt;/div&gt;(dynamically generated data)&lt;div class=&quot;post&quot;&gt;first post&lt;/div&gt;&#39;;

preg_match(&#39;/&lt;div class=&quot;topictitle&quot;&gt;(.+?)&lt;&#92;/div&gt;(.*?)&lt;div class=&quot;post&quot;&gt;(.+?)&lt;&#92;/div&gt;/&#39;, $example_post, $output);

print_r($output);

?&gt;


ghost's Avatar
0 0

themastersinner wrote: would something like this help ?

That just makes the dynamic content optional.. you can assume there always is dynamic content. I should've been more clear on what I have tried.

.*? .{1,}? .+? [^blah]+? (blah being a character guaranteed to not be in the text) preg_match('/exp/U', $data, $out);

The problem occurs when there is a large amount of text already associated with the expression.


ghost's Avatar
0 0

Chinchilla3k wrote: <snip> Why does it fail? Can anyone suggest a work around for this? It seems to fail around the 99999'th character mark.. </snip>

It looks like you're either exceeding the default pcre.backtrack_limit or pcre.recursion_limit. It's failing around the 99999 mark because the default limit is 100000 characters.

More info here: http://docs.php.net/manual/en/pcre.configuration.php

It looks as if both can be changed via the ini_set() function as of PHP version 5.2.0. Setting them much higher than 100000 will probably break the functionality in unexpected ways, though. Have fun experimenting.