Tuesday, 18 September 2018

Extract the first paragraph text from a web page with PHP

This post looks at how to extract the first paragraph from an HTML page using PHP's strpos and substr functions to find the location of the first <p> and </p> tags and get the content between them.

Using strpos and substr

Assuming the content to extract the paragraph from is in the variable $html (which may have come from a file, database, template or downloaded from an external website), use the following code to work out the position of the first <p> tag, the first </p> tag after that tag, and then get all the HTML between them including the opening and closing tags:
$start = strpos($html, '<p>');
$end = strpos($html, '</p>', $start);
$paragraph = substr($html, $start, $end-$start+4);
Line 1 gets the position of the first opening <p> tag
Line 2 gets the position of the first </p> after the first opening <p>
Line 3 then uses substr to get the HTML. The third parameter is the number of characters to copy and is calculated by subtracting $start from $end and adding on the length of "</p>" so it is included in the extracted HTML.

Converting to plain text

If the extracted paragraph needs to be in plain text rather than HTML, use the following to remove the HTML tags and convert HTML entities into normal plain text:
$paragraph = html_entity_decode(strip_tags($paragraph));

Related posts:

0 comments:

Post a Comment