loading . . . ## Stop using preg_* on HTML and start using \Dom\HTMLDocument instead
https://shkspr.mobi/blog/2025/05/stop-using-preg_-on-html-and-use-domhtmldocument/
It is a truth universally acknowledged that a programmer in possession of some HTML will eventually try to parse it with a regular expression.
This makes many people very angry and is widely regarded as a bad move.
In the bad old days, it was somewhat understandable for a PHP coder to run a quick-and-dirty `preg_replace()` on a scrap of code. They probably could control the input and there wasn't a great way to manipulate an HTML5 DOM.
Rejoice sinners! PHP 8.4 is here to save your wicked souls. There's a new HTML5 Parser which makes _everything_ better and stops you having to write brittle regexen.
Here are a few tips - mostly notes to myself - but I hope you'll find useful.
## Sanitise HTML
This is the most basic example. This loads HTML into a DOM, tries to fix all the mistakes it finds, and then spits out the result.
PHP$html = '<p id="yes" id="no"><em>Hi</div><h2>Test</h3><img />';$dom = \Dom\HTMLDocument::createFromString( $html, LIBXML_NOERROR | LIBXML_HTML_NOIMPLIED , "UTF-8" );echo $dom->saveHTML();
It uses `LIBXML_HTML_NOIMPLIED` because we don't want a full HTML document with a doctype, head, body, etc.
If you want Pretty Printing, you can use my library.
## Get the plain text
OK, so you've got the DOM, how do you get the text of the body without any of the surrounding HTML
PHP$html = '<p><em>Hello</em> World!</p>';$dom = \Dom\HTMLDocument::createFromString( $html, LIBXML_NOERROR , "UTF-8" );echo $dom->body->textContent;
Note, this doesn't replace images with their alt text.
## Get a single element
You can use the same `querySelector()` function as you do in JavaScript!
PHP$element = $dom->querySelector( "h2" );
That returns a _pointer_ to the element. Which means you can run:
PHP$element->setAttribute( "id", "interesting" );echo $dom->querySelector( "h2" )->attributes["id"]->value;
And you will see that the DOM has been manipulated!
## Search for multiple elements
Suppose you have a bunch of headings and you want to get all of them. You can use the same `querySelectorAll()` function as you do in JavaScript!
To get all headings, in the order they appear:
PHP$headings = $dom->querySelectorAll( "h1, h2, h3, h4, h5, h6" );foreach ( $headings as $heading ) { // Do something}
## Advanced Search
Suppose you have a bunch of links and you want to find only those which point to "example.com/test/". Again, you can use the same attribute selectors as you would elsewhere
PHP$dom->querySelectorAll( "a[href^=https\:\/\/example\.com\/test\/]" );
## Replacing content
Sadly, it isn't quite as simple as setting the `innerHTML`. Each search returns a node. That node may have _children_. Those children will also be node which, themselves, may have children, and so on.
Let's take a simple example:
PHP$html = '<h2>Hello</h2>';$dom = \Dom\HTMLDocument::createFromString( $html, LIBXML_NOERROR | LIBXML_HTML_NOIMPLIED, "UTF-8" );$element = $dom->querySelector( "h2" );$element->childNodes[0]->textContent = "Goodbye";echo $dom->saveHTML();
That changes "Hello" to "Goodbye".
But what if the element has child nodes?
PHP$html = '<h2>Hello <em>friend</em></h2>';$dom = \Dom\HTMLDocument::createFromString( $html, LIBXML_NOERROR | LIBXML_HTML_NOIMPLIED, "UTF-8" );$element = $dom->querySelector( "h2" );$element->childNodes[0]->textContent = "Goodbye";echo $dom->saveHTML();
That outputs `<h2>Goodbye<em>friend</em></h2>` - so think carefully about the structure of the DOM and what you want to replace.
## Adding a new node
This one is tricky! Let's suppose you have this:
HTML<div id="page"> <main> <h2>Hello</h2>
You want to add an `<h1>` _before_ the `<h2>`. Here's how to do this.
First, you need to construct the DOM:
PHP$html = '<div id="page"><main><h2>Hello</h2>';$dom = \Dom\HTMLDocument::createFromString( $html, LIBXML_NOERROR | LIBXML_HTML_NOIMPLIED, "UTF-8" );
Next, you need to construct _an entirely new_ DOM for your new node.
PHP$newHTML = "<h1>Title</h1>";$newDom = \Dom\HTMLDocument::createFromString( $newHTML, LIBXML_NOERROR | LIBXML_HTML_NOIMPLIED, "UTF-8" );
Next, extract the new element from the new DOM, and import it into the original DOM:
PHP$element = $dom->importNode( $newDom->firstChild, true );
The element now needs to be inserted _somewhere_ in the original DOM. In this case, get the `h2`, tell its parent node to insert the new node _before_ the `h2`:
PHP$h2 = $dom->querySelector( "h2" );$h2->parentNode->insertBefore( $element, $h2 );echo $dom->saveHTML();
Out pops:
HTML<div id="page"> <main> <h1>Title</h1> <h2>Hello</h2> </main></div>
An alternative is to use the `appendChild()` method. Note that it appends it to the _end_ of the children. For example:
PHP$div = $dom->querySelector( "#page" );$div->appendChild( $element );echo $dom->saveHTML();
Produces:
HTML<div id="page"> <main> <h2>Hello</h2> </main> <h1>Title</h1></div>
## And more?
I've only scratched the surface of what the new 8.4 HTML Parser can do. I've already rewritten lots of my yucky old `preg_` code to something which (hopefully) is less likely to break in catastrophic ways.
If you have any other tips, please leave a comment.
#HTML #HTML5 #php https://shkspr.mobi/blog/2025/05/stop-using-preg_-on-html-and-use-domhtmldocument/