Terence Eden’s Blog<p><strong>Stop using preg_* on HTML and start using \Dom\HTMLDocument instead</strong></p><p><a href="https://shkspr.mobi/blog/2025/05/stop-using-preg_-on-html-and-use-domhtmldocument/" rel="nofollow noopener noreferrer" translate="no" target="_blank"><span class="invisible">https://</span><span class="ellipsis">shkspr.mobi/blog/2025/05/stop-</span><span class="invisible">using-preg_-on-html-and-use-domhtmldocument/</span></a></p><p></p><p>It is a truth universally acknowledged that a programmer in possession of some HTML will eventually try to parse it with a regular expression.</p><p><a href="https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454" rel="nofollow noopener noreferrer" target="_blank">This makes many people very angry and is widely regarded as a bad move</a>.</p><p>In the bad old days, it was somewhat understandable for a PHP coder to run a quick-and-dirty <code>preg_replace()</code> on a scrap of code. They probably could control the input and there wasn't a great way to manipulate an HTML5 DOM.</p><p>Rejoice sinners! PHP 8.4 is here to save your wicked souls. There's a new <a href="https://wiki.php.net/rfc/domdocument_html5_parser" rel="nofollow noopener noreferrer" target="_blank">HTML5 Parser</a> which makes <em>everything</em> better and stops you having to write brittle regexen.</p><p>Here are a few tips - mostly notes to myself - but I hope you'll find useful.</p><p><strong>Sanitise HTML</strong></p><p>This is the most basic example. This loads HTML into a DOM, tries to fix all the mistakes it finds, and then spits out the result.</p><pre><span class=""><span> PHP</span></span><code><span>$html</span> = <span>'<p id="yes" id="no"><em>Hi</div><h2>Test</h3><img />'</span>;<span>$dom</span> = <span>\Dom\HTMLDocument</span>::<span>createFromString</span>( <span>$html</span>, <span>LIBXML_NOERROR</span> | <span>LIBXML_HTML_NOIMPLIED</span> , <span>"UTF-8"</span> );<span>echo</span> <span>$dom</span>-><span>saveHTML</span>();</code></pre><p>It uses <code>LIBXML_HTML_NOIMPLIED</code> because we don't want a full HTML document with a doctype, head, body, etc.</p><p>If you want <a href="https://shkspr.mobi/blog/2025/04/introducing-pretty-print-html-for-php-8-4/" rel="nofollow noopener noreferrer" target="_blank">Pretty Printing, you can use my library</a>.</p><p><strong>Get the plain text</strong></p><p>OK, so you've got the DOM, how do you get the text of the body without any of the surrounding HTML</p><pre><span class=""><span> PHP</span></span><code><span>$html</span> = <span>'<p><em>Hello</em> World!</p>'</span>;<span>$dom</span> = <span>\Dom\HTMLDocument</span>::<span>createFromString</span>( <span>$html</span>, <span>LIBXML_NOERROR</span> , <span>"UTF-8"</span> );<span>echo</span> <span>$dom</span>-><span>body</span>-><span>textContent</span>;</code></pre><p>Note, this doesn't replace images with their alt text.</p><p><strong>Get a single element</strong></p><p>You can use <a href="https://developer.mozilla.org/en-US/docs/Web/API/Document/querySelector" rel="nofollow noopener noreferrer" target="_blank">the same <code>querySelector()</code> function as you do in JavaScript</a>!</p><pre><span class=""><span> PHP</span></span><code><span>$element</span> = <span>$dom</span>-><span>querySelector</span>( <span>"h2"</span> );</code></pre><p>That returns a <em>pointer</em> to the element. Which means you can run:</p><pre><span class=""><span> PHP</span></span><code><span>$element</span>-><span>setAttribute</span>( <span>"id"</span>, <span>"interesting"</span> );<span>echo</span> <span>$dom</span>-><span>querySelector</span>( <span>"h2"</span> )-><span>attributes</span>[<span>"id"</span>]-><span>value</span>;</code></pre><p>And you will see that the DOM has been manipulated!</p><p><strong>Search for multiple elements</strong></p><p>Suppose you have a bunch of headings and you want to get all of them. You can use <a href="https://developer.mozilla.org/en-US/docs/Web/API/Document/querySelectorAll" rel="nofollow noopener noreferrer" target="_blank">the same <code>querySelectorAll()</code> function as you do in JavaScript</a>!</p><p>To get all headings, in the order they appear:</p><pre><span class=""><span> PHP</span></span><code><span>$headings</span> = <span>$dom</span>-><span>querySelectorAll</span>( <span>"h1, h2, h3, h4, h5, h6"</span> );<span>foreach</span> ( <span>$headings</span> <span>as</span> <span>$heading</span> ) { <span>// Do something</span>}</code></pre><p><strong>Advanced Search</strong></p><p>Suppose you have a bunch of links and you want to find only those which point to "example.com/test/". Again, you can use <a href="https://developer.mozilla.org/en-US/docs/Web/CSS/Attribute_selectors" rel="nofollow noopener noreferrer" target="_blank">the same attribute selectors</a> as you would elsewhere</p><pre><span class=""><span> PHP</span></span><code><span>$dom</span>-><span>querySelectorAll</span>( <span>"a[href^=https\:\/\/example\.com\/test\/]"</span> );</code></pre><p><strong>Replacing content</strong></p><p>Sadly, it isn't quite as simple as setting the <code>innerHTML</code>. Each search returns a node. That node may have <em>children</em>. Those children will also be node which, themselves, may have children, and so on.</p><p>Let's take a simple example:</p><pre><span class=""><span> PHP</span></span><code><span>$html</span> = <span>'<h2>Hello</h2>'</span>;<span>$dom</span> = <span>\Dom\HTMLDocument</span>::<span>createFromString</span>( <span>$html</span>, <span>LIBXML_NOERROR</span> | <span>LIBXML_HTML_NOIMPLIED</span>, <span>"UTF-8"</span> );<span>$element</span> = <span>$dom</span>-><span>querySelector</span>( <span>"h2"</span> );<span>$element</span>-><span>childNodes</span>[0]-><span>textContent</span> = <span>"Goodbye"</span>;<span>echo</span> <span>$dom</span>-><span>saveHTML</span>();</code></pre><p>That changes "Hello" to "Goodbye".</p><p>But what if the element has child nodes?</p><pre><span class=""><span> PHP</span></span><code><span>$html</span> = <span>'<h2>Hello <em>friend</em></h2>'</span>;<span>$dom</span> = <span>\Dom\HTMLDocument</span>::<span>createFromString</span>( <span>$html</span>, <span>LIBXML_NOERROR</span> | <span>LIBXML_HTML_NOIMPLIED</span>, <span>"UTF-8"</span> );<span>$element</span> = <span>$dom</span>-><span>querySelector</span>( <span>"h2"</span> );<span>$element</span>-><span>childNodes</span>[0]-><span>textContent</span> = <span>"Goodbye"</span>;<span>echo</span> <span>$dom</span>-><span>saveHTML</span>();</code></pre><p>That outputs <code><h2>Goodbye<em>friend</em></h2></code> - so think carefully about the structure of the DOM and what you want to replace.</p><p><strong>Adding a new node</strong></p><p>This one is tricky! Let's suppose you have this:</p><pre><span class=""><span> HTML</span></span><code><<span>div</span> <span>id</span>="page"> <<span>main</span>> <<span>h2</span>>Hello</<span>h2</span>></code></pre><p>You want to add an <code><h1></code> <em>before</em> the <code><h2></code>. Here's how to do this.</p><p>First, you need to construct the DOM:</p><pre><span class=""><span> PHP</span></span><code><span>$html</span> = <span>'<div id="page"><main><h2>Hello</h2>'</span>;<span>$dom</span> = <span>\Dom\HTMLDocument</span>::<span>createFromString</span>( <span>$html</span>, <span>LIBXML_NOERROR</span> | <span>LIBXML_HTML_NOIMPLIED</span>, <span>"UTF-8"</span> );</code></pre><p>Next, you need to construct <em>an entirely new</em> DOM for your new node.</p><pre><span class=""><span> PHP</span></span><code><span>$newHTML</span> = <span>"<h1>Title</h1>"</span>;<span>$newDom</span> = <span>\Dom\HTMLDocument</span>::<span>createFromString</span>( <span>$newHTML</span>, <span>LIBXML_NOERROR</span> | <span>LIBXML_HTML_NOIMPLIED</span>, <span>"UTF-8"</span> );</code></pre><p>Next, extract the new element from the new DOM, and import it into the original DOM:</p><pre><span class=""><span> PHP</span></span><code><span>$element</span> = <span>$dom</span>-><span>importNode</span>( <span>$newDom</span>-><span>firstChild</span>, <span>true</span> ); </code></pre><p>The element now needs to be inserted <em>somewhere</em> in the original DOM. In this case, get the <code>h2</code>, tell its parent node to insert the new node <em>before</em> the <code>h2</code>:</p><pre><span class=""><span> PHP</span></span><code><span>$h2</span> = <span>$dom</span>-><span>querySelector</span>( <span>"h2"</span> );<span>$h2</span>-><span>parentNode</span>-><span>insertBefore</span>( <span>$element</span>, <span>$h2</span> );<span>echo</span> <span>$dom</span>-><span>saveHTML</span>();</code></pre><p>Out pops:</p><pre><span class=""><span> HTML</span></span><code><<span>div</span> <span>id</span>="page"> <<span>main</span>> <<span>h1</span>>Title</<span>h1</span>> <<span>h2</span>>Hello</<span>h2</span>> </<span>main</span>></<span>div</span>></code></pre><p>An alternative is to use <a href="https://www.php.net/manual/en/domnode.appendchild.php" rel="nofollow noopener noreferrer" target="_blank">the <code>appendChild()</code> method</a>. Note that it appends it to the <em>end</em> of the children. For example:</p><pre><span class=""><span> PHP</span></span><code><span>$div</span> = <span>$dom</span>-><span>querySelector</span>( <span>"#page"</span> );<span>$div</span>-><span>appendChild</span>( <span>$element</span> );<span>echo</span> <span>$dom</span>-><span>saveHTML</span>();</code></pre><p>Produces:</p><pre><span class=""><span> HTML</span></span><code><<span>div</span> <span>id</span>="page"> <<span>main</span>> <<span>h2</span>>Hello</<span>h2</span>> </<span>main</span>> <<span>h1</span>>Title</<span>h1</span>></<span>div</span>></code></pre><p><strong>And more?</strong></p><p>I've only scratched the surface of what the new 8.4 HTML Parser can do. I've already rewritten lots of my yucky old <code>preg_</code> code to something which (hopefully) is less likely to break in catastrophic ways.</p><p>If you have any other tips, please leave a comment.</p><p></p><p><a rel="nofollow noopener noreferrer" class="hashtag u-tag u-category" href="https://shkspr.mobi/blog/tag/html/" target="_blank">#HTML</a> <a rel="nofollow noopener noreferrer" class="hashtag u-tag u-category" href="https://shkspr.mobi/blog/tag/html5/" target="_blank">#HTML5</a> <a rel="nofollow noopener noreferrer" class="hashtag u-tag u-category" href="https://shkspr.mobi/blog/tag/php/" target="_blank">#php</a></p>