August 7, 2024
by Matt Raines

When a bug slips through the net

A graph of web server error log entries on various servers. One line follows a regular pattern of increasing gradually throughout the day and resetting overnight, but on two days the peak is much larger than usual.

Automated testing is a great thing and something we’re always looking at doing much, much more of. But sometimes it’s just unavoidable that errors in third party systems slip through and make it onto the production website. That’s especially the case for minor bugs that generate warning notices but have no noticeable impact on the site functionality. We can pick up on and deal with these snags before they become bigger problems thanks to our world class server monitoring, and open source means we can fix the problem upstream much more quickly.

Our server monitoring is provided by a on premise installation of Zabbix which we run at a third site outside our regular network. It monitors more than 18,000 data points across our server array, scanning 30 values a second and notifying us if one of 10,000 alert conditions is triggered, via Teams, Slack, SMS or email depending on severity. After some years of tweaking, its highly configurable dashboard now contains a beautiful array of graphs, gauges and blinkenlights plotting the 17 million values it has logged that it’s quite pleasurable to leave open on a spare screen. Clients can be notified too, and granted access to see only data relevant to the servers which host their specific applications.

So it was that a couple of weeks back on a Wednesday afternoon Gary let me know about a surprising spike in the size of the web server error log on one of our machines. Tracking error log file size and alerting when it’s larger than usual is a really effective way of getting an early notification that something’s slipped through the net. The log files are archived daily so the typical pattern is a gradual rise during the day and a reset overnight, but the incline that day was much steeper than usual.

Diagnosis

The error log is a text file where each line represents a reported problem. Many entries are to be expected, when visitors try to access missing pages or request information they are not entitled to view. Others report notices about unexpected situations that aren’t sufficient to stop code execution but should be investigated. Each line has the format

[date and time] [module:severity] [pid ID:tid ID] [client IP address:port] the actual error message

To find out which new messages are filling the file, I’m not interested in any of the information in square brackets. I just want to see which error messages appear the most frequently. Some entries don’t quite follow the same syntax so my initial attempt using cut -d ']' -f 5 didn’t quite – cut – it. I had to use sed to remove everything up to the final ] instead and get a long list of error messages without time, process, or connection information. Then sort and count before sorting again in numerical order to get the most common entries last.

sed 's/.*\] //' /var/log/apache2/error.log |sort |uniq -c |sort -n

On a typical day the most common entry might appear a few hundred times and be something like

657 AH01630: client denied by server configuration: .../xmlrpc.php

This particular Wednesday had a less usual most popular entry

4364 AH01071: Got error 'PHP message: PHP Warning: Undefined array key "host" in .../wp-includes/media.php on line 1372'

This is a built-in part of WordPress core and we had just recently updated to version 6.6.0. And because it’s Open Source, of course we can take a look at line 1372 and find out what it’s doing.

It’s a pretty simple block which has the admirable job of converting an image URL beginning http:// to one beginning https://.

/*
 * If currently on HTTPS, prefer HTTPS URLs when we know they're supported by the domain
 * (which is to say, when they share the domain name of the current request).
 */
if ( is_ssl() && ! str_starts_with( $image_baseurl, 'https' ) ) {
    // Since the `Host:` header might contain a port we should
    // compare it against the image URL using the same port.
    $parsed = parse_url( $image_baseurl );
    $domain = $parsed['host'];
    if ( isset( $parsed['port'] ) ) {
        $domain .= ':' . $parsed['port'];
    }
    if ( $_SERVER['HTTP_HOST'] === $domain ) {
        $image_baseurl = set_url_scheme( $image_baseurl, 'https' );
    }
}

That didn’t really leave me any the wiser but a bit of stepping through the code with XDebug later and I realised we have a relative URL for our WP_CONTENT_URL, something which is not officially supported but which makes it much easier to host separate instances of the websites for testing, development, and QA. A relative URL begins with “/”, which is not “https”, so the block gets triggered, can’t parse the URL, and outputs this PHP warning. There’s no actual damage done, the website loads fine, but it does rather fill the error log.

Solutions

I can’t edit the WordPress core code directly very easily without having issues whenever there’s an upstream change. So it took me a little while to come up with a workaround solution. Which is to convert WP_CONTENT_URL to an absolute URL if and only if it’s being called from this function. We get to keep all the benefits of relative URLs but this function can then work fine.

// wp_calculate_image_srcset does not work with relative WP_CONTENT_URL because it checks for
// URLs not beginning with https:// to determine whether to upgrade the paths to SSL.
// Add the site URL to the baseurl for uploads but only if called from wp_calculate_image_size.
// This allows us to continue using relative URLs elsewhere but does have the two downsides that:
//  1. it is rather slow
//  2. rendered HTML now includes absolute URLs
add_filter('upload_dir', function($upload_dir){
    if (in_array('wp_calculate_image_srcset', array_column(debug_backtrace(), 'function'))) {
        if (!str_contains($upload_dir['baseurl'], '://')) {
            $upload_dir['baseurl'] =
                trailingslashit(get_option('siteurl'))
                . ltrim($upload_dir['baseurl'], '/');
        }
    }
    return $upload_dir;
});

Which I committed, tested and put live and duly declared the problem fixed.

It took a further nudge from Gary to remind me the problem wasn’t really fixed, and I ought to file a bug report or a pull request with WordPress to notify them about the problem. And lo and behold, two days later, the problem is fixed and scheduled for inclusion in the next maintenance release of WordPress, version 6.6.2. Open Source for the win!