Specification vs. implementation

There are a lot of times when the specification says one thing but common implementations do another. Here are some especially common examples to watch out for.

Attribute quoting

According to the HTML specification, single-quoted attributes are perfectly valid; for example, these HTML fragments should be absolutely equivalent:

<a href="https://example.com/">Check out my website</a>
<a href='https://example.com/'>Check out my website</a>
<a href=https://example.com/>Check out my website</a>

However, there are many, many client implementations which expect any quoted attributes to be double-quoted, and even some which do not support unquoted attributes at all. So, for example, I’ve seen many implementations assume that a single-quoted attribute is equivalent to an unquoted attribute, so it treats these as equivalent:

<a href='https://example.com/'>Check out my website</a>
<a href="'https://example.com/'">Check out my website</a>

which is to say, if <a href='https://example.com/'>Check out my website</a> appears on the website https://foo.example/~bob/homepage.html, the URL then is interpreted as being https://foo.example/~bob/'https://example.com' (or https://foo.example/~bob/%39https://example.com%39 if we’re being strict about URL-encoding).

Unquoted attributes also often are subject to all sorts of weird things, especially with how the entities within them get decoded.

Email systems are historically particularly bad about this; the impetus to this article was discovering that my email provider1 does not support single-quoted attributes, and goes so far as to converting the quotes to &#39; entities, causing even more problems downstream.

So, for maximum compatibility, it’s best to always use double-quoted attributes, regardless of what the HTML specification says.

Protocol-relative URLs

Back in the day, it was pretty common for websites to serve things up in a mixture of HTTP (plaintext) and HTTPS (encrypted), and there were reasons to want static pages to link to external resources with the same scheme (for example, an HTTP page referring to an external image with an HTTP URL, but the HTTPS version of the page using an HTTPS URL for the image).

I cannot find any official specification for HTML, but the commonly-accepted standard for these, per the generic URI syntax, considers the initial //hostname to be the starting portion of the path component of the URL, which is to say, a protocol-relative URL of //example.com/foo should be treated by adding the current page’s scheme to the URL; for example, from https://example.com/~bob/homepage.html, a link to //website.example/meow.gif should be interpreted as https://website.example/meow.gif, while from http://example.com/~alice/ the same link would become http://website.example/meow.gif.

Unfortunately, a lot of software out there just sees that the link starts with a / and assumes it’s a site-relative URL instead, so from https://example.com/~bob/ it is interpreted as https://example.com/website.example/meow.gif.

You can see how your browser implements such a link.

In any case, it’s better to be explicit about your URL scheme, and in general if a site supports https it’s best to just link to that version anyway.

Path coalescing

In most operating systems, there is a convention that .. refers to the parent directory, so for example the path /foo/bar/../baz is equivalent to /foo/baz. Additionally, . refers to the current directory, and / is seen as a path separator. So a path of /foo/bar/./baz is equivalent to /foo/bar/baz, for example.

Most web-based things will automatically apply these rules, even if it’s technically incorrect; for example, both Apache and nginx will internally manipulate the URL to treat them as equivalent before it even touches the backing application, and even if they don’t, it seems that most application stacks will also pre-coalesce the URL.

But on the client side, browsers will also automatically do this path coalescing before it even forms the URL to be requested; for example, ../blog/ and even https://junk.sockpuppet.band/foo/bar/../../songlets/ never even show up in the DOM with any of the .. components from most browsers (although I have seen some clients preserve them in some cases). Strictly-speaking those URLs shouldn’t even be equivalent, because foo/bar is a nonexistent path on both of those sites, so based purely on filesystem rules those should result in a 404 Not Found error. But things are being short-circuited for the sake of friendliness. And if you enter a URL manually, by copy-pasting e.g. https://beesbuzz.biz/foo/../code/ into your location bar, every modern browser I’ve tried will just automagically coalesce the path component.

(Note that how it coalesces // is inconsistent, in my experience; some browsers treat it as a subdirectory with an empty name, while others treat it as if it’s the same as a single /, the same as UNIX.)

But it’s not necessarily the case that the path will be coalesced. For example, here’s a trivial WSGI application that just passes through a couple of things from the request:

app.py
def app(environ, start_response):
    start_response("200 OK", [("Content-Type", "text/plain")])

    for key in ('HTTP_HOST', 'RAW_URI'):
        yield f'{key}: {environ[key]}\n'.encode('utf-8')

And here’s some outputs when run through gunicorn; for starters, by default, curl coalesces /./ and /../ (but not //) client-side:

bean:~ $ curl -i http://localhost:8000/foo//moo/./bar/../baz/
HTTP/1.1 200 OK
Server: gunicorn
Date: Wed, 08 Apr 2026 20:36:44 GMT
Connection: close
Transfer-Encoding: chunked
Content-Type: text/plain

HTTP_HOST: localhost:8000
RAW_URI: /foo//moo/baz/
bean:~ $ curl -i http://localhost:8000/foo//moo/../../bar/
HTTP/1.1 200 OK
Server: gunicorn
Date: Wed, 08 Apr 2026 20:42:56 GMT
Connection: close
Transfer-Encoding: chunked
Content-Type: text/plain

HTTP_HOST: localhost:8000
RAW_URI: /foo/bar/

But a request that uses the path as-is will still at least pass through directly, at least through gunicorn itself:

bean:~ $ curl -i http://localhost:8000/foo//moo/./bar/../baz/ --path-as-is
HTTP/1.1 200 OK
Server: gunicorn
Date: Wed, 08 Apr 2026 20:40:09 GMT
Connection: close
Transfer-Encoding: chunked
Content-Type: text/plain

HTTP_HOST: localhost:8000
RAW_URI: /foo//moo/./bar/../baz/

But in other testing I have found that, at least with a stack of nginx+gunicorn+Flask, the path coalescing takes place somewhere before it hits the actual application. (I do not have the patience to try to figure out where, exactly, not that it even matters.)

All this is to say, you cannot expect runs of multiple / or paths containing /./ or /../ to remain intact, even when the request is being made at the wire level, but you also cannot assume that the path will be pre-coalesced.

Case-sensitivity/case-folding

Case-sensitivity and lack thereof in the hostname is also something you cannot rely on:

$ curl -ivvv http://beesbuzz.biz/ | head
* Host beesbuzz.biz:80 was resolved.
[...]
> GET / HTTP/1.1
> Host: beesbuzz.biz
[...]
< link: <https://webmention.io/beesbuzz.biz/webmention>; rel="webmention"
< Link: <https://beesbuzz.biz/_tokens>; rel="token_endpoint"
[...]

$ curl -sivvv http://BeesBuzz.Biz/ | head
* Host BeesBuzz.Biz:80 was resolved.
[...]
> GET / HTTP/1.1
> Host: BeesBuzz.Biz
[...]
< link: <https://webmention.io/beesbuzz.biz/webmention>; rel="webmention"
< Link: <https://beesbuzz.biz/_tokens>; rel="token_endpoint"
[...]

In this case, note that curl preserved the case of the domain name in the Host: parameter, but something within the stack converted the hostname to all-lowercase (as can be seen in the link headers in the response). Whether this is happening in nginx or Flask is uncertain (and I, again, do not feel a particular need to figure out where this takes place, although I’d assume it’s at the vhost — and therefore nginx — level), but gunicorn does preserve the case of the hostname (using the same minimal WSGI app as above):

bean:~ $ curl http://LocalHost:8000/
HTTP_HOST: LocalHost:8000
RAW_URI: /

So, as with path coalescing, you cannot assume that elements will be case-folded for you, but you also cannot assume that they won’t be.

And of course, path resolution for resources is up to the underlying implementation; a webserver running on macOS or Windows will (usually) treat /foo.jpg and /foo.JPG as the same resource, while on Linux, those are different resources. Of course the browser will hopefully treat them as separate for the purpose of caching, but: you cannot guarantee this.

As one of my college professors once said, “If it makes a difference whether something is case-sensitive or not, you have made a mistake.”

http vs. https in general

Nothing in the HTTP specification says that the same path on two different schemes will reflect the same resource; for example, http://example.com/ and https://example.com/ can very well be completely different websites. But I have seen plenty of browsers, web crawlers, and other software assume that they are the one and the same!

At its most trivial, this very site will have slightly different content for the two versions; there are a handful of places where out of necessity, some links do not appear on the http version, or where they are rendered as absolute links and will match the original request’s scheme rather than directing to https.

But things can be a lot more complicated. For example, once upon a time I ran a site where the http version was an informational page and the https version was the webmail for the domain. It was silly to do it that way, and I stopped doing it when browsers started being “helpful” about automatically converting http URLs to https (not to mention when I stopped hosting my own email and switched to other hosting providers), but you absolutely cannot just assume that two pages will be the same despite different URL schemes.

(Also, remember that URL schemes other than http and https exist! FTP, Gopher, and others might have fallen out of fashion, but they still exist. Not to mention nascent protocols like Gemini.)

From a server implementation standpoint, you should assume that clients can and will treat differing schemes as identical, so if a website is available from both protocols, the content should match between them, and if something is only available via https, then an http request to the same resource should redirect to the https one.

But from a client standpoint, you really should consider the scheme to be a part of the URL.

www. prefixes (and other subdomain issues)

Back in the early days of the Internet, it was common for a domain to host a whole bunch of different services, for example ftp.example.com, irc.example.com, mail.example.com, and so on, and many of these would even be hosted by separate physical servers with their own IP addresses. So when the web started up as an experimental thing it was super common to just spin up another server named www, and that was the one and only way that people would reference the website; there often wouldn’t even be a root domain A record.

In those days, the hostname used to resolve the site had no impact on the resource returned; in fact the Host: request header didn’t even exist, and it wasn’t until quite some time later that browsers started sending that, to support name-based virtual hosting. Every website needed its own IP address. (Note that many non-HTTP protocols still have this limitation.)

As the web became the primary use of the Internet, the www prefix convention remained, and you had a big hot mess of differing implementations:

  • Dedicated-IP hosts that have the root record and www resolve to the same server, which would then serve up the same content on either hostname
  • Sites that would map both example.com and www.example.com to the same virtual host configuration
  • Sites that would redirect www.example.com to example.com
  • Sites that would redirect example.com to www.example.com
  • Sites that serve up entirely different content for example.com vs. www.example.com
  • Hosts that only resolve from one or the other

Pretty much all of these remain to this very day, and to make things even more fun, many clients try to do “helpful” things where, for example, if example.com doesn’t resolve it’ll automatically redirect to www.example.com (or put up a prompt to that effect), or if a web crawler sees both hostnames it’ll just assume that both are the same, or follow the preference of whomever implemented it.

I don’t even know what the best practice should be in this case. I guess it should be something like:

  • Clients should assume that www.example.com and example.com are different websites and use canonical URLs to sort out which is the “real” one if they both exist, even if this means potentially crawling the same site twice
  • Servers should redirect to the one that is correct

Then again, the same issue comes up with sites that are available from multiple separate domains, and I’ve also seen situations where badly-behaved crawlers will assume that all subdomains are equivalent (e.g. alice.example.com and bob.example.com), sometimes even getting confused by ccTLDs that are multi-level (like .uk) and thinking that, for example, example.co.uk and google.co.uk are the same site because they’re both subdomains of co.uk! (This was especially bad back when so-called “dynamic DNS” providers were super common.)

Redirections

There are so many different kinds of HTTP redirection, each with different implications on caching, HTTP method, and equivalence.

Clients should probably just note the type and target of a redirection rather than try to treat the URLs as equivalent; for example, /code and /code/ are distinct URLs and should be treated as such.

Like, in theory, /code could not redirect and instead have entirely different content from /code/, but in practice, this will almost certainly cause Problems, and I’m sure there’s even crawlers out there which strip off trailing slashes and then expect the actual request to be redirected.

In conclusion

When implementing client software, you should do whatever you can to follow the specification, but when implementing server software, you should also be aware of common client implementation issues.

If you do run into an implementation issue, it is of course a kindness to inform the implementor of the mistake, but some of these issues are common enough that it’s best to accommodate the common misunderstandings and just sigh quietly about it.

Comments

To see the comments on this entry, please log in. Alternately, send me an email, or join me on Discord!