More fun with encodings

On a Slack I’m on, there was a conversation wondering why so many websites disallow passwords with spaces, punctuation, “special” characters, and so on; shouldn’t they all be hashing the passwords rather than storing them in plain text anyway?

Yes, they should, but that’s not where the problem is. Once again, encodings become a problem.

Here is my not-very-lucid, unedited ramble:

the encoding process can happen in a bunch of spots. The browser does the encoding into what it assumes is appropriate for what it thinks the webserver is expecting, charset-wise, and then the server may or may not decode the URL-encoded data into what the hash algorithm is expecting so for example if your password is i like béans and the client thinks the server is expecting ISO-8859-1, it will get sent as i%20like%20b%E9ans

so the server might try hashing i%20like%20b%E9ans or it might decode it to the ISO-8859-1 bytestream that corresponds to i like béans and hash that or it might try decoding it to UTF-8 and then the %E9 might throw an exception or get converted to who-knows-what, and then the resulting stream may or may not get hashed correctly because it might be expecting a string or it might be expecting a byte sequence

or the client might think the server is expecting UTF-8 so then it gets sent as i%20like%20b%C3%A9ans and then the server might hash that or it might think it’s ISO-8859-1 and decode it as i like béans and then hash that in whatever way makes sense (which again might be done as a string or it might throw an exception)

there is definitely a right way to handle this stuff, but many many many things can go wrong and it’s easier for web developers to just 🤷‍♀️ and 🖕

oh and even if the server does just hash the encoded string, there’s no guarantee that it’ll be encoded the same way by a future client, which might decide to encode it as i%20like%20b%c3%a9ns instead, which is also valid but hashes differently

or it might encode it to a different charset

there’s only one way for it to go perfectly right (and a couple of ways which happen to work temporarily) but there’s PLENTY of ways for it to get fucked up.

Oh and to add to the above, a lot of punctuation causes similar weirdness; space characters sometimes get encoded as + instead of %20, and basically it’s all a gigantic mess.

Anyway, here’s what Wikipedia has to say about percent-encoding, and then there’s just so many places in the HTTP and application stack where the decoding can heck up.

And that’s assuming everything even normalizes the same. Thanks to the magic of Unicode there can be many different ways for the same characters to be encoded, which might change between different operating systems or browsers, or even different versions of the same operating systems and browser. And forget about consistency between older unicode glyphs which now get replaced (sometimes) with emoji, or how some OSes now add a ZWJ and gender and/or race marker to emoji that didnt use to have them or whatever.

(that said if a website is specifically disallowing just a small subset of characters, like ' and ; and nothing else, you an be pretty sure they’re just storing your password in plaintext with terrible database code)

tl;dr i really should be in bed why do i keep doing this to myself