RSS LJ

June 22, 2009

Base-10 file sizes (, )

by fluffy at 11:30 AM
So, Mac OS X 10.6 (Snow Leopard) will be using base-10 file sizes. There has been a lot of nerd outcry over this, but frankly, I think it's about freaking time.

There is absolutely no reason to use base-2 file sizes. Yes, computers deal with things in terms of base 2, but nobody else does. When you look at a file that is 104768926 bytes big, you think, "oh, 105 megabytes," not "100 megabytes." As files get bigger and bigger, the disparity between MB and MiB gets worse and worse.

People have long accused hard drive manufacturers of "inflating" drive sizes by using base-10 instead of base-2, but really it's been the fault of OS makers for deflating it, based on some really ridiculous legacy which dates back to the 70s, namely that it was a lot easier for OSes to just say how many 1K clusters were available, or divide the bytes available by >>10 instead of /1024, or whatever.

The practice of 1024-as-K has also led to all sorts of weirdness, like 1.44MB disks (which were 1440KiB, i.e. 1474560 bytes - neither 1.44MB nor 1.44MiB).

"But computer parts are sold in terms of 1024 units!" is also crap. The only part that has ever been sold on that basis is RAM, which actually makes sense for various technological reasons not worth getting into. CPU speed is base-10. Network adapters are base-10. Bus speed is base-10. And hard drives are sold based on base-10, but reported based on base-2.

Okay, so RAM sizes will be somewhat disparate from hard disk sizes, but really, why does that matter? RAM sizes only matter to programmers, and as a ballpark figure for users for having "enough" memory. Just because a file on disk takes 1200KB doesn't mean it will take 1200KB of RAM; chances are it will take much more. (Granted, there are a lot of spots where it makes sense for code to use power-of-2 sizes, for things like memory allocation and caches and the like, but that doesn't need to be reported to the user.)

The only place where hard disk size really has any base-2 issue is because file systems tend to allocate things in base-2-sized chunks (usually 512 or 1024 bytes), but that's not counting overhead of the filesystem itself, and anyway the vast majority of files (the ones which take enough space for hard drive availability to be an issue) are so large that the cluster size essentially just amounts to rounding error anyway. Okay, so the "real" storage space taken by a 123456789-byte file is actually 123457536 bytes, but that's still a lot closer to 123.4MB than it is to 117.7MB!

In short: Apple is doing a good thing by finally freeing us of some ridiculous legacy which has no bearing on reality.

Okay, so it does mean there will be a mismatch between file sizes reported on OSX 10.6 vs. any other OS, but when does that actually matter?

Comments

#12194 06/22/2009 12:28 pm
One presumes that command line utilities will use the IEC 60027 (kibi, mebi, gibi...) standard rather than the SI one. At least when they're piping their output or run from a script and whatnot.

But yep, that all sounds very legacy nowadays.
#12195 06/22/2009 01:20 pm
Well, what tools even report sizes in other-than-bytes anyway, these days?

du and df have options of -h (which is specifically intended to be human-readable and not machine-parsed), du -k (which specifically is KiB), and du alone (which specs that it is 'cluster count' which just happens to be KiB on many platforms).

I guess "find" has file size constraints, but that also very specifically specifies what the units are (bytes,KiB,MiB,GiB) and even then, that still gets to the issue of, why are people searching on file sizes, and if they are, shouldn't it be intended for the human, not based on some arbitrary legacy idea of what disk space entailed?

Everything I've ever written which deals with sizes is dealing with it either in terms of bytes or is unit-agnostic and is just sorting based on largest-to-smallest, so it still doesn't even matter.
#12196 06/22/2009 01:22 pm
Quite correct. I haven't dealt with anything that close to the metal in a long time, so I made a silly assumption.
#12197 06/22/2009 01:37 pm
This is perhaps the first time I've ever agreed with Apple on any decision.
#12198 belisle (unregistered) 06/22/2009 04:23 pm
fluffy:
du and df have options of -h (which is specifically intended to be human-readable and not machine-parsed)


Right. If you're writing code that internally deals with English prefixes at any time other than representing a result to the user, you're doing it wrong.
#12367 Stephen (unregistered) 09/02/2009 10:42 pm SSDs
How about SSDs and other flash memory? Aren't they base-2 sizes (like 128MB, 256MB and so on)?
#12368 09/02/2009 11:02 pm
Their total raw capacity is, but it's diminished by the file system itself, as well as built-in wear leveling mechanisms and so on. Also, I suspect that many SSDs have defective blocks which are detected during testing and then diverted around in software, just like magnetic media.

Anyway, just because it happens to be a power of 2 doesn't mean that we should continue to make contortions just to make an arbitrary (for all intents and purposes) number slightly cleaner-looking. I mean, even if a 128GiB SSD were exactly 128*2^30 bytes large, you're still using a base-10 number in front of that GiB signifier. A 1280GiB SSD would is 1.25TiB, not 1.28.

There is no reason for us to make the numbers vaguely "more computery" when all it does is complicate matters.

This reminds me of a time when I was taking a tour of some veterinary research lab with a bunch of other CS students (as part of some "expanding your horizons" thing). The guy who was showing off their computerized animal testing facilities quite proudly showed how he was setting a timer to 16 seconds "because computers are better at dealing with powers of 2, right?" completely not understanding that no, the computer doesn't actually care how many bits are set in the start value when it begins.