Filesize unit conversion algorithm

A place to discuss the implementation and style of computer programs.

Moderators: phlip, Moderators General, Prelates

User avatar
Archgeek
Posts: 185
Joined: Wed May 02, 2007 6:00 am UTC
Location: Central US
Contact:

Filesize unit conversion algorithm

Postby Archgeek » Tue Aug 19, 2014 10:40 pm UTC

So upon poking around at how to format a decimal number in PHP to a certain precision, I ran into various ways of calculating the appropriate suffix for a given magnitude of file size, with my two favourites below:

Code: Select all

$byteUnits = array(' B',' kB', ' MB', ' GB');
$base = floor(log($size)/log(1024)); //find the power of 1024 in play
$size /= (1 << (10*$base)); //bitwise sorcery

//fails for file sizes greater than 2^90   
$i = 0, $byteUnits = array(' B',' kB', ' MB', ' GB', ' TB', ' PB', ' EB', ' ZB', ' YB');
//find the largest unit that cannot be expressed as a whole number of the next largest
while($size > 1024){
   $size /= 1024;
   i++;
}    

More fearsome things involving sprintf and regexes also abounded, all claiming efficiency, which I doubt, what with the reckless function calls. It made me curious, though, what is the most computationally efficient way of doing such a simple task? A case statement would just be a few branches and a single divide, which seems pretty cheap, but a loop has the elegance of an amusing little lookup table. I'm guessing grabbing characters out of an array [BkMGTPEZY] or some such would be more memory efficient, but I'm not sure if that'd be more or less efficient in its capacity as a lookup table.

What might say those who've had a proper algorithms class (neverminding that it's all roughly O(n))?
"That big tube down the side was officially called a "systems tunnel", which is aerospace contractor speak for "big tube down the side."

korona
Posts: 495
Joined: Sun Jul 04, 2010 8:40 pm UTC

Re: Filesize unit conversion algorithm

Postby korona » Tue Aug 19, 2014 11:08 pm UTC

It's O(log(n)). Integer division by a power of 2 if fast because it translates to a bitwise shift. It doubt that there is anything faster than an unrolled divide-and-check approach.

User avatar
Thesh
Made to Fuck Dinosaurs
Posts: 6232
Joined: Tue Jan 12, 2010 1:55 am UTC
Location: Colorado

Re: Filesize unit conversion algorithm

Postby Thesh » Wed Aug 20, 2014 2:35 am UTC

Code: Select all

$i = floor(log($size)/log(1024))
$size = $size/1024^$i


You'll have to correct the functions since I haven't touched PHP in years and I'm too lazy to look them up.
Summum ius, summa iniuria.

User avatar
Archgeek
Posts: 185
Joined: Wed May 02, 2007 6:00 am UTC
Location: Central US
Contact:

Re: Filesize unit conversion algorithm

Postby Archgeek » Wed Aug 20, 2014 4:26 am UTC

korona wrote:It's O(log(n)). Integer division by a power of 2 if fast because it translates to a bitwise shift. It doubt that there is anything faster than an unrolled divide-and-check approach.


Dang, really? I thought it was O(n) as the worst case involves a division for every power of 1024, up to 9 in the dang crazy case. Does something make the divides overwhelm the loopery?

Thesh wrote:

Code: Select all

$i = floor(log($size)/log(1024))
$size = $size/1024^$i[


This also speaks to the core curiosity -- two calls to the log function, one to the floor function, p multiplies and a divide are cheaper than n branches, or two logs, a floor, n bit shifts and a divide? I must know how so. If I knew the x86 (x64?) instruction set enough to sum up the cycles, I could just do that, but I do not and I know that documentation is tome-like in nature. I just distrust function calls what with their big pile of stack accesses and branchery.
"That big tube down the side was officially called a "systems tunnel", which is aerospace contractor speak for "big tube down the side."

User avatar
Thesh
Made to Fuck Dinosaurs
Posts: 6232
Joined: Tue Jan 12, 2010 1:55 am UTC
Location: Colorado

Re: Filesize unit conversion algorithm

Postby Thesh » Wed Aug 20, 2014 4:34 am UTC

Well, you can always test it, but how much performance do you really need? The time to execute any of those is going to be negligible and I highly doubt that's where your performance hit will be from. That said, logarithms, multiplication, and exponentiation of floats are all assembly instructions (and log(1024) is a constant if it's really that important) so they can be pretty fast depending on how they are implemented. In your case, you are doing branching (one of the slowest operation), multiple divisions, and multiple additions.

"We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil" - Donald Knuth
Summum ius, summa iniuria.

EvanED
Posts: 4331
Joined: Mon Aug 07, 2006 6:28 am UTC
Location: Madison, WI
Contact:

Re: Filesize unit conversion algorithm

Postby EvanED » Wed Aug 20, 2014 4:48 am UTC

Archgeek wrote:
korona wrote:It's O(log(n)). Integer division by a power of 2 if fast because it translates to a bitwise shift. It doubt that there is anything faster than an unrolled divide-and-check approach.


Dang, really? I thought it was O(n) as the worst case involves a division for every power of 1024, up to 9 in the dang crazy case. Does something make the divides overwhelm the loopery?
Well, what's n? If it's the bit width, then a division for every power of 1024 is O(n). If it's the numerical value, then a division for every power of 1024 is O(lg n).

(Both can be reasonable to pay attention to in different cases. The distinction shows up in other places too. For example, there are a lot of numerical algorithms that you'll see people say are "polynomial" but are technically pseudopolynomial: that is, they are polynomial in the value but not in the actual bit with (which is what the formal, TM-based definition of complexity requires). Wikipedia's article gives an example of the naive primality testing algorithm that just divides by 1, 2, 3, etc.; this is polynomial in the numerical value of the input but exponential in the actual width of the input, and hence is pseudopolynomial. Repeated division by 1024 is is basically the same case, but down a level of exponentials.)

User avatar
Archgeek
Posts: 185
Joined: Wed May 02, 2007 6:00 am UTC
Location: Central US
Contact:

Re: Filesize unit conversion algorithm

Postby Archgeek » Wed Aug 20, 2014 2:33 pm UTC

Thesh: It's academic really, I'm just curious. Woah, exponents are right in the instruction set? I thought they were loops of multiplications. 'Guess the x86 folks went nuts on saving instruction fetches. PHP's an interpreted language, though, so I've no guarantee it's being cool enough to use those and not do full store-all-the-registers-and-jump subroutine calls.
Does division by 1024^i beat out 10i right-shifts?

EvanED: Yeah, I was taking n to be the the number of bits/10, which kind of rigs the measure to wind up O(n), but I was a bit tired when I wrote that. Interesting that formal complexity focuses on bit-width. I'll have to remember that.
"That big tube down the side was officially called a "systems tunnel", which is aerospace contractor speak for "big tube down the side."

User avatar
Xenomortis
Not actually a special flower.
Posts: 1421
Joined: Thu Oct 11, 2012 8:47 am UTC

Re: Filesize unit conversion algorithm

Postby Xenomortis » Wed Aug 20, 2014 3:14 pm UTC

Archgeek wrote:Thesh: It's academic really, I'm just curious. Woah, exponents are right in the instruction set?

Only for floats, not integers.
x87 (floating point coprocessor for x86) has FYL2X which computes y * log2(x).
Maybe there's an SSE equivalent, I don't know.

Archgeek wrote:Does division by 1024^i beat out 10i right-shifts?

I'd expect right shifts to edge out by a near immeasurable margin.

Archgeek wrote:It made me curious, though, what is the most computationally efficient way of doing such a simple task? A case statement would just be a few branches and a single divide, which seems pretty cheap, but a loop has the elegance of an amusing little lookup table. I'm guessing grabbing characters out of an array [BkMGTPEZY] or some such would be more memory efficient, but I'm not sure if that'd be more or less efficient in its capacity as a lookup table.

In a compiled context (can't speak for PHP), branching is typically slow(er), particularly if it can't easily predict the right path ahead of time. Although in this case you could eliminate the branching and implement what's basically an unrolled loop with some clever arithmetic statements.
As for the "multiple strings" vs "single character array", a single array would be better since these things tend to end up in random places on the heap.
In C, I might* try this:

Code: Select all

int n = ... ;
char units[20] = { 'B', 0, 0, 'k', 'B', 0, 'M', 'B', 0, 'G', 'B', 0, 'T', 'B', 0, ... };
char* unit = units;
while ( n > 1024 ) {
  n >> 10;
  unit += 3;
}

But in PHP, you're given less freedom to abuse strings and arrays like that.

*going out of my way to avoid indirection and have everything on the stack

Archgeek wrote:I just distrust function calls what with their big pile of stack accesses and branchery.

Touching the stack is cheap - your code does it all the time. And there isn't any branching.
(Although I have heard that function calls in PHP are expensive, but I cannot remember where.)
Image

User avatar
Thesh
Made to Fuck Dinosaurs
Posts: 6232
Joined: Tue Jan 12, 2010 1:55 am UTC
Location: Colorado

Re: Filesize unit conversion algorithm

Postby Thesh » Wed Aug 20, 2014 3:16 pm UTC

Archgeek wrote:Woah, exponents are right in the instruction set? I thought they were loops of multiplication

Well, 2x is in the instruction set, then you compute yx = 2x log2y; no looping required. Of course, since in this case it's 1024^x, it is simply 210x (really, this works with anything where y is a constant, you just precompute log2y). But yes, in this case bitshifting is marginally faster since it's an integer operation instead of float.
Summum ius, summa iniuria.

User avatar
You, sir, name?
Posts: 6983
Joined: Sun Apr 22, 2007 10:07 am UTC
Location: Chako Paul City
Contact:

Re: Filesize unit conversion algorithm

Postby You, sir, name? » Mon Aug 25, 2014 3:25 pm UTC

I'd imagine something like this, if we want something that checks most likely sizes first (and we're by the nature of the beast never going to have an extremely large number of extremely large files)...

Code: Select all

if (! (x & ~0x3ff)) { return "b"; } // 1024 = 0x400
else if (! (x & ~0xffff)) { return "kb"; } // 1024*1024 = 0x100000
else if (! (x & ~0x3fffffff)) { return "mb"; } // etc.

I edit my posts a lot and sometimes the words wrong order words appear in sentences get messed up.


Return to “Coding”

Who is online

Users browsing this forum: No registered users and 8 guests