photosinensis wrote:btilly wrote:If your dataset is big enough that it has to live on disk, then quicksort is very, very painfully slow.
Of course, if your data is on disk, any operations on it will be painfully slow. Best to avoid thrashing at all costs. Of course, that's another discussion and/or flamewar.
It is only painfully slow if you use the wrong algorithms, and have the wrong expectations.
For example suppose you're trying to sort a billion items, each about 50 bytes, on commodity hardware. If it turns at 6000 rpm, then it turns 100 times per second, leading to random access times of about 1/200th of a second. Sorting this with a quicksort takes billions of random accesses to disk. (Many of the accesses can hit cache instead of disk, but billions still need to go to disk.) A billion accesses will take almost 58 days. Pain galore.
Now suppose the same disk has a sustained throughput of 70 MB/second. If we're using an unoptimized mergesort, we need to read each piece of data 30 times and write it 30 times. We have 50 GB of data. So we have 300 GB of data transfer. Which takes 71 minutes.
Slow? Sure. But given the size of the problem and the hardware available, a little over an hour is quite reasonable. However you see that quicksort does not run at reasonable speed by any reasonable measure.
Not about your other point, when you take code that is meant to run against RAM and run it against virtual memory on disk, the time to follow a pointer is now disk seek time. Looking at how slow that is, you can see why a computer seems to freeze when it starts "thrashing". The algorithms that are appropriate for normal use are a worst case scenario once data lives on disk. That doesn't mean that data can't be on disk while still having reasonable performance. Relational databases do that all of the time. But you need to access that data in appropriate patterns.
If you have never tried to understand how to use disk appropriately, but have encountered the worst case scenarios, then you can be pardoned for your excessively negative view of working with datasets that live on disk. However in the right circumstances it is quite reasonable to handle data on disk. (Though, of course, it would be nicer if you had enough RAM for it to all be there. But sometimes that is just not possible.)