## Removing repeated numbers from a very large list.

A place to discuss the science of computers and programs, from algorithms to computability.

Formal proofs preferred.

Moderators: phlip, Moderators General, Prelates

undecim
Posts: 289
Joined: Tue Jan 19, 2010 7:09 pm UTC
Contact:

### Removing repeated numbers from a very large list.

I have a list of 2^32 32-bit integers (stored in a 16GiB file). I know that none of them are unique, and need to remove repeated ints (including non-adjacent repeated ints)

But I have no idea where to start on this... Any hints?
Blue, blue, blue

eta oin shrdlu
Posts: 450
Joined: Sat Jan 19, 2008 4:25 am UTC

### Re: Removing repeated numbers from a very large list.

Here's a hint: 2^32 bits is only 512 MiB; on a reasonably modern computer you should be able to allocate an array this big.

jaap
Posts: 2076
Joined: Fri Jul 06, 2007 7:06 am UTC
Contact:

### Re: Removing repeated numbers from a very large list.

Spoiler:
.. using an array of 2^32 bits (512MB), like a BitSet (Java) or vector<bool> (C++)

Spoiler:
For each element n on your list with duplicates, set the nth bit in your array. Once you've done that, you can construct a version of the list without duplicates by going through the array and including n only if the nth bit is set.

undecim
Posts: 289
Joined: Tue Jan 19, 2010 7:09 pm UTC
Contact:

### Re: Removing repeated numbers from a very large list.

eta oin shrdlu wrote:Here's a hint: 2^32 bits is only 512 MiB; on a reasonably modern computer you should be able to allocate an array this big.

2^32 is the number of integers. They are 32-bit integers.

in other words, 2^37 bits, or 16 GiB.

EDIT: Sorry, I misunderstood... I see what you're saying now.

This should do what I need. Thank you.
Last edited by undecim on Wed Mar 02, 2011 6:29 pm UTC, edited 1 time in total.
Blue, blue, blue

achan1058
Posts: 1783
Joined: Sun Nov 30, 2008 9:50 pm UTC

### Re: Removing repeated numbers from a very large list.

Even if it is a longer list , you can create "buckets" the file into ranges of manageable size.

MHD
Posts: 630
Joined: Fri Mar 20, 2009 8:21 pm UTC
Location: Denmark

### Re: Removing repeated numbers from a very large list.

I would suggest doing some sort of partition sorting of the file. But the bitmask looks cool too.
EvanED wrote:be aware that when most people say "regular expression" they really mean "something that is almost, but not quite, entirely unlike a regular expression"

Killamus
Posts: 163
Joined: Sun Jan 24, 2010 3:26 am UTC

### Re: Removing repeated numbers from a very large list.

Insert them into a database via a program, then use "Select distinct <columnname> from <tablename>"
Then write the results to a file.

Xanthir
My HERO!!!
Posts: 5281
Joined: Tue Feb 20, 2007 12:49 am UTC
Contact:

### Re: Removing repeated numbers from a very large list.

So, perhaps you didn't see the part where he has 2^32 of them.

(Plus, your suggestion basically boils down to "sort and remove duplicates", which he can do much more easily than using a database.)
(defun fibs (n &optional (a 1) (b 1)) (take n (unfold '+ a b)))

Moose Hole
Posts: 398
Joined: Fri Jul 09, 2010 1:34 pm UTC

### Re: Removing repeated numbers from a very large list.

You could put them into a hash table. Then, go through each bucket and sort/delete duplicates, because the bucket size will be more manageable than your file. You could alternatively do a sort/delete on each insert to keep the table from becoming large.

Killamus
Posts: 163
Joined: Sun Jan 24, 2010 3:26 am UTC

### Re: Removing repeated numbers from a very large list.

Xanthir wrote:So, perhaps you didn't see the part where he has 2^32 of them.

(Plus, your suggestion basically boils down to "sort and remove duplicates", which he can do much more easily than using a database.)

Yea, I didn't say it'd be fast, it'd be easy. I can write a C# program to insert every line from a file into a database inunder 5 minutes. I gave him the select statement to use. Then just copy/paste into a file.

Moose Hole
Posts: 398
Joined: Fri Jul 09, 2010 1:34 pm UTC

### Re: Removing repeated numbers from a very large list.

I went out to find some database benchmarks and the best I could find was about 2000 transactions per second. For 2^32 records, that query takes almost 25 days. Unless I don't understand what transactions are.

Spambot5546
Posts: 1466
Joined: Thu Apr 29, 2010 7:34 pm UTC

### Re: Removing repeated numbers from a very large list.

Transactions usually refers to individual database "hits". The query he's describing will involve using 1 transaction, the query, and then the database will perform the shit-ton of processes necessary to sort and delete to return the data. The only thing that makes the database option slower is the overhead of hitting the database and the database returning the data.
"It is bitter – bitter", he answered,
"But I like it
Because it is bitter,
And because it is my heart."

Killamus
Posts: 163
Joined: Sun Jan 24, 2010 3:26 am UTC

### Re: Removing repeated numbers from a very large list.

Moose Hole wrote:I went out to find some database benchmarks and the best I could find was about 2000 transactions per second. For 2^32 records, that query takes almost 25 days. Unless I don't understand what transactions are.

No matter what, this is going to take a long time. Just reading 16 gigs into anything will take a while. My solution is the fastest in terms of programming time, which (Unless otherwise specified) is more important then execution time. Another option is just reading from the file, store unique integers into memory and discard the rest, which doesn't involve a SQL database, but runs into issues with running out of memory/exponentially increasing search time, unless it's sorted after each number entered.

Thesh
Posts: 5833
Joined: Tue Jan 12, 2010 1:55 am UTC

### Re: Removing repeated numbers from a very large list.

A bitmask is pretty quick to write, and that would probably be the way I would do it as well. It's only a few lines of code to find/set the bit. It will probably take about half to one third of the time to run (the DB has to write to the transaction log, and then to the data page when it checkpoints, on top of the reads... If you only have one disk it will probably be worse).
Summum ius, summa iniuria.

phlip
Restorer of Worlds
Posts: 7550
Joined: Sat Sep 23, 2006 3:56 am UTC
Location: Australia
Contact:

### Re: Removing repeated numbers from a very large list.

Killamus wrote:exponentially increasing search time, unless it's sorted after each number entered.

Quadratically, not exponentially. And it would still be quadratic if you sorted after each number (that's basically an insertion sort, which is O(n2)).

Using a bit mask for all the possible values, and setting bits as you see numbers come in, is linear in the size of the list. As is reading the whole list and using a sort method that relies on a specific limit on the values' range, like radix sort. The bitmask option is probably the best... it's probably the fastest, and many languages have a space-efficient bitmask data structure built-in.

Code: Select all

`enum ಠ_ಠ {°□°╰=1, °Д°╰, ಠ益ಠ╰};void ┻━┻︵​╰(ಠ_ಠ ⚠) {exit((int)⚠);}`
[he/him/his]

cogman
Posts: 112
Joined: Sun May 18, 2008 2:17 pm UTC

### Re: Removing repeated numbers from a very large list.

Don't use a vector of bool. Bools are not 1 bit, but 8 bytes wide. A vector of bool would not be a manageable size. Rather use a bitset and do what is suggested here.

Killamus wrote:
Moose Hole wrote:I went out to find some database benchmarks and the best I could find was about 2000 transactions per second. For 2^32 records, that query takes almost 25 days. Unless I don't understand what transactions are.

No matter what, this is going to take a long time. Just reading 16 gigs into anything will take a while. My solution is the fastest in terms of programming time, which (Unless otherwise specified) is more important then execution time. Another option is just reading from the file, store unique integers into memory and discard the rest, which doesn't involve a SQL database, but runs into issues with running out of memory/exponentially increasing search time, unless it's sorted after each number entered.

Umm... Using a database is going to to increase the time it takes to do this by a very large factor. Disk IO is going to be the main bottle neck for this, this is true. However, shoving this into a database is just adding stupid amounts of overhead. A standard HD can read around 40MB/s which means that a non-database solution will take somewhere around 6 minutes to complete.

Using a database, on the other hand, adds HUGE amounts of IO (already the main bottle neck) while would increase things by a very large factor.

phlip wrote:
Killamus wrote:exponentially increasing search time, unless it's sorted after each number entered.

Quadratically, not exponentially. And it would still be quadratic if you sorted after each number (that's basically an insertion sort, which is O(n2)).

Using a bit mask for all the possible values, and setting bits as you see numbers come in, is linear in the size of the list. As is reading the whole list and using a sort method that relies on a specific limit on the values' range, like radix sort. The bitmask option is probably the best... it's probably the fastest, and many languages have a space-efficient bitmask data structure built-in.

Not to mention the fact that it is quite possible the full file might not need to be read in, after all, he said that every number has a duplicate, which means once the bit mask is full, there is no reason to look at any more files.
Last edited by phlip on Thu Mar 10, 2011 10:26 pm UTC, edited 1 time in total.
Reason: Merged your triple post for you. Use the edit button next time.

jaap
Posts: 2076
Joined: Fri Jul 06, 2007 7:06 am UTC
Contact:

### Re: Removing repeated numbers from a very large list.

cogman wrote:Don't use a vector of bool. Bools are not 1 bit, but 8 bytes wide. A vector of bool would not be a manageable size. Rather use a bitset and do what is suggested here.

In C++ a vector<bool> has a special implementation that does use only 1 bit per bool.

phlip
Restorer of Worlds
Posts: 7550
Joined: Sat Sep 23, 2006 3:56 am UTC
Location: Australia
Contact:

### Re: Removing repeated numbers from a very large list.

cogman wrote:Not to mention the fact that it is quite possible the full file might not need to be read in, after all, he said that every number has a duplicate, which means once the bit mask is full, there is no reason to look at any more files.

Unlikely... they said they have 232 numbers, and there are 232 possible numbers... if they have any duplicates at all, then there'll be some numbers they have 0 of. Indeed, that's the point - they're trying to find out which numbers they have, and which ones they've missed.

Code: Select all

`enum ಠ_ಠ {°□°╰=1, °Д°╰, ಠ益ಠ╰};void ┻━┻︵​╰(ಠ_ಠ ⚠) {exit((int)⚠);}`
[he/him/his]

_Axle_
Posts: 253
Joined: Fri Sep 24, 2010 7:33 pm UTC

### Re: Removing repeated numbers from a very large list.

jaap wrote:
cogman wrote:Don't use a vector of bool. Bools are not 1 bit, but 8 bytes wide. A vector of bool would not be a manageable size. Rather use a bitset and do what is suggested here.

In C++ a vector<bool> has a special implementation that does use only 1 bit per bool.

While that is true, doing so saves memory, but eats more cpu time for operations.
Yakk wrote:Computer Science is to Programming as Materials Physics is to Structural Engineering.

tkx
Posts: 1
Joined: Thu Mar 10, 2011 11:18 pm UTC

### Re: Removing repeated numbers from a very large list.

phlip wrote:
cogman wrote:Not to mention the fact that it is quite possible the full file might not need to be read in, after all, he said that every number has a duplicate, which means once the bit mask is full, there is no reason to look at any more files.

Unlikely... they said they have 232 numbers, and there are 232 possible numbers... if they have any duplicates at all, then there'll be some numbers they have 0 of. Indeed, that's the point - they're trying to find out which numbers they have, and which ones they've missed.

Still, you can stop when "total read + seen only once = 2^32". This allows you not to read the whole file (except if it contains only one number, in which case you can avoid reading the last number as well).

cogman
Posts: 112
Joined: Sun May 18, 2008 2:17 pm UTC

### Re: Removing repeated numbers from a very large list.

jaap wrote:
cogman wrote:Don't use a vector of bool. Bools are not 1 bit, but 8 bytes wide. A vector of bool would not be a manageable size. Rather use a bitset and do what is suggested here.

In C++ a vector<bool> has a special implementation that does use only 1 bit per bool.

Thats crazy, definitely not what I expected. (guess you learn something new every day.)

How do they accomplish this? I thought it was somewhat impossible to do type checking in C++ (especially with a primitive). Does the compiler have to detect it?

phlip
Restorer of Worlds
Posts: 7550
Joined: Sat Sep 23, 2006 3:56 am UTC
Location: Australia
Contact:

### Re: Removing repeated numbers from a very large list.

tkx wrote:Still, you can stop when "total read + seen only once = 2^32". This allows you not to read the whole file (except if it contains only one number, in which case you can avoid reading the last number as well).

... how? Surely if there are any numbers you haven't seen, and any numbers you haven't read, you have to read them in case any of them are ones you haven't seen yet... and the only way there'll be no numbers you haven't seen is if there's exactly one of each and you've already read the whole file.

cogman wrote:How do they accomplish this? I thought it was somewhat impossible to do type checking in C++ (especially with a primitive). Does the compiler have to detect it?

It's just how templates work in C++... you can have special cases:

Code: Select all

`#include <iostream>#include <typeinfo>template <class T> void foo(){   std::cout << "Called generic foo<T> with T=" << typeid(T).name() << std::endl;}template <> void foo<bool>(){   std::cout << "Called specific foo<bool>" << std::endl;}int main(){   foo<int>(); // prints "generic" and some string that represents "int"  foo<double>(); // prints "generic" and some string that represents "double"  foo<bool>(); // prints "specific"  return 0;}`
When compiled, unless the optimiser gets involved, this will end up with three different foo() functions in the program - foo<int>, foo<double> and foo<bool>. The first two will take their definition from the general template, and the third will take its definition from the override. The same works with template classes and whatever. So <vector> contains something like:

Code: Select all

`template <class T> class vector { generic array-based code; };template <> class vector<bool> { specific bitfield-based code; };`

Code: Select all

`enum ಠ_ಠ {°□°╰=1, °Д°╰, ಠ益ಠ╰};void ┻━┻︵​╰(ಠ_ಠ ⚠) {exit((int)⚠);}`
[he/him/his]

Yakk
Poster with most posts but no title.
Posts: 11077
Joined: Sat Jan 27, 2007 7:27 pm UTC
Location: E pur si muove

### Re: Removing repeated numbers from a very large list.

I had a nice solution to this. I guess I didn't hit "post".

In any case, what I'd do is write a big-bit-mask class that has 2^32 entries (indexed by a size_t).

To avoid 32 bit issues (the size of a single vector is limited to 2^32-1), I'd have two vectors (high and low) each of size 2^31 (or 1<<31).

operator[](size_t) would return a "pseudo_bool_ref", a class that has a pointer to a vector, an offset, and overriden operator bool() and operator=(bool).
operator[](size_t) const would return a bool.

If the input was bigger than or equal to 2^31, it would use the high vector -- otherwise the low vector.

The main code would consist of reading through the array, checking [] to see if it was already there -- and if it was already there, it would note it. Then it would set it.

The bottleneck in that program would be disk IO, and to a lesser extend memory cache. So the next optimization step would involve twerking the disk IO pipeline so that there pages being delivered from disk, and summarily dispatched and then thrown away. Once that is optimized, I'd get to work on the memory locality problem (if it turned out to be a problem) of the bit-array if I wanted extreme performance.

But the naive implementation I mentioned above will be pretty close to disk IO speed.

...

Aha, found this written post:

This will work in 32 bit systems (creating a vector of size 2^32 has issues on 32 bit systems):

Code: Select all

`struct big_bitmask{  std::vector<bool> low;  std::vector<bool> high;  big_bitmask():low(), high()  {    low.resize(1<<31);    high.resize(1<<31);  }  struct bool_pseudo_ref  {    std::vector<bool>* vec;    size_t offset;    bool_pseudo_ref( std::vector<bool>& base, size_t offset_ ): vec(&base), offset(offset_) {      Assert(vec);      Assert(offset < (1<<31));    }    operator bool() const {      Assert(vec);      if (!vec) return false;      return vec->at(offset);    };    bool_pseudo_ref& operator=(bool src)    {      Assert(vec);      if (!vec) return *this;      return vec->at(offset) = src;      return *this;    }  };  bool_pseudo_ref operator[](size_t i)  {    std::tr1::int64_t offset = i;    if (offset >= (1<<31))    {      offset -= (1<<31);      return bool_pseudo_ref( high, size_t(offset) );    } else {      return bool_pseudo_ref( low, size_t(offset) );    }  }  bool operator[](size_t i) const  {    if (offset >= (1<<31))    {      offset -= (1<<31);      return high.at(offset);    } else {      return low.at(offset);    }  }};`

That should be a relatively fully functional bitmask that supports 2^32 elements in it.

Simply read in each element. Check that your bitmask[value] is false -- if it is true, you have a duplicate. Then set bitmask[value] to true.

Make sure you don't needlessly copy your big_bitmask around -- it is reasonably large.
One of the painful things about our time is that those who feel certainty are stupid, and those with any imagination and understanding are filled with doubt and indecision - BR

Last edited by JHVH on Fri Oct 23, 4004 BCE 6:17 pm, edited 6 times in total.

jaap
Posts: 2076
Joined: Fri Jul 06, 2007 7:06 am UTC
Contact:

### Re: Removing repeated numbers from a very large list.

Yakk wrote:To avoid 32 bit issues (the size of a single vector is limited to 2^32-1),

Really?
A C++ vector is indexed by a size_t type, which is an unsigned type. It would think it was able to cover the whole range even if it is 32 bit (though it is more likely to be 64 bit nowadays for most std libraries). Memory addressing should also not be an issue.

Yakk
Poster with most posts but no title.
Posts: 11077
Joined: Sat Jan 27, 2007 7:27 pm UTC
Location: E pur si muove

### Re: Removing repeated numbers from a very large list.

jaap wrote:
Yakk wrote:To avoid 32 bit issues (the size of a single vector is limited to 2^32-1),

Really?
A C++ vector is indexed by a size_t type, which is an unsigned type.

I said 2^32-1, not 2^31-1.
It would think it was able to cover the whole range even if it is 32 bit (though it is more likely to be 64 bit nowadays for most std libraries).

The lowest size_t is 0. All values exist, and on 32 bit systems it is 32 bit. Thus the highest value, by the law of pigeons, is 2^32-1.

I believe that on 32 bit systems, std libraries still use 32 bit sizes?

On 64 bit systems this is not going to be a problem. And maybe the std::vector<bool> specialization replaces the size type with a 64 bit one in some std libraries even on 32 bit systems -- I don't know.
One of the painful things about our time is that those who feel certainty are stupid, and those with any imagination and understanding are filled with doubt and indecision - BR

Last edited by JHVH on Fri Oct 23, 4004 BCE 6:17 pm, edited 6 times in total.

jaap
Posts: 2076
Joined: Fri Jul 06, 2007 7:06 am UTC
Contact:

### Re: Removing repeated numbers from a very large list.

Yakk wrote:
jaap wrote:
Yakk wrote:To avoid 32 bit issues (the size of a single vector is limited to 2^32-1),

Really?
A C++ vector is indexed by a size_t type, which is an unsigned type.

I said 2^32-1, not 2^31-1.
It would think it was able to cover the whole range even if it is 32 bit (though it is more likely to be 64 bit nowadays for most std libraries).

The lowest size_t is 0. All values exist, and on 32 bit systems it is 32 bit. Thus the highest value, by the law of pigeons, is 2^32-1.

If a vector has items in it from index 0 to index 2^32-1, then it has size 2^32.

Thesh
Posts: 5833
Joined: Tue Jan 12, 2010 1:55 am UTC

### Re: Removing repeated numbers from a very large list.

It's implementation specific, and won't necessarily match size_type. If you really want to know, print out vector::max_size().

Note that because max_size() is the total number of elements, if size_type is a 32 bit unsigned int, the maximum number of elements will be 2^32-1.
Summum ius, summa iniuria.

jaap
Posts: 2076
Joined: Fri Jul 06, 2007 7:06 am UTC
Contact:

### Re: Removing repeated numbers from a very large list.

Thesh wrote:Note that because max_size() is the total number of elements, if size_type is a 32 bit unsigned int, the maximum number of elements will be 2^32-1.

Bleghh! So any implementation of stl vector that allows them to grow so large that the whole range of size_type values are needed to index it cannot be valid, as max_size() needs to return a size_type with value [maximum index +1]. That's kinda sucky.

Thesh
Posts: 5833
Joined: Tue Jan 12, 2010 1:55 am UTC

### Re: Removing repeated numbers from a very large list.

It really doesn't matter. If you are limited to four gigs of ram, it's impossible to fill it anyway. Even if your program doesn't require an OS and has all the memory to itself, the stack, in most cases, will consume more than one byte.
Summum ius, summa iniuria.

jaap
Posts: 2076
Joined: Fri Jul 06, 2007 7:06 am UTC
Contact:

### Re: Removing repeated numbers from a very large list.

Thesh wrote:It really doesn't matter. If you are limited to four gigs of ram, it's impossible to fill it anyway. Even if your program doesn't require an OS and has all the memory to itself, the stack, in most cases, will consume more than one byte.

Well, it does matter in this case, as it is therefore impossible to make a vector<bool> of size 2^32 (only 512MB excluding overhead) if size_t is 32 bit.

Thesh
Posts: 5833
Joined: Tue Jan 12, 2010 1:55 am UTC

### Re: Removing repeated numbers from a very large list.

True, this is the one exception.
Summum ius, summa iniuria.

Lanzaa
Posts: 33
Joined: Fri Apr 06, 2007 6:24 am UTC

### Re: Removing repeated numbers from a very large list.

While in this case a bit array is probably best it may be interesting to look at using bloom filters for this problem.

http://en.wikipedia.org/wiki/Bloom_filter

If there were a very large number of repeated numbers such that the total number of unique values used in the file was small a bloom filter would be better than a bit array. The simplest reason being that it would take up less memory. I would wager that the bloom filter would also be significantly faster than a naive bit array due to the fact that a bloom filter would cause the cache to thrash less.

It sounds like undecim has already decided how to accomplish the goal. Has anyone attempted the problem themselves?

I would like to see the results of a speed test, maybe using a quick random number generator instead of a file to remove the disk IO bottleneck and effects of number conversion.

FuzzyPanda
Posts: 7
Joined: Thu Mar 25, 2010 11:34 pm UTC

### Re: Removing repeated numbers from a very large list.

It was specified that the integers are all 32-bit, so we don't need to worry about exceeding our 2^32 element hash table, but what about only using a small portion of it? This seems pretty wasteful

Here is a proposed modification which would increase the time complexity and potentially decrease the space complexity (which seems necessary for such a huge array).

The current time complexity is 2n - n to convert the initial array to the hash table + n to convert the hash table into the unique array
The current space complexity is 2^32, which will exceed memory and slow down the computer - a problem which can be easily avoided

Instead, we add an additional n pass through the 2^32 element array to establish the lowest and highest numbers, and allocate a hash table of size (highest - lowest), this could be enormously smaller than 2^32 elements and with arrays of this size, it's best not to be wasteful.

The new time complexity is 3n - n to establish the memory usage + n to convert the initial array to the bitmask + n to convert the bitmask into the unique array
The current space complexity is between 1 and 2^32, which can be an excellent improvement

NOTE: 3n and 2n are both O(n), so the improved memory performance should be well worth the extra processor time.

EDIT: To Lanzaa: Sorry, I somehow missed your post for the bloom filter which pretty much does what I wrote. Although I'm a little worried about the possibility of a false positive.

Moose Hole
Posts: 398
Joined: Fri Jul 09, 2010 1:34 pm UTC

### Re: Removing repeated numbers from a very large list.

The worst case is (2^32)/2 integers, because the OP stated "none of them are unique." Like I said, use a hash table, which can be as small as you like in terms of hashes. On collisions, check for duplicates in the hash list, which will be small compared to the original file.

Thesh
Posts: 5833
Joined: Tue Jan 12, 2010 1:55 am UTC

### Re: Removing repeated numbers from a very large list.

Moose Hole wrote:The worst case is (2^32)/2 integers, because the OP stated "none of them are unique."

I wouldn't take that literally; I believe he meant "none of them are guaranteed to be unique."

Also, a hash table won't be efficient for this, because you have to keep every unique integer in the hash table to handle collisions, unless you use a perfect hash function (which is easy to write in this case). Even if there are only 2^31 unique integers, it still takes up 2^33 bytes. The only perfect hash function is "int perfectIntHash(int a) { return a ;}" and then you are pretty much just using a bitmask. Any hash function less than 4 bytes will likely have a large number of collisions.
Summum ius, summa iniuria.

Yakk
Poster with most posts but no title.
Posts: 11077
Joined: Sat Jan 27, 2007 7:27 pm UTC
Location: E pur si muove

### Re: Removing repeated numbers from a very large list.

The efficiency boost I'd aim at would be to hash my numbers into buckets for later consumption.

Ie, you hash your number, find the bucket, drop it in bucket. Ideally these writes would not require reading the contents of the bucket (maintain bucket write pointers and sizes elsewhere -- with few enough buckets, this has better cache-coherency). As the buckets threaten to over-flow, you pick the most full bucket and request the memory system fully cache it. Then you "rectify" the bucket, detecting collisions and compressing its state down to a bitmask.

The goal would be that, instead of randomly accessing the bitmask, you work in a small amount of memory that fits in a cache, then blit a bunch of "close" bitmask elements to the bitmask as your cache fills up. The write-without-reading trick can be used to avoid poisoning your cache (and the instruction that does this exists on modern x86 processors, which is just as important), as do the "I'm about to want to access this memory, can you bring it to the cache while I do something else" instructions.

In practice, write the basic bitmask one. It is reasonably robust and fast. See if it is fast enough. If it isn't fast enough, trade off robustness risk and programmer time for speed.
One of the painful things about our time is that those who feel certainty are stupid, and those with any imagination and understanding are filled with doubt and indecision - BR

Last edited by JHVH on Fri Oct 23, 4004 BCE 6:17 pm, edited 6 times in total.

not baby Newt
Posts: 110
Joined: Wed Feb 03, 2010 11:30 pm UTC

### Re: Removing repeated numbers from a very large list.

If anyone wants an excuse for exotic solutions, assume we want a sort instead.

Btw I've learned the name bucket sort as what you might use when the possible values are few (Start by setting up a counter for each possible value). Got me confused for a second reading Yakk, who has bigger buckets.

Allocating 2^32 ints is no fun, but you could read the file 8 times, first discarding everything over 2^32/8 etc. Eight times the disk IO is way slow, but probably eight times faster to write than a sort that cashes to disk effectively.

poohat
Posts: 230
Joined: Mon Apr 07, 2008 6:21 am UTC

### Re: Removing repeated numbers from a very large list.

FuzzyPanda wrote:It was specified that the integers are all 32-bit, so we don't need to worry about exceeding our 2^32 element hash table, but what about only using a small portion of it? This seems pretty wasteful

Here is a proposed modification which would increase the time complexity and potentially decrease the space complexity (which seems necessary for such a huge array).

The current time complexity is 2n - n to convert the initial array to the hash table + n to convert the hash table into the unique array
The current space complexity is 2^32, which will exceed memory and slow down the computer - a problem which can be easily avoided

Instead, we add an additional n pass through the 2^32 element array to establish the lowest and highest numbers, and allocate a hash table of size (highest - lowest), this could be enormously smaller than 2^32 elements and with arrays of this size, it's best not to be wasteful..

You could do it probabilistically by just looking at the first 100000 numbers in the file, the min/max of that, combined with the number of repeated values and the standard deviation, should give you a reasonable estimate of the min/max of the whole array. And most good hash table implementations will automatically resize for you if you start getting too many collisions.

Although I wouldnt bother personally; I think people are exaggerating how long its going to take. I doubt if what he's doing will take more than a couple of hours. 2^32 isnt that big a number, its only about 4 billion.