Removing repeated numbers from a very large list.

A place to discuss the science of computers and programs, from algorithms to computability.

Formal proofs preferred.

Moderators: phlip, Prelates, Moderators General

Removing repeated numbers from a very large list.

Postby undecim » Wed Mar 02, 2011 5:56 pm UTC

I have a list of 2^32 32-bit integers (stored in a 16GiB file). I know that none of them are unique, and need to remove repeated ints (including non-adjacent repeated ints)

But I have no idea where to start on this... Any hints?
Blue, blue, blue
User avatar
undecim
 
Posts: 286
Joined: Tue Jan 19, 2010 7:09 pm UTC

Re: Removing repeated numbers from a very large list.

Postby eta oin shrdlu » Wed Mar 02, 2011 6:06 pm UTC

Here's a hint: 2^32 bits is only 512 MiB; on a reasonably modern computer you should be able to allocate an array this big.
User avatar
eta oin shrdlu
 
Posts: 431
Joined: Sat Jan 19, 2008 4:25 am UTC

Re: Removing repeated numbers from a very large list.

Postby jaap » Wed Mar 02, 2011 6:07 pm UTC

How about ...
Spoiler:
.. using an array of 2^32 bits (512MB), like a BitSet (Java) or vector<bool> (C++)

Spoiler:
For each element n on your list with duplicates, set the nth bit in your array. Once you've done that, you can construct a version of the list without duplicates by going through the array and including n only if the nth bit is set.
User avatar
jaap
 
Posts: 1789
Joined: Fri Jul 06, 2007 7:06 am UTC

Re: Removing repeated numbers from a very large list.

Postby undecim » Wed Mar 02, 2011 6:20 pm UTC

eta oin shrdlu wrote:Here's a hint: 2^32 bits is only 512 MiB; on a reasonably modern computer you should be able to allocate an array this big.


2^32 is the number of integers. They are 32-bit integers.

in other words, 2^37 bits, or 16 GiB.


EDIT: Sorry, I misunderstood... I see what you're saying now.

This should do what I need. Thank you.
Last edited by undecim on Wed Mar 02, 2011 6:29 pm UTC, edited 1 time in total.
Blue, blue, blue
User avatar
undecim
 
Posts: 286
Joined: Tue Jan 19, 2010 7:09 pm UTC

Re: Removing repeated numbers from a very large list.

Postby achan1058 » Wed Mar 02, 2011 6:25 pm UTC

Even if it is a longer list , you can create "buckets" the file into ranges of manageable size.
achan1058
 
Posts: 1791
Joined: Sun Nov 30, 2008 9:50 pm UTC

Re: Removing repeated numbers from a very large list.

Postby MHD » Mon Mar 07, 2011 8:42 am UTC

I would suggest doing some sort of partition sorting of the file. But the bitmask looks cool too.
EvanED wrote:be aware that when most people say "regular expression" they really mean "something that is almost, but not quite, entirely unlike a regular expression"
User avatar
MHD
 
Posts: 631
Joined: Fri Mar 20, 2009 8:21 pm UTC
Location: Denmark

Re: Removing repeated numbers from a very large list.

Postby Killamus » Tue Mar 08, 2011 2:29 am UTC

Insert them into a database via a program, then use "Select distinct <columnname> from <tablename>"
Then write the results to a file.
Killamus
 
Posts: 163
Joined: Sun Jan 24, 2010 3:26 am UTC

Re: Removing repeated numbers from a very large list.

Postby Xanthir » Tue Mar 08, 2011 6:43 am UTC

So, perhaps you didn't see the part where he has 2^32 of them.

(Plus, your suggestion basically boils down to "sort and remove duplicates", which he can do much more easily than using a database.)
(defun fibs (n &optional (a 1) (b 1)) (take n (unfold '+ a b)))
User avatar
Xanthir
My HERO!!!
 
Posts: 4225
Joined: Tue Feb 20, 2007 12:49 am UTC
Location: The Googleplex

Re: Removing repeated numbers from a very large list.

Postby Moose Hole » Tue Mar 08, 2011 2:44 pm UTC

You could put them into a hash table. Then, go through each bucket and sort/delete duplicates, because the bucket size will be more manageable than your file. You could alternatively do a sort/delete on each insert to keep the table from becoming large.
Moose Hole
 
Posts: 400
Joined: Fri Jul 09, 2010 1:34 pm UTC

Re: Removing repeated numbers from a very large list.

Postby Killamus » Tue Mar 08, 2011 3:35 pm UTC

Xanthir wrote:So, perhaps you didn't see the part where he has 2^32 of them.

(Plus, your suggestion basically boils down to "sort and remove duplicates", which he can do much more easily than using a database.)


Yea, I didn't say it'd be fast, it'd be easy. I can write a C# program to insert every line from a file into a database inunder 5 minutes. I gave him the select statement to use. Then just copy/paste into a file.
Killamus
 
Posts: 163
Joined: Sun Jan 24, 2010 3:26 am UTC

Re: Removing repeated numbers from a very large list.

Postby Moose Hole » Tue Mar 08, 2011 3:44 pm UTC

I went out to find some database benchmarks and the best I could find was about 2000 transactions per second. For 2^32 records, that query takes almost 25 days. Unless I don't understand what transactions are.
Moose Hole
 
Posts: 400
Joined: Fri Jul 09, 2010 1:34 pm UTC

Re: Removing repeated numbers from a very large list.

Postby Spambot5546 » Tue Mar 08, 2011 3:49 pm UTC

Transactions usually refers to individual database "hits". The query he's describing will involve using 1 transaction, the query, and then the database will perform the shit-ton of processes necessary to sort and delete to return the data. The only thing that makes the database option slower is the overhead of hitting the database and the database returning the data.
"It is bitter – bitter", he answered,
"But I like it
Because it is bitter,
And because it is my heart."
Spambot5546
 
Posts: 1350
Joined: Thu Apr 29, 2010 7:34 pm UTC

Re: Removing repeated numbers from a very large list.

Postby Killamus » Tue Mar 08, 2011 5:03 pm UTC

Moose Hole wrote:I went out to find some database benchmarks and the best I could find was about 2000 transactions per second. For 2^32 records, that query takes almost 25 days. Unless I don't understand what transactions are.


No matter what, this is going to take a long time. Just reading 16 gigs into anything will take a while. My solution is the fastest in terms of programming time, which (Unless otherwise specified) is more important then execution time. Another option is just reading from the file, store unique integers into memory and discard the rest, which doesn't involve a SQL database, but runs into issues with running out of memory/exponentially increasing search time, unless it's sorted after each number entered.
Killamus
 
Posts: 163
Joined: Sun Jan 24, 2010 3:26 am UTC

Re: Removing repeated numbers from a very large list.

Postby Thesh » Wed Mar 09, 2011 8:17 am UTC

A bitmask is pretty quick to write, and that would probably be the way I would do it as well. It's only a few lines of code to find/set the bit. It will probably take about half to one third of the time to run (the DB has to write to the transaction log, and then to the data page when it checkpoints, on top of the reads... If you only have one disk it will probably be worse).
Big are getting bigger and hearts are getting harder, an imaginary game
eating at every living thing, a voice dripping with sarcasm like a bloody fat slash
grinning over bleached white-fang teeth that glow like green warning signs of sickness.
User avatar
Thesh
Has the Brain Worms, In Case You Forgot.
 
Posts: 3435
Joined: Tue Jan 12, 2010 1:55 am UTC
Location: California, Southern USA

Re: Removing repeated numbers from a very large list.

Postby phlip » Wed Mar 09, 2011 8:38 am UTC

Killamus wrote:exponentially increasing search time, unless it's sorted after each number entered.

Quadratically, not exponentially. And it would still be quadratic if you sorted after each number (that's basically an insertion sort, which is O(n2)).

Using a bit mask for all the possible values, and setting bits as you see numbers come in, is linear in the size of the list. As is reading the whole list and using a sort method that relies on a specific limit on the values' range, like radix sort. The bitmask option is probably the best... it's probably the fastest, and many languages have a space-efficient bitmask data structure built-in.
While no one overhear you quickly tell me not cow cow.
but how about watch phone?
User avatar
phlip
Restorer of Worlds
 
Posts: 7089
Joined: Sat Sep 23, 2006 3:56 am UTC
Location: Australia

Re: Removing repeated numbers from a very large list.

Postby cogman » Thu Mar 10, 2011 8:59 pm UTC

Don't use a vector of bool. Bools are not 1 bit, but 8 bytes wide. A vector of bool would not be a manageable size. Rather use a bitset and do what is suggested here.

Killamus wrote:
Moose Hole wrote:I went out to find some database benchmarks and the best I could find was about 2000 transactions per second. For 2^32 records, that query takes almost 25 days. Unless I don't understand what transactions are.


No matter what, this is going to take a long time. Just reading 16 gigs into anything will take a while. My solution is the fastest in terms of programming time, which (Unless otherwise specified) is more important then execution time. Another option is just reading from the file, store unique integers into memory and discard the rest, which doesn't involve a SQL database, but runs into issues with running out of memory/exponentially increasing search time, unless it's sorted after each number entered.

Umm... Using a database is going to to increase the time it takes to do this by a very large factor. Disk IO is going to be the main bottle neck for this, this is true. However, shoving this into a database is just adding stupid amounts of overhead. A standard HD can read around 40MB/s which means that a non-database solution will take somewhere around 6 minutes to complete.

Using a database, on the other hand, adds HUGE amounts of IO (already the main bottle neck) while would increase things by a very large factor.

phlip wrote:
Killamus wrote:exponentially increasing search time, unless it's sorted after each number entered.

Quadratically, not exponentially. And it would still be quadratic if you sorted after each number (that's basically an insertion sort, which is O(n2)).

Using a bit mask for all the possible values, and setting bits as you see numbers come in, is linear in the size of the list. As is reading the whole list and using a sort method that relies on a specific limit on the values' range, like radix sort. The bitmask option is probably the best... it's probably the fastest, and many languages have a space-efficient bitmask data structure built-in.

Not to mention the fact that it is quite possible the full file might not need to be read in, after all, he said that every number has a duplicate, which means once the bit mask is full, there is no reason to look at any more files.
Last edited by phlip on Thu Mar 10, 2011 10:26 pm UTC, edited 1 time in total.
Reason: Merged your triple post for you. Use the edit button next time.
cogman
 
Posts: 114
Joined: Sun May 18, 2008 2:17 pm UTC

Re: Removing repeated numbers from a very large list.

Postby jaap » Thu Mar 10, 2011 10:18 pm UTC

cogman wrote:Don't use a vector of bool. Bools are not 1 bit, but 8 bytes wide. A vector of bool would not be a manageable size. Rather use a bitset and do what is suggested here.

In C++ a vector<bool> has a special implementation that does use only 1 bit per bool.
User avatar
jaap
 
Posts: 1789
Joined: Fri Jul 06, 2007 7:06 am UTC

Re: Removing repeated numbers from a very large list.

Postby phlip » Thu Mar 10, 2011 10:25 pm UTC

cogman wrote:Not to mention the fact that it is quite possible the full file might not need to be read in, after all, he said that every number has a duplicate, which means once the bit mask is full, there is no reason to look at any more files.

Unlikely... they said they have 232 numbers, and there are 232 possible numbers... if they have any duplicates at all, then there'll be some numbers they have 0 of. Indeed, that's the point - they're trying to find out which numbers they have, and which ones they've missed.
While no one overhear you quickly tell me not cow cow.
but how about watch phone?
User avatar
phlip
Restorer of Worlds
 
Posts: 7089
Joined: Sat Sep 23, 2006 3:56 am UTC
Location: Australia

Re: Removing repeated numbers from a very large list.

Postby _Axle_ » Thu Mar 10, 2011 11:17 pm UTC

jaap wrote:
cogman wrote:Don't use a vector of bool. Bools are not 1 bit, but 8 bytes wide. A vector of bool would not be a manageable size. Rather use a bitset and do what is suggested here.

In C++ a vector<bool> has a special implementation that does use only 1 bit per bool.


While that is true, doing so saves memory, but eats more cpu time for operations.
Yakk wrote:Computer Science is to Programming as Materials Physics is to Structural Engineering.
_Axle_
 
Posts: 253
Joined: Fri Sep 24, 2010 7:33 pm UTC

Re: Removing repeated numbers from a very large list.

Postby tkx » Thu Mar 10, 2011 11:51 pm UTC

phlip wrote:
cogman wrote:Not to mention the fact that it is quite possible the full file might not need to be read in, after all, he said that every number has a duplicate, which means once the bit mask is full, there is no reason to look at any more files.

Unlikely... they said they have 232 numbers, and there are 232 possible numbers... if they have any duplicates at all, then there'll be some numbers they have 0 of. Indeed, that's the point - they're trying to find out which numbers they have, and which ones they've missed.


Still, you can stop when "total read + seen only once = 2^32". This allows you not to read the whole file (except if it contains only one number, in which case you can avoid reading the last number as well).
tkx
 
Posts: 1
Joined: Thu Mar 10, 2011 11:18 pm UTC

Re: Removing repeated numbers from a very large list.

Postby cogman » Fri Mar 11, 2011 12:29 am UTC

jaap wrote:
cogman wrote:Don't use a vector of bool. Bools are not 1 bit, but 8 bytes wide. A vector of bool would not be a manageable size. Rather use a bitset and do what is suggested here.

In C++ a vector<bool> has a special implementation that does use only 1 bit per bool.

Thats crazy, definitely not what I expected. (guess you learn something new every day.)

How do they accomplish this? I thought it was somewhat impossible to do type checking in C++ (especially with a primitive). Does the compiler have to detect it?
cogman
 
Posts: 114
Joined: Sun May 18, 2008 2:17 pm UTC

Re: Removing repeated numbers from a very large list.

Postby phlip » Fri Mar 11, 2011 1:59 am UTC

tkx wrote:Still, you can stop when "total read + seen only once = 2^32". This allows you not to read the whole file (except if it contains only one number, in which case you can avoid reading the last number as well).

... how? Surely if there are any numbers you haven't seen, and any numbers you haven't read, you have to read them in case any of them are ones you haven't seen yet... and the only way there'll be no numbers you haven't seen is if there's exactly one of each and you've already read the whole file.

cogman wrote:How do they accomplish this? I thought it was somewhat impossible to do type checking in C++ (especially with a primitive). Does the compiler have to detect it?

It's just how templates work in C++... you can have special cases:
Code: Select all
#include <iostream>
#include <typeinfo>
template <class T> void foo()
{
  std::cout << "Called generic foo<T> with T=" << typeid(T).name() << std::endl;
}

template <> void foo<bool>()
{
  std::cout << "Called specific foo<bool>" << std::endl;
}

int main()
{
  foo<int>(); // prints "generic" and some string that represents "int"
  foo<double>(); // prints "generic" and some string that represents "double"
  foo<bool>(); // prints "specific"
  return 0;
}
When compiled, unless the optimiser gets involved, this will end up with three different foo() functions in the program - foo<int>, foo<double> and foo<bool>. The first two will take their definition from the general template, and the third will take its definition from the override. The same works with template classes and whatever. So <vector> contains something like:
Code: Select all
template <class T> class vector { generic array-based code; };
template <> class vector<bool> { specific bitfield-based code; };
While no one overhear you quickly tell me not cow cow.
but how about watch phone?
User avatar
phlip
Restorer of Worlds
 
Posts: 7089
Joined: Sat Sep 23, 2006 3:56 am UTC
Location: Australia

Re: Removing repeated numbers from a very large list.

Postby Yakk » Tue Mar 15, 2011 3:13 pm UTC

I had a nice solution to this. I guess I didn't hit "post".

In any case, what I'd do is write a big-bit-mask class that has 2^32 entries (indexed by a size_t).

To avoid 32 bit issues (the size of a single vector is limited to 2^32-1), I'd have two vectors (high and low) each of size 2^31 (or 1<<31).

operator[](size_t) would return a "pseudo_bool_ref", a class that has a pointer to a vector, an offset, and overriden operator bool() and operator=(bool).
operator[](size_t) const would return a bool.

If the input was bigger than or equal to 2^31, it would use the high vector -- otherwise the low vector.

The main code would consist of reading through the array, checking [] to see if it was already there -- and if it was already there, it would note it. Then it would set it.

The bottleneck in that program would be disk IO, and to a lesser extend memory cache. So the next optimization step would involve twerking the disk IO pipeline so that there pages being delivered from disk, and summarily dispatched and then thrown away. Once that is optimized, I'd get to work on the memory locality problem (if it turned out to be a problem) of the bit-array if I wanted extreme performance.

But the naive implementation I mentioned above will be pretty close to disk IO speed.

...

Aha, found this written post:

This will work in 32 bit systems (creating a vector of size 2^32 has issues on 32 bit systems):
Code: Select all
struct big_bitmask
{
  std::vector<bool> low;
  std::vector<bool> high;
  big_bitmask():low(), high()
  {
    low.resize(1<<31);
    high.resize(1<<31);
  }
  struct bool_pseudo_ref
  {
    std::vector<bool>* vec;
    size_t offset;
    bool_pseudo_ref( std::vector<bool>& base, size_t offset_ ): vec(&base), offset(offset_) {
      Assert(vec);
      Assert(offset < (1<<31));
    }
    operator bool() const {
      Assert(vec);
      if (!vec) return false;
      return vec->at(offset);
    };
    bool_pseudo_ref& operator=(bool src)
    {
      Assert(vec);
      if (!vec) return *this;
      return vec->at(offset) = src;
      return *this;
    }
  };
  bool_pseudo_ref operator[](size_t i)
  {
    std::tr1::int64_t offset = i;
    if (offset >= (1<<31))
    {
      offset -= (1<<31);
      return bool_pseudo_ref( high, size_t(offset) );
    } else {
      return bool_pseudo_ref( low, size_t(offset) );
    }
  }
  bool operator[](size_t i) const
  {
    if (offset >= (1<<31))
    {
      offset -= (1<<31);
      return high.at(offset);
    } else {
      return low.at(offset);
    }
  }
};

That should be a relatively fully functional bitmask that supports 2^32 elements in it.

Simply read in each element. Check that your bitmask[value] is false -- if it is true, you have a duplicate. Then set bitmask[value] to true.

Make sure you don't needlessly copy your big_bitmask around -- it is reasonably large.
One of the painful things about our time is that those who feel certainty are stupid, and those with any imagination and understanding are filled with doubt and indecision - BR

Last edited by JHVH on Fri Oct 23, 4004 BCE 6:17 pm, edited 6 times in total.
User avatar
Yakk
Poster with most posts but no title.
 
Posts: 10324
Joined: Sat Jan 27, 2007 7:27 pm UTC
Location: E pur si muove

Re: Removing repeated numbers from a very large list.

Postby jaap » Tue Mar 15, 2011 4:25 pm UTC

Yakk wrote:To avoid 32 bit issues (the size of a single vector is limited to 2^32-1),

Really?
A C++ vector is indexed by a size_t type, which is an unsigned type. It would think it was able to cover the whole range even if it is 32 bit (though it is more likely to be 64 bit nowadays for most std libraries). Memory addressing should also not be an issue.
User avatar
jaap
 
Posts: 1789
Joined: Fri Jul 06, 2007 7:06 am UTC

Re: Removing repeated numbers from a very large list.

Postby Yakk » Tue Mar 15, 2011 4:36 pm UTC

jaap wrote:
Yakk wrote:To avoid 32 bit issues (the size of a single vector is limited to 2^32-1),

Really?
A C++ vector is indexed by a size_t type, which is an unsigned type.

I said 2^32-1, not 2^31-1.
It would think it was able to cover the whole range even if it is 32 bit (though it is more likely to be 64 bit nowadays for most std libraries).

The lowest size_t is 0. All values exist, and on 32 bit systems it is 32 bit. Thus the highest value, by the law of pigeons, is 2^32-1.

I believe that on 32 bit systems, std libraries still use 32 bit sizes?

On 64 bit systems this is not going to be a problem. And maybe the std::vector<bool> specialization replaces the size type with a 64 bit one in some std libraries even on 32 bit systems -- I don't know.
One of the painful things about our time is that those who feel certainty are stupid, and those with any imagination and understanding are filled with doubt and indecision - BR

Last edited by JHVH on Fri Oct 23, 4004 BCE 6:17 pm, edited 6 times in total.
User avatar
Yakk
Poster with most posts but no title.
 
Posts: 10324
Joined: Sat Jan 27, 2007 7:27 pm UTC
Location: E pur si muove

Re: Removing repeated numbers from a very large list.

Postby jaap » Tue Mar 15, 2011 5:02 pm UTC

Yakk wrote:
jaap wrote:
Yakk wrote:To avoid 32 bit issues (the size of a single vector is limited to 2^32-1),

Really?
A C++ vector is indexed by a size_t type, which is an unsigned type.

I said 2^32-1, not 2^31-1.
It would think it was able to cover the whole range even if it is 32 bit (though it is more likely to be 64 bit nowadays for most std libraries).

The lowest size_t is 0. All values exist, and on 32 bit systems it is 32 bit. Thus the highest value, by the law of pigeons, is 2^32-1.


If a vector has items in it from index 0 to index 2^32-1, then it has size 2^32.
User avatar
jaap
 
Posts: 1789
Joined: Fri Jul 06, 2007 7:06 am UTC

Re: Removing repeated numbers from a very large list.

Postby Thesh » Tue Mar 15, 2011 5:14 pm UTC

It's implementation specific, and won't necessarily match size_type. If you really want to know, print out vector::max_size().

Note that because max_size() is the total number of elements, if size_type is a 32 bit unsigned int, the maximum number of elements will be 2^32-1.
Big are getting bigger and hearts are getting harder, an imaginary game
eating at every living thing, a voice dripping with sarcasm like a bloody fat slash
grinning over bleached white-fang teeth that glow like green warning signs of sickness.
User avatar
Thesh
Has the Brain Worms, In Case You Forgot.
 
Posts: 3435
Joined: Tue Jan 12, 2010 1:55 am UTC
Location: California, Southern USA

Re: Removing repeated numbers from a very large list.

Postby jaap » Tue Mar 15, 2011 5:28 pm UTC

Thesh wrote:Note that because max_size() is the total number of elements, if size_type is a 32 bit unsigned int, the maximum number of elements will be 2^32-1.

Bleghh! So any implementation of stl vector that allows them to grow so large that the whole range of size_type values are needed to index it cannot be valid, as max_size() needs to return a size_type with value [maximum index +1]. That's kinda sucky.
User avatar
jaap
 
Posts: 1789
Joined: Fri Jul 06, 2007 7:06 am UTC

Re: Removing repeated numbers from a very large list.

Postby Thesh » Tue Mar 15, 2011 5:34 pm UTC

It really doesn't matter. If you are limited to four gigs of ram, it's impossible to fill it anyway. Even if your program doesn't require an OS and has all the memory to itself, the stack, in most cases, will consume more than one byte.
Big are getting bigger and hearts are getting harder, an imaginary game
eating at every living thing, a voice dripping with sarcasm like a bloody fat slash
grinning over bleached white-fang teeth that glow like green warning signs of sickness.
User avatar
Thesh
Has the Brain Worms, In Case You Forgot.
 
Posts: 3435
Joined: Tue Jan 12, 2010 1:55 am UTC
Location: California, Southern USA

Re: Removing repeated numbers from a very large list.

Postby jaap » Tue Mar 15, 2011 5:37 pm UTC

Thesh wrote:It really doesn't matter. If you are limited to four gigs of ram, it's impossible to fill it anyway. Even if your program doesn't require an OS and has all the memory to itself, the stack, in most cases, will consume more than one byte.

Well, it does matter in this case, as it is therefore impossible to make a vector<bool> of size 2^32 (only 512MB excluding overhead) if size_t is 32 bit.
User avatar
jaap
 
Posts: 1789
Joined: Fri Jul 06, 2007 7:06 am UTC

Re: Removing repeated numbers from a very large list.

Postby Thesh » Tue Mar 15, 2011 5:42 pm UTC

True, this is the one exception.
Big are getting bigger and hearts are getting harder, an imaginary game
eating at every living thing, a voice dripping with sarcasm like a bloody fat slash
grinning over bleached white-fang teeth that glow like green warning signs of sickness.
User avatar
Thesh
Has the Brain Worms, In Case You Forgot.
 
Posts: 3435
Joined: Tue Jan 12, 2010 1:55 am UTC
Location: California, Southern USA

Re: Removing repeated numbers from a very large list.

Postby Lanzaa » Wed Mar 16, 2011 7:42 am UTC

While in this case a bit array is probably best it may be interesting to look at using bloom filters for this problem.

http://en.wikipedia.org/wiki/Bloom_filter

If there were a very large number of repeated numbers such that the total number of unique values used in the file was small a bloom filter would be better than a bit array. The simplest reason being that it would take up less memory. I would wager that the bloom filter would also be significantly faster than a naive bit array due to the fact that a bloom filter would cause the cache to thrash less.

It sounds like undecim has already decided how to accomplish the goal. Has anyone attempted the problem themselves?


I would like to see the results of a speed test, maybe using a quick random number generator instead of a file to remove the disk IO bottleneck and effects of number conversion.
Lanzaa
 
Posts: 31
Joined: Fri Apr 06, 2007 6:24 am UTC

Re: Removing repeated numbers from a very large list.

Postby FuzzyPanda » Thu Mar 17, 2011 1:54 am UTC

It was specified that the integers are all 32-bit, so we don't need to worry about exceeding our 2^32 element hash table, but what about only using a small portion of it? This seems pretty wasteful

Here is a proposed modification which would increase the time complexity and potentially decrease the space complexity (which seems necessary for such a huge array).

The current time complexity is 2n - n to convert the initial array to the hash table + n to convert the hash table into the unique array
The current space complexity is 2^32, which will exceed memory and slow down the computer - a problem which can be easily avoided

Instead, we add an additional n pass through the 2^32 element array to establish the lowest and highest numbers, and allocate a hash table of size (highest - lowest), this could be enormously smaller than 2^32 elements and with arrays of this size, it's best not to be wasteful.

The new time complexity is 3n - n to establish the memory usage + n to convert the initial array to the bitmask + n to convert the bitmask into the unique array
The current space complexity is between 1 and 2^32, which can be an excellent improvement

NOTE: 3n and 2n are both O(n), so the improved memory performance should be well worth the extra processor time.

EDIT: To Lanzaa: Sorry, I somehow missed your post for the bloom filter which pretty much does what I wrote. Although I'm a little worried about the possibility of a false positive.
FuzzyPanda
 
Posts: 7
Joined: Thu Mar 25, 2010 11:34 pm UTC

Re: Removing repeated numbers from a very large list.

Postby Moose Hole » Thu Mar 17, 2011 1:47 pm UTC

The worst case is (2^32)/2 integers, because the OP stated "none of them are unique." Like I said, use a hash table, which can be as small as you like in terms of hashes. On collisions, check for duplicates in the hash list, which will be small compared to the original file.
Moose Hole
 
Posts: 400
Joined: Fri Jul 09, 2010 1:34 pm UTC

Re: Removing repeated numbers from a very large list.

Postby Thesh » Thu Mar 17, 2011 2:02 pm UTC

Moose Hole wrote:The worst case is (2^32)/2 integers, because the OP stated "none of them are unique."


I wouldn't take that literally; I believe he meant "none of them are guaranteed to be unique."

Also, a hash table won't be efficient for this, because you have to keep every unique integer in the hash table to handle collisions, unless you use a perfect hash function (which is easy to write in this case). Even if there are only 2^31 unique integers, it still takes up 2^33 bytes. The only perfect hash function is "int perfectIntHash(int a) { return a ;}" and then you are pretty much just using a bitmask. Any hash function less than 4 bytes will likely have a large number of collisions.
Big are getting bigger and hearts are getting harder, an imaginary game
eating at every living thing, a voice dripping with sarcasm like a bloody fat slash
grinning over bleached white-fang teeth that glow like green warning signs of sickness.
User avatar
Thesh
Has the Brain Worms, In Case You Forgot.
 
Posts: 3435
Joined: Tue Jan 12, 2010 1:55 am UTC
Location: California, Southern USA

Re: Removing repeated numbers from a very large list.

Postby Yakk » Thu Mar 17, 2011 2:32 pm UTC

The efficiency boost I'd aim at would be to hash my numbers into buckets for later consumption.

Ie, you hash your number, find the bucket, drop it in bucket. Ideally these writes would not require reading the contents of the bucket (maintain bucket write pointers and sizes elsewhere -- with few enough buckets, this has better cache-coherency). As the buckets threaten to over-flow, you pick the most full bucket and request the memory system fully cache it. Then you "rectify" the bucket, detecting collisions and compressing its state down to a bitmask.

The goal would be that, instead of randomly accessing the bitmask, you work in a small amount of memory that fits in a cache, then blit a bunch of "close" bitmask elements to the bitmask as your cache fills up. The write-without-reading trick can be used to avoid poisoning your cache (and the instruction that does this exists on modern x86 processors, which is just as important), as do the "I'm about to want to access this memory, can you bring it to the cache while I do something else" instructions.

In practice, write the basic bitmask one. It is reasonably robust and fast. See if it is fast enough. If it isn't fast enough, trade off robustness risk and programmer time for speed.
One of the painful things about our time is that those who feel certainty are stupid, and those with any imagination and understanding are filled with doubt and indecision - BR

Last edited by JHVH on Fri Oct 23, 4004 BCE 6:17 pm, edited 6 times in total.
User avatar
Yakk
Poster with most posts but no title.
 
Posts: 10324
Joined: Sat Jan 27, 2007 7:27 pm UTC
Location: E pur si muove

Re: Removing repeated numbers from a very large list.

Postby not baby Newt » Mon Mar 21, 2011 1:53 pm UTC

If anyone wants an excuse for exotic solutions, assume we want a sort instead.

Btw I've learned the name bucket sort as what you might use when the possible values are few (Start by setting up a counter for each possible value). Got me confused for a second reading Yakk, who has bigger buckets.

Allocating 2^32 ints is no fun, but you could read the file 8 times, first discarding everything over 2^32/8 etc. Eight times the disk IO is way slow, but probably eight times faster to write than a sort that cashes to disk effectively.
not baby Newt
 
Posts: 109
Joined: Wed Feb 03, 2010 11:30 pm UTC

Re: Removing repeated numbers from a very large list.

Postby poohat » Wed Mar 23, 2011 5:22 am UTC

FuzzyPanda wrote:It was specified that the integers are all 32-bit, so we don't need to worry about exceeding our 2^32 element hash table, but what about only using a small portion of it? This seems pretty wasteful

Here is a proposed modification which would increase the time complexity and potentially decrease the space complexity (which seems necessary for such a huge array).

The current time complexity is 2n - n to convert the initial array to the hash table + n to convert the hash table into the unique array
The current space complexity is 2^32, which will exceed memory and slow down the computer - a problem which can be easily avoided

Instead, we add an additional n pass through the 2^32 element array to establish the lowest and highest numbers, and allocate a hash table of size (highest - lowest), this could be enormously smaller than 2^32 elements and with arrays of this size, it's best not to be wasteful..

You could do it probabilistically by just looking at the first 100000 numbers in the file, the min/max of that, combined with the number of repeated values and the standard deviation, should give you a reasonable estimate of the min/max of the whole array. And most good hash table implementations will automatically resize for you if you start getting too many collisions.

Although I wouldnt bother personally; I think people are exaggerating how long its going to take. I doubt if what he's doing will take more than a couple of hours. 2^32 isnt that big a number, its only about 4 billion.
poohat
 
Posts: 230
Joined: Mon Apr 07, 2008 6:21 am UTC


Return to Computer Science

Who is online

Users browsing this forum: No registered users and 4 guests