Moderators: phlip, Larson, Moderators General, Prelates
eta oin shrdlu wrote:Here's a hint: 2^32 bits is only 512 MiB; on a reasonably modern computer you should be able to allocate an array this big.
EvanED wrote:be aware that when most people say "regular expression" they really mean "something that is almost, but not quite, entirely unlike a regular expression"
Xanthir wrote:So, perhaps you didn't see the part where he has 2^32 of them.
(Plus, your suggestion basically boils down to "sort and remove duplicates", which he can do much more easily than using a database.)
Moose Hole wrote:I went out to find some database benchmarks and the best I could find was about 2000 transactions per second. For 2^32 records, that query takes almost 25 days. Unless I don't understand what transactions are.
Killamus wrote:exponentially increasing search time, unless it's sorted after each number entered.
Killamus wrote:Moose Hole wrote:I went out to find some database benchmarks and the best I could find was about 2000 transactions per second. For 2^32 records, that query takes almost 25 days. Unless I don't understand what transactions are.
No matter what, this is going to take a long time. Just reading 16 gigs into anything will take a while. My solution is the fastest in terms of programming time, which (Unless otherwise specified) is more important then execution time. Another option is just reading from the file, store unique integers into memory and discard the rest, which doesn't involve a SQL database, but runs into issues with running out of memory/exponentially increasing search time, unless it's sorted after each number entered.
phlip wrote:Killamus wrote:exponentially increasing search time, unless it's sorted after each number entered.
Quadratically, not exponentially. And it would still be quadratic if you sorted after each number (that's basically an insertion sort, which is O(n2)).
Using a bit mask for all the possible values, and setting bits as you see numbers come in, is linear in the size of the list. As is reading the whole list and using a sort method that relies on a specific limit on the values' range, like radix sort. The bitmask option is probably the best... it's probably the fastest, and many languages have a space-efficient bitmask data structure built-in.
cogman wrote:Don't use a vector of bool. Bools are not 1 bit, but 8 bytes wide. A vector of bool would not be a manageable size. Rather use a bitset and do what is suggested here.
cogman wrote:Not to mention the fact that it is quite possible the full file might not need to be read in, after all, he said that every number has a duplicate, which means once the bit mask is full, there is no reason to look at any more files.
jaap wrote:cogman wrote:Don't use a vector of bool. Bools are not 1 bit, but 8 bytes wide. A vector of bool would not be a manageable size. Rather use a bitset and do what is suggested here.
In C++ a vector<bool> has a special implementation that does use only 1 bit per bool.
Yakk wrote:Computer Science is to Programming as Materials Physics is to Structural Engineering.
phlip wrote:cogman wrote:Not to mention the fact that it is quite possible the full file might not need to be read in, after all, he said that every number has a duplicate, which means once the bit mask is full, there is no reason to look at any more files.
Unlikely... they said they have 232 numbers, and there are 232 possible numbers... if they have any duplicates at all, then there'll be some numbers they have 0 of. Indeed, that's the point - they're trying to find out which numbers they have, and which ones they've missed.
jaap wrote:cogman wrote:Don't use a vector of bool. Bools are not 1 bit, but 8 bytes wide. A vector of bool would not be a manageable size. Rather use a bitset and do what is suggested here.
In C++ a vector<bool> has a special implementation that does use only 1 bit per bool.
tkx wrote:Still, you can stop when "total read + seen only once = 2^32". This allows you not to read the whole file (except if it contains only one number, in which case you can avoid reading the last number as well).
cogman wrote:How do they accomplish this? I thought it was somewhat impossible to do type checking in C++ (especially with a primitive). Does the compiler have to detect it?
#include <iostream>
#include <typeinfo>
template <class T> void foo()
{
std::cout << "Called generic foo<T> with T=" << typeid(T).name() << std::endl;
}
template <> void foo<bool>()
{
std::cout << "Called specific foo<bool>" << std::endl;
}
int main()
{
foo<int>(); // prints "generic" and some string that represents "int"
foo<double>(); // prints "generic" and some string that represents "double"
foo<bool>(); // prints "specific"
return 0;
}template <class T> class vector { generic array-based code; };
template <> class vector<bool> { specific bitfield-based code; };struct big_bitmask
{
std::vector<bool> low;
std::vector<bool> high;
big_bitmask():low(), high()
{
low.resize(1<<31);
high.resize(1<<31);
}
struct bool_pseudo_ref
{
std::vector<bool>* vec;
size_t offset;
bool_pseudo_ref( std::vector<bool>& base, size_t offset_ ): vec(&base), offset(offset_) {
Assert(vec);
Assert(offset < (1<<31));
}
operator bool() const {
Assert(vec);
if (!vec) return false;
return vec->at(offset);
};
bool_pseudo_ref& operator=(bool src)
{
Assert(vec);
if (!vec) return *this;
return vec->at(offset) = src;
return *this;
}
};
bool_pseudo_ref operator[](size_t i)
{
std::tr1::int64_t offset = i;
if (offset >= (1<<31))
{
offset -= (1<<31);
return bool_pseudo_ref( high, size_t(offset) );
} else {
return bool_pseudo_ref( low, size_t(offset) );
}
}
bool operator[](size_t i) const
{
if (offset >= (1<<31))
{
offset -= (1<<31);
return high.at(offset);
} else {
return low.at(offset);
}
}
};
Yakk wrote:To avoid 32 bit issues (the size of a single vector is limited to 2^32-1),
jaap wrote:Yakk wrote:To avoid 32 bit issues (the size of a single vector is limited to 2^32-1),
Really?
A C++ vector is indexed by a size_t type, which is an unsigned type.
It would think it was able to cover the whole range even if it is 32 bit (though it is more likely to be 64 bit nowadays for most std libraries).
Yakk wrote:jaap wrote:Yakk wrote:To avoid 32 bit issues (the size of a single vector is limited to 2^32-1),
Really?
A C++ vector is indexed by a size_t type, which is an unsigned type.
I said 2^32-1, not 2^31-1.It would think it was able to cover the whole range even if it is 32 bit (though it is more likely to be 64 bit nowadays for most std libraries).
The lowest size_t is 0. All values exist, and on 32 bit systems it is 32 bit. Thus the highest value, by the law of pigeons, is 2^32-1.
Thesh wrote:Note that because max_size() is the total number of elements, if size_type is a 32 bit unsigned int, the maximum number of elements will be 2^32-1.
Thesh wrote:It really doesn't matter. If you are limited to four gigs of ram, it's impossible to fill it anyway. Even if your program doesn't require an OS and has all the memory to itself, the stack, in most cases, will consume more than one byte.
Moose Hole wrote:The worst case is (2^32)/2 integers, because the OP stated "none of them are unique."
FuzzyPanda wrote:It was specified that the integers are all 32-bit, so we don't need to worry about exceeding our 2^32 element hash table, but what about only using a small portion of it? This seems pretty wasteful
Here is a proposed modification which would increase the time complexity and potentially decrease the space complexity (which seems necessary for such a huge array).
The current time complexity is 2n - n to convert the initial array to the hash table + n to convert the hash table into the unique array
The current space complexity is 2^32, which will exceed memory and slow down the computer - a problem which can be easily avoided
Instead, we add an additional n pass through the 2^32 element array to establish the lowest and highest numbers, and allocate a hash table of size (highest - lowest), this could be enormously smaller than 2^32 elements and with arrays of this size, it's best not to be wasteful..
Users browsing this forum: No registered users and 3 guests