Page 1 of 1

Choosing datastructures

Posted: Sun Oct 24, 2010 12:04 pm UTC
by HungryHobo
I've been thinking lately about the huge difference that using the right datastructures in your code can make.
Now I've learned about the basic data structures but I've well aware that there's quite a few traps situations where they're far less than optimal.

I was thinking of trying to make an app where I could tick boxes or choose options to specify my criteria.

exampe:
"Will it be updated more often than read or read more often than updated"

or perhaps something to be able to specify the expected ratio of reads:inserts:updates:deletions since I know some datatypes are cheaper to update at a cost of read speed or cheaper to read at the cost of slower updates or inserts.

Some datastructures are better in highly parallel environments than others.
Some are better when the amount of memory available is significantly larger than the amount of data to be stored.

etc.

And once the parameters are set it lists the datastructures which are the best fit for your requirements.

I'm familiar with most of the common ones,
BST's, lists, arrays, heaps, queues, hashtables etc etc

but there's no shortage of ones which I've never heard of - even the list on wikipedia http://en.wikipedia.org/wiki/List_of_data_structures has plenty I've never even heard of.

So is my idea madness or feasible?

If feasible any suggestions for criteria/options?

Re: Choosing datastructures

Posted: Sun Oct 24, 2010 3:16 pm UTC
by Axidos
Your idea sounds useful if it can be comprehensive enough, and as long as nobody comes to believe it's the be-all end-all (I am sure there are useful structures it will not know, or structures it will know but fail to recommend in some situations it should).

I only have a comment on the design of the test: Start off with a handful of generic questions to narrow down the scope some, then begin dynamically serving new questions to narrow the options down within the likely scope.
E.G. If you ask whether they're working with numbers, you can immediately grade string-exclusive data structures as irrelevant, and serve up some new questions which try to narrow down the sorts of numeric data structures they should use. My concern is that if you try to serve them every possible question, 90% of them will probably turn out to be unnecessary in that they didn't really contribute to finding out the answer.

Since that could make the test seem endless after a little while, I'd suggest some kind of progress meter saying basically "This is how close I am to being reasonably sure of your options", if you take that approach.

Re: Choosing datastructures

Posted: Sun Oct 24, 2010 8:10 pm UTC
by letterX
Axidos wrote:Your idea sounds useful if it can be comprehensive enough, and as long as nobody comes to believe it's the be-all end-all (I am sure there are useful structures it will not know, or structures it will know but fail to recommend in some situations it should).

I only have a comment on the design of the test: Start off with a handful of generic questions to narrow down the scope some, then begin dynamically serving new questions to narrow the options down within the likely scope.
E.G. If you ask whether they're working with numbers, you can immediately grade string-exclusive data structures as irrelevant, and serve up some new questions which try to narrow down the sorts of numeric data structures they should use. My concern is that if you try to serve them every possible question, 90% of them will probably turn out to be unnecessary in that they didn't really contribute to finding out the answer.

Since that could make the test seem endless after a little while, I'd suggest some kind of progress meter saying basically "This is how close I am to being reasonably sure of your options", if you take that approach.


Speaking of: there's a datastructure for this kind of process!

Re: Choosing datastructures

Posted: Tue Oct 26, 2010 9:43 am UTC
by Carnildo
Axidos wrote:I only have a comment on the design of the test: Start off with a handful of generic questions to narrow down the scope some, then begin dynamically serving new questions to narrow the options down within the likely scope.
E.G. If you ask whether they're working with numbers, you can immediately grade string-exclusive data structures as irrelevant, and serve up some new questions which try to narrow down the sorts of numeric data structures they should use. My concern is that if you try to serve them every possible question, 90% of them will probably turn out to be unnecessary in that they didn't really contribute to finding out the answer.

There are two ways of classifying data structures. The first is on what operations are possible (eg. determining relative ordering of two elements isn't possible in a set); the second is on what operations are fast (eg. a perfect hash table has O(1) lookup time). I think I'd start by asking what operations need to be possible.

Keep in mind that some data structures have unusual limitations on "possible". For example, with a Bloom filter, "is this element in the data structure" is not a possible operation, while "is this element not in the data structure" is possible.

Re: Choosing datastructures

Posted: Tue Nov 02, 2010 7:57 pm UTC
by MHD
This is actually an interesting concept... Might make a simple non interactive decision tree.

But things like a 2,3 tree with list ends as a sequence is pretty much all purpose... So what about them?

Re: Choosing datastructures

Posted: Wed Nov 03, 2010 3:17 am UTC
by Carnildo
MHD wrote:This is actually an interesting concept... Might make a simple non interactive decision tree.

But things like a 2,3 tree with list ends as a sequence is pretty much all purpose... So what about them?

What operations are fast? What operations are slow? There are plenty of data structures that can perform all common operations, but they don't do them all equally well. You wouldn't use a linked list if you needed fast random access, and you wouldn't use an array if you needed fast random inserts. This is about picking the best data structure for a given task.

Re: Choosing datastructures

Posted: Wed Nov 03, 2010 2:56 pm UTC
by Yakk
And why use O-notation?

Give an estimate of the cost of copying the data being stored and the keys, and comparing. These estimates should be ranges.

Give a range of possible container sizes.

Generate explicit running times from the above (which can be distributions, because some data structures are effectively random in their performance). Expose best, worst, average and 95% confidence running time ranges for the entire domain above, possibly on a range of hardware models.

You'd also want a 20-questions like system, where it drills down on some questions to find out if stranger data structures are usable.

Re: Choosing datastructures

Posted: Wed Nov 03, 2010 4:16 pm UTC
by Jplus
Yakk wrote:(...) and 95% confidence running time ranges for the entire domain above, possibly on a range of hardware models.

Maybe this is a bit overdone.

Re: Choosing datastructures

Posted: Wed Nov 03, 2010 4:58 pm UTC
by Yakk
:-)

Re: Choosing datastructures

Posted: Thu Nov 04, 2010 9:26 pm UTC
by MHD
Maybe have parameters of "How easy to code?" for imperative and functional languages. Also, add pointer-tangle-ness and maybe something about using it in purely functional where no data is ever changed and all apparent changes are copies.

Re: Choosing datastructures

Posted: Thu Nov 04, 2010 10:16 pm UTC
by Yakk
With a decent language & library support, you shouldn't have to code reasonably famous versions of these creatures.

The pure functional bit is interesting: for a data structure to be purely functional friendly, you need internal references to be monodirectional (some kind of table might get around this?) and maybe high locality (so when you make a change, you don't end up with tonnes of changes all over the place).

Or maybe ... a data structure that can be cheaply and efficiently delta'd is a good pure functional data structure.

I'm not sure what you mean by pointer-tangle-ness -- you mean the ease to which someone manually trying to walk the pointers can get lost?

Re: Choosing datastructures

Posted: Fri Nov 05, 2010 10:29 pm UTC
by headprogrammingczar
Multi-directional references can be simulated with a zipper, with varying degrees of ugliness. A doubly linked list can be implemented as:

Code: Select all

data DLL a = DLL [a] a [a]

forward (DLL fs x (l:ls)) = DLL (x:fs) l ls

backward (DLL (f:fs) x ls) = DLL fs f (x:ls)

fromList (a:as) = DLL [] a as