Coding: Fleeting Thoughts

A place to discuss the implementation and style of computer programs.

Moderators: phlip, Moderators General, Prelates

Tub
Posts: 446
Joined: Wed Jul 27, 2011 3:13 pm UTC

Re: Coding: Fleeting Thoughts

Postby Tub » Tue Nov 20, 2018 5:53 am UTC

There are a bunch of nice css transforms, powerful enough to create a full animated 3d layout, if you like. When clicking somewhere, the browser will reverse those transforms, accurately determining which element was under the mouse pointer at the time of the event.
At the same time, the browser will also absolutely refuse to give you any helpful coordinates telling you *where* the element has been clicked, so you either have to reimplement the transformation matrix formulas and do it yourself, or you have to overlay your widgets in a grid of 1x1 divs and hope that your clients don't notice any slowdown. *sigh*

User avatar
Xenomortis
Not actually a special flower.
Posts: 1443
Joined: Thu Oct 11, 2012 8:47 am UTC

Re: Coding: Fleeting Thoughts

Postby Xenomortis » Tue Nov 20, 2018 11:46 pm UTC

My knowledge of the details of dynamic linking is limited; I don't really know what happens when "dlopen" is called.
But the following situation occurs with some program at work, and my intuition tells me that "bad things" could happen.

There exists an application (call it "app"). It links to "foo_v1.so"

Code: Select all

$ ldd app
...
foo_v1.so => [path/to/foo_v1.so] [hex address]
...

foo_v1.so may, at some point during the execution of app, open another shared object "bar.so".
bar.so however, links to foo_v2.so - a different version of foo.

Code: Select all

$ ldd foo_v1.so
...
foo_v1.so
...


Library foo holds a fair amount of internal state, some of which I believe is allocated statically. Calls to functions in foo will often modify this internal state. foo is not threadsafe (by default anyway).

So, what's happening here?
Are there two instances of "foo" loaded? What happens with the symbol collisions? What gets used when bar wants a function from foo - which version does it get?
Why doesn't "app" occasionally explode? Why do there appear to be no consequences?
Are we just lucky that bar doesn't actually call much from foo?

Edit:
Functions are resolved as offsets from the load point of the library? So bar only "sees" foo_v2?
Image

Tub
Posts: 446
Joined: Wed Jul 27, 2011 3:13 pm UTC

Re: Coding: Fleeting Thoughts

Postby Tub » Wed Nov 21, 2018 6:34 pm UTC

Xenomortis wrote:my intuition tells me that "bad things" could happen.

That is a reasonable default assumption. If you wish to dive into the inner workings of symbol resolution, read this: https://akkadia.org/drepper/dsohowto.pdf

>> Are there two instances of "foo" loaded?

Yes. Unless they're the same file, they're a different library. The fact that their names both contain "foo" doesn't mean anything.

>> What happens with the symbol collisions?

If v1 and v2 have a different API/ABI, then those symbols *should* be versioned, and you should only link against versioned symbols, hence there would be no symbol collision.

Otherwise, symbol resolution simply returns the first match, but with a well defined search order.

Generelly, the linker just returns void pointers. The linker does not guarantee that the symbol you're getting complies with the ABI that you're expecting. It's the system maintainer's responsibility to ensure that the exported symbols in the search path match the expected symbols of the installed executables.

>> What gets used when bar wants a function from foo - which version does it get?

Page 10 of the pdf linked above deals with that specific case. The global scope is searched before the local scope introduced during dlopen() - if your symbols are unversioned, it gets the version that your app was linked against, not the one that bar links against.

>> Why doesn't "app" occasionally explode? Why do there appear to be no consequences?

Either because those symbols are versioned, or because you've been lucky, or because you've been unlucky and those bugs will only turn up in production while you're on vacation.

Library foo holds a fair amount of internal state, some of which I believe is allocated statically.

If both your app and bar.so link against foo.so, which has a global state, then you can get data races on that global state without even introducing a thread. In addition to being thread-safe, you also need foo to be reentrant.

bar.so linking against a different version (thus getting its own internal state) may work in your favour here, but that's obviously not a good thing to rely on.

User avatar
Yakk
Poster with most posts but no title.
Posts: 11109
Joined: Sat Jan 27, 2007 7:27 pm UTC
Location: E pur si muove

Re: Coding: Fleeting Thoughts

Postby Yakk » Wed Nov 21, 2018 7:28 pm UTC

So, in C++ land you can safely version your libraries like this:

Code: Select all

namespace my_library_ns {
  inline namespace version2_0_1 {
    // API goes here
  }
}

now clients can just use `my_library_ns::foo()` and not even know about the `inline` namespace. But they link against `my_library_ns::version2_0_1::foo` magically.
One of the painful things about our time is that those who feel certainty are stupid, and those with any imagination and understanding are filled with doubt and indecision - BR

Last edited by JHVH on Fri Oct 23, 4004 BCE 6:17 pm, edited 6 times in total.

User avatar
Xenomortis
Not actually a special flower.
Posts: 1443
Joined: Thu Oct 11, 2012 8:47 am UTC

Re: Coding: Fleeting Thoughts

Postby Xenomortis » Wed Nov 21, 2018 7:55 pm UTC

Yakk wrote:So, in C++...

"foo" and "bar" are C (and we can compile both). However "app" appears to be C++ (we don't have the secret sauce) and was distributed with its own version of "foo".


Tub wrote:>> What happens with the symbol collisions?

If v1 and v2 have a different API/ABI, then those symbols *should* be versioned, and you should only link against versioned symbols, hence there would be no symbol collision.

Otherwise, symbol resolution simply returns the first match, but with a well defined search order.

Generelly, the linker just returns void pointers. The linker does not guarantee that the symbol you're getting complies with the ABI that you're expecting. It's the system maintainer's responsibility to ensure that the exported symbols in the search path match the expected symbols of the installed executables.

The ABI's consistent - I guess the question is, what objects get modified by the different libraries. What happens with their internal data?

Suppose then (no longer completely representative of the case at hand):
foo_v1.c:

Code: Select all

static int x = 0;
void foo_f1() {
    x++;
}

and foo_v2.c:

Code: Select all

static int x = 0;
void foo_f1() {
    return ++x;
}
void foo_f2() {
    x *= -1;
    return x;
}


bar.c

Code: Select all

int bar_f() {
  foo_f1();
  return foo_f2();
}


app.c

Code: Select all

int main() {
  foo_f1();
  /*
  ..dlopen(bar.so) etc..
  */
  int y = bar_f();
}


What's the value of y in app.c? I suppose "-2", but that assumes foo_f2 read the same x that foo_f1 (that would have come from the global scope).
And what would it be if foo_v2.c looked like:

Code: Select all

static int x = 0;
static int y = 2;
void foo_f1() {
    return ++x;
}
void foo_f2() {
    x *= -1;
    return x + y;
}


Tub wrote:If you wish to dive into the inner workings of symbol resolution, read this: https://akkadia.org/drepper/dsohowto.pdf

I'll try to, but maybe not after beer. ;)
Image

Tub
Posts: 446
Joined: Wed Jul 27, 2011 3:13 pm UTC

Re: Coding: Fleeting Thoughts

Postby Tub » Wed Nov 21, 2018 8:53 pm UTC

Xenomortis wrote:The ABI's consistent

If that is the case, then you'd expect v1 to be a symlink to v2, and you wouldn't have this trouble. If you wish to override a vendor-supplied binary, you can use LD_PRELOAD to make app use your system's v2.

Xenomortis wrote:"foo" and "bar" are C (and we can compile both). However "app" appears to be C++ (we don't have the secret sauce) and was distributed with its own version of "foo".

If you have the source of foo, then you should know if it has global internal state, if it's thread-safe/reentrant, and how it solves its own versioning.

You can force bar to use a separate copy of foo by statically linking, if that avoids trouble.

>> Suppose then (no longer completely representative of the case at hand):

assuming that code compiles (it doesn't):

* if the symbols are versioned, bar gets both functions from f2, and y will be -1
* if the symbols are unversioned, bar would be linked with foo_f1 from v1, and foo_f2 from v2, and y is 0.

There is no case where y = -2. x cannot be shared between both versions unless it's exported.

>> And what would it be if foo_v2.c looked like:

results are two larger than above.

User avatar
Xeio
Friends, Faidites, Countrymen
Posts: 5100
Joined: Wed Jul 25, 2007 11:12 am UTC
Location: C:\Users\Xeio\
Contact:

Re: Coding: Fleeting Thoughts

Postby Xeio » Thu Nov 29, 2018 6:45 pm UTC

Sometimes when I see reflection use I think "Has science gone too far?".

Like I mean it's cool to get every type from the current assembly and check if they have a particular static method then invoking the method... I guess...

But there's only one of those aforementioned types in the whole solution so I mean, they could have just had a line to call it directly. I guess in the future if/when we add more you're saved one line of code to the places these are all called from... but... Ehhhh.

EDIT: Apparently this is for an implementation of IRegisterMetadata which already works like the above, so there's a reflection loop looking for implementations of IRegisterMetadata which in itself is looking for implementations of this particular static method. I'm not sure why the XAML desigenrs wouldn't just implement IRegisterMetadata directly...

I'll just choose to believe that there is some unknown performance reason to throw away type safety and add fragility that if someone ever creates an unrelated static method with the same name in the assembly that the whole thing blows up.

User avatar
Link
Posts: 1395
Joined: Sat Mar 07, 2009 11:33 am UTC
Location: ᘝᓄᘈᖉᐣ
Contact:

Re: Coding: Fleeting Thoughts

Postby Link » Fri Nov 30, 2018 11:34 am UTC

Gahh, kludges, kludges everywhere. Eigen's matrices can use custom complex types, but ComplexEigenSolver requires std::complex (possibly of a custom real type), so now I have to use kludges to convert my mpc-based complex numbers (via boost::multiprecision::mpc_complex) to std::complex of an mpfr-derived real type to compute the eigenvalues and -vectors, and then convert those back to mpc. Looks like there's a bug report for this, but in the mean time I have use this stupid kludge. Bollocks.

EDIT: completely unrelated, but in modern C++, what is the consensus on defining inline functions in header files: good practise or bad? With template bodies already needing to be put in header files anyway, and inline being only a hint to the compiler, it's tempting to just slap an inline keyword on every function that needs to be used in multiple source files and put the whole definition in the header (this is a well-contained project that doesn't need to expose an API or provide a library to other projects). Since modern compilers seem to be smart enough to figure out function call optimisation on their own most of the time, I don't see any significant downsides to doing so, but I still have the old "declaration in the header, definition in the source" mantra firmly rooted in the back of my mind, so it's making me a bit uncomfortable to break it so heavily.

Tub
Posts: 446
Joined: Wed Jul 27, 2011 3:13 pm UTC

Re: Coding: Fleeting Thoughts

Postby Tub » Sat Dec 01, 2018 12:29 pm UTC

Link wrote:EDIT: completely unrelated, but in modern C++, what is the consensus on defining inline functions in header files: good practise or bad?

.cpp file: "Eh, I guess a really smart compiler could still inline it"
.h file: as inline "I trust the compiler to make a good choice. The linker will remove the redundant non-inline instantiations, right?"
.h file, as static inline: "That should save the linker some trouble."
.h file, #define: "I want to make it as difficult as possible for the compiler to disobey my wishes."

It depends on your toolchain's capabilities for link-time optimization, your compiler's ability to make good inlining decisions (see: pgo), and your personal priorities regarding compilation speed vs. running speed vs. binary size vs. code style. If in doubt, compare both using your favourite toolchain.

User avatar
Flumble
Yes Man
Posts: 2182
Joined: Sun Aug 05, 2012 9:35 pm UTC

Re: Coding: Fleeting Thoughts

Postby Flumble » Tue Dec 04, 2018 3:12 am UTC

Not really coding, but I don't see a "helpdesk: fleeting thoughts":
I'm copying (cp) some large files from a remote computer (over sshfs) to an encrypted partition (luks/dmcrypt stuff) and it keeps switching between download & 'write', and encrypt & write.
During the former ssh, sshfs and cp are quite busy: top reports cpu use up to 50%, nethogs says 100MB/s traffic, and iotop says both total disk read and write are 100MB/s (and actual reads and writes at 0).
During the latter some kcryptds and dmcrypt_write get busy: 10% cpu use per kcryptd, no network traffic, and iotop reports some incredible peaks in actual writes (up to 2GB/s) while staying at 0 for most of the time and for the other statistics.

Why is the computer being so silly? And who is to blame/can it be remedied or worked around?
I've tried mounting the encrypted partition with sync, but, while it seemingly removed this switching behaviour, the throughput was an abysmal 5MB/s.


(Note: both disks are old-fashioned spinny things with read and write speeds of about 100MB/s.)
(Note: I disabled the write cache on the disk for reasons; dunno if that affects the situation.)

Tub
Posts: 446
Joined: Wed Jul 27, 2011 3:13 pm UTC

Re: Coding: Fleeting Thoughts

Postby Tub » Tue Dec 04, 2018 12:47 pm UTC

This could be perfectly harmless, if your total average throughput matches what the disk supports. In a perfect world, network speed would adjust to disk speed, but several layers of buffers and queues may prevent that.
IIRC there are a few procfs/sysctl knobs you can try to influence dirty page handling. Disabling the write cache may have influenced the behaviour, too.
But the first thing I'd try is to get rid of buffers in sshfs/fuse. How does scp and/or rsync work for you?

User avatar
Flumble
Yes Man
Posts: 2182
Joined: Sun Aug 05, 2012 9:35 pm UTC

Re: Coding: Fleeting Thoughts

Postby Flumble » Thu Dec 27, 2018 8:05 pm UTC

Argh, why does C++ not zero out memory in edge cases? Just because my class/struct doesn't have an explicit constructor and is a local variable (of the main function for crying out loud) shouldn't be a reason to leave junk in there. Perhaps the most ridiculous part of this behaviour is that an initializer "MyStruct s = {}" does zero all members of the object. :x

Tub wrote:But the first thing I'd try is to get rid of buffers in sshfs/fuse. How does scp and/or rsync work for you?

Unfortunately it was a one time thing, so I just left it running overnight. But, if I ever do a big transfer again, I'll go for rsync and check if that gives a solid >50MB/s.

User avatar
Soupspoon
You have done something you shouldn't. Or are about to.
Posts: 4052
Joined: Thu Jan 28, 2016 7:00 pm UTC
Location: 53-1

Re: Coding: Fleeting Thoughts

Postby Soupspoon » Thu Dec 27, 2018 9:27 pm UTC

A true fleeting thought, I had this morning. How unloved the NOP opcode is, these days!

Not that anybody really delves as deep as that is in code these days (commented-out lines don't count++). Last time I remember using it was in Redcode, or else something self-modifying (GA) in a similar pseudo low-level instruction set I was messing with.

commodorejohn
Posts: 1160
Joined: Thu Dec 10, 2009 6:21 pm UTC
Location: Placerville, CA
Contact:

Re: Coding: Fleeting Thoughts

Postby commodorejohn » Thu Dec 27, 2018 10:02 pm UTC

Flumble wrote:Argh, why does C++ not zero out memory in edge cases? Just because my class/struct doesn't have an explicit constructor and is a local variable (of the main function for crying out loud) shouldn't be a reason to leave junk in there. Perhaps the most ridiculous part of this behaviour is that an initializer "MyStruct s = {}" does zero all members of the object. :x

It's C-land, Flumble. That's far too much implicit, non-programmer-specified behavior. You should think of these things yourself!

Soupspoon wrote:A true fleeting thought, I had this morning. How unloved the NOP opcode is, these days!

Not that anybody really delves as deep as that is in code these days (commented-out lines don't count++). Last time I remember using it was in Redcode, or else something self-modifying (GA) in a similar pseudo low-level instruction set I was messing with.

Pretty much. Even for those of us who like to dabble in assembly language, a NOP instruction has pretty much two uses: leaving room to insert code later (which is really only useful in a situation where you're going to be editing/debugging code in-memory rather than re-assembling from source) or as a means to implement a delay (which is much less useful in newer architectures where you can have a hardware timer for many such applications.)
"'Legacy code' often differs from its suggested alternative by actually working and scaling."
- Bjarne Stroustrup
www.commodorejohn.com - in case you were wondering, which you probably weren't.

User avatar
hotaru
Posts: 1044
Joined: Fri Apr 13, 2007 6:54 pm UTC

Re: Coding: Fleeting Thoughts

Postby hotaru » Thu Dec 27, 2018 11:33 pm UTC

commodorejohn wrote:Pretty much. Even for those of us who like to dabble in assembly language, a NOP instruction has pretty much two uses: leaving room to insert code later (which is really only useful in a situation where you're going to be editing/debugging code in-memory rather than re-assembling from source) or as a means to implement a delay (which is much less useful in newer architectures where you can have a hardware timer for many such applications.)

there's also a third use, which is much more common: https://en.wikipedia.org/wiki/NOP_slide

Code: Select all

factorial product enumFromTo 1
isPrime n 
factorial (1) `mod== 1

User avatar
Yakk
Poster with most posts but no title.
Posts: 11109
Joined: Sat Jan 27, 2007 7:27 pm UTC
Location: E pur si muove

Re: Coding: Fleeting Thoughts

Postby Yakk » Sun Dec 30, 2018 1:29 am UTC

The best part is that zeroing all members of all variables implicitly would not be a breaking change.

Because the value of the uninitialized data is unspecified, and 0 is an option.

It would have performance impact on some code bases. You'd have to add a way to create uninitialized data for those that care.
One of the painful things about our time is that those who feel certainty are stupid, and those with any imagination and understanding are filled with doubt and indecision - BR

Last edited by JHVH on Fri Oct 23, 4004 BCE 6:17 pm, edited 6 times in total.

commodorejohn
Posts: 1160
Joined: Thu Dec 10, 2009 6:21 pm UTC
Location: Placerville, CA
Contact:

Re: Coding: Fleeting Thoughts

Postby commodorejohn » Sun Dec 30, 2018 3:04 am UTC

Seems like the kind of thing that the #pragma directive exists for.
"'Legacy code' often differs from its suggested alternative by actually working and scaling."
- Bjarne Stroustrup
www.commodorejohn.com - in case you were wondering, which you probably weren't.

User avatar
chridd
Has a vermicelli title
Posts: 836
Joined: Tue Aug 19, 2008 10:07 am UTC
Location: ...Earth, I guess?
Contact:

Re: Coding: Fleeting Thoughts

Postby chridd » Sun Dec 30, 2018 10:28 pm UTC

Yakk wrote:It would have performance impact on some code bases. You'd have to add a way to create uninitialized data for those that care.
I'd imagine a lot of cases could be handled by an optimizer. If the optimizer can prove that the initial value isn't used, it can eliminate the initializer; if not, likely the code is (intentionally or not) depending on the initial value.
~ chri d. d. /tʃɹɪ.di.di/ (Phonotactics, schmphonotactics) · she(?)(?(?)(?))(?(?(?))(?))(?) · Forum game scores
mittfh wrote:I wish this post was very quotable...
chridd (on Discord) wrote:
Dummy wrote:Sorry You're Gay Dads
SYG'D
marionic (on Discord) wrote:sleep in grave

Tub
Posts: 446
Joined: Wed Jul 27, 2011 3:13 pm UTC

Re: Coding: Fleeting Thoughts

Postby Tub » Mon Dec 31, 2018 12:27 pm UTC

Yakk wrote:It would have performance impact on some code bases. You'd have to add a way to create uninitialized data for those that care.

But there already is a compatible and standardized way to specify whether you want initialized or uninitialized memory for your struct. To create uninitialized data you write
> struct Foo bar;
and the way to create initialized data is to write
> struct Foo bar = {};

There's no need to redefine the standard, to add compiler switches, to add pragmas.. all you have to do is write the proper code.
chridd wrote:I'd imagine a lot of cases could be handled by an optimizer. If the optimizer can prove that the initial value isn't used, it can eliminate the initializer; if not, likely the code is (intentionally or not) depending on the initial value.

Optimizing compilers can already do that.

Even better, when they can prove that you may be using an uninitialized value, they can warn about it and let you fix your code. Never compile without -Wall -Werror.

User avatar
PM 2Ring
Posts: 3684
Joined: Mon Jan 26, 2009 3:19 pm UTC
Location: Sydney, Australia

Re: Coding: Fleeting Thoughts

Postby PM 2Ring » Mon Dec 31, 2018 1:34 pm UTC

Reading uninitialised memory is undefined behaviour, so any program that does that is already skating on very thin ice.

User avatar
Flumble
Yes Man
Posts: 2182
Joined: Sun Aug 05, 2012 9:35 pm UTC

Re: Coding: Fleeting Thoughts

Postby Flumble » Mon Dec 31, 2018 1:43 pm UTC

Tub wrote:and the way to create initialized data is to write
> struct Foo bar = {};

To me that reads "initialize no members to a particular value" i.e. don't initialize anything. Why did the C people decide that all unlisted members in an initializer list should be zeroed? That seems horribly inefficient! Maybe I want to leave .big_static_buffer uninitialized! (But I do want to use the initializer list for the other members because the notation is nice.) (Sure, an optimizer can easily remove the zeroing if it sees that a member is reassigned before use, but think of the time before optimizers!)

Tub wrote:Even better, when they can prove that you may be using an uninitialized value, they can warn about it and let you fix your code. Never compile without -Wall -Werror.

Oh right, I keep forgetting to add those flags. I'm spoiled by "no type errors ⇒ program works" in some other languages.


On NOPs: they're also nice to put after a branching instruction to make the life of a processor's pipeline easier*... or the life of a compiler, when the processor is known to execute the instruction after a branch.
And it's really useful for overwriting instructions that you don't want executed for whatever reason. :roll:

*Actually this might be just in my imagination. Surely nowadays a processor is much faster with its branch prediction and speculative execution than a NOP could ever help with preventing a pipeline flush.

Tub
Posts: 446
Joined: Wed Jul 27, 2011 3:13 pm UTC

Re: Coding: Fleeting Thoughts

Postby Tub » Tue Jan 01, 2019 2:03 am UTC

Flumble wrote:To me that reads "initialize no members to a particular value" i.e. don't initialize anything. Why did the C people decide that all unlisted members in an initializer list should be zeroed?

I wasn't part of that decision, but I can take a guess.

An initializer list is not a shorthand for a bunch of assignments. It's an initialization. Which means that the variable is henceforth in a defined state.
Boiling it down to two states ("uninitialized" and "initialized", but not "partially initialized") makes it easier to reason about the code. Can a partially initialized struct be safely passed to another function? The compiler usually doesn't know what the other function does, so it cannot tell.

This makes more sense if you take a look at the POSIX APIs. Some of them will accept a pointer to a struct as some parameter, but will only use certain fields of that struct while ignoring others. It's good practice (or even an API requirement) to null all unused fields.
Because it's tedious to write memset(&foo, 0, sizeof(foo)), and even more tedious and error prone to have a giant initializer list assigning null everywhere, it's useful to have a shorthand that does just that: initialize what you need, null everything else.

Flumble wrote:Maybe I want to leave .big_static_buffer uninitialized! (But I do want to use the initializer list for the other members because the notation is nice.)

It makes sense to reserve the shorthand notations for the common case, while requiring more verbose code for the uncommon case.

If you need partial initialization, you'll have to accept the verbose variant:
> struct Foo bar;
> bar.a = 42;
> bar.b = 43;
> /* do not set .big_static_buffer

Flumble wrote:I'm spoiled by "no type errors ⇒ program works" in some other languages.

I don't know... there seems to be an additional step between "program is semantically valid and defined" and "program does what I want it to do", regardless of language.

User avatar
Flumble
Yes Man
Posts: 2182
Joined: Sun Aug 05, 2012 9:35 pm UTC

Re: Coding: Fleeting Thoughts

Postby Flumble » Fri Jan 04, 2019 4:00 am UTC

How hard would it be to find the longest path over all English* words of 1-5 letters, connected by adding, changing or deleting a letter? There are quite a lot of them and their degrees vary wildly. (I'm quite certain there's no Hamiltonian path, though, because quick and quack can't get anywhere without an extra letter.)


Tub wrote:
Flumble wrote:I'm spoiled by "no type errors ⇒ program works" in some other languages.

I don't know... there seems to be an additional step between "program is semantically valid and defined" and "program does what I want it to do", regardless of language.

That just means the semantics aren't specific enough. :P
But, yes, there's still a possible gap between "seems to run fine" and "runs correctly". However, I think that gap gets smaller with more abstractions and their semantics. In this case the semantics are even there, they're just out of sight because I forgot to -Wall.


*I guess that means "has a Merriam-Webster entry" or "has an «English» section on wiktionary".

User avatar
Soupspoon
You have done something you shouldn't. Or are about to.
Posts: 4052
Joined: Thu Jan 28, 2016 7:00 pm UTC
Location: 53-1

Re: Coding: Fleeting Thoughts

Postby Soupspoon » Fri Jan 04, 2019 7:25 am UTC

Flumble wrote:How hard would it be to find the longest path over all English* words of 1-5 letters, connected by adding, changing or deleting a letter? There are quite a lot of them and their degrees vary wildly.
I was considering a similar problem of my own, just the other day.

1) Get a dictionary list file. easily available, but note that ones intended for password cracking (brute-force, rainbowing, etc) will have things like "abc123" and "qwertyuiop". You can use a list with all these, I suppose, because any that survive any of the further stages can be spotted (no longer in the midst of half a million other words, some of which may look as invalid only because you don't know them) but maybe you want to look for a "scrabble" list, if available, because that's got only valid words (including "aa", "qi", "re" and other such game-winners) to start with. Anyway, you can still normalise to lower-case (if necessary) or remove those whose capitalisation indicates being proper nouns (if you don't want them).

2) For each word in the dictionary list, blindly apply the length*25 letter changes, (length+1)*26 letter additions and (length) letter removals to see if any/many match another word in the list. You can discard any that have no matches.

3) Any that do, extract this and all first-order matches from the initial dictionary list and store in a mini list of its own (within a list of such lists, or keyword (perl-hash or python-dictionary or whatever your system calls it) index containing linked lists - if the latter, I suggest using the alphabetically-first word as key, but other systems can work as well). Also bring those dictionary-ripped words to the front of the testing queue (depends on how you're shuffling your data).

4) Repeat, but once you've started putting word-groups into whichever sublists you search in all sublists (except the one that a current word is in), for previously words not already tested (suggestion: postfix an asterisk or other meta-character on the ones that caused a grab of others, then skip over them again when seeking matches, though allow them to be sought; though you can asterisk the ones only grabbed, instead, if you want, with reverse intent). If a sublist item finds a word in another sublist, merge the sublists, as well as dragging in any more words still in the main input stack.

5) Once you've tested every word (nothing left in the input stack, no sublist has any word as yet untried) you now have bunches of related words. "Quick" and "quack" will be (the only two?) words in one. You can then work with each smaller-than-whole-dictionary bunch to make sure you have the full net of relationships. (If you didn't, during the above processing, already store all direct neighbourships (backwards and forwards) alongside the word entry. But that depends if you're happy to deal with lists of lists of word+list structures.)

6) You can analyse them for various metrics. Number of words in each bag (related, though most probably not neighbouringly so), looking for longest non-intersecting chains (related to the salesman problem, if you wish to weight by themparticular inter-word transform used), looking to prove or easily throw out the possibility of a Euler path (Konigsburg bridge problem, every relationship travelled just once, words may be revisited multiple times), and even find such a Hamiltonian for this subset only.

If you decide to remove a nonsense word from a bunch (after discovering it) you'd have to reprocess the bunch in case it splits into two or more now-unlinked lists. If you find more words then just push them into the (emptied) input dictionary to allow them to be checked for mergers. It's fairly resilient. The initial sorting could be made more intelligent to save some of the brute forcing and save processor cycles, but cycles are relatively cheap (as is storage space, these days, so you don't even have to worry about that overhead) and it's how you derive the bunch (or subset of a bunch) that is down to what you then want to derive, and how.

Depending on how you store your listed data in memory structures, I'd suggest exporting the bunches to files. Apart from redoing with word additions/removals, you are now at leisure to use a subsequent script to import each of those bunches to do the bunch-analysis. Practice on the small ones (which you can visually and manually check, to see that your netting and net-reading gives what you want) then once you're happy target the largest one. It'll either break things (too much time or too much memory, because you still aren't being smart enough - then you know you need to improve your methods) or it quickly churns out the most likely maximal solution to your problem, which can be used to truncate "how low you can go" down the list back to the bottom, before ruling out any more solutions that could beat it.

(Adapted and adopted from a thought-project of similar scope considered on a Boxing Day walk. Not yet put into practice, so obviously there are optimisations and time-saving methods I haven't discovered as possible yet, and I only have a basic idea about the Order of the time/space complexity I'd encounter. It should all work without issue other than coding errors, though.)

Tub
Posts: 446
Joined: Wed Jul 27, 2011 3:13 pm UTC

Re: Coding: Fleeting Thoughts

Postby Tub » Fri Jan 04, 2019 12:25 pm UTC

The longest path problem is NP-hard. No good approximations are known. I wouldn't advise trying to solve this exactly on >100k nodes. Memory is not going to be a problem, but there's just too many possible paths to check them all.

The good news is that we can place an upper bound k on the amount of neighbours each node can have, regardless of word count. This turns many algorithms from O(n^2) into O(n * k) = O(n).

1) Create a graph (with adjacency lists) in O(n * k)
2) Split the graph into connected components, in O(n) per component

Now we have the same grouping that Soupspoon achieved after step 5), but it's already in graph form, and it's been done efficiently. If words are added or removed, re-running these steps should be quick.

3) Randomly find a few paths in your largest component(s), in O(n) per path. This gets us a lower bound for the longest path, and we can safely discard all components smaller than that. If we're lucky, only one component remains.

At this point, we either try to brute force this thing (but we won't live long enough to see it finished), or we approximate. One approach would be to generate a whole lot of "good" solutions and hope that the best of them is good enough.
1) generate a random path in O(n * k)
2) for each node not in the path, see if it forms a triangle with an edge in the path. If so, merge it into the path.
(As each merge may open up new possibilities for merges, the trivial approach requires checking O(n^2) nodes. I believe it can be done checking only O(n * k) nodes, but I'm too lazy for a formal proof).

Repeat until you're happy with the longest path, or until patience runs out.

User avatar
Soupspoon
You have done something you shouldn't. Or are about to.
Posts: 4052
Joined: Thu Jan 28, 2016 7:00 pm UTC
Location: 53-1

Re: Coding: Fleeting Thoughts

Postby Soupspoon » Fri Jan 04, 2019 8:13 pm UTC

Additional possibly optimisation:

I'd sort of havered over the "build the net from the start" bit, because lists-of-lists-of-nodes-of-a-net is getting highly structural. But when you have the net, it's likely (though not definite) that a longest-chain will have terminator nodes with just a single neighbour. Try using these to (unidirectionally) grow your first paths from. We know that there's no longer path by extending this node in the direction opposite the forward neighbour, because it has no other neighbour.

However:
1) If there's a perfect Hamiltonian (or several), then there are no single-neighboured. All max-length journeys start and end at adjacent neighbours but instead go all the way round the possible chain(s) instead of crossing this ring-split junction. And finding the ring(s) would be a different search (if you knew it existed) or a potential discovery on a "I've got no loose ends!" fall-back search.

2) Longer paths may still exist. The opposite end to a solo-neighbour might itself terminate in one of the other solo-neighbours, or it might swerve off into a different path where it racks up many links as it bounces around until it finds itself in a non-solo-neighbour (nominally) whose neighbours are unviable due to already being on the track towards it. Similarly, it is possible that by unlinking the starting-solo and re-retreating to an alternate neighbour of the start-neighbour (to increasing degrees) a similar loop of self-hemmming-in can be discovered (assuming we already found/refuted every longer chain starting from alternate solo-starts passing onto the chain N links inwards from the original solo-start) leading to a double-ended hemming in.

2a) Or even (when exhaustively tested, to find the final form of longest link) end up with something like an Alexandrian Horned Sphere, to the limit of node resolution, due to the specifics of the net topology.

But it's a start. And I'd be interested to find out how many nodes there are in the largest contiguous net. Probaby <<100k, but I couldn't tell you if it was <50k, <10k or even <1000. Without trying. I'll find me a scrabble dictionary file and do the initial trivial subdivision, perhaps, shortly. If that takes too long (without saving connectvities, other than group membership) then the next stage is probably going to be a write-off. ;)

Tub
Posts: 446
Joined: Wed Jul 27, 2011 3:13 pm UTC

Re: Coding: Fleeting Thoughts

Postby Tub » Mon Jan 07, 2019 6:03 pm UTC

Collins Scrabble Words 2015 (276,643 words)
(not linking it here, apparently word lists can be copyrighted..)

The largest component has 85,526 words. Everything else is in components with 100 or less. 44692 words have no neighbour at all; making it very tempting to sneak one of them into the word game thread.

Each word is linked to, on average, 1.58 other words. Within the largest component, each word is linked to, on average, 3.68 other words.

Within the largest component, 4974 words have only 1 neighbour, so our longest path cannot exceed 80554 words.
12281 words have exactly two neighbours. 12526 have three, 10481 7712 5831 4658 3630 2982 2563 2148 1810 1607 1429 1287 1114 946 853 765 633 590 511 468 441 339 349 274 257 222 208 176 161 138 136 118 112 88 81 86 71 72 64 63 56 35 38 21 27 10 17 15 6 9 7 7 1 5 2 1 0 1 1 1 1


Performance-wise, building the graph with 276k words takes ~3 minutes, with 85k words 25 seconds (single-threaded), but that's the trivial O(n^2) algorithm which I haven't bothered to optimize.

Creating ~100k paths takes less than a minute. The deterministic search always favours jumping to nodes with lots of neighbours.

Code: Select all

Found path of length 9204 using deterministic search (85526 attempts).
Found path of length 9204 using random search (100000 attempts).
Found path of length 9204 using random search (100000 attempts).
Found path of length 9204 using random search (100000 attempts).
Found path of length 9204 using random search (100000 attempts).
Found path of length 9204 using random search (100000 attempts).
Found path of length 9204 using random search (100000 attempts).
Found path of length 9203 using random search (100000 attempts).
Found path of length 9204 using random search (100000 attempts).
Found path of length 9204 using random search (100000 attempts).
Found path of length 9204 using random search (100000 attempts).

/edit: ~20 million paths later, still nothing better than 9204

Maybe 9204 seems good. I haven't implemented code for merging additional nodes into the middle of the path, but a random search should eventually fix that by itself.

I have not written extensive test cases, so everything I wrote may be wrong. Playtime is over, so I'm dropping the pen for now.

User avatar
Soupspoon
You have done something you shouldn't. Or are about to.
Posts: 4052
Joined: Thu Jan 28, 2016 7:00 pm UTC
Location: 53-1

Re: Coding: Fleeting Thoughts

Postby Soupspoon » Mon Jan 07, 2019 9:41 pm UTC

(Snap! That's exactly the list *I* was using, and your figures don't look different to what I quickly got before the weekend took me away from it. But you got to the pathfinding bit before me. Fair enough. Maybe I'll optimise and nicify the script for others' consumption. It's an unreadable munge-mess right now. ;))

User avatar
Link
Posts: 1395
Joined: Sat Mar 07, 2009 11:33 am UTC
Location: ᘝᓄᘈᖉᐣ
Contact:

Re: Coding: Fleeting Thoughts

Postby Link » Mon Jan 28, 2019 4:46 pm UTC

Modern C++ has made me lazy: I just spent an hour debugging only to find out my weird problems were caused by auto.

User avatar
Yakk
Poster with most posts but no title.
Posts: 11109
Joined: Sat Jan 27, 2007 7:27 pm UTC
Location: E pur si muove

Re: Coding: Fleeting Thoughts

Postby Yakk » Wed Jan 30, 2019 4:19 pm UTC

That expression template should insist on being called with `&&` rvalue only.

Then you'd get a build break and not runtime insanity.
One of the painful things about our time is that those who feel certainty are stupid, and those with any imagination and understanding are filled with doubt and indecision - BR

Last edited by JHVH on Fri Oct 23, 4004 BCE 6:17 pm, edited 6 times in total.

User avatar
Flumble
Yes Man
Posts: 2182
Joined: Sun Aug 05, 2012 9:35 pm UTC

Re: Coding: Fleeting Thoughts

Postby Flumble » Sat Feb 02, 2019 10:42 pm UTC

Someone has gone through the trouble of transforming the contents of ss/netstat or a pcap file (or traversing the sockets by itself, though that doesn't sound very *nix-like) to a DOT graph, right? This sounds like a common enough thing to want that it should be somewhere on the web. Then where is it? :?

User avatar
Soupspoon
You have done something you shouldn't. Or are about to.
Posts: 4052
Joined: Thu Jan 28, 2016 7:00 pm UTC
Location: 53-1

Re: Coding: Fleeting Thoughts

Postby Soupspoon » Sun Feb 03, 2019 4:28 am UTC

ISTR using some software in the '90s that allowed that. It defaulted to a perimeter to perimeter (on an arbitrary ring layout) vector assigning IPs/MACs/whatever as the unique machine IDs in tracking LAN traffic (with thicker lines, or different hue, to show proportionate use) but you could add in a vector-file to plot a desk-plan (if LAN) or office-to-office diagram (if WAN-gatewayed) if you wanted something even Management wouod understand at a glance. Wouldn't take much to make it fully geolocate.

If I could remember what the software was… though it was probably a premium product, there's almost certainly some sort of an OpenFoo version of it, by now. If not then, when we paid through the nose for the privilege. It's been a while since I used anything of that kind, sorry. But it wouldn't be too hard to mock up your own, to access public location data (if you avoid the obvious problems).

User avatar
Flumble
Yes Man
Posts: 2182
Joined: Sun Aug 05, 2012 9:35 pm UTC

Re: Coding: Fleeting Thoughts

Postby Flumble » Tue Feb 05, 2019 1:07 am UTC

I don't really care about the geolocation, but a quick-and-dirty tool to get an impression of which machines are connected to you from which addresses (and perhaps which/how many ports) would be nice. I guess sysadmins would want an interactive map of a whole LAN (over which they have control through various routers) at once, so there ought to be fancysoftware for that. But I'm on a campus network and just wanted to know real quick if my machine is connecting to something interesting without wading through the long list that netstat gives. (or doing a pcap for that matter)
I guess I'll write a script and put it on github when I feel like it.



Anyway, Xanthir, why is Element.querySelector such a pain? :x

Code: Select all

// say I have this document
<div id="level1">
  <div id="level2">
    <div id="level3"></div>
  </div>
</div>

const outer = document.querySelector("div")
// outer is #level1 alright

const second = outer.querySelector("div")
// second is #level2 alright

const inner = outer.querySelector("div div")
// ARGH inner is #level2 again

const realInner = outer.querySelector(":scope div div")
// "yay", we got a workaround in most browsers

Today was the first time I actually ran into a problem with it (where "query" and ":scope query" are different elements), so for the longest time I've thought element.querySelector considers element as the root.
I understand now that it just grabs the first element (after the current node) that matches the selector, but I don't get why. Why doesn't querySelector(All) perform the query local to the element?

User avatar
Xanthir
My HERO!!!
Posts: 5389
Joined: Tue Feb 20, 2007 12:49 am UTC
Location: The Googleplex
Contact:

Re: Coding: Fleeting Thoughts

Postby Xanthir » Tue Feb 05, 2019 9:32 pm UTC

Because there are two very reasonable models of what one should mean when evaluating a selector against an element scope:

1. The result is scope-filtered: run the selector normally, then filter out all the results that aren't in scope’s subtree.
2. The result is scope-anchored: run the selector on the subtree defined by scope, treating scope as the root element.

#1 is easy to define and implement; selectors work normally, you don't need to worry about selectors like :root maybe meaning something different, etc. However, #2 is what most people expect, most of the time; it's also what jQuery implemented. The author of the QS spec just chose wrong. (There used to be another method defined on elements in the DOM spec that had the behavior you expect, but it got dropped for naming reasons that I don't recall the details of right now.)

And, to be fair, going with #1 + :scope does let you get both behaviors, while #2 requires a more complicated solution to get both behaviors. (Iirc, I suggested doing #2, but allowing you to pass an element as a second argument to make it filter, so you could do `document.querySelector("div div", scopeEl)` to get the behavior of today's `scopeEl.querySelector("div div")`.
(defun fibs (n &optional (a 1) (b 1)) (take n (unfold '+ a b)))

Tub
Posts: 446
Joined: Wed Jul 27, 2011 3:13 pm UTC

Re: Coding: Fleeting Thoughts

Postby Tub » Wed Feb 06, 2019 10:44 pm UTC

I really don't know where my computer is getting its language settings from. I know that some websites will stupidly use geoip databases to determine my language instead of just looking at the HTTP header that I specifically configured to request english. But local software? I install and configure everything as US english (LANG=en_US.UTF-8), just with additional support for other locales, and with european date and money formats (LC_TIME, LC_MONETARY). Still, every now and then some software somehow decides to not be english.

Today, git took the cake.

Code: Select all

 ~> git pull
Bereits aktuell.
 ~/> git push
Everything up-to-date


Why?

User avatar
PM 2Ring
Posts: 3684
Joined: Mon Jan 26, 2009 3:19 pm UTC
Location: Sydney, Australia

Re: Coding: Fleeting Thoughts

Postby PM 2Ring » Thu Feb 07, 2019 10:47 am UTC

Image

User avatar
Flumble
Yes Man
Posts: 2182
Joined: Sun Aug 05, 2012 9:35 pm UTC

Re: Coding: Fleeting Thoughts

Postby Flumble » Sun Feb 10, 2019 12:01 am UTC

Hmph, I filled a disk to the brim during a git commit. The command didn't complain, but I fear it hasn't completed the commit.
Is there a way to verify the correctness of the repository? (git status reports "nothing to commit, working tree clean")

User avatar
Yakk
Poster with most posts but no title.
Posts: 11109
Joined: Sat Jan 27, 2007 7:27 pm UTC
Location: E pur si muove

Re: Coding: Fleeting Thoughts

Postby Yakk » Sun Feb 10, 2019 1:18 am UTC

I assume you have another repo out there with almost the same contents?
One of the painful things about our time is that those who feel certainty are stupid, and those with any imagination and understanding are filled with doubt and indecision - BR

Last edited by JHVH on Fri Oct 23, 4004 BCE 6:17 pm, edited 6 times in total.

User avatar
Xenomortis
Not actually a special flower.
Posts: 1443
Joined: Thu Oct 11, 2012 8:47 am UTC

Re: Coding: Fleeting Thoughts

Postby Xenomortis » Sun Feb 10, 2019 1:31 am UTC

git fsck?
I suspect you're probably fine if git didn't complain during the operation.

Try cloning it elsewhere from that repo, and then diff the working trees.
Image

User avatar
Flumble
Yes Man
Posts: 2182
Joined: Sun Aug 05, 2012 9:35 pm UTC

Re: Coding: Fleeting Thoughts

Postby Flumble » Sun Feb 10, 2019 2:23 am UTC

Xenomortis wrote:git fsck?

You're kidding. Who adds a subcommand 'fsck'? (Oh right, git started out as a glorified filesystem.)
Hmm, it also says everything is okay, so I guess everything is okay. It only needed that last byte to finish the commit or something.


Yakk wrote:I assume you have another repo out there with almost the same contents?

I also have every commit as a separate directory, so no data would be lost, only a bunch of time because the copies and commits take 5 minutes each. :P

But now I realized that git doesn't store metadata (except for the execute bit), so it's time I assimilate borg.


Return to “Coding”

Who is online

Users browsing this forum: No registered users and 8 guests