I was thinking about mirrors. What if we had more? How should it look like? Should we just duplicate the one mirror that already exists? I don't think that would be a good idea. Most importantly, the ketchup would be done too redundantly. But isn't redundancy molpish? It is. But different kinds of redundancy have different level of molpishnessM
. The mirrors should be cooperating in some kind of way. In this post I want to tell you some of my thoughts on mirrors. Some are general, some are specific. I hope this post won't end up in total chaos.
So how would I do such a multi-mirror system? The first thing to think about is data organisation. The stored information has to be divided ito some kind of datapieces. They could be implemented as a database entries or files or something else. I would most probably implement them as text files.These data pieces would represent things like posts, signaturas, user information, etc. Yes, posts. Currently, the posts are analysed separately but are remembered together, in newpages (to make the code simpler). But it can no longer stay like that (does anyone here know how phpBB decides which post is on what newpage? Or what would be a good way to do it?). The data pieces would be divided into fields containing information, depending on what it represents, the fileds would be different, but some would be common for all of them:
- the type - id defines if it's a post or a sig, etc.
- the ID - it defines which (post, for example) it is
- a timestamp - defining when this was last created/updated
- a list of mirrors that already have this information
The ID must be unique inside one category, the full ID is created by adding the first two fileds together: type-id, for example POST-3708991. The full id also defines the file name. For some types there has to be a field defining if the thing it represents comes from or to a mirror - I'lll explain that later. Other fields depend on the type of the datapiece. For example, for a post it would be the post time, the author user-id, the html content, the bbcode content, the informaiton if there is a signature or not. What's my vision of the fileformat? Each field should start on a new line. But it can occupy more tham one line (why? There was a post of Neil_boekend that stopped bothasar_t, because mawk failed on a too long line. But it could safely be broken into multiple lines). The beginning of a new field would be marked with a symbol, for example @. then comes the field name, then a =, then the content. To avoid some problems (and for some other reasons) the should only use alphanumerical characters and _ and space. Any other characters would need something similar to URLencoding. But to not confuse it with actual URLencoding, which will happen too, let's use another character, #. This also gives an opportunity to use some reserved characters for some other purposes, like inserting references to attachments. An example, this:
<img src="http://forums.xkcd.com/download/file.php?id=47064" alt="Image">
@html=#3Cimg src#3D#22[ATTACHMENT-47064]#22 alt#3D#22Image#22#3E
How does the information propagate? During a ketchup, thebot notices that some of the informations are changed/added. it creates new datapiece files, with the current timestamp, only itself on the mirrors list and adds (or replaces) them to the data directory. And BEFORE THAT it also creates special reference files in another directory. Another bot, running in regular time intervals will pick them up and know which data to propagate. Why before? If the data is changed first and then the botcastle chrashes before the reference files are created, the other bot will not know that there is data to propagate. What channel to use for this? I'd prefer using the http file upload mechanism. One mirror would upload a file on a special URL on the other mirror. The other mirror analyses the file and generates a reply. Because the exchanged information can contain passwords and because the publicly available URL should not allow data injection both the file and the reply should be encrypted. The bot encrypts the file, sends it, the other recveives it, decrypts and analyses. All mirrors in one multi-mirror system share the same keys. What does the bot send? The whole datapiece file.
How does the other mirror react? If it doesn't have the information with this ID or it has an older version it adds the new one to its data directory, adds itself to the mirrors list, and creates a response. The response consists of the first 4 fields, the type, the ID, the timestamp, and the updated mirrors list. The bot then updates its own mirrors list and sends the file to the next mirror until all mirrors have it.
If the mirror already has a newer version, it doesn't accept it. Instead it generates a response that is the full version of the newer file. The bot seing a newer timestamp in this response abandons the propagation instead it updates its own database.
If the mirror doesn't respond it's marked as temporarily mustarded (and that information goes to the propagation queue as well) and all information that should be sent to it is queued in a special directory. It's checked again later, in regular time intervals. If the mirror still doesn't respond for a time longer than a predefined limit (some weeks, or so) it's believed to have reached eternal mustard and is removed from the list of all mirrors. If it molpies up someday, it will have to re-register itself and do some mirror-ketchup.
When the bot successfully sent the datapiece to all of the mirrors (or to some of them and the rest is temporarily mustarded) it goes on to the next datapiece in its queue. At this point all (except the mustarded) mirrors have that data but only two know it. The rest have a more or less complete list. But each mirror receiving the information also adds it to its data propagation queue. Also the bot does it when it receives a reply with a newer version. Now the bot on the other mirrors also attempts to propagate the data. If it sends it to a mirror that already has it, it's redundant but both mirrors update the list of mirrors that have it. So no, it's not going to be an all mirrors sending data to every other situation. The mirrors are selected in a random order. Some data will be exchanged redundantly but it can't be avoided in a decentralised system. Also it is important because if the original mirror crashes, others continue the propagation.
When the data is an attachment, avatar, or image, then the datapiece file does not include the actual file. The propagation is done a little differently then. The mirror receives the datapiece file, sees that it's an attachment (or else) and which one it is, downloads it from the original mirror, and only then sends the response.
There could be other cases of communication. If mirror A sends to mirror B only the first 3 or 4 fields, it means that it wants to exchange the list of mirrors that have it. Mirror B replies with th 4 first fields too. Unless it has a newer version, then it sends the full file. If A sends only the first two fields (type, id) it means it wants to downoad the datapiece (because maybe it's a new mirror or is reketchupping after mustard). B replies with the full file. If A sends only the first field (type), B replies with a list of all entries in the category. (again, new mirror or reketchup).
How is ketchup done? We don't want two mirrors to redundantly ketchup the same newpage at the sama page. The ketchup tasks can no longer be cronjobs. Instead we will have ketchuptask datapiece files. They contain the information of how should that ketchup done (how many newpages in one run, haow long to wait between downloads, etc.). It also says what is the minimal time interval between ketchups and when it was last completed. A ketchup bot molpies up regularly and checks the files. If the last ketchup was compleded longer ago than the minimal time interval the bot starts to ketcup. But before that it has to reserve the ketchup. A ketchup-reservation datapiece is created and sent to other mirrors. Only then, the ketchup begins. Unlike others, the older reservations replace the newer ones. After ketchup, the ketchup datapiece is updated and the reservation is removed. The reservations have a timeout. If a mirror mants to ketchup and sees that another mirror made a reservation that is too old now it will remove it. A similar mechanism would be used for mustardpost delivering. I'd like to all mustardposts be visible on every mirror, after the regular posts in the mirrored thread. They would be distinguished by a different background color or something like that.
Because more than one process can operate on the data, race conditions can happen. To avoid this there must be some locking implemented. Before a process wants to do someting with a datapiece tha may involve writing to it it has to create a tempfile and lock it. then read , update, and write to it, then unlock the tempfile and then remove it. If the file was already locked, it should wait until it isn't.
Some datatypes have a field that defines the direction (from or to a mirror) that's because I want to have it possible to change the avatar or signature from the mirror. And when the OTT is not responding at the time, it should be updated later. And we don't want the ketchup bot to replace the mirror uploaded sig with the old one from the ott.
When the mirror is viewed, the references to attachments, avatars, images are replaced with URLs. If the mirror doesn't have them archived yet but the reference is already there, it will replace it with the URL of the original.
These are some of my thoughts on a multi-mirror system. There are more but they are not fully converted to actual words.M
that's a quote from the Redundant Book Of Redundancies
ETA: where does UNG come from? mrobdex unly links to this: viewtopic.php?p=3556493&hilit=ungs#p3556493