I am currently writing a program where I have a graph and want to find the minimal sub-graph which satisfies my requirements.

The graph is directed with four types of nodes. It has many cycles but no 1 cycles. There are hundreds to low thousands of nodes. Nodes generally connected to around ten other nodes, but that can vary a lot.

The types of node and how they are connected:

Root nodes which may point to any number of Process nodes

Leaf nodes which may be pointed to by any number of Process nodes

Process nodes which may be pointed to by any number of Root or Product nodes and may point to any number of Product or Leaf nodes.

Product nodes which may be pointed to by any number of Process nodes and which may point to any number of Process nodes.

I wish to find the sub-graph which satisfies the following requirements:

Any Root or Leaf which is adjacent to a node in the sub-graph must be in the sub-graph

For every Process node in the sub-graph, all Product nodes which point to it must be included in the sub-graph

For every Product node in the sub-graph, at least one Process node which points to that Product must be present in the sub-graph

Medium priority: Minimize Root, Product, and Process nodes included in sub-graph

Low priority: Maximize Leaf nodes

Could use some help from smart people on xkcd with this algorithm.

## Finding a subgraph with given properties.

**Moderators:** phlip, Moderators General, Prelates

### Re: Finding a subgraph with given properties.

Per your description, the smallest subgraph satisfying your requirements is the empty graph. I assume you're starting with either a product or leaf, and you're looking for the smallest subgraph telling you how to build that product with the least amount of processing steps?

And if I understand your description, the only choices you have is to pick just one of the available parents of each product node?

I think the first step would be to estimate the amount of possible subgraphs that could satisfy your query. If that number is small, you can evaluate all of them and pick the best one. If that number is large, you need to employ heuristics (which will end up with a lot of trial and error).

You could implement a recursive marking algorithm: start with your product node, mark all its parents, mark all their parents etc. until you've reached the roots. For every product node you've marked, note how many parents it has, and multiply those numbers. For example, if that graph contains products A, B and C, which can be produced by 3, 4 and 5 different processes respectively, then you have an upper bound of 3*4*5 = 60 different subgraphs that satisfy your requirements.

How many combinations would your largest queries yield?

And if I understand your description, the only choices you have is to pick just one of the available parents of each product node?

I think the first step would be to estimate the amount of possible subgraphs that could satisfy your query. If that number is small, you can evaluate all of them and pick the best one. If that number is large, you need to employ heuristics (which will end up with a lot of trial and error).

You could implement a recursive marking algorithm: start with your product node, mark all its parents, mark all their parents etc. until you've reached the roots. For every product node you've marked, note how many parents it has, and multiply those numbers. For example, if that graph contains products A, B and C, which can be produced by 3, 4 and 5 different processes respectively, then you have an upper bound of 3*4*5 = 60 different subgraphs that satisfy your requirements.

How many combinations would your largest queries yield?

### Re: Finding a subgraph with given properties.

Actually I start with a process node and my goal is to find a cycle of processes that can produces all its inputs. I'm starting with all the process / resource definitions from the Mars Simulation Project and my goal is to use it as a starting point for finding possible seed factory configurations.

### Re: Finding a subgraph with given properties.

So it looks like if you map vertices onto products and sets onto processes, (you need to have at least one parent process per product, and you need at least one set to cover each vertex), you can reduce the problem of minimizing non-leaf nodes into set cover, which is NP-hard and can't be approximated better than order log n.

In fact since you can have multiple layers of product and process and product and process nodes, you can even build some sort of hierarchical set-covering problem into the graph, where the set of points which must be covered on layer i+1 depends on the sets selected on layer i of the problem.

Here's a rough sketch of a algorithm which might get reasonable answers:

Start with a single process node, P. Now we are forced to add in all parent product nodes and parent root nodes.

Each parent root node needs select at least one parent process node, (if any exist). Do so at random with a fixed seed. Now we just have process nodes left to satisfy, so repeat the process of selecting all parent product nodes and root nodes, etc.

When we're done, we'll have picked an alternating sequence of process nodes, product nodes, process nodes, product nodes, etc. So far we've taken O(n) time. Now in the first set of selected product nodes, begin trying out greedy alterations of which parent process to select. If we do each product node one by one separately, and each product node has at most O(n) parent nodes, then there are O(n^3) possible choices here. For each of these choices, throw out the rest of the solution and rerun the algorithm using the new choices of parent process, taking the solution with the best score (lowest # nodes) as the final choice. Now proceed to the second layer of process nodes, and repeat. Continue through all layers of process nodes. This should taken O(n^4) time in total.

Finally, at the end, include all the leaf children of all the process nodes.

In fact since you can have multiple layers of product and process and product and process nodes, you can even build some sort of hierarchical set-covering problem into the graph, where the set of points which must be covered on layer i+1 depends on the sets selected on layer i of the problem.

Here's a rough sketch of a algorithm which might get reasonable answers:

Start with a single process node, P. Now we are forced to add in all parent product nodes and parent root nodes.

Each parent root node needs select at least one parent process node, (if any exist). Do so at random with a fixed seed. Now we just have process nodes left to satisfy, so repeat the process of selecting all parent product nodes and root nodes, etc.

When we're done, we'll have picked an alternating sequence of process nodes, product nodes, process nodes, product nodes, etc. So far we've taken O(n) time. Now in the first set of selected product nodes, begin trying out greedy alterations of which parent process to select. If we do each product node one by one separately, and each product node has at most O(n) parent nodes, then there are O(n^3) possible choices here. For each of these choices, throw out the rest of the solution and rerun the algorithm using the new choices of parent process, taking the solution with the best score (lowest # nodes) as the final choice. Now proceed to the second layer of process nodes, and repeat. Continue through all layers of process nodes. This should taken O(n^4) time in total.

Finally, at the end, include all the leaf children of all the process nodes.

### Who is online

Users browsing this forum: No registered users and 5 guests