Moderators: Moderators General, Prelates
Akula wrote:Our team has turned into this hate-fueled juggernaut of profit. It's goddamn wonderful.
Michael McClary, in alt.fusion, wrote:Irrigation of the land with sewater desalinated by fusion power is ancient. It's called 'rain'.
MoonBuggy wrote:Awesome, I like it a lot (as do I like wing's idea - depending on how bandwidth intensive it is I will also volunteer my machines to the cause).
I also took the liberty of converting it from a 2MB to a 250KB PNG (same resolution). Hope you don't mind.Spoiler:
[Edit] Looks a dump of just the current content from Wikipedia is only 3.4GB (makes sense to cache it locally if we're going to be mapping the lot, or they'll probably block my IP) and I now really want to make a poster that shows a map of the whole of Wikipedia.
package wikicrawl;
use Graph::Easy;
use strict;
use Graph::Easy::Parser;
use LWP;
use HTML::TokeParser;
#things that shouldn't be looked at
my %bad = ("Wikipedia" => 1,
"Image" => 1,
"Talk" => 1,
"Help" => 1,
"Template" => 1,
"Portal" => 1,
"Special" => 1,
"User" => 1,
"Category" => 1
);
#max depth to crawl
my $maxDepth = 3;
#max number of links per node
my $maxSpread = 4;
#if 1, create graph from local txt
#if 0, crawl wikipedia for new data
my $useTxt = 0;
my $data = "graph {output: graphviz}\n";
my %visitedLinks;
sub crawl {
my ($page, $depth) = @_;
#don't come back
return if $visitedLinks{$page} == 1;
#crawl page
my $ua = LWP::UserAgent->new();
my $res = $ua->request(HTTP::Request->new(GET => $page));
(my $title = $res->title) =~ s/ -.*//;
print "Crawling " .$title . " at depth " . $depth . "\n";
#set flag
$visitedLinks{$page} = 1;
my $content = $res->content;
#parse anchors
my $parser = HTML::TokeParser->new(\$content) or die("Could not parse page.");
#iterate
for(my $i = 0; (my $token = $parser->get_tag("a")) && ($i < $maxSpread || $maxSpread == 0);) {
my $url = $token->[1]{href};
my $alt = $token->[1]{title};
next if $url !~ m/^\/wiki\//; #no pages outside of wikipedia
next if $alt =~ m/\[/; #no brackets
my @chunks = split ":", substr($url, 6); #extract special pages, if any
next if $bad{$chunks[0]} == 1; #no bad pages
$i++;
#print $url . "\n";
$data .= sprintf("[%s] -> [%s]\n", $title, $alt) if $title ne $alt; #this is the format that graphEasy uses
crawl("http://en.wikipedia.com" . $url, $depth + 1) unless $depth + 1 > $maxDepth; #recurse
}
}
#root node
my $start = "http://en.wikipedia.org/wiki/Xkcd";
if(open(STARTDATA, "data.txt") && $useTxt) {
local $/ = "";
$data = <STARTDATA>;
}
else {
crawl($start, 0);
open(DATA, ">", "data.txt") or die("Could not open file!");
print DATA $data;
}
print "Generating chart...\n";
my $parser = Graph::Easy::Parser->new();
my $graph = $parser->from_text($data);
open(OUTPUT, ">", "graph_output.txt") or die("Could not open file!");
print OUTPUT $graph->output();
close OUTPUT;digraph GRAPH_0 {
// Generated by Graph::Easy 0.60 at Mon Mar 10 16:45:22 2008
edge [ arrowhead=open ];
graph [ rankdir=LR overlap=false ];
node [
fontsize=11,
fillcolor=white,
style=filled,
shape=box ];
xkcd -> Website [ color="#000000" ]
xkcd -> "Randall Munroe" [ color="#000000" ]
xkcd -> September [ color="#000000" ]
xkcd -> Author [ color="#000000" ]
Author -> "Michel Foucault" [ color="#000000" ]MoonBuggy wrote:Awesome, I like it a lot (as do I like wing's idea - depending on how bandwidth intensive it is I will also volunteer my machines to the cause).
I also took the liberty of converting it from a 2MB to a 250KB PNG (same resolution). Hope you don't mind.Spoiler:
[Edit] Looks a dump of just the current content from Wikipedia is only 3.4GB (makes sense to cache it locally if we're going to be mapping the lot, or they'll probably block my IP) and I now really want to make a poster that shows a map of the whole of Wikipedia.
Akula wrote:Our team has turned into this hate-fueled juggernaut of profit. It's goddamn wonderful.
tiny wrote:A poster? How about a gigantic 3D model that you can climb? Would that work?
I think the reason why I sometimes have unconnected branches is because to save bandwidth, I get the name of the target article from the alternate text, but sometimes the article name doesn't match the alternate text (like alternate text "MIT" would lead to Massachusetts Institute of Technology). Of course, bandwidth is moot if you're going to download wikipedia.
perl -Ilib examples/wikicrawl.pl --maxdepth=4 --maxspread=3 --lang=en
recurve boy wrote:If you are on Mac you can use http://pathway.screenager.be/
It doesn't map everything. But if you're like me and like to read Wikipedia occasionally - it's still interesting if not entirely accurate - it will map where you have been, where you can go etc and save the graph.
It's perhaps more interesting than just mapping the entire wiki since we can keep track of your reading habits, or you can play little games like see how many links are between say 'Tungsten' and 'Wet T-shirt Contest'.
tels wrote:@Xeio: This would need some interactive renderer, tho. (I think something in Java has been done, it was probably named zoomviewer)
nyeguy wrote:recurve boy wrote:If you are on Mac you can use http://pathway.screenager.be/
It doesn't map everything. But if you're like me and like to read Wikipedia occasionally - it's still interesting if not entirely accurate - it will map where you have been, where you can go etc and save the graph.
It's perhaps more interesting than just mapping the entire wiki since we can keep track of your reading habits, or you can play little games like see how many links are between say 'Tungsten' and 'Wet T-shirt Contest'.
That application is so cool. Definitely saved.
Users browsing this forum: Alpha Omicron, chridd, elminster, NecklaceOfShadow, Ubik and 18 guests