Map of Wikipedia rooted at XKCD

Things that don't belong anywhere else. (Check first).

Moderators: Magistrates, Prelates, Moderators General

Map of Wikipedia rooted at XKCD

Postby integral » Fri Apr 18, 2008 10:43 pm UTC

I don't know where this would fit, so I decided here. This is a map of a part of Wikipedia done with GraphEasy and Perl. The root node is XKCD, the depth is 3, and the max links per page is 3.

Click to view

I have no idea why that loop of philosophy is in the bottom right. I think it's a bug in my inflector.
Last edited by integral on Fri Apr 18, 2008 11:10 pm UTC, edited 1 time in total.
User avatar
integral
 
Posts: 56
Joined: Fri Mar 28, 2008 1:53 am UTC

Re: Map of Wikipedia rooted at XKCD

Postby wing » Fri Apr 18, 2008 10:48 pm UTC

So do the world a favor and map it all! Or post the scripts and I'll do it.
I AM A SEXY, SHOELESS GOD OF WAR!
Akula wrote:Our team has turned into this hate-fueled juggernaut of profit. It's goddamn wonderful.
User avatar
wing
the /b/slayer
 
Posts: 1876
Joined: Tue May 29, 2007 5:56 am UTC

Re: Map of Wikipedia rooted at XKCD

Postby MoonBuggy » Fri Apr 18, 2008 10:57 pm UTC

Awesome, I like it a lot (as do I like wing's idea - depending on how bandwidth intensive it is I will also volunteer my machines to the cause).

I also took the liberty of converting it from a 2MB to a 250KB PNG (same resolution). Hope you don't mind.
Spoiler:
test.png
test.png (251.13 KiB) Viewed 3998 times


[Edit] Looks a dump of just the current content from Wikipedia is only 3.4GB (makes sense to cache it locally if we're going to be mapping the lot, or they'll probably block my IP) and I now really want to make a poster that shows a map of the whole of Wikipedia.
Michael McClary, in alt.fusion, wrote:Irrigation of the land with sewater desalinated by fusion power is ancient. It's called 'rain'.
User avatar
MoonBuggy
 
Posts: 683
Joined: Thu Sep 06, 2007 12:14 am UTC
Location: On a journey through time and space

Re: Map of Wikipedia rooted at XKCD

Postby integral » Fri Apr 18, 2008 11:16 pm UTC

MoonBuggy wrote:Awesome, I like it a lot (as do I like wing's idea - depending on how bandwidth intensive it is I will also volunteer my machines to the cause).

I also took the liberty of converting it from a 2MB to a 250KB PNG (same resolution). Hope you don't mind.
Spoiler:
test.png


[Edit] Looks a dump of just the current content from Wikipedia is only 3.4GB (makes sense to cache it locally if we're going to be mapping the lot, or they'll probably block my IP) and I now really want to make a poster that shows a map of the whole of Wikipedia.


Thanks, I've uploaded the small version.

This is the script I used. Graph::Easy Module required.
Code: Select all
package wikicrawl;

use Graph::Easy;
use strict;
use Graph::Easy::Parser;
use LWP;
use HTML::TokeParser;

#things that shouldn't be looked at
my %bad = ("Wikipedia" => 1,
         "Image" => 1,
         "Talk" => 1,
         "Help" => 1,
         "Template" => 1,
         "Portal" => 1,
         "Special" => 1,
         "User" => 1,
         "Category" => 1
        );

#max depth to crawl
my $maxDepth = 3;
#max number of links per node
my $maxSpread = 4;

#if 1, create graph from local txt
#if 0, crawl wikipedia for new data
my $useTxt = 0;
my $data = "graph {output: graphviz}\n";

my %visitedLinks;

sub crawl {
   my ($page, $depth) = @_;
   #don't come back
   return if $visitedLinks{$page} == 1;

   #crawl page
   my $ua = LWP::UserAgent->new();
    my $res = $ua->request(HTTP::Request->new(GET => $page));
   (my $title = $res->title) =~ s/ -.*//;
   print "Crawling " .$title . " at depth " . $depth . "\n";

   #set flag
   $visitedLinks{$page} = 1;
   my $content = $res->content;

   #parse anchors
   my $parser = HTML::TokeParser->new(\$content) or die("Could not parse page.");

   #iterate
   for(my $i = 0; (my $token = $parser->get_tag("a")) && ($i < $maxSpread || $maxSpread == 0);) {
      my $url = $token->[1]{href};
      my $alt = $token->[1]{title};


      next if $url !~ m/^\/wiki\//; #no pages outside of wikipedia
      next if $alt =~ m/\[/; #no brackets
      my @chunks = split ":", substr($url, 6); #extract special pages, if any
      next if $bad{$chunks[0]} == 1; #no bad pages

      $i++;
      #print $url . "\n";
      $data .= sprintf("[%s] -> [%s]\n", $title, $alt) if $title ne $alt; #this is the format that graphEasy uses
      crawl("http://en.wikipedia.com" . $url, $depth + 1) unless $depth + 1 > $maxDepth; #recurse
   }
}

#root node
my $start = "http://en.wikipedia.org/wiki/Xkcd";

if(open(STARTDATA, "data.txt") && $useTxt) {
   local $/ = "";
   $data = <STARTDATA>;
}
else {
   crawl($start, 0);
   open(DATA, ">", "data.txt") or die("Could not open file!");
   print DATA $data;
}

print "Generating chart...\n";
my $parser = Graph::Easy::Parser->new();
my $graph = $parser->from_text($data);
open(OUTPUT, ">", "graph_output.txt") or die("Could not open file!");
print OUTPUT $graph->output();
close OUTPUT;


That gives something similar to this:
Code: Select all
digraph GRAPH_0 {

  // Generated by Graph::Easy 0.60 at Mon Mar 10 16:45:22 2008

  edge [ arrowhead=open ];
  graph [ rankdir=LR overlap=false ];
  node [
    fontsize=11,
    fillcolor=white,
    style=filled,
    shape=box ];

  xkcd -> Website [ color="#000000" ]
  xkcd -> "Randall Munroe" [ color="#000000" ]
  xkcd -> September [ color="#000000" ]
  xkcd -> Author [ color="#000000" ]
  Author -> "Michel Foucault" [ color="#000000" ]


Which is then read by GraphViz to get the finished product. GraphViz has several modules that dictate how the web is laid out, but I forgot which one I used to get this web.

I really would like a 3D web for Wikipedia though. It's only due to the cap on links per page that the web is even readable, because otherwise every page would have 20-50 links to other pages and it'd be nearly incomprehensible.

I think the reason why I sometimes have unconnected branches is because to save bandwidth, I get the name of the target article from the alternate text, but sometimes the article name doesn't match the alternate text (like alternate text "MIT" would lead to Massachusetts Institute of Technology). Of course, bandwidth is moot if you're going to download wikipedia.
Last edited by integral on Fri Apr 18, 2008 11:19 pm UTC, edited 2 times in total.
User avatar
integral
 
Posts: 56
Joined: Fri Mar 28, 2008 1:53 am UTC

Re: Map of Wikipedia rooted at XKCD

Postby Chrismclegless » Fri Apr 18, 2008 11:17 pm UTC

Heh - a map of Wikipedia would suddenly change the dynamic of the wikisoccer game in Forum Games...
Londo: Maybe it was something I said?
G'Kar: Maybe it is everything you say.

How tasty is YOUR brain?

GENERATION 21: The first time you see this, copy it into your sig on any forum and add 1 to the generation. Social experiment.
User avatar
Chrismclegless
 
Posts: 240
Joined: Wed Dec 05, 2007 8:25 pm UTC
Location: Sigma 957

Re: Map of Wikipedia rooted at XKCD

Postby thedufer » Sun Apr 20, 2008 3:17 am UTC

MoonBuggy wrote:Awesome, I like it a lot (as do I like wing's idea - depending on how bandwidth intensive it is I will also volunteer my machines to the cause).

I also took the liberty of converting it from a 2MB to a 250KB PNG (same resolution). Hope you don't mind.
Spoiler:
test.png


[Edit] Looks a dump of just the current content from Wikipedia is only 3.4GB (makes sense to cache it locally if we're going to be mapping the lot, or they'll probably block my IP) and I now really want to make a poster that shows a map of the whole of Wikipedia.


That'd be an enormous poster. Or it'd be in a font too small to read. I mean, over 2 million article titles? Not gonna fit on any of the walls in my house...

In other news, this sounds like fun and I'm looking into writing a C program to do it in 3D in OpenGL. Unfortunately my first download of the wikipedia dump got corrupted, so I'll hafta try again.
User avatar
thedufer
 
Posts: 263
Joined: Mon Aug 06, 2007 2:11 am UTC
Location: Northern VA (not to be confused with VA)

Re: Map of Wikipedia rooted at XKCD

Postby wing » Sun Apr 20, 2008 4:33 am UTC

I'm in the process of setting up a machine for the ubercrawl. If anyone would like to play with it further after I'm done (maybe a bugfix to prevent whatever the hell happened with the philosophy cluster on that test map?), I'd be willing to hand out shell logins (Debian) in exchange for promises to be nice to my home bandwidth (which shouldn't be a problem since I'm mirroring Wikipedia)

I may have things running as soon as a few hours from now.
I AM A SEXY, SHOELESS GOD OF WAR!
Akula wrote:Our team has turned into this hate-fueled juggernaut of profit. It's goddamn wonderful.
User avatar
wing
the /b/slayer
 
Posts: 1876
Joined: Tue May 29, 2007 5:56 am UTC

Re: Map of Wikipedia rooted at XKCD

Postby tiny » Sun Apr 20, 2008 12:52 pm UTC

A poster? How about a gigantic 3D model that you can climb? Would that work? :-D
"I write what I see, the endless procession to the guillotine." ~ de Sade
User avatar
tiny
 
Posts: 771
Joined: Tue Oct 23, 2007 5:34 pm UTC
Location: Below the fifth cellar.

Re: Map of Wikipedia rooted at XKCD

Postby wst » Sun Apr 20, 2008 1:09 pm UTC

tiny wrote:A poster? How about a gigantic 3D model that you can climb? Would that work? :-D

How about a giant 4d model immersed on 3d space? (To prevent you getting trapped.)
I'm thinking... giant klein bottle. With the map of wikipedia imposed on both (one?) sides.
"If it looks like a duck, and quacks like a duck, we have at least to consider the possibility that we have a small aquatic bird of the family anatidae on our hands." - Douglas Adams
User avatar
wst
 
Posts: 2584
Joined: Sat Nov 24, 2007 10:06 am UTC

Re: Map of Wikipedia rooted at XKCD

Postby integral » Sun Apr 20, 2008 5:27 pm UTC

The problem with the philosophy:

I think the reason why I sometimes have unconnected branches is because to save bandwidth, I get the name of the target article from the alternate text, but sometimes the article name doesn't match the alternate text (like alternate text "MIT" would lead to Massachusetts Institute of Technology). Of course, bandwidth is moot if you're going to download wikipedia.


I just looked over my raw data for the small graph, and it looks like "Roland Barth" linked to "Western Philosophy", which was actually "Western philosophy." Western philosophy then spawned the little cluster.
User avatar
integral
 
Posts: 56
Joined: Fri Mar 28, 2008 1:53 am UTC

Re: Map of Wikipedia rooted at XKCD

Postby Robin S » Sun Apr 20, 2008 10:51 pm UTC

Something similar to this was done a couple of years ago for the whole of Wikipedia, with statistical analysis to boot: Temporal Analysis of the Wikigraph
This is a placeholder until I think of something more creative to put here.
Robin S
 
Posts: 3579
Joined: Wed Jun 27, 2007 7:02 pm UTC
Location: London, UK

Re: Map of Wikipedia rooted at XKCD

Postby tels » Sat May 17, 2008 6:50 am UTC

Heh,

this is a very cool usage of Graph::Easy - and clearly I need to optimize its graphviz output :D

If you have any problems with it, don't hesitate to file a bug report ! :oops:

Tels
User avatar
tels
 
Posts: 7
Joined: Fri May 16, 2008 8:35 pm UTC
Location: Antarctica

Re: Map of Wikipedia rooted at XKCD

Postby Xeio » Sat May 17, 2008 2:22 pm UTC

Actually, if you could trim the tree out to 4-5 deep for every rendering and being able to traverse from article to article would be pretty neat. It'd be like the real wiki, but less informative (unless you're really curious about the correlation between articles). The whole 2 million would be a bit overkill though to display all at once, just too much information.
User avatar
Xeio
Friends, Faidites, Countrymen
 
Posts: 4858
Joined: Wed Jul 25, 2007 11:12 am UTC
Location: C:\Users\Xeio\

Re: Map of Wikipedia rooted at XKCD

Postby integral » Sat May 17, 2008 7:32 pm UTC

I have a feeling the completed graph would be unreadable in a 2-dimensional image.
User avatar
integral
 
Posts: 56
Joined: Fri Mar 28, 2008 1:53 am UTC

Re: Map of Wikipedia rooted at XKCD

Postby tels » Sat May 17, 2008 8:05 pm UTC

@Xeio: This would need some interactive renderer, tho. (I think something in Java has been done, it was probably named zoomviewer)
@integral: I think you are right :)

In any event, integral gave me kindly permission to release his script under GPL as part of Graph::Easy :) So I added it into my example folder and worked a bit on it. While doing so, I found (and fixed) two bugs in Graph::Easy - wooho! Thank you! :D

If you want to see the new version, please download http://bloodgate.com/perl/packages/deve ... .63.tar.gz and look into examples/. Here is how you use it:

Code: Select all
perl -Ilib examples/wikicrawl.pl --maxdepth=4 --maxspread=3 --lang=en


Things I have done:

* command-line support for options
* POD with some help
* support for Unicode
* added support for different languages (currently working are "en", "fr" and "de")
* the crawling is now breath-first instead of depth-first (to correctly identify depth of a node)
* redirects are correctly handled (f.i. "Western Philosophy" vs. "Western philosophy", "MIT" vs. "Massaschusets Institute of Technology" etc)
* nodes are now colored according to their depth (along the Hue part of the HSL color space)
* filter out disambiguation pages
* assorted fixes

Here are three example files rendered with the method above for the three languages:

http://bloodgate.com/perl/graph/wikicrawl/

(Edit: fixed the bug with redirections in french (was an unicode issue) and updated the example files above)
(Edit #2: Fixed the bug that colored nodes wrong. Updated examples, now finally the colors are right :D
Last edited by tels on Sat May 17, 2008 10:28 pm UTC, edited 1 time in total.
User avatar
tels
 
Posts: 7
Joined: Fri May 16, 2008 8:35 pm UTC
Location: Antarctica

Re: Map of Wikipedia rooted at XKCD

Postby tels » Sat May 17, 2008 10:20 pm UTC

A little bit of brainstorming about mapping it all:

* the node size must be reduced. This could be done by dropping the article names and make them just small dots, putting the article name as a mouse-over link. That would work for SVG (aka computer output). For a giant poster, one could render it as SVG and print it in a real high resolution and on really big paper - but would it fit 2 million nodes?
* processing the data is hard. crawling not good, but processing the XML dump is equally challenging. Plus, rendering it might take weeks.

Have to investigate how to do this. :twisted:
User avatar
tels
 
Posts: 7
Joined: Fri May 16, 2008 8:35 pm UTC
Location: Antarctica

Re: Map of Wikipedia rooted at XKCD

Postby recurve boy » Sun May 18, 2008 1:22 am UTC

If you are on Mac you can use http://pathway.screenager.be/

It doesn't map everything. But if you're like me and like to read Wikipedia occasionally - it's still interesting if not entirely accurate - it will map where you have been, where you can go etc and save the graph.

It's perhaps more interesting than just mapping the entire wiki since we can keep track of your reading habits, or you can play little games like see how many links are between say 'Tungsten' and 'Wet T-shirt Contest'.
recurve boy
 
Posts: 353
Joined: Wed Jan 31, 2007 5:48 am UTC
Location: Sydney Australia

Re: Map of Wikipedia rooted at XKCD

Postby nyeguy » Sun May 18, 2008 2:28 am UTC

recurve boy wrote:If you are on Mac you can use http://pathway.screenager.be/

It doesn't map everything. But if you're like me and like to read Wikipedia occasionally - it's still interesting if not entirely accurate - it will map where you have been, where you can go etc and save the graph.

It's perhaps more interesting than just mapping the entire wiki since we can keep track of your reading habits, or you can play little games like see how many links are between say 'Tungsten' and 'Wet T-shirt Contest'.

That application is so cool. Definitely saved.
Image
User avatar
nyeguy
 
Posts: 580
Joined: Sat Aug 04, 2007 5:59 pm UTC

Re: Map of Wikipedia rooted at XKCD

Postby Xeio » Sun May 18, 2008 2:49 am UTC

tels wrote:@Xeio: This would need some interactive renderer, tho. (I think something in Java has been done, it was probably named zoomviewer)


Once you build the dataset, assuming you manage to organize it decently (indexing nodes by article name, then attach the links somehow I suppose), it doesn't seem like rendering would be a 'huge' hurdle, just scaling it to the render the entire dataset would be... difficult. I'm not sure how you're drawing the nodes exactly, but if that library allows selecting a node you could navigate by click then just reload based on the new base node.

Though, getting a dataset of 2 million nodes would be messy (and big, you'd probably want to build this in one program, then render it in another). As for rendering, I have no idea about libraries to do it, and even if there are libraries, I can't help but think even they wouldn't handle 2 million nodes well gracefully (it already looks like what you're using is already having problems keeping everything from overlapping).

I'd suggest maybe testing by increasing depth of the rendering to see how well it handles each additional layer (exponential growth from the dataset, and if they graphs are a more complex time... :shock: ).
User avatar
Xeio
Friends, Faidites, Countrymen
 
Posts: 4858
Joined: Wed Jul 25, 2007 11:12 am UTC
Location: C:\Users\Xeio\

Re: Map of Wikipedia rooted at XKCD

Postby tels » Sun May 18, 2008 7:43 am UTC

Rendering 2 million nodes:

This topic is often discussed on the graphviz mailing list :) In our current code, Graph::Easy is just used to construct the graph and then feed it to a graphviz renderer (usually neato) and this renderer generates a static image. This does not scale very well, with 2 million nodes you end up with a gigantic image which doesn't even fit into memory.

What one needs is a viewer that does it differently.

Here are a few interactive viewers listed:

http://www.graphviz.org/Resources.php

This is the one I remembered:

http://zvtm.sourceforge.net/zgrviewer.html

Edit: Building such an interactive viewer is a lot of work, and currently none of my code does even come close or has an GUI (except my little ajax test but this is so far removed from being complete I won't even mention the URL :D
User avatar
tels
 
Posts: 7
Joined: Fri May 16, 2008 8:35 pm UTC
Location: Antarctica

Re: Map of Wikipedia rooted at XKCD

Postby tels » Sun May 18, 2008 7:51 am UTC

nyeguy wrote:
recurve boy wrote:If you are on Mac you can use http://pathway.screenager.be/

It doesn't map everything. But if you're like me and like to read Wikipedia occasionally - it's still interesting if not entirely accurate - it will map where you have been, where you can go etc and save the graph.

It's perhaps more interesting than just mapping the entire wiki since we can keep track of your reading habits, or you can play little games like see how many links are between say 'Tungsten' and 'Wet T-shirt Contest'.

That application is so cool. Definitely saved.


Unfortunately Mac only.... :(
User avatar
tels
 
Posts: 7
Joined: Fri May 16, 2008 8:35 pm UTC
Location: Antarctica

Re: Map of Wikipedia rooted at XKCD

Postby tels » Mon May 19, 2008 5:19 pm UTC

I've started to work on doing the graph from the XML dumps. In some sense it was easier than I thought, but in other ways much more complicated :D

To track my progress I started this article on my wiki:

http://bloodgate.com/wiki/Graphing_Wikipedia

Will need some time to fill in the details and upload the sample pictures, but in a few hours you should have some pretty graphs from the iewiki - English, German or French are definitively too big for the current software I have :D
User avatar
tels
 
Posts: 7
Joined: Fri May 16, 2008 8:35 pm UTC
Location: Antarctica

Re: Map of Wikipedia rooted at XKCD

Postby integral » Tue May 20, 2008 7:39 pm UTC

The uncompressed XML is actually only 15 GB.
User avatar
integral
 
Posts: 56
Joined: Fri Mar 28, 2008 1:53 am UTC

Re: Map of Wikipedia rooted at XKCD

Postby tels » Wed May 21, 2008 6:40 pm UTC

That's some good news, but 15 Gbyte is still way to much to fit in memory in one go.

I hadn't time to post my code into the wiki, but my current experiments with the iewiki and then ndswiki indicate it will take some real big power to work on the enwiki - ndswiki is only 8.6 mbyte compressed but already runs into minutes just for converting the xml to a graph - not to mention rendering the result.

Needs some serious thinning of the data to be able to handle 2 million nodes without ending up with a unreadable grafic :)
User avatar
tels
 
Posts: 7
Joined: Fri May 16, 2008 8:35 pm UTC
Location: Antarctica


Return to General

Who is online

Users browsing this forum: bigglesworth, Sidneynho and 6 guests