xkcd download script

A place to discuss the implementation and style of computer programs.

Moderators: phlip, Prelates, Moderators General

xkcd download script

Postby the mishanator » Tue Aug 03, 2010 11:23 pm UTC

i've written this script:
Code: Select all
#!/bin/bash

cd $HOME/*top
mkdir xkcd_comics
cd xkcd_comics
for ((i=0;i<=1000;i++)); do #this really is variable, i just use 1000 because i'm too lazy to change it every time.
     wget http://xkcd.com/$i
     link=$(cat $PWD/*.html | grep -i hotlinking | sed -e 's\<h3>Image URL (for hotlinking/embedding): \\g' -e 's\</h3>\\g')
     name=$(echo $link | sed -e 's\http://imgs.xkcd.com/comics/\\g')
     newName="(${i})-${name}"
     rm *.html
     wget -O $newName $link
done


it basically is to download and properly name xkcd comics... however it misses exactly one comic... i have no idea which it is as i've yet to find the time to go through them all and count. how can i fix this/make the code better?
the mishanator
 
Posts: 209
Joined: Mon May 31, 2010 6:49 pm UTC

Re: xkcd download script

Postby hotaru » Tue Aug 03, 2010 11:25 pm UTC

the mishanator wrote:however it misses exactly one comic...

it wouldn't happen to be number 404, would it?
Code: Select all
uint8_t f(uint8_t n)
{ if (!(
n&1)) return 2;
  if (
n==169) return 13; if (n==121||n==143) return 11;
  if (
n==77||n==91) return 7; if (n==3||n==5) return 0;
  
n=(n>>4)+(n&0xF); n+=n>>4n&=0xF;
  return (
n==3||n==6||n==9||n==12||n==15)?3:(n==5||n==10)?5:0; } 
User avatar
hotaru
 
Posts: 949
Joined: Fri Apr 13, 2007 6:54 pm UTC

Re: xkcd download script

Postby ++$_ » Tue Aug 03, 2010 11:55 pm UTC

the mishanator wrote:how can i fix this/make the code better?
Write it in perl! That makes everything better!

Also, make it get the title text too, and put it into an html file that you can use to view the comic in a browser with the title text and even a nice title in the browser title bar!
Code: Select all
#!perl
#cd to the directory before running the script!

my $i
for ($i=1;;++$i) {
    last unless $i == 404 or `wget -O - http://xkcd.com/$i` =~ m{(<img src="(http://imgs\.xkcd\.com/comics/([^"]+))"(.*?)/>)};
    system("wget -O $i-$3 $2");
    open(OUT, ">comic$i.html") or warn "Couldn't create file $i.html: $@\n";
    print OUT, <<EOF;
<html><head><title>$3</title></head><body><img src="$i-$3" $4/></body></html>
EOF
    close OUT;
}
print "Done at $i\n";
This has been tested exactly 0 times, so it might do something completely different from what I think it should. Heck, for all I know it doesn't compile.
++$_
Mo' Money
 
Posts: 2370
Joined: Thu Nov 01, 2007 4:06 am UTC

Re: xkcd download script

Postby the mishanator » Tue Aug 03, 2010 11:58 pm UTC

yaa its 404... i wonder why it misses it. i dont know much perl... i just used bash cuz its easy.

++$_ , it does get the titles... it pins it to the name of the d/l'd image along with the number
the mishanator
 
Posts: 209
Joined: Mon May 31, 2010 6:49 pm UTC

Re: xkcd download script

Postby Sc4Freak » Wed Aug 04, 2010 12:01 am UTC

It's a joke on XKCD's part. Comic #404, amusingly enough, gives you a 404.
User avatar
Sc4Freak
 
Posts: 673
Joined: Thu Jul 12, 2007 4:50 am UTC
Location: Redmond, Washington

Re: xkcd download script

Postby ++$_ » Wed Aug 04, 2010 12:14 am UTC

the mishanator wrote:++$_ , it does get the titles... it pins it to the name of the d/l'd image along with the number
Yeah, but it doesn't get the title text (usually known as alt-text, but if you get the "alt" attribute you will end up getting the wrong thing).

EDIT: I wish I found bash easy.
++$_
Mo' Money
 
Posts: 2370
Joined: Thu Nov 01, 2007 4:06 am UTC

Re: xkcd download script

Postby the mishanator » Wed Aug 04, 2010 1:44 am UTC

i'm not sure i understand you. it seems to me that it totally gets the title text.

EDIT: look at this
Code: Select all
[mikhail@mvs-desktop ~]$ cd /home/mikhail/Desktop/xkcd_comics && ls | head -n 5
(100)-family_circus.jpg
(101)-laser_scope.jpg
(102)-back_to_the_future.jpg
(103)-moral_relativity.jpg
(104)-find_you.jpg
the mishanator
 
Posts: 209
Joined: Mon May 31, 2010 6:49 pm UTC

Re: xkcd download script

Postby ++$_ » Wed Aug 04, 2010 2:25 am UTC

By "title text" I mean the text that pops up when you mouse over the image of the comic. For example, on #100, it is "This was my friend David's idea".
++$_
Mo' Money
 
Posts: 2370
Joined: Thu Nov 01, 2007 4:06 am UTC

Re: xkcd download script

Postby hotaru » Wed Aug 04, 2010 2:32 am UTC

is there some reason you're using the html instead of the json interface?
Spoiler:
Code: Select all
#!/usr/bin/perl

use JSON;
use LWP::Simple;
use DBI;

my $end = from_json get 'http://xkcd.com/info.0.json';
die 'couldn\'t get json for latest comic.' unless $end;
my $dbh = DBI->connect('dbi:SQLite:dbname=xkcd.db', '', '');
mkdir 'comics';
$dbh->do('CREATE TABLE IF NOT EXISTS comics (link TEXT, alt TEXT, num INTEGER '.
  'PRIMARY KEY, month INTEGER, transcript TEXT, safe_title TEXT, img TEXT, day'.
  ' INTEGER, title TEXT, news TEXT, year INTEGER);') or die 'database error.';
my $start = $dbh->selectrow_hashref('SELECT * FROM comics ORDER BY num DESC LI'.
  'MIT 1;');
  my $sth = $dbh->prepare('INSERT INTO comics VALUES(?,?,?,?,?,?,?,?,?,?,?);')
    or die 'database error.';
for $i ($start->{num} + 1..$end->{num})
{ next if $i == 404;
  print "downloading comic $i...";
  my $json = from_json get "http://xkcd.com/$i/info.0.json";
  die "couldn't get json for comic $i." unless $json;
  my $img = $json->{img};
  $json->{img} =~ s!^.*/([^/]+)$!$1!;
  $json->{img} =~ s/\.+/./g;
  getstore $img, "comics/$json->{img}";
  $sth->execute(
    map { $json->{$_} }
        qw(link alt num month transcript safe_title img day title news year))
    or die 'database error.';
  print "done.\n"; }


edit: is there some way to make the forum not break code like that?
edit 2: thanks, phlip.
Last edited by hotaru on Wed Aug 04, 2010 3:46 am UTC, edited 1 time in total.
Code: Select all
uint8_t f(uint8_t n)
{ if (!(
n&1)) return 2;
  if (
n==169) return 13; if (n==121||n==143) return 11;
  if (
n==77||n==91) return 7; if (n==3||n==5) return 0;
  
n=(n>>4)+(n&0xF); n+=n>>4n&=0xF;
  return (
n==3||n==6||n==9||n==12||n==15)?3:(n==5||n==10)?5:0; } 
User avatar
hotaru
 
Posts: 949
Joined: Fri Apr 13, 2007 6:54 pm UTC

Re: xkcd download script

Postby phlip » Wed Aug 04, 2010 2:33 am UTC

the mishanator wrote:how can i fix this/make the code better?

Use the JSON.

[edit] Gah, ninjaed yet again.
hotaru wrote:is there some way to make the forum not break code like that?
It's the double dollar signs, they trigger jsMath. Try using the $hashref->{$key} syntax instead of $$hashref{$key}. Also, any particular reason you're re-preparing the insert on every loop? Also also, davean hasn't promised he won't change the JSON format in the future, just that he'll give warning and bump up the version number if properties get removed or changed... properties can (and have) been added without warning (transcript being the most recent example). So assuming that "values %$json" will have the right things in the right order is dangerous. Of course, it's still less dangerous than reading the HTML pages, which could be completely redesigned at any point, but still.
While no one overhear you quickly tell me not cow cow.
but how about watch phone?
User avatar
phlip
Restorer of Worlds
 
Posts: 7179
Joined: Sat Sep 23, 2006 3:56 am UTC
Location: Australia

Re: xkcd download script

Postby the mishanator » Wed Aug 04, 2010 3:38 am UTC

oh the mouseovers lol i did not realize u meant that.
i dont suppose it would be too hard to get those in a file:
Code: Select all
mouseover=$(cat $PWD/*.html | grep -i "title text" | sed -e 's/\{\{title text\: //g' -e 's/\}\}\<\/div\>//g' -e 's/\&\#39\;/\'/g')
and have it write to some file, say mouseovers.txt
the mishanator
 
Posts: 209
Joined: Mon May 31, 2010 6:49 pm UTC

Re: xkcd download script

Postby PM 2Ring » Wed Aug 04, 2010 8:55 am UTC

Please use the JSON, it makes things a lot simpler. Eg,

n=123; wget -O - http://xkcd.com/$n/info.0.json | sed 's/", "/"\n"/g; s/\\n/\n/g'; echo

If you let n='', you get the data for the current comic.
User avatar
PM 2Ring
 
Posts: 3277
Joined: Mon Jan 26, 2009 3:19 pm UTC
Location: Mid north coast, NSW, Australia

Re: xkcd download script

Postby samwho? » Tue Aug 10, 2010 9:34 pm UTC

Here's my far less glamorous PHP approach ^_^

Code: Select all
<?php
/*
 * xkcd comic downloader script.
 *
 * Written on August 10th, 2010 by Sam Rose, http://lbak.co.uk.
 * Functions I got from elsewhere have been properly cited and linked to.
 *
 * Tested on PHP 5.3 using both the cURL and allow_url_fopen methods. Seems to
 * works using both but it is very possible that there are still bugs.
 * Be careful! ;)
 *
 * The script accepts the following GET variables:
 *
 * start - The comic you want to start downloading from. Numerical values only.
 *         Defaults to 1.
 *
 * end - The comic you want to finish downloading at. Numerical values only.
 *       Accepts -1 as a value if you want to download until you get to the end.
 *       Defaults to -1.
 *
 * directory - The directory you want to save the images to. I recommend using
 *             a relative filepath. Defaults to the current directory.
 *
 * saveas - Can be set to "number", "name", or "both". Determines the naming
 *          convention used to save the files. Number saves it as the comic
 *          number, name saves it as the comic name and both saves it as both,
 *          number first.
 *
 *
 * Enjoy! :)
*/

//Set the script to use as much time as it wants to.
set_time_limit(0);

//Identifying the directory you want to save to and creating it if it doesn't
//exist. Uses current directory if the get variable isn't set.
if (isset($_GET['directory'])) {
    //Add a forward slash to the end if there isn't one already.
    if (substr($_GET['directory'],strlen($_GET['directory'])-1,1) != "/") {
        $_GET['directory'] .= "/";
    }

    //Checks if the directory currently exists.
    if (is_dir($_GET['directory'])) {
        define("SAVE_PATH",$_GET['directory']);
    }
    else {
        //Make the directory if it doesn't exist.
        if (!mkdir($_GET['directory'])) {
            //Fail safely if the directory can't be created.
            echo "There was a problem while trying to create the directory '".
                    $_GET['directory']."'<br />";
            exit;
        }
        else {
            define("SAVE_PATH",$_GET['directory']);
        }
    }
}
else {
    define("SAVE_PATH","");
}

if (get_curl() == false && get_allow_url_fopen() == false) {
    echo "Sorry, you need either cURL to be installed or allow_url_fopen to be
        set to true in php.ini to run this script.";
    exit;
}

//read $_GET variables
if (isset($_GET['start'])) {
    if ($_GET['start'] > 0) {
        $start = intval($_GET['start']);
    }
    else {
        $start = 1;
    }
}
else {
    $start = 1;
}
if (isset($_GET['end'])) {
    $end = intval($_GET['end']);
}
else {
    //Set end to -1 to specify the script to continue until comics run out.
    $end = -1;
}
if (isset($_GET['saveas'])) {
    if ($_GET['saveas'] == "name") {
        $save_as = "name";
    }
    else if ($_GET['saveas'] == "number") {
        $save_as = "number";
    }
    else {
        $save_as = "both";
    }
}
else {
    $save_as = "both";
}

$comic = $start;
while (true) {

    //skip the 404 joke comic
    if ($comic == 404) {
        $comic++;
    }

    //get the image tag of the comic number supplied
    if (($comic_info = getComicInfo($comic)) == false) {
        //break the loop if the comic doesn't exist
        break;
    }
    //Check if we've hit the specified end
    if ($end > 0 && $comic >= $end) {
        break;
    }
   
    //Image url
    $image_url = $comic_info['img'];

    //Hover text
    $title_text = $comic_info['alt'];

    //Comic name
    $alt_text = $comic_info['title'];

    //Declare where we're going to save the comic and the name of it.
    if ($save_as == "name") {
        $img_file_name = SAVE_PATH . $alt_text . "." .
                getFiletype($image_url);
    }
    else if ($save_as == "number") {
        $img_file_name = SAVE_PATH . $comic . "." . getFiletype($image_url);
    }
    else {
        $img_file_name = SAVE_PATH . $comic . " - " . $alt_text ."." .
                getFiletype($image_url);
    }

    //Save the comic image.
    if (file_put_contents($img_file_name, get_web_page($image_url))) {
        //Just some confirmation text.
        echo "'" .$image_url . "' saved as '" . getcwd() . "/" .
                $img_file_name . "'<br />";
    }
    else {
        echo "Could not save '" . $image_url . "' as '" . getcwd() .
                "/" . $img_file_name . "'<br />";
    }

    //Save the hover over text of the comic
    if ($save_as == "name") {
        $alt_file_name = SAVE_PATH . $alt_text . ".txt";
    }
    else if ($save_as == "number") {
        $alt_file_name = SAVE_PATH . $comic . ".txt";
    }
    else {
        $alt_file_name = SAVE_PATH . $comic . " - " . $alt_text .".txt";
    }

    if (($alt_file = fopen($alt_file_name, 'wb')) != false) {
        fwrite($alt_file, $title_text);
        echo "Comic " . intval($comic) . " hover text saved as " .
                getcwd() . "/" . $alt_file_name . "<br /><br />";
        fclose($alt_file);
    }
    else {
        echo "Could not save hover text for comic " . intval($comic) .
                ".<br /><br />";
    }

    //Advance to the next comic
    $comic++;
}

function getComicInfo($comic) {
    $comic = intval($comic);
    if (($page = get_web_page("http://xkcd.com/$comic/info.0.json")) != false) {
        return json_decode($page, true);
    }
    else {
        return false;
    }
}

function get_curl() {
    if  (in_array('curl', get_loaded_extensions())) {
        return true;
    }
    else {
        return false;
    }
}
function get_allow_url_fopen() {
    $allow_url_fopen = ini_get("allow_url_fopen");
    if ($allow_url_fopen != "" && $allow_url_fopen != null) {
        return true;
    }
    else {
        return false;
    }
}
function getFiletype($file) {
    return end(explode(".",$file));
}
/**
 * Get a web file (HTML, XHTML, XML, image, etc.) from a URL.  Return an
 * array containing the HTTP server response header fields and content.
 *
 * Courtesy of:
 * http://nadeausoftware.com/articles/2007/06/php_tip_how_get_web_page_using_curl
 */
function get_web_page( $url ) {
    if (get_curl()) {
        $options = array(
                CURLOPT_RETURNTRANSFER => true,     // return web page
                CURLOPT_HEADER         => false,    // don't return headers
                CURLOPT_FOLLOWLOCATION => true,     // follow redirects
                CURLOPT_ENCODING       => "",       // handle all encodings
                CURLOPT_USERAGENT      => "spider", // who am i
                CURLOPT_AUTOREFERER    => true,     // set referer on redirect
                CURLOPT_CONNECTTIMEOUT => 120,      // timeout on connect
                CURLOPT_TIMEOUT        => 120,      // timeout on response
                CURLOPT_MAXREDIRS      => 10,       // stop after 10 redirects
        );

        $ch      = curl_init( $url );
        curl_setopt_array( $ch, $options );
        $content = curl_exec( $ch );
        $err     = curl_errno( $ch );
        $errmsg  = curl_error( $ch );
        $header  = curl_getinfo( $ch );
        curl_close( $ch );

        if ($err != 0) {
            return false;
        }

        return $content;
    }
    else if (get_allow_url_fopen()) {
        return file_get_contents($url);
    }
    else {
        return false;
    }
}

//Flush the output buffer, required because of the set_time_limit call.
flush();
?>
User avatar
samwho?
 
Posts: 9
Joined: Sat Aug 29, 2009 5:29 pm UTC

Re: xkcd download script

Postby resshin » Wed Aug 11, 2010 7:56 am UTC

Here's mine.
It's in Perl, using json as input, output image and alttext in html files.
You can define CSS to decorate the pages, just put them in result directory with the name "style.css".
Tested on Windows.

JSON module downloaded from, err, I'm not sure if I can post links already...
Just google for "perl json" and go to the first result. :mrgreen:

Usage:
<script_name> $start $end $dirname
where:
$start = save comic from number ... (default 1)
$end = save comic until number ... (default -1)
$dirname = directory to store results, relative to working directory (default "result")

Here's the script:
Code: Select all
use JSON;
use File::Basename;
use LWP::Simple;

my($start, $end, $dirname) = @ARGV;
$start ||= 1;
$end ||= -1;
$dirname ||= "result";

mkdir $dirname;
mkdir $dirname . "/img";
$json = JSON->new->allow_nonref;

for($i=$start; $i<=$end || $end < 0; $i++) {
   #read json
   print "Reading comic: " . $i . "\n";
   $txt = get("http://www.xkcd.com/" . $i . "/info.0.json") or ($i == 404 ? next : last);
   $comic = from_json($txt);
   
   #save img
   print "Saving image...\n";
   $imgname = basename($comic->{"img"});
   getstore($comic->{"img"}, $dirname . "/img/" . $imgname);
   
   #save html
   print "Saving html...\n";
   $current = sprintf "%03d.html", $i;
   $prev = sprintf "%03d.html", $i==405 ? 403 : $i-1;
   $next= sprintf "%03d.html", $i==403 ? 405 : $i+1;
   open FILE, ">" ,  $dirname . "/" . $current;
   printf FILE "
   <html>
      <head>
         <title>%s</title>
         <link rel='stylesheet' type='text/css' href='style.css'/>
         <meta http-equiv='Content-Type' content='text/html; charset=UTF-8'>
      </head>
      <body>
         <div class='panel'>
            <h1 id='title'>%s</h1>
            <div class='nav'>
               <a id='prev' href='%s'>Previous</a>
               <a id='next' href='%s'>Next</a>
            </div>
            <img src='img/%s'/>
            <div class='alt'>%s</div>
         </div>
      </body>
   </html>",
   $comic->{"title"}, $comic->{"title"}, $prev, $next, $imgname, $comic->{"alt"};
   close FILE;
   
   print "\n";
}


Here's my CSS: (save as "style.css")
Code: Select all
html, body {
   font: normal 12px Tahoma;
   background: #96A8C8;
}

.panel {
   width: 800px;
   margin: 0 auto;
   background: #fff;
   border: 1px solid #000;
}

h1 {
   font-size: 1.2em;
   text-align: center;
}

.nav {
   margin: 1em 0;
   text-align: center;
}

a {
   background: #6E7B91;
   border: 1px solid #000;
   font-variant: small-caps;
   color: #fff;
   text-decoration: none;
   padding: 0.5em;
}

img {
   display: block;
   margin: 0 auto;
}

.alt {
   margin: 1em 0;
   text-align: center;
}
resshin
 
Posts: 5
Joined: Sun Nov 22, 2009 12:44 pm UTC

Re: xkcd download script

Postby karlandtanya » Sat Jun 23, 2012 2:33 pm UTC

I like to read them during the day, then every couple weeks grab them for safekeeping...

Code: Select all
#!/bin/bash
#All the good parts stolen from smart people in xkcd forums and elsewhere.
die () {
    echo >&2 "$@"
    exit 1
}

[ "$#" -eq 2 ] || die "2 arguments (from , to) required, $# provided"
echo $1 | grep -E -q '^[0-9]+$' || die "Numeric argument required, $1 provided"
echo $2 | grep -E -q '^[0-9]+$' || die "Numeric argument required, $1 provided"

for i in $(eval echo {$1..$2}) #get from $1 to $2
do
   if [ "$i" = '404' ]; then i=`expr $i + 1`; fi
   echo "getting xkcd $i"   
   wget -qO- xkcd.com/$i | sed 's:>:>\x0a:g' > temp #put the newlines back in... a hack but it worked.  Would it be more correct to just tell grep to use > as the newline?
   set -f #disable globbing
   string=`grep '^<img src="http://imgs.xkcd.com/comics/' temp`
   title=`echo $string | sed -e 's/.* alt="//'   -e 's/".*//'`
   imgurl=`echo $string | sed -e 's/.* src="//'   -e 's/".*//'`
   if grep xkcd.com/$i/large temp; then imgurl=${imgurl%.*}_large.${imgurl##*\.}; fi #this is a hack; it would be better to follow links in the main image...
   imgout=`printf "%03d" $i`.`echo $imgurl | sed "s/.*\\///"`
   imgout=`echo $imgout | sed -e "s/_cropped_(1)//" -e "s/_noline_(1)//"`
   txtout=${imgout%.*}.txt
   i=`expr $i + 1`
   wget -qO- $imgurl > $imgout
   echo $string | sed -e 's/.* title="//' -e 's/".*//' -e "s/\&#39\;/\'/g" -e "s/\&quot\;/\"/g"  > $txtout  #cheesy patches; there's gotta be a nice html parser to do this generically!
done
rm temp
echo -e "\n\nMissed these:"
find . -type f -name "*\." -exec echo {} \; -exec rm {}txt \; -exec rm {} \;
karlandtanya
 
Posts: 3
Joined: Sat Feb 13, 2010 1:33 am UTC

Re: xkcd download script

Postby hotaru » Sat Jun 23, 2012 6:03 pm UTC

karlandtanya wrote:
Code: Select all
there's gotta be a nice html parser to do this generically!

PM 2Ring wrote:Please use the JSON, it makes things a lot simpler. Eg,

n=123; wget -O - http://xkcd.com/$n/info.0.json | sed 's/", "/"\n"/g; s/\\n/\n/g'; echo

If you let n='', you get the data for the current comic.
Code: Select all
uint8_t f(uint8_t n)
{ if (!(
n&1)) return 2;
  if (
n==169) return 13; if (n==121||n==143) return 11;
  if (
n==77||n==91) return 7; if (n==3||n==5) return 0;
  
n=(n>>4)+(n&0xF); n+=n>>4n&=0xF;
  return (
n==3||n==6||n==9||n==12||n==15)?3:(n==5||n==10)?5:0; } 
User avatar
hotaru
 
Posts: 949
Joined: Fri Apr 13, 2007 6:54 pm UTC

Re: xkcd download script

Postby cactusfrenzy » Thu May 22, 2014 5:03 am UTC

A shot in the dark here...but it doesn't hurt to ask.

Well, I have a Java program used to parse and download images from the internet, and I was trying to make it work with the XKCD images.
My biggest issue here is that I wanted to change the downloaded image, making it bigger at the bottom so it could fit the alt/title text in there.

Since I use a CBZ/CBR Android comic reader, I didn't want to use the approach of creating offilne HTML files with the title text as a mouseover text.

I tried BufferedImage and Graphics2D.drawString, but it didn't work nicely, I need to do a lot of String divisions so they can fit in the image.

Anyway, I don't need to use Java at all for that, I don't mind learning/using another language.
Does anybody know of a simpler way to achieve this?
cactusfrenzy
 
Posts: 1
Joined: Thu May 22, 2014 4:53 am UTC

Re: xkcd download script

Postby Jplus » Fri May 23, 2014 7:05 am UTC

Hey, I never saw this thread before... how funny. Our fellow forums member scarecrovv wrote a nice xkcd comic fetcher/searcher/data extractor that ships with the red spider project. The programs are called xkcd-fetch, xkcd-search and json-parse, respectively.

@cactusfrenzy: depending on whether and how you are storing the comics, HTML+JavaScript might actually be perfect for your purpose. Care to elaborate on what you are doing and you want exactly?
Feel free to call me Julian. J+ is just an abbreviation.
Image coding and xkcd combined
User avatar
Jplus
 
Posts: 1554
Joined: Wed Apr 21, 2010 12:29 pm UTC
Location: classified


Return to Coding

Who is online

Users browsing this forum: No registered users and 4 guests