Simple perl script to download all comics

A place to discuss the implementation and style of computer programs.

Moderators: phlip, Moderators General, Prelates

Gordo4444
Posts: 8
Joined: Wed Oct 21, 2009 11:01 pm UTC

Simple perl script to download all comics

Postby Gordo4444 » Wed Oct 21, 2009 11:16 pm UTC

Hey everyone, I've been a fan of the XKCD comics for a while now and I just realized there was a forum here so I decided to check out the coding board for a script or program that would automatically download all of the comics and save them to a folder on my desktop. Needless to say I couldn't find out so I made one! This script uses WWW::Mechanize and WWW::Mechanize::Image so you will need those in your perl library to be able to run the script. Keep in mind I'm not a professional programmer, just a teenage kid stuck in his house because he has mono, so don't hate me for my amateur coding style...

Code: Select all

#!/usr/bin/perl -w

use WWW::Mechanize;
use WWW::Mechanize::Image;
use strict;
system("CLS");


print "Initilizing Script...\n";
my $mech = WWW::Mechanize->new();


foreach my $NUM (1..403) {

my $U="http://xkcd.com/$NUM/";
print "Loading image #$NUM\n";
$mech->get($U);
my $image = $mech->find_image( url_regex => qr/comics/i);
$mech->get($image->url());
print "Saving Image...\n";
$mech->save_content("C:/Documents and

Settings/Owner/Desktop/xkcd/$NUM.png");

}

print "Skipping 404 because of a sick joke...\n";
### Try going to http:://xkcd.com/404/, It was a funny joke until it broke my code... So I just had to make a quick change ###

foreach my $NUM (405..652) {

my $U="http://xkcd.com/$NUM/";
print "Loading image #$NUM\n";
$mech->get($U);
my $image = $mech->find_image( url_regex => qr/comics/i);
$mech->get($image->url());
print "Saving Image...\n";
$mech->save_content("C:/Documents and Settings/Owner/Desktop/xkcd/$NUM.png");
### You can change the path ^ to whatever you want, just make sure they are forward slashes and not backslashes ###
}
print "All Images Downloaded!\n";
print "Have a nice day! =)\n";


Remember to change the place where you want the images to be saved in the code TWICE!! I had to split the code into 2 because of the 404 joke. So just be sure to change both, that is unless you want all the images saved on your desktop in a folder labeled xkcd.

Oh, another thing, the images themselves are in .png format. I found it easier to just save them as a .jpg. They will still open in Windows Picture Viewer. But hey, that's just my opinion. I hope you guys enjoy this code as much as I did making it.

EDIT: At the time of writing/posting this code there where 652 comics which took up 37.1mb of space. I'm about to start on a script that will check for updates and automatically download them for you.


Peace!
~Gordo~

Gordo4444
Posts: 8
Joined: Wed Oct 21, 2009 11:01 pm UTC

Re: Simple perl script to download all comics

Postby Gordo4444 » Fri Oct 23, 2009 7:03 pm UTC

Thanks, like I said. I'm far from an even amateur programming level. I was actually in the middle of posting this topic when I saw that the 404 page broke my script and I did a quick fix instead of doing the most logical solution. I actually never thought of using the difference idea, thanks for the tip!

~Gordo~

Gordo4444
Posts: 8
Joined: Wed Oct 21, 2009 11:01 pm UTC

Re: Simple perl script to download all comics

Postby Gordo4444 » Fri Oct 30, 2009 2:43 am UTC

Hey guys, I made another script that automates a download process. This one is used to download another webcomic called Cyanide and Happiness. I'm a big fan of both them and Xkcd and after writing my perl script to download all of the Xkcd comics I decided to make one to download all of the C&H comics as well.

So here's the code...

Code: Select all

#!/usr/bin/perl -w

use WWW::Mechanize;
use WWW::Mechanize::Image;
use strict;
system("CLS");

print "Initilizing Script...\n";
my $mech = WWW::Mechanize->new();




foreach my $NUM (15..1841) {
print "Loading image #$NUM\n";
my $U="http://www.explosm.net/comics/$NUM/";
$mech->get($U);
my $image = $mech->find_image( alt_regex => qr/Cyanide and Happiness, a daily webcomic/i);


if (defined $image){
$mech->get($image->url());
print "Saving Image...\n";
$mech->save_content("C:/Documents and Settings/Owner/Desktop/C&H/$NUM.jpg");
}

else {
print "Skipping #$NUM because it doesn't exist...\n";
}
}
print "All Images Downloaded!\n";


I had a problem with some of their comics. It seems they've deleted 1-14, then 14-39. Along with a bunch of others. So the script kept trying to pass an undefined variable and it kept breaking. So I just set up an if defined() part to test if the comic existed and if not told you it was skipping it. Well, I hope you guys enjoy it. Like I said in my original post, I'm no master coder, just a bored teenager...

Peace!
~Gordo~

CortoPasta
Posts: 38
Joined: Fri Aug 15, 2008 5:51 pm UTC

Re: Simple perl script to download all comics

Postby CortoPasta » Fri Oct 30, 2009 5:44 pm UTC

Sick! It's cool seeing someone throw stuff out there for everyone to use

Gordo4444
Posts: 8
Joined: Wed Oct 21, 2009 11:01 pm UTC

Re: Simple perl script to download all comics

Postby Gordo4444 » Fri Oct 30, 2009 6:28 pm UTC

No problem! I love perl and making useful scripts. Just wanted to share them with other people...

roboman
Posts: 12
Joined: Wed Jul 16, 2008 12:12 am UTC

Re: Simple perl script to download all comics

Postby roboman » Fri Oct 30, 2009 8:37 pm UTC

I wrote this a while back, admittedly it is a bit verbose, but it does have a couple tricks that maybe you can incorporate.

-It goes backwards for then you do not have to worry about checking for updates
-It also grabs the alt text, and puts the image and text in its own directory
-Uses a default directory ($ENV{HOME}/Desktop) could also be changed.

Code: Select all

#!/usr/bin/perl

use LWP::Simple;
#  use Smart::Comments;

## Objectives ##

#  Download all comics from xkcd.com
#  Ability to download new comics
#  Download ALT text
#  Saved in: ~/Desktop

# Set Specifics
$sitePrefix = "http://xkcd.com/";

## Path to main XKCD directory ##
$path = "$ENV{HOME}/Desktop";


mkdir "$path/XKCD", 0755 or print "XKCD Directory Exists\n";
chomp($path = "$path/XKCD");

$d = get($sitePrefix);
if ($d =~ /http:\/\/xkcd.com\/(\d+)\//) {
    $current = $1;
}

# Obtains all individual comic data
sub getComicData {
    my $siteData = get("$sitePrefix$current/");
    my @data = split /\n/, $siteData;
    foreach (@data) {
        if (/http:\/\/xkcd.com\/(\d+)\//) {
            $current = $1;
        }
        if (/src="(http:\/\/imgs.xkcd.com\/comics\/.+\.\w{3})"/) {
            $currentUrl = $1;
            if (/alt="(.+?)"/) {
                $title = $1;
            $title = "House of Pancakes" if $current == 472;  # Color title on comic 472 with weird syntax
            }
            if (/title="(.+?)"/) {    #title commonly know as 'alt' text
                $alt = $1;
            }
        }
    }
}

chdir "$path" or die "Cannot change directory: $!";
&getComicData();
while ( get("$sitePrefix$current/")){ ### Writing Files $current: $title
    print "Writting Files $current: $title\n";
    # Create directories for individual comics
    mkdir "$current $title", 0755 or die "Previously Downloaded";
    chdir "$path/$current $title" or die "Cannot change directory: $!";

    # Save image file
    $image = get($currentUrl);
    open my $IMAGE, '>>', "$title.png"
        or die "Cannot create file!";
    print $IMAGE $image;
    close $IMAGE;

    # Save alt text
    open my $TXT, '>>', "$title ALT.txt"
        or die "Cannot create file!";
    print $TXT $alt;
    close $TXT;
    chdir "$path" or die "Cannot change directory: $!";
    $current--;

    # Check for non existent 404 comic
    $current-- if $current == 404;

    &getComicData();
}


# End Gracefully
print "Download Complete\n"


Hopefully you can take something from that.

isaaclw
Posts: 2
Joined: Fri Jan 08, 2010 7:15 am UTC

Re: Simple perl script to download all comics

Postby isaaclw » Fri Jan 08, 2010 7:32 am UTC

I'm wrote my own (and then thought I'd see who else wrote a script). It's not just for xkcd though, but basically any comic.
(I have a config file popluated with about 20 different ones)

roboman wrote:-It goes backwards for then you do not have to worry about checking for updates

I didn't think about that...
I guess if you followed some kind of RSS feed, you could pull out the number, and de-crement that. Starting at the begining is easier though, and also lets you read them in order...

roboman wrote:-It also grabs the alt text, and puts the image and text in its own directory

You should check out imagemagick.
This is the line of code I use to dump the alt-text to a image file:

Code: Select all

my $pic_details = `identify $store_path.$ext`;
$pic_details =~ /$store_path.$ext [A-Z]+ (\d+?x)\d+? /;
system("convert -pointsize 12 -size '$1' caption:\"$text\" $store_path.title.$ext");


That way I can "zip" everything together and make "comic books" (cbz files have to be only pictures...)

I'll upload my comic downloader once I'm more satisfied with it's completion.

isaaclw
Posts: 2
Joined: Fri Jan 08, 2010 7:15 am UTC

Re: Simple perl script to download all comics

Postby isaaclw » Wed Jan 13, 2010 12:26 pm UTC

Well, I thought I'd release this here first.
comic downloader: http://www.isaaclw.com/cmcdwnlr2/

Readme:

Code: Select all

'cmcdwnlr' is a Tool that parses through comic sites and
downloads new comics.
It is released under the GNU GPL:
    http://www.gnu.org/licenses/gpl-3.0.txt

==Modes==
It has two modes:
- Counting mode, where the site url or image contains an
    incrementing number
- Parsing mode, where the site has a "next" or "back"
    button.
Counting mode is default unless the "next" value is set in
the conf file

In Counting mode, the files are grabbed in two ways:
- Direct download.
    If the image is stored online increments, then the
    direct link to the image can be listed by itself in
    the conf file.
    This means that the "title" text (or mouse over text)
    can not be gathered.
- Parsed download.
    If the url where the image is stored incrementing format
    then a regular expressin 'pic' can be set, to catch the
    image in the page.
    With this option, you can also set "get_text" in order
    to capture the title text/mouse over text of the image.

In Parsing mode, the files are grabbed by starting at the
Url specified and crawling through the site by finding a
link. This means you need to set two regular expressions:
- The regular expression for the picture
- The regular expression that allows the parser to find the
    next link.

==Conf file==
cmcdwnlr looks for a conf file with the same name as the
program, except with a ".conf" at the end.
If the program is saved at "/usr/path/comic.downloder"
the conf file should be at "/usr/path/comic.downloder.conf"
The file is set up in the following format:

COMICNAME:
    mode1=""
    mode2=""

The modes are predesgined, but the names are not. You can
chose any name for your comic. This name is the name of the
folder the comic will be saved in.
There are currently 5 Modes:
- url: the start url, or the "expression" for the
    images/pages
- pic: the regular expression for the image. When used the
    'url' will be assumed to be an "expression" for the
    page the image is located at.
- next: the regular expression for the next page. When used
    the 'url' will be assumed to be the start page. 'pic'
    must also be assigned.
- get_text: if set to true (or almost anything else), then
    the title text (or mouseover text) of the downloaded
    image, will also be gathered, and saved to
    $dir/$comic.title.$ext
    This option needs to have 'pic' set in order to
    function.
- double_urls: say a comic has 15 comics, and you visit the
    16th. Sometimes comics give you a blank page (with no
    image to parse out) others give you the previous comic.
    In the latter case we don't want to download the comic
    again, so setting this option to true (or almost
    anything).

=='url' expressions==
In the case where you are using the "counting" downloader,
the url will contain a number. In order to allow cmcdwnlr to
insert the number, use a {} to show location. The contents
of the brackets will change depending on the "padding"

1-9999  : {0} effectivally turning off padding:
ex:         http://comicwopadding.com/{0}.png
will catch: http://comicwopadding.com/1.png
and:        http://comicwopadding.com/99.png
but not:    http;//comicwopadding.com/09.png

001-999 : {3} pad the numbers to three:
ex:         http://comicwpadding.com/{3}.png
will catch: http://comicwpadding.com/001.png
and:        http://comicwpadding.com/999.png
but not:    http://comicwpadding.com/0001.png
or:         http://comicwpadding.com/1.png

=='pic'/'next' expressions==
It's a wise idea to look up regular expressions. Since it's
easy to mess up on the expression, testing is a good idea.
TODO: insert a regular expression tester.
The actual link, or picture, should be enclosed in the first
group of parens. parethesis can be used to enclose other
portions of text, but the first enclosed group of text will
be assumed to be the link/picture to act upon.

==Acting on URLS==
Since many sites imbed relative links, cmcdwnlr will act
upon the parsed url/picture as if it is in the browser.
ie:
if the path is of the form:
    /path/to/file
then it will be counted as a domain link
if the path is of the form:
    path/to/file
then it will be counted as a relative link from the
current page.
if the path is of the form:
    http://comic.com/path/to/file
then it will be counted as a new link entirely.


==End of downloading==
Cases where downloading ends:
1) If the header of 'url' returns a 404, exit. In the case
    it doesn't, it's assumed that there's a picture inside,
    or it is the picture.
2) If after parsing with 'pic' or 'next' no path is found,
    exit.
3) If double_urls is set, and the links on both page 35 and
    page 36 match, then exit before downloading 36.

Bug Reports:
send to isaaclw@gmail.com
Bug reports without the 'conf' file will be ignored.
Please include any and all cmcdwnlr output when the error
occurs.

vladi
Posts: 1
Joined: Mon Mar 08, 2010 8:56 pm UTC

Re: Simple perl script to download all comics

Postby vladi » Mon Mar 08, 2010 8:59 pm UTC

Oooooor, you could try out Woofy http://code.google.com/p/woofy/. It only works on Windows, but it has a (mostly) shiny GUI.

Richard.Williams
Posts: 1
Joined: Tue Apr 20, 2010 5:03 pm UTC

Re: Simple perl script to download all comics

Postby Richard.Williams » Tue Apr 20, 2010 5:14 pm UTC

I also love comics, and, writing scripts. So, I wrote this script to automatically download the daily comic from xkcd to my computer.


Code: Select all

# Script DailyComic.txt
var string out, server, url, localfile

isstart "dc" "dailycomics" "Mozilla/4.0"
iscon "dc" "http://xkcd.com" > $out
stex -c -r "^<h3&Image&URL&\:^]" $out > null
stex -c "[^http:^" $out > $url ; stex "^.png^[" $url > null
stex -c "]^/^3" $url > $server ; stex -c -p "^/^l[" $url > $localfile
isdiscon "dc"

echo -e "DEBUG: Getting " $url " and saving it as " $localfile " in current directory."
iscon "dc" $server > null
isget -b "dc" $url > null
issave -b "dc" $localfile
isdiscon "dc"
isend "dc"
system ("\""+$localfile+"\"")



Copy and paste the script into file "C:/Scripts/DailyComic.txt". Script is in biterscripting ( http://www.biterscripting.com - download it from there - it's good for getting stuff - and is free - and is the language that is closest to UNIX shell and works on windows - any other scripting langauge will do also).

Start biterscripting. Copy and paste this command into it.

script "C:/Scripts/DailyComic.txt"


Or, to do this daily, create a desktop icon, and assign it this command

"C:/biterscripting/biterscripting.exe" "C:/Scripts/DailyComic.txt"

Each time you double click on the desktop icon, it will download the latest comic to the computer, without opening browser etc. and also show the comic on screen.


Return to “Coding”

Who is online

Users browsing this forum: No registered users and 10 guests