Help trawling through 170+ pages of a forum thread

"Please leave a message at the beep, we will get back to you when your support contract expires."

Moderators: phlip, Moderators General, Prelates

User avatar
dudiobugtron
Posts: 1098
Joined: Mon Jul 30, 2012 9:14 am UTC
Location: The Outlier

Help trawling through 170+ pages of a forum thread

Postby dudiobugtron » Mon May 13, 2013 2:00 am UTC

So, I want to collate all of the posts in this thread:
viewtopic.php?f=14&t=6989
in to one easy-to read text document. This is for my own benefit only, but I would of course share the results with others.

In true xkcd spirit, I'd like to do this by automating the task somehow. However, with my current level of programming skill, the easiest way for me to do this would be for me to manually do 'select all - copy - paste' on each page, and then use MS Word's "Find and replace" tool to clear out the excess.

So, my question is, what program/s or programming language/s should I use/learn to best achieve this goal? Any other hints or tips? If I ended up with a program that could to this for any particular thread, then that's even better.
Image

User avatar
PM 2Ring
Posts: 3664
Joined: Mon Jan 26, 2009 3:19 pm UTC
Location: Mid north coast, NSW, Australia

Re: Help trawling through 170+ pages of a forum thread

Postby PM 2Ring » Mon May 13, 2013 3:21 am UTC

I'd probably do something like this in Python. The forum HTML is nicely organised, so there's probably no need to use a fully-fledged HTML parser - you can extract the desired text using regular expressions.

I gather you just want the output to be a plain text file. Do you also want to extract the names of the post authors and the date of the post?

I only looked at the first page of that thread, but I noticed that some of the posts use coloured text; that sort of thing should be easy enough to clean up with regexes. Do any of the posts contain HTML links? If so, what do you want to do about them?

FWIW, here's a simple script I wrote a little while ago that prints the list of current topics in a forum. It doesn't even use the Python re module, it just uses the string find method.

Code: Select all

#! /usr/bin/env python

''' Fetch the list of current topics on an XKCD forum '''

import urllib

def PrintForumTopicList(num):
    url = 'http://forums.xkcd.com/viewforum.php?f=%d' % num
    f = urllib.urlopen(url)
    data = f.read()
    f.close()

    key = 'class="topictitle"'
    keylen = len(key)
    for line in data.splitlines():
        i = line.find(key)
        if i != -1:
            print line[i + keylen + 1 : -4]

def main():
    #Forum 11 is the Coding forum
    PrintForumTopicList(11)   

if __name__ == '__main__': 
  main()

User avatar
PM 2Ring
Posts: 3664
Joined: Mon Jan 26, 2009 3:19 pm UTC
Location: Mid north coast, NSW, Australia

Re: Help trawling through 170+ pages of a forum thread

Postby PM 2Ring » Mon May 13, 2013 9:44 am UTC

Ok, here's a basic Python script that saves the contents of a thread (or part thereof) to a text file. Please let me know if you need help understanding what it does or how to use it. :)

I guess this thread might be more suited to the Coding forum...

Code: Select all

#! /usr/bin/env python

''' Fetch data from a page of an XKCD forum thread
and print the author, time & contents of each post



Created by PM 2Ring, 2013.5.13
'''

import sys, re, urllib

#Regex patterns for thread data extraction
pauthortime = re.compile(r'<p class="author">.*>(.*)</a></strong> &raquo; (.*) </p>')
pcontent = re.compile(r'<div class="content">(.*?)</div>', re.S)
pbr = re.compile(r'<br />')
pspan = re.compile(r'<span.*?>|</span>', re.S)

#Number of posts per page
pagesize = 40

def DoPage(url, ofile):
    print >>sys.stderr, " Fetching '%s'" % url

    f = urllib.urlopen(url)
    data = f.read()
    f.close()
   
    while True:
        #Get Author's name and post time
        a = pauthortime.search(data)
        if not a:
            break
           
        ofile.write('By %s at %s\n' % (a.group(1), a.group(2)))
        data = data[a.end():]
       
        #Get post contents
        a = pcontent.search(data)
        content = a.group(1)
        data = data[a.end():]
       
        #Strip out HTML linebreaks
        content = pbr.sub('', content)
       
        #Strip out HTML spans, which contain style & color data
        content = pspan.sub('', content)
        ofile.write('%s\n\n' % content)

def main():
    #Display a basic usage message if invoked with the '-h' option
    if len(sys.argv) > 1 and sys.argv[1] == '-h':
        print >>sys.stderr, "Usage: python %s [thread_base_URL] [output_filename] [start_page] [end_page]" % sys.argv[0]
        print >>sys.stderr, "You will probably need to put the URL string in quotes."
        sys.exit()
       
    #Get commandline arguments. Default values are for the 1st page of the "Good News, Bad News" thread
    baseurl = len(sys.argv) > 1 and sys.argv[1] or 'http://forums.xkcd.com/viewtopic.php?f=14&t=6989'
    oname = len(sys.argv) > 2 and sys.argv[2] or 'xkcdthread.txt'
    lopage = len(sys.argv) > 3 and int(sys.argv[3]) or 1
    hipage = len(sys.argv) > 4 and int(sys.argv[4]) or lopage
   
    print >>sys.stderr, "baseurl: '%s'\noname: '%s'\nlopage: %d\nhipage: %d\n" % (baseurl, oname, lopage, hipage)
   
    #Convert page numbers to message numbers
    lonum = (lopage - 1) * pagesize
    hinum = (hipage) * pagesize

    ofile = open(oname, 'wt')
    for i in xrange(lonum, hinum, pagesize):
        #print >>sys.stderr, i
        url = '%s&start=%d' % (baseurl, i)
        DoPage(url, ofile)
    ofile.close()
   
if __name__ == '__main__': 
  main()

User avatar
dudiobugtron
Posts: 1098
Joined: Mon Jul 30, 2012 9:14 am UTC
Location: The Outlier

Re: Help trawling through 170+ pages of a forum thread

Postby dudiobugtron » Mon May 13, 2013 10:56 am UTC

Thanks heaps for those example scripts, that is really helpful. I have been meaning to learn python for a while so this is a good excuse. I will certainly learn it up to the point where I can understand your script, (or at least when I can compile and run your script...) ;)

PM 2Ring wrote:I guess this thread might be more suited to the Coding forum...

I initially thought that, but the description of the forum made me think this one was more appropriate.
Image

User avatar
PM 2Ring
Posts: 3664
Joined: Mon Jan 26, 2009 3:19 pm UTC
Location: Mid north coast, NSW, Australia

Re: Help trawling through 170+ pages of a forum thread

Postby PM 2Ring » Mon May 13, 2013 1:09 pm UTC

No worries. I couldn't remember whether or not you knew Python. The second script will basically do what you want, although there's room for improvement.

To run the script you need to install Python 2 on your machine, but that's pretty easy to do and the download for Windows is only about 20 MB. You don't need any fancy libraries, just the basic Python package.

Once you have Python installed, you can run the script from the commandline. Assuming the script is named SaveThread.py in the current directory (or Python script path), just type

python SaveThread.py thread_base_URL output_filename start_page end_page
or
python SaveThread.py -h
to get a simple help message.

All of the commandline arguments are optional, but since the argument parsing is very basic, you must give the arguments in the order specified above. The default arguments are
'http://forums.xkcd.com/viewtopic.php?f=14&t=6989' 'xkcdthread.txt' 1 1
so if you just type
python SaveThread.py
the script will save the text from the 1st page of the "Good News, Bad News" thread to a file called 'xkcdthread.txt' in the current directory.


Python runs in an interpreter, so you don't need to compile the code I posted (although the Python interpreter does convert a script to a "semi-compiled" form known as bytecode that makes interpreting more efficient). There'd be little advantage in writing a program like this in a compiled language: the conversion process itself is quite fast, downloading the data is the slow step. But I think you'll find that the script can download forum pages a lot faster than you could do it manually with your browser. :)

User avatar
dudiobugtron
Posts: 1098
Joined: Mon Jul 30, 2012 9:14 am UTC
Location: The Outlier

Re: Help trawling through 170+ pages of a forum thread

Postby dudiobugtron » Mon May 13, 2013 9:44 pm UTC

Thanks heaps PM 2Ring! Although by giving me a fish, there is a bit less motivation for me to teach myself to fish. ;)
Having the authour and post time is what I was going for, but for my own interest I'll try to figure out how to do it with just the post content since it might be useful for other threads (eg: the continuous story ones). I imagine I just need to get rid of the carefully labelled part which fetches that info, and then any other later references to that.
Image

User avatar
PM 2Ring
Posts: 3664
Joined: Mon Jan 26, 2009 3:19 pm UTC
Location: Mid north coast, NSW, Australia

Re: Help trawling through 170+ pages of a forum thread

Postby PM 2Ring » Tue May 14, 2013 3:55 am UTC

dudiobugtron wrote:Thanks heaps PM 2Ring!

Glad to help, dudiobugtron. I've thought about writing a script like this for a while; your post just gave me the necessary motivation.
dudiobugtron wrote:Although by giving me a fish, there is a bit less motivation for me to teach myself to fish. ;)

Good point. But as I said there's room for improvement in that script. And hopefully, you'll have more tasks that you'll want to write programs for. :)

Actually, I just realised that I neglected to translate HTML entities in that script, so it currently outputs &gt; instead of >, etc. I'll fix that shortly.

dudiobugtron wrote:Having the author and post time is what I was going for, but for my own interest I'll try to figure out how to do it with just the post content since it might be useful for other threads (eg: the continuous story ones). I imagine I just need to get rid of the carefully labelled part which fetches that info, and then any other later references to that.


The simplest way is to just comment out the line that prints the author & time info:

ofile.write('By %s at %s\n' % (a.group(1), a.group(2)))

You do that by putting a hash mark # at the start of the line.

If you want to make the program more efficient by cutting out the search for the author & time info, then you need to change this

Code: Select all

    while True:
        #Get Author's name and post time
        a = pauthortime.search(data)
        if not a:
            break
           
        ofile.write('By %s at %s\n' % (a.group(1), a.group(2)))
        data = data[a.end():]
       
        #Get post contents
        a = pcontent.search(data)
        content = a.group(1)
        data = data[a.end():]

to this

Code: Select all

    while True:
        #Get post contents
        a = pcontent.search(data)
        if not a:
            break
        content = a.group(1)
        data = data[a.end():]


I'm thinking that it might be a Good Idea to print the thread post number at the start of each post to help eliminate duplicates when saving threads in multiple stages, eg for threads that are still growing.

And it might be an idea to clean up various things, like quotes, links and images, although they don't seem to occur much in the "Good News, Bad News" thread.



Edit

Ok, here's an improved version, which prints the post number, and translates HTML entities in the post contents and the author name to their extended ASCII equivalent.

Code: Select all

#! /usr/bin/env python

''' Fetch data from a page of an XKCD forum thread
and print the author, time & contents of each post

Created by PM 2Ring, 2013.5.13
Added HTML entity translation 2013.5.14
'''

import sys, re, urllib

#Dictionary to convert HTML entities to ASCII
entityASCII = {
    'quot'  : '\x22', 'amp'   : '\x26', 'lt'    : '\x3c', 'gt'    : '\x3e',
    'nbsp'  : '\xa0', 'iexcl' : '\xa1', 'cent'  : '\xa2', 'pound' : '\xa3',
    'curren': '\xa4', 'yen'   : '\xa5', 'brvbar': '\xa6', 'sect'  : '\xa7',
    'uml'   : '\xa8', 'copy'  : '\xa9', 'ordf'  : '\xaa', 'laquo' : '\xab',
    'not'   : '\xac', 'shy'   : '\xad', 'reg'   : '\xae', 'macr'  : '\xaf',
    'deg'   : '\xb0', 'plusmn': '\xb1', 'sup2'  : '\xb2', 'sup3'  : '\xb3',
    'acute' : '\xb4', 'micro' : '\xb5', 'para'  : '\xb6', 'middot': '\xb7',
    'cedil' : '\xb8', 'sup1'  : '\xb9', 'ordm'  : '\xba', 'raquo' : '\xbb',
    'frac14': '\xbc', 'frac12': '\xbd', 'frac34': '\xbe', 'iquest': '\xbf',
    'Agrave': '\xc0', 'Aacute': '\xc1', 'Acirc' : '\xc2', 'Atilde': '\xc3',
    'Auml'  : '\xc4', 'Aring' : '\xc5', 'AElig' : '\xc6', 'Ccedil': '\xc7',
    'Egrave': '\xc8', 'Eacute': '\xc9', 'Ecirc' : '\xca', 'Euml'  : '\xcb',
    'Igrave': '\xcc', 'Iacute': '\xcd', 'Icirc' : '\xce', 'Iuml'  : '\xcf',
    'ETH'   : '\xd0', 'Ntilde': '\xd1', 'Ograve': '\xd2', 'Oacute': '\xd3',
    'Ocirc' : '\xd4', 'Otilde': '\xd5', 'Ouml'  : '\xd6', 'times' : '\xd7',
    'Oslash': '\xd8', 'Ugrave': '\xd9', 'Uacute': '\xda', 'Ucirc' : '\xdb',
    'Uuml'  : '\xdc', 'Yacute': '\xdd', 'THORN' : '\xde', 'szlig' : '\xdf',
    'agrave': '\xe0', 'aacute': '\xe1', 'acirc' : '\xe2', 'atilde': '\xe3',
    'auml'  : '\xe4', 'aring' : '\xe5', 'aelig' : '\xe6', 'ccedil': '\xe7',
    'egrave': '\xe8', 'eacute': '\xe9', 'ecirc' : '\xea', 'euml'  : '\xeb',
    'igrave': '\xec', 'iacute': '\xed', 'icirc' : '\xee', 'iuml'  : '\xef',
    'eth'   : '\xf0', 'ntilde': '\xf1', 'ograve': '\xf2', 'oacute': '\xf3',
    'ocirc' : '\xf4', 'otilde': '\xf5', 'ouml'  : '\xf6', 'divide': '\xf7',
    'oslash': '\xf8', 'ugrave': '\xf9', 'uacute': '\xfa', 'ucirc' : '\xfb',
    'uuml'  : '\xfc', 'yacute': '\xfd', 'thorn' : '\xfe', 'yuml'  : '\xff'
}

#Regex patterns for thread data extraction
pauthortime = re.compile(r'<p class="author">.*>(.*)</a></strong> &raquo; (.*) </p>')
pcontent = re.compile(r'<div class="content">(.*?)</div>', re.S)
pbr = re.compile(r'<br />')
pspan = re.compile(r'<span.*?>|</span>', re.S)
pentity = re.compile(r'&(\w+?);')

def xlate(matchobj):
    return entityASCII.get(matchobj.group(1), matchobj.group(0))

def entity2ascii(s):
    return pentity.sub(xlate, s)

#Number of posts per page
pagesize = 40

def DoPage(baseurl, start, ofile):
    url = '%s&start=%d' % (baseurl, start)
   
    print >>sys.stderr, " Fetching '%s'" % url
    f = urllib.urlopen(url)
    data = f.read()
    f.close()

    count = start
    while True:
        #Get Author's name and post time
        a = pauthortime.search(data)
        if not a:
            break
           
        author = a.group(1)
        time = a.group(2)
        ofile.write('%d: %s at %s\n' % (count, entity2ascii(author), time))
        data = data[a.end():]
       
        #Get post contents
        a = pcontent.search(data)
        content = a.group(1)
        data = data[a.end():]
       
        #Strip out HTML linebreaks
        content = pbr.sub('', content)
       
        #Strip out HTML spans, which contain style & color data
        content = pspan.sub('', content)
        ofile.write('%s\n\n' % entity2ascii(content))
       
        count += 1

def main():
    #Display a basic usage message if invoked with the '-h' option
    if len(sys.argv) > 1 and sys.argv[1] == '-h':
        print >>sys.stderr, "Usage: python %s [thread_base_URL] [output_filename] [start_page] [end_page]" % sys.argv[0]
        print >>sys.stderr, "You will probably need to put the URL string in quotes."
        sys.exit()
       
    #Get commandline arguments. Default values are for the 1st page of the "Good News, Bad News" thread
    baseurl = len(sys.argv) > 1 and sys.argv[1] or 'http://forums.xkcd.com/viewtopic.php?f=14&t=6989'
    oname = len(sys.argv) > 2 and sys.argv[2] or 'xkcdthread.txt'
    lopage = len(sys.argv) > 3 and int(sys.argv[3]) or 1
    hipage = len(sys.argv) > 4 and int(sys.argv[4]) or lopage
   
    print >>sys.stderr, "baseurl: '%s'\noname: '%s'\nlopage: %d\nhipage: %d\n" % (baseurl, oname, lopage, hipage)
   
    #Convert page numbers to message numbers
    lonum = (lopage - 1) * pagesize
    hinum = (hipage) * pagesize

    ofile = open(oname, 'wt')
    for i in xrange(lonum, hinum, pagesize):
        #print >>sys.stderr, i
        DoPage(baseurl, i, ofile)
    ofile.close()
   
if __name__ == '__main__': 
  main()

User avatar
dudiobugtron
Posts: 1098
Joined: Mon Jul 30, 2012 9:14 am UTC
Location: The Outlier

Re: Help trawling through 170+ pages of a forum thread

Postby dudiobugtron » Tue May 14, 2013 6:54 am UTC

That's a good idea (no pun intended) - it seems like the sort of thing that other people (including myself) might find more generally useful as well. (I initially mostly wanted it so I could read through the good news bad news thread without having to click 'next page' 180 times.)

PM 2Ring wrote:The simplest way is to just comment out the line that prints the author & time info:

ofile.write('By %s at %s\n' % (a.group(1), a.group(2)))

You do that by putting a hash mark # at the start of the line.

That is much simpler! Very clever. (I figured out already that # is the symbol for 'commented' (I'm used to seeing //, or %) since the python editor that came with python helpfully colour-codes everything.)
Image

User avatar
PM 2Ring
Posts: 3664
Joined: Mon Jan 26, 2009 3:19 pm UTC
Location: Mid north coast, NSW, Australia

Re: Help trawling through 170+ pages of a forum thread

Postby PM 2Ring » Tue May 14, 2013 11:46 am UTC

Commenting-out stuff is a standard program development technique. :)

FWIW, using # for comments is a common practice in the *nix world - it's the comment introducer in shell scripting languages like sh & bash, it's also used in awk and sed.


Return to “The Help Desk”

Who is online

Users browsing this forum: No registered users and 4 guests