Python, conventions?

A place to discuss the implementation and style of computer programs.

Moderators: phlip, Moderators General, Prelates

User avatar
fiyarburst
Posts: 30
Joined: Sat Feb 09, 2008 10:55 pm UTC

Python, conventions?

Postby fiyarburst » Mon Aug 11, 2008 10:53 pm UTC

I was working on this script last night (not so sure about this revision) for this forum game.
For some reason, I feel there's a more elegant way to do what I'm trying to do, but I'm still trying to come to grips with python and indents and colons and whitespace and all. I come from a Java background.

Any thoughts/comments on better coding practices, any improvements? A C++ person I was talking to once told me that for loops and breaks are just as elegant (if not more) as what I did with the while loop. But that's not what they taught us in school.

I had to hardcode the URLs for each page of the thread. That's not going to work for 10000 posts, I think. :?

User avatar
misskwiz
Posts: 96
Joined: Wed Mar 21, 2007 6:19 am UTC

Re: Python, conventions?

Postby misskwiz » Mon Aug 11, 2008 11:42 pm UTC

Code: Select all

for i in xrange(0, htmlSource.__len__()-20):

is the same as

Code: Select all

for i in xrange(len(htmlSource)-20):


the body of the loop is super inefficient:

Code: Select all

i = i + 21
content = htmlSource[i:i+6]
post = ''
   while content != '</div>' :
      post = post + htmlSource[i]
      i = i+1
      content = htmlSource[i:i+6]
write(post)

this is because python strings are immutable, so every post = post + htmlSource[i] is a brand new string, a more efficient way is to have a list of strings that you want to eventually join;

Code: Select all

post = []
while content != '</div>' :
   post.append(htmlSource[i])
   i = i+1
   content = htmlSource[i:i+6]
post = ''.join(post)
write(post)

the join method works by putting the string in between every list element so;

Code: Select all

>>> a = ['a','b','c']
>>> '|'.join(a)
'a|b|c'

or in the case of the empty string

Code: Select all

>>> a = ['a','b','c']
>>> ''.join(a)
'abc'
I am currently enjoying the pathetic anger bread of a dissatisfied life.

User avatar
mat-tina
Posts: 331
Joined: Mon Jun 02, 2008 3:33 pm UTC

Re: Python, conventions?

Postby mat-tina » Tue Aug 12, 2008 5:58 am UTC

fiyarburst wrote:I had to hardcode the URLs for each page of the thread. That's not going to work for 10000 posts, I think. :?


No, you didn't. All you have to do is adjust the start-part of the url:

Code: Select all

for start in [str(a) for a in xrange(0, NUM_POSTS, 40)]:
   parse("http://forums.xkcd.com/viewtopic.php?f=14&t=26056&st=0&sk=t&sd=a&start=" + start)


I have no idea how to get the number of posts without hardcoding, though, but that's less of a problem.


EDIT: Is there a way to make the above code more generator friendly?

More EDIT: Also, I'm not sure that opening and closing the output file for every message is such a good idea... Store everything in a list and flush that now and then, or open the file at the start of the program, and don't close it before everything has been written.
Felltir wrote:has no sig, and therefore something to hide
GENERATION n: The first time you see this, copy it into your sig on any forum. If n is an even number, divide it by 2. If it's odd, multiply it by 3 and add 1. Prove that this sequence converges to 1 for all n.

User avatar
Berengal
Superabacus Mystic of the First Rank
Posts: 2707
Joined: Thu May 24, 2007 5:51 am UTC
Location: Bergen, Norway
Contact:

Re: Python, conventions?

Postby Berengal » Tue Aug 12, 2008 5:12 pm UTC

I find it weird that it's happened so often lately that I've come upon a topic with a problem I almost instantly thought regexen were the answer to. This is no exception. The worst thing is, I've been correct about it too.

Code: Select all

from __future__ import with_statement # Sometimes required. Officially required. Read below
import urllib
import re # because everyone likes regexen
# Why import sys? You don't use it.


postPattern = re.compile('<div class="content">(.*?)</div>')

def parse(url):

  # "with" is a nice, recent addition to python, and is the prefered idiom for opening files.
  # Simply said, the next line creates a new block of execution the same way ifs, whiles and fores do.
  # It opens the file (or site) on entry, assigns it a name (with the optional "as" clause),
  # and closes it when the block exits. Also closes the resources on most exceptional block exits (such as exceptions).
  with urllib.urlopen(url) as site:
    source = site.read()
 
  # This goes through the html source and puts everything that matches the regex in a list.
  # Also, it strips the surrounding tags. Read up on regexen if you wonder why.
  posts = postPattern.findall(source)

  # This should be self-explanatory.
  with file('xmas.txt', 'a') as f:
    f.write('\n')
    f.write('\n'.join(posts))


if __name__ == '__main__':
  start = 0
  step = 40 # Guess what this does
  end = 10000
  for i in xrange(start, end, step):
    # String formating in python: "string % (format arguments)" (paranthesis can be dropped in unambiguous cases))
    # Example: "I wrote %d scripts on %s" % (5, 'Monday')
    parse('http://forums.xkcd.com/viewtopic.php?f=14&t=26056&st=0&sk=t&sd=a&start=%d' % i)


Edit: During testing I found a minor typo (in the url), and that urlopen doesn't have suport for the with statement yet (it's still an old-style object as well, which is bad). Url-typo is fixed, but I left the essay on the with-statement behind, because I love that little word.
It is practically impossible to teach good programming to students who are motivated by money: As potential programmers they are mentally mutilated beyond hope of regeneration.

User avatar
Yakk
Poster with most posts but no title.
Posts: 11129
Joined: Sat Jan 27, 2007 7:27 pm UTC
Location: E pur si muove

Re: Python, conventions?

Postby Yakk » Tue Aug 12, 2008 10:03 pm UTC

Notice that the downside is that nested <div> elements end up screwing up. A RE based solution cannot catch this (but the OP's didn't either).

And yes, the OP's post is a manual RE-like effort: REs are more optimized than your hacky hand-code job, so use them. :-)
One of the painful things about our time is that those who feel certainty are stupid, and those with any imagination and understanding are filled with doubt and indecision - BR

Last edited by JHVH on Fri Oct 23, 4004 BCE 6:17 pm, edited 6 times in total.

User avatar
r1chard
Posts: 281
Joined: Thu Dec 06, 2007 2:17 am UTC
Location: Melbourne, AU
Contact:

Re: Python, conventions?

Postby r1chard » Wed Aug 13, 2008 4:38 am UTC

When parsing HTML the answer is almost always to use Beautiful Soup
You didn't write that awful page. You're just trying to get some data out of it. Right now, you don't really care what HTML is supposed to look like.

Neither does this parser.

User avatar
Berengal
Superabacus Mystic of the First Rank
Posts: 2707
Joined: Thu May 24, 2007 5:51 am UTC
Location: Bergen, Norway
Contact:

Re: Python, conventions?

Postby Berengal » Wed Aug 13, 2008 1:27 pm UTC

Yakk wrote:Notice that the downside is that nested <div> elements end up screwing up. A RE based solution cannot catch this (but the OP's didn't either).

Yes, I know, but I was coding to the spesifications the original script satisfied (that is, produce the same output as the original). Nested divs are problematic, but still doable with a little recursion and use of string.find.

Then again, using a HTML library like the previously mentioned Beautiful Soup is probably the best solution.
It is practically impossible to teach good programming to students who are motivated by money: As potential programmers they are mentally mutilated beyond hope of regeneration.

DaymItzJack
Posts: 9
Joined: Tue Jun 17, 2008 7:39 pm UTC

Re: Python, conventions?

Postby DaymItzJack » Wed Aug 13, 2008 5:48 pm UTC

mat-tina wrote:
fiyarburst wrote:I had to hardcode the URLs for each page of the thread. That's not going to work for 10000 posts, I think. :?

for start in [str(a) for a in xrange(0, NUM_POSTS, 40)]:
parse("http://forums.xkcd.com/viewtopic.php?f=14&t=26056&st=0&sk=t&sd=a&start=" + start)
[/code]

EDIT: Is there a way to make the above code more generator friendly?

Code: Select all

for start in xrange(0, NUM_POST, 40):
   parse("http://forums.xkcd.com/viewtopic.php?f=14&t=26056&st=0&sk=t&sd=a&start=" + str(start))
?

User avatar
mat-tina
Posts: 331
Joined: Mon Jun 02, 2008 3:33 pm UTC

Re: Python, conventions?

Postby mat-tina » Wed Aug 13, 2008 6:03 pm UTC

DaymItzJack wrote:

Code: Select all

for start in xrange(0, NUM_POST, 40):
   parse("http://forums.xkcd.com/viewtopic.php?f=14&t=26056&st=0&sk=t&sd=a&start=" + str(start))
?


Now I just feel stupid...

:)
Felltir wrote:has no sig, and therefore something to hide
GENERATION n: The first time you see this, copy it into your sig on any forum. If n is an even number, divide it by 2. If it's odd, multiply it by 3 and add 1. Prove that this sequence converges to 1 for all n.

User avatar
fiyarburst
Posts: 30
Joined: Sat Feb 09, 2008 10:55 pm UTC

Re: Python, conventions?

Postby fiyarburst » Fri Aug 15, 2008 1:47 am UTC

mat-tina wrote:I have no idea how to get the number of posts without hardcoding, though, but that's less of a problem.

That's the only reason I wasn't already using a loop. The only way I can think of to find out when to stop incrementing the post count is by digging through the input and finding the "Page # of #".

zahlman
Posts: 638
Joined: Wed Jan 30, 2008 5:15 pm UTC

Re: Python, conventions?

Postby zahlman » Fri Aug 15, 2008 2:12 am UTC

fiyarburst wrote:
mat-tina wrote:I have no idea how to get the number of posts without hardcoding, though, but that's less of a problem.

That's the only reason I wasn't already using a loop. The only way I can think of to find out when to stop incrementing the post count is by digging through the input and finding the "Page # of #".


Why not stop incrementing it when an attempt to retrieve a page fails? :)

Code: Select all

for counter in itertools.count():
   # modify parse() to return False if the urlopen fails, and True otherwise.
   if not parse("http://forums.xkcd.com/viewtopic.php?f=14&t=26056&st=0&sk=t&sd=a&start=" + str(counter * 40)): break


BTW, instead of re-opening the file for each parse() call, you could pass the file object in as a parameter. :)
Belial wrote:I once had a series of undocumented and nonstandardized subjective experiences that indicated that anecdotal data is biased and unreliable.

User avatar
fiyarburst
Posts: 30
Joined: Sat Feb 09, 2008 10:55 pm UTC

Re: Python, conventions?

Postby fiyarburst » Fri Aug 15, 2008 2:27 am UTC

zahlman wrote:Why not stop incrementing it when an attempt to retrieve a page fails? :)

Because it won't fail; it'll increment indefinitely and just keep retrieving the last page. A weird phpBB thing, I guess.

User avatar
Xeio
Friends, Faidites, Countrymen
Posts: 5101
Joined: Wed Jul 25, 2007 11:12 am UTC
Location: C:\Users\Xeio\
Contact:

Re: Python, conventions?

Postby Xeio » Fri Aug 15, 2008 6:38 am UTC

Eh, not to hijack the topic or anything, but as I'm playing around with python can someone explain the unexpected results below to me? Ok, so the first block tests using previously discovered primes to find new ones, the second one tests all the possible factors between 2 and sqrt(n) to find the primes. So, why does the latter perform so much better? Does iterating over a list add that much overhead?

I should note, I've tried pre-allocating a list (this nearly doubled the time to complete that of the first test :shock: ), as well as iterating using a chopped [:] copy of the list and using indexes, they were all at about or worse performance than the first case. So, I'm thinking I did something wrong, anyone point it out to me? Because even at numbers such as 300,000 the latter case has to check 3-4x as many numbers.

Code: Select all

import time, math

#runtime ~120 seconds
#uses already found primes to check
def PrimeTest1(maxTest):
    timer = time.time()
    current = 3 #start at 3, wheee
    primes = [2] #2 is a given :P
    while current < maxTest: #primes up to this number
        temp = math.sqrt(current)
        for factor in primes:   #check only with previous primes               
            if factor < temp:   #only up the the sqrt             
                if current%factor==0:   #not a prime
                    break                   
        else:
            primes.append(current)
        current+=2
    print "Test 1"
    print time.time()-timer
    print len(primes)

#runtime ~8 seconds
#checks ALL the numbers up to the sqrt
def PrimeTest2(maxTest):
    timer = time.time()
    current = 3 #start at 3, wheee
    primes = [2] #2 is a given :P
    while current < maxTest: #primes up to this number
        temp = math.sqrt(current)
        factor = 2  #start at 2...                   
        while factor < temp: #check only up to sqrt               
            if current%factor == 0: #not a prime
                break
            factor+=1  #count up one
        else:
            primes.append(current)
        current+=2
    print "Test 2"
    print time.time()-timer
    print len(primes)

a = 500000
PrimeTest1(a)
PrimeTest2(a)

User avatar
mat-tina
Posts: 331
Joined: Mon Jun 02, 2008 3:33 pm UTC

Re: Python, conventions?

Postby mat-tina » Fri Aug 15, 2008 12:32 pm UTC

Xeio wrote:Does iterating over a list add that much overhead?


Yes.


More hijacking of threads:

@Berengal: with urllib.urlopen doesn't work for me. It quits with an AttributeError: addinfourl instance has no attribute '__exit__' .

This works, though:

Code: Select all

from __future__ import with_statement
from contextlib import closing
import urllib

with closing(urllib.urlopen("http://example.com")) as page:
      print page.read()


I just thought that needed to be said for the sake of future generations browsing the Internet for help. (Hi, post-apocalyptic society!)
Felltir wrote:has no sig, and therefore something to hide
GENERATION n: The first time you see this, copy it into your sig on any forum. If n is an even number, divide it by 2. If it's odd, multiply it by 3 and add 1. Prove that this sequence converges to 1 for all n.

User avatar
RoadieRich
The Black Hand
Posts: 1037
Joined: Tue Feb 12, 2008 11:40 am UTC
Location: Behind you

Re: Python, conventions?

Postby RoadieRich » Fri Aug 15, 2008 2:24 pm UTC

fiyarburst wrote:Because it won't fail; it'll increment indefinitely and just keep retrieving the last page. A weird phpBB thing, I guess.


You can use that as a convienient way of telling when to stop. Compare the recieved page to the previous one. If you get two pages with identical content, you know you've reached the end. It may not be elegant, but it saves having to either edit your script every time you run it, or adding user input and all the problems that could incorporate, espcially if you plan on releasing the code.

You should also consider whether to include a posts-per-page setting. As this will rarely change, it could be hard-coded.
73, de KE8BSL loc EN26.

User avatar
Berengal
Superabacus Mystic of the First Rank
Posts: 2707
Joined: Thu May 24, 2007 5:51 am UTC
Location: Bergen, Norway
Contact:

Re: Python, conventions?

Postby Berengal » Fri Aug 15, 2008 4:09 pm UTC

Looking through the source for this page, I found this:

Code: Select all

         <div class="pagination">
         <a href="#unread">First unread post</a> &bull; 15 posts          &bull; Page <strong>1</strong> of <strong>1</strong>      </div>
This could probably be used in some kind of productive fashion, I wager.

mat-tina wrote:@Berengal: with urllib.urlopen doesn't work for me. It quits with an AttributeError: addinfourl instance has no attribute '__exit__' .

This works, though:

Code: Select all

from __future__ import with_statement
from contextlib import closing
import urllib

with closing(urllib.urlopen("http://example.com")) as page:
      print page.read()


I just thought that needed to be said for the sake of future generations browsing the Internet for help. (Hi, post-apocalyptic society!)

Berengal wrote:Edit: During testing I found a minor typo (in the url), and that urlopen doesn't have suport for the with statement yet (it's still an old-style object as well, which is bad). Url-typo is fixed, but I left the essay on the with-statement behind, because I love that little word.


I didn't know about closing though. Neat.
It is practically impossible to teach good programming to students who are motivated by money: As potential programmers they are mentally mutilated beyond hope of regeneration.


Return to “Coding”

Who is online

Users browsing this forum: No registered users and 7 guests