Python - Best way to turn string of digits into list of ints

A place to discuss the implementation and style of computer programs.

Moderators: phlip, Moderators General, Prelates

MostlyHarmless
Posts: 154
Joined: Sun Oct 22, 2006 4:29 am UTC

Python - Best way to turn string of digits into list of ints

Postby MostlyHarmless » Wed Jul 15, 2015 6:43 am UTC

First off, I apologize if this belongs in another thread. The best I could find was this, but I didn't want to necro a 6 year old thread just to post something almost on-topic.

On to the question: I've been teaching myself python (3.4), mostly to use for applied math stuff. In a lot of problems I've been working on, I need to take a string of digits (usually from a .txt file, sometimes from input()) and turn the digits into integers for processing. I've found four approaches that seem reasonable, but I'm not sure which is better (and under what conditions other solutions will be better). Let's say, for the sake of argument, that my string is called "line" and that I have to loop through the digits exactly once. Here are the options I've found:

1) nums = map(int, line)
2) nums = (int(chr) for chr in line)
3) nums = list(map(int, line))
4) nums = [int(chr) for chr in line]

Since we're assuming that I only have to loop through nums once, I think option 3 is strictly worse than option 1, but I'm not sure about the others. Is it better to use map than a generator expression? If the string is sufficiently short, am I better off making a list like in option 4? Are there even better methods that I haven't thought of?

I realize the most likely answer is that they're all roughly equal, but I'm really curious. Plus, in numerical methods a little more efficiency seems to go an awfully long way.

User avatar
karhell
Posts: 687
Joined: Wed Jun 19, 2013 4:56 pm UTC
Location: Breizh

Re: Python - Best way to turn string of digits into list of

Postby karhell » Wed Jul 15, 2015 2:07 pm UTC

First, let's have a look at your options :

Code: Select all

nums = map(int, line)
  #Returns a Map object; needs to be iterated on to be of any use.
nums = (int(chr) for chr in line)
  #Returns a Generator object; also needs to be iterated on.
nums = list(map(int, line))
  # Returns a list; can be iterated on, can be accessed by index
nums = [int(chr) for chr in line]
  # Returns the exact same list; pretty much equivalent to 3

If I'm not mistaken, to the interpreter, 1 and 2 both read "create an iterator that will call int() on all elements of line", while 3 and 4 read as "call int() on all elements of line and create a list of the results".

This question provides some insight in the differences between list comprehension and generator expressions.
In short, they're pretty much equivalent, with generators handling very large sequences better, and lists being slightly faster and more flexible (since list expressions will create the full list off the bat, potentially hogging a lot of memory, but actually loading everything in memory, while generators will yield the results as they're needed, using up less memory, but needing to load the next element with every call).

All in all, whichever you end up using will depend on the data you expect to process. If you're looking at a 300Mb file, or are particularly resource-starved, use map(), otherwise, list() works just fine.

Oh, and if it's just for a single iteration, don't bother assigning a variable:
(using bracket notation since I like it better, personally :P)

Code: Select all

#Generator:
for i in (int(c) for c in line):
  process_number(i)

#List:
for i in [int(c) for c in line]:
  process_number(i)

#Or the alternative, taking advantage of the fact that a string is already an iterable (behaves very much like a list, in fact) :
for i in line:
  process_number(int(i))
Last edited by karhell on Thu Jul 16, 2015 12:32 pm UTC, edited 1 time in total.
AluisioASG wrote:191 years ago, the great D. Pedro I drew his sword and said: "Indent thy code or die!"
lmjb1964 wrote:We're weird but it's okay.
ColletArrow, katakissa, iskinner, thunk, GnomeAnne, Quantized, and any other Blitzers, have fun on your journey!

User avatar
Xanthir
My HERO!!!
Posts: 5401
Joined: Tue Feb 20, 2007 12:49 am UTC
Location: The Googleplex
Contact:

Re: Python - Best way to turn string of digits into list of

Postby Xanthir » Wed Jul 15, 2015 2:14 pm UTC

Agree with karhell: if it's short, just use a list (a comprehension is best); if it's long, use map() or a generator comprehension (same thing). Generators are more expensive computationally, but use drastically less memory for large datas, so whatever fits your data is best.
(defun fibs (n &optional (a 1) (b 1)) (take n (unfold '+ a b)))

User avatar
PM 2Ring
Posts: 3708
Joined: Mon Jan 26, 2009 3:19 pm UTC
Location: Sydney, Australia

Re: Python - Best way to turn string of digits into list of

Postby PM 2Ring » Thu Jul 16, 2015 12:20 pm UTC

@MostlyHarmless
What they said. Generators (including generator expressions) run slightly slower than list comprehension and take more setup time, so list comprehensions will beat them for stuff like this unless you're processing ridiculously huge strings of digits.

BTW, in general, don't use `chr` as a variable name since that shadows Python's built-in `chr` function. You can get away with it here since it only affects the local scope inside a gen expression or list comp (in Python 2 it would be bad since a list comp doesn't create a new scope - it runs in the same scope as the surrounding code).


karhell wrote:Oh, and if it's just for a single iteration, don't bother assigning a variable:
(using bracket notation since I like it better, personally :P)

Code: Select all

#Generator:
for nums in (int(c) for c in line):
  process_number(i)

#List:
for nums in [int(c) for c in line]:
  process_number(i)

#Or the alternative, taking advantage of the fact that a string is already an iterable (behaves very much like a list, in fact) :
for nums in line:
  process_number(int(i))

Nice post. But those nums should be i (or vice versa).

User avatar
karhell
Posts: 687
Joined: Wed Jun 19, 2013 4:56 pm UTC
Location: Breizh

Re: Python - Best way to turn string of digits into list of

Postby karhell » Thu Jul 16, 2015 12:32 pm UTC

PM 2Ring wrote:BTW, in general, don't use `chr` as a variable name since that shadows Python's built-in `chr` function. You can get away with it here since it only affects the local scope inside a gen expression or list comp (in Python 2 it would be bad since a list comp doesn't create a new scope - it runs in the same scope as the surrounding code).

That's pretty good to know. I wasn't aware that list comps create their own scope, now.
PM 2Ring wrote:<own post snipped>
Nice post. But those nums should be i (or vice versa).

Oh, aye. Got mixed up between MostlyHarmless's variable names and what I tested with to make sure I wasn't talking out of my arse. I'll fix that right now.
AluisioASG wrote:191 years ago, the great D. Pedro I drew his sword and said: "Indent thy code or die!"
lmjb1964 wrote:We're weird but it's okay.
ColletArrow, katakissa, iskinner, thunk, GnomeAnne, Quantized, and any other Blitzers, have fun on your journey!

MostlyHarmless
Posts: 154
Joined: Sun Oct 22, 2006 4:29 am UTC

Re: Python - Best way to turn string of digits into list of

Postby MostlyHarmless » Thu Jul 16, 2015 11:33 pm UTC

Thanks for the replies! I picked up "Learning Python" by Mark Lutz and he seems to mostly agree: list comprehensions are usually better than generators unless you have a very large number of digits. (I've been solving some project Euler problems, and things like "the first ten million fibonacci numbers" seem like good places for generators.)

Interestingly, from Lutz's book and a few of my own small tests, map and generator expressions aren't really equivalent. For instance, assuming f is already defined somewhere and seq is an iterable, the code
map(f, seq)
is faster than
(f(x) for x in seq)

However, if expr is some expression involving x, then
map(lambda x: expr, seq)
is slower than
(expr for x in seq)

So my takeaway is that list comprehensions are almost always better, but if I need to turn an extremely long string of digits to integers, map is a little better than a generator expression (since int() is built in). (My real takeaway should probably be that premature optimization is the root of all evil and that I'd learn faster if I focused on important things.)

Also, thanks for pointing out the name `chr`. I actually knew that, and in my real code I used `c' instead, but I wanted to make things more readable and screwed it up. Always good to get a reminder. =)

User avatar
PM 2Ring
Posts: 3708
Joined: Mon Jan 26, 2009 3:19 pm UTC
Location: Sydney, Australia

Re: Python - Best way to turn string of digits into list of

Postby PM 2Ring » Fri Jul 17, 2015 9:01 am UTC

karhell wrote:I wasn't aware that list comps create their own scope, now.

Lots of people have been bitten by the "scope-leaking" of Python 2 list comps. Most of us just learned to live with it, but enough people complained that it was changed for Python 3, even though it makes list comps slightly less efficient. They couldn't change it in Python 2 because it would break existing code that relies on the scope-leaking.


MostlyHarmless wrote:([...] things like "the first ten million fibonacci numbers" seem like good places for generators.)

Spoiler:
The ten millionth Fibonacci number is almost 1.1298343782253998e+2089876. So you probably don't have enough RAM to hold a Fibonacci list that big if you compute them as exact integers. :) OTOH, that number is way too big to fit in a Python float; fib(1475) ~= 8.0776376321562252e+307 is the biggest Fibonacci number that float can handle. FWIW, I calculated those Fibonacci numbers using the arbitrary-precision math package mpmath.

Another good place for generators is when you're using `any()` or `all()`. Both of those functions short-circuit, i.e., they stop processing further elements as soon as they have a definitive result. So if you call them on a generator you only generate as many elements as necessary, but if you call them on a list comp the whole list has already been created before `any()` or `all()` start processing.

Conversely, using the `str.join()` method on a gen exp is much more inefficient than using it on a list comp.
Eg

Code: Select all

#bad
"".join(str(i) for i in range(10))
#good
"".join([str(i) for i in range(10)])

`.join()` has to scan through the input strings twice. The first time is to find the length of each input string so it can determine the total size of the output string. So if you pass it a generator it has to build a list so it can access the strings again after it's got their lengths.

MostlyHarmless wrote:Interestingly, from Lutz's book and a few of my own small tests, map and generator expressions aren't really equivalent. For instance, assuming f is already defined somewhere and seq is an iterable, the code
map(f, seq)
is faster than
(f(x) for x in seq)

However, if expr is some expression involving x, then
map(lambda x: expr, seq)
is slower than
(expr for x in seq)


All things being equal, `map()` is a little more efficient than a "hand-made" gen exp, which (kind of) explains your 1st pair of examples. But in your second pair the `map()` form is slower due to the overhead in calling a function; there's also a little bit of overhead in setting up the lambda, but that initial overhead becomes insignificant when processing a large iterable.

A useful tool for when you get curious about stuff like this is dis the Python bytecode disassembler.

MostlyHarmless wrote:So my takeaway is that list comprehensions are almost always better, but if I need to turn an extremely long string of digits to integers, map is a little better than a generator expression (since int() is built in). (My real takeaway should probably be that premature optimization is the root of all evil and that I'd learn faster if I focused on important things.)


I fully agree with Tony Hoare & Donald Knuth that premature optimization is the root of all evil. But that doesn't mean that we can get away with writing unnecessarily inefficient code. :) Randall Hyde has written an excellent article on this topic, available on ACM website The Fallacy of Premature Optimization.

MostlyHarmless wrote:Also, thanks for pointing out the name `chr`. I actually knew that, and in my real code I used `c' instead, but I wanted to make things more readable and screwed it up. Always good to get a reminder. =)


No worries. A lot of the time when I see things like `chr`, `int`, `str`, `list`, `set`, etc being used as variable names on Stack Overflow it turns out that the OP was doing exactly what you were. :)

User avatar
cemper93
Posts: 209
Joined: Sun Feb 20, 2011 2:35 pm UTC
Location: `pwd`

Re: Python - Best way to turn string of digits into list of

Postby cemper93 » Tue Jul 21, 2015 4:39 pm UTC

I fully agree with Tony Hoare & Donald Knuth that premature optimization is the root of all evil. But that doesn't mean that we can get away with writing unnecessarily inefficient code. :) Randall Hyde has written an excellent article on this topic, available on ACM website The Fallacy of Premature Optimization.

As long as the results are within +-200% or so, the optimization is probably unnecessary unless the application is already in use and users are complaining about speed performing a particular operation. And even then there are a lot of steps you should take before changing a list comprehension into a map. These include, but are not limited to, using a profiler to make sure that you guessed correctly as to what function is causing the slowdown, and using a different algorithm with lower big-O complexity.

User avatar
Xanthir
My HERO!!!
Posts: 5401
Joined: Tue Feb 20, 2007 12:49 am UTC
Location: The Googleplex
Contact:

Re: Python - Best way to turn string of digits into list of

Postby Xanthir » Thu Jul 23, 2015 1:02 am UTC

You should be using a profile before doing any non-obvious-and-trivial optimization, anyway. It's always *very* surprising where your actual slowdowns are - there's often trivial places you can change that have a huge benefit, and they can guide you to where you might need to make big changes that you wouldn't want to make without data.

For example, a few months ago I was doing perf work on [url=github.com/tabatkins/bikeshed]Bikeshed[/url], and found that 10% of my entire execution time was from a single badly-optimized check for duplicate definitions. It took two minutes to fix and shaved 10+ seconds off of most runs, and I'd never have known it was an issue otherwise. Similarly, I invoked an HTML parser all throughout my codebase, and it was costing me more than a third of my runtime; it was painful and involved to switch over nearly all my uses to DOM-buildling, but it was very worthwhile (and solved a bunch of bugs to boot - no more places for me to forget to escape something!).
(defun fibs (n &optional (a 1) (b 1)) (take n (unfold '+ a b)))

lvc
Posts: 20
Joined: Mon Jun 15, 2009 5:27 am UTC

Re: Python - Best way to turn string of digits into list of

Postby lvc » Sat Jul 25, 2015 7:49 am UTC

PM 2Ring wrote:Conversely, using the `str.join()` method on a gen exp is much more inefficient than using it on a list comp.
Eg

Code: Select all

#bad
"".join(str(i) for i in range(10))
#good
"".join([str(i) for i in range(10)])

`.join()` has to scan through the input strings twice. The first time is to find the length of each input string so it can determine the total size of the output string. So if you pass it a generator it has to build a list so it can access the strings again after it's got their lengths.

This is theoretically true (only because list comprehensions are optimised by the interpreter more than calling list() on an equivalent generator expression - other than that, the fact that join listifies it anyway only means the generator would be *no faster* than the list, not that it would be *slower*). But the difference turns out to be negligible - the time to scan the list twice and build the string easily dominates. To test this, I ran these timings:

Code: Select all

from timeit import timeit

passes = 1000000


g_time = timeit("''.join(str(i) for i in range(10))",
   number=passes)
g_avg = g_time/passes

l_time = timeit("''.join([str(i) for i in range(10)])",
   number=passes)
l_avg = l_time/passes

percent_difference = abs(l_time - g_time)/max(l_time,g_time) * 100

print('         \tTotal \t Average')
print('Generator\t{:>f} \t {:>f}\t'.format(g_time, g_avg))
print('Listcomp \t{:>f} \t {:>f}\t'.format(l_time, l_avg))

print('% difference: {:>.2f}'.format(percent_difference))


For different size ranges. Results:

Code: Select all

range(10)
            Total     Average
Generator   3.482450     0.000003   
Listcomp    3.121575     0.000003   
% difference: 10.36

range(100)
            Total     Average
Generator   24.632491     0.000025   
Listcomp    22.269575     0.000022   
% difference: 9.59

range(1000)
            Total     Average
Generator   236.696367     0.000237   
Listcomp    218.486281     0.000218   
% difference: 7.69


The biggest gap here is a difference of 25 microseconds. So using a generator isn't "much more inefficient", it's at worst about a tenth slower. Unless you're doing repeated joins on large collections in a tight loop, you might as well avoid the (admittedly also quite marginal) amount of added line noise.

User avatar
PM 2Ring
Posts: 3708
Joined: Mon Jan 26, 2009 3:19 pm UTC
Location: Sydney, Australia

Re: Python - Best way to turn string of digits into list of

Postby PM 2Ring » Sat Jul 25, 2015 10:51 am UTC

I wouldn't call a 10% difference negligible, but YMMV. But do I agree that 25 microseconds is negligible.

The relative speed difference of the two methods depends on the speed of the computation inside the list comp / gen exp: if that computation is slow it may dominate the total time, hiding the difference in speed between doing join on a list comp vs on a gen exp.

I've modified your timing code a little to demonstrate this. As well as measuring the raw times of the list comp and the gen exp, my version also measures the time taken to do the string conversions in a simple `for` loop, and subtracts that time from the raw times to (approximately) determine the time taken to perform the join operation.

My version does 3 repetitions of each timing and takes the minimum, as recommended in the Python docs of timeit and as endorsed by Python core developers Raymond Hettinger, Tim Peters, and Guido van Rossum.

Here's the code:

Code: Select all

#!/usr/bin/env python
from __future__ import print_function, division

import timeit

size = 1000
passes = 5000000 // size
repeat = 3

def timing(stmt):
    return min(timeit.repeat(stmt=stmt, number=passes, repeat=repeat))

cmd = {
    'for': "for i in range(%d):str(i)" % size,
    'join_lc': "''.join([str(i) for i in range(%d)])" % size,
    'join_ge': "''.join(str(i) for i in range(%d))" % size,
}

print('size={0}, passes={1}, repeat={2}\n'.format(size, passes, repeat))

print('Raw times')
print('         \tTotal\t\tAverage')

n_time = timing(cmd['for'])
print('Null     \t{0:>f}\t{1:>e}'.format(n_time, n_time / passes))

l_time = timing(cmd['join_lc'])
print('Listcomp \t{0:>f}\t{1:>e}'.format(l_time, l_time / passes))

g_time = timing(cmd['join_ge'])
print('Generator\t{0:>f}\t{1:>e}'.format(g_time, g_time / passes))

percent_difference = abs(l_time - g_time) / max(l_time, g_time) * 100
print('% difference: {0:>.2f}'.format(percent_difference))

print('\nAdjusted times')

l_time -= n_time
g_time -= n_time
l_avg = l_time / passes
g_avg = g_time / passes
percent_difference = abs(l_time - g_time) / max(l_time, g_time) * 100

print('         \tTotal\t\tAverage')
print('Listcomp \t{0:>f}\t{1:>e}'.format(l_time, l_time / passes))
print('Generator\t{0:>f}\t{1:>e}'.format(g_time, g_time / passes))
print('% difference: {0:>.2f}'.format(percent_difference))


And here are typical results on my trusty old 2GHz Pentium 4 machine running Python 2.6.6 on Mepis 11 Linux.

Code: Select all

size=10, passes=500000, repeat=3

Raw times
                Total           Average
Null            4.290210        8.580420e-06
Listcomp        7.798899        1.559780e-05
Generator       9.622689        1.924538e-05
% difference: 18.95

Adjusted times
                Total           Average
Listcomp        3.508689        7.017378e-06
Generator       5.332479        1.066496e-05
% difference: 34.20

--------------------------------------------

size=100, passes=50000, repeat=3

Raw times
                Total           Average
Null            4.669853        9.339706e-05
Listcomp        5.780933        1.156187e-04
Generator       6.454906        1.290981e-04
% difference: 10.44

Adjusted times
                Total           Average
Listcomp        1.111080        2.222160e-05
Generator       1.785053        3.570106e-05
% difference: 37.76

--------------------------------------------

size=1000, passes=5000, repeat=3

Raw times
                Total           Average
Null            4.919475        9.838950e-04
Listcomp        5.977431        1.195486e-03
Generator       6.319833        1.263967e-03
% difference: 5.42

Adjusted times
                Total           Average
Listcomp        1.057956        2.115912e-04
Generator       1.400358        2.800715e-04
% difference: 24.45


As you can see, the adjusted times show a difference of 25% - 35%, which is somewhat more significant (IMO) than the differences of the raw times.

FWIW, as I mentioned in an earlier post, there are scoping differences between the Python 2 and Python 3 list comps, and that can affect their speed slightly.

lvc
Posts: 20
Joined: Mon Jun 15, 2009 5:27 am UTC

Re: Python - Best way to turn string of digits into list of

Postby lvc » Sun Jul 26, 2015 12:31 am UTC

A third slower still isn't "much more inefficient", especially when the biggest actual time difference you've reported on your Pentium 4 is still only ~70 microseconds. Is this *really* going to be the slow part of any actual program?


Return to “Coding”

Who is online

Users browsing this forum: No registered users and 10 guests