函数式编程指引

作者:A. M. Kuchling
发布版本:0.31

本文档提供恰当的 Python 函数式编程范例,在函数式编程简单的介绍之后,将简单介绍Python中关于函数式编程的特性如 iteratorgenerator 以及相关库模块如 itertoolsfunctools 等。

概述

This section explains the basic concept of functional programming; if you’re just interested in learning about Python language features, skip to the next section.

编程语言支持通过以下几种方式来解构具体问题:

  • 大多数的编程语言都是 过程式 的,所谓程序就是一连串告诉计算机怎样处理程序输入的指令。C、Pascal 甚至 Unix shells 都是过程式语言。
  • 声明式 语言中,你编写一个用来描述待解决问题的说明,并且这个语言的具体实现会指明怎样高效的进行计算。 SQL 可能是你最熟悉的声明式语言了。 一个 SQL 查询语句描述了你想要检索的数据集,并且 SQL 引擎会决定是扫描整张表还是使用索引,应该先执行哪些子句等等。
  • 面向对象 程序会操作一组对象。 对象拥有内部状态,并能够以某种方式支持请求和修改这个内部状态的方法。Smalltalk 和 Java 都是面向对象的语言。 C++ 和 Python 支持面向对象编程,但并不强制使用面向对象特性。
  • 函数式 编程则将一个问题分解成一系列函数。 理想情况下,函数只接受输入并输出结果,对一个给定的输入也不会有影响输出的内部状态。 著名的函数式语言有 ML 家族(Standard ML,Ocaml 以及其他变种)和 Haskell。

The designers of some computer languages choose to emphasize one particular approach to programming. This often makes it difficult to write programs that use a different approach. Other languages are multi-paradigm languages that support several different approaches. Lisp, C++, and Python are multi-paradigm; you can write programs or libraries that are largely procedural, object-oriented, or functional in all of these languages. In a large program, different sections might be written using different approaches; the GUI might be object-oriented while the processing logic is procedural or functional, for example.

在函数式程序里,输入会流经一系列函数。每个函数接受输入并输出结果。函数式风格反对使用带有副作用的函数,这些副作用会修改内部状态,或者引起一些无法体现在函数的返回值中的变化。完全不产生副作用的函数被称作“纯函数”。消除副作用意味着不能使用随程序运行而更新的数据结构;每个函数的输出必须只依赖于输入。

Some languages are very strict about purity and don’t even have assignment statements such as a=3 or c = a + b, but it’s difficult to avoid all side effects. Printing to the screen or writing to a disk file are side effects, for example. For example, in Python a print statement or a time.sleep(1) both return no useful value; they’re only called for their side effects of sending some text to the screen or pausing execution for a second.

函数式风格的 Python 程序并不会极端到消除所有 I/O 或者赋值的程度;相反,他们会提供像函数式一样的接口,但会在内部使用非函数式的特性。比如,函数的实现仍然会使用局部变量,但不会修改全局变量或者有其他副作用。

函数式编程可以被认为是面向对象编程的对立面。对象就像是颗小胶囊,包裹着内部状态和随之而来的能让你修改这个内部状态的一组调用方法,以及由正确的状态变化所构成的程序。函数式编程希望尽可能地消除状态变化,只和流经函数的数据打交道。在 Python 里你可以把两种编程方式结合起来,在你的应用(电子邮件信息,事务处理)中编写接受和返回对象实例的函数。

函数式设计在工作中看起来是个奇怪的约束。为什么你要消除对象和副作用呢?不过函数式风格有其理论和实践上的优点:

  • 形式证明。
  • 模块化。
  • 组合性。
  • 易于调试和测试。

形式证明

一个理论上的优点是,构造数学证明来说明函数式程序是正确的相对更容易些。

很长时间,研究者们对寻找证明程序正确的数学方法都很感兴趣。这和通过大量输入来测试,并得出程序的输出基本正确,或者阅读一个程序的源代码然后得出代码看起来没问题不同;相反,这里的目标是一个严格的证明,证明程序对所有可能的输入都能给出正确的结果。

证明程序正确性所用到的技术是写出 不变量,也就是对于输入数据和程序中的变量永远为真的特性。然后对每行代码,你说明这行代码执行前的不变量 X 和 Y 以及执行后稍有不同的不变量 X’ 和 Y’ 为真。如此一直到程序结束,这时候在程序的输出上,不变量应该会与期望的状态一致。

函数式编程之所以要消除赋值,是因为赋值在这个技术中难以处理;赋值可能会破坏赋值前为真的不变量,却并不产生任何可以传递下去的新的不变量。

不幸的是,证明程序的正确性很大程度上是经验性质的,而且和 Python 软件无关。即使是微不足道的程序都需要几页长的证明;一个中等复杂的程序的正确性证明会非常庞大,而且,极少甚至没有你日常所使用的程序(Python 解释器,XML 解析器,浏览器)的正确性能够被证明。即使你写出或者生成一个证明,验证证明也会是一个问题;里面可能出了差错,而你错误地相信你证明了程序的正确性。

模块化

函数式编程的一个更实用的优点是,它强制你把问题分解成小的方面。因此程序会更加模块化。相对于一个进行了复杂变换的大型函数,一个小的函数更明确,更易于编写, 也更易于阅读和检查错误。

易于调试和测试

测试和调试函数式程序相对来说更容易。

调试很简单是因为函数通常都很小而且清晰明确。当程序无法工作的时候,每个函数都是一个可以检查数据是否正确的接入点。你可以通过查看中间输入和输出迅速找到出错的函数。

测试更容易是因为每个函数都是单元测试的潜在目标。在执行测试前,函数并不依赖于需要重现的系统状态;相反,你只需要给出正确的输入,然后检查输出是否和期望的结果一致。

组合性

当你编写函数式风格的程序时,你会写出很多带有不同输入和输出的函数。其中一些不可避免地会局限于特定的应用,但其他的却可以广泛的用在程序中。举例来说,一个接受文件夹目录返回所有文件夹中的 XML 文件的函数; 或是一个接受文件名,然后返回文件内容的函数,都可以应用在很多不同的场合。

久而久之你会形成一个个人工具库。通常你可以重新组织已有的函数来组成新的程序,然后为当前的工作写一些特殊的函数。

迭代器

我会从 Python 的一个语言特性, 编写函数式风格程序的重要基石开始说起:迭代器。

An iterator is an object representing a stream of data; this object returns the data one element at a time. A Python iterator must support a method called next() that takes no arguments and always returns the next element of the stream. If there are no more elements in the stream, next() must raise the StopIteration exception. Iterators don’t have to be finite, though; it’s perfectly reasonable to write an iterator that produces an infinite stream of data.

The built-in iter() function takes an arbitrary object and tries to return an iterator that will return the object’s contents or elements, raising TypeError if the object doesn’t support iteration. Several of Python’s built-in data types support iteration, the most common being lists and dictionaries. An object is called an iterable object if you can get an iterator for it.

你可以手动试验迭代器的接口。

>>> L = [1,2,3]
>>> it = iter(L)
>>> print it
<...iterator object at ...>
>>> it.next()
1
>>> it.next()
2
>>> it.next()
3
>>> it.next()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
StopIteration
>>>

Python expects iterable objects in several different contexts, the most important being the for statement. In the statement for X in Y, Y must be an iterator or some object for which iter() can create an iterator. These two statements are equivalent:

for i in iter(obj):
    print i

for i in obj:
    print i

可以用 list()tuple() 这样的构造函数把迭代器具体化成列表或元组:

>>> L = [1,2,3]
>>> iterator = iter(L)
>>> t = tuple(iterator)
>>> t
(1, 2, 3)

序列的解压操作也支持迭代器:如果你知道一个迭代器能够返回 N 个元素,你可以把他们解压到有 N 个元素的元组:

>>> L = [1,2,3]
>>> iterator = iter(L)
>>> a,b,c = iterator
>>> a,b,c
(1, 2, 3)

Built-in functions such as max() and min() can take a single iterator argument and will return the largest or smallest element. The "in" and "not in" operators also support iterators: X in iterator is true if X is found in the stream returned by the iterator. You’ll run into obvious problems if the iterator is infinite; max(), min() will never return, and if the element X never appears in the stream, the "in" and "not in" operators won’t return either.

Note that you can only go forward in an iterator; there’s no way to get the previous element, reset the iterator, or make a copy of it. Iterator objects can optionally provide these additional capabilities, but the iterator protocol only specifies the next() method. Functions may therefore consume all of the iterator’s output, and if you need to do something different with the same stream, you’ll have to create a new iterator.

支持迭代器的数据类型

我们已经知道列表和元组支持迭代器。实际上,Python 中的任何序列类型,比如字符串,都自动支持创建迭代器。

Calling iter() on a dictionary returns an iterator that will loop over the dictionary’s keys:

>>> m = {'Jan': 1, 'Feb': 2, 'Mar': 3, 'Apr': 4, 'May': 5, 'Jun': 6,
...      'Jul': 7, 'Aug': 8, 'Sep': 9, 'Oct': 10, 'Nov': 11, 'Dec': 12}
>>> for key in m:
...     print key, m[key]
Mar 3
Feb 2
Aug 8
Sep 9
Apr 4
Jun 6
Jul 7
Jan 1
May 5
Nov 11
Dec 12
Oct 10

Note that the order is essentially random, because it’s based on the hash ordering of the objects in the dictionary.

Applying iter() to a dictionary always loops over the keys, but dictionaries have methods that return other iterators. If you want to iterate over keys, values, or key/value pairs, you can explicitly call the iterkeys(), itervalues(), or iteritems() methods to get an appropriate iterator.

dict() 构造函数可以接受一个迭代器,然后返回一个有限的 (key, value) 元组的数据流:

>>> L = [('Italy', 'Rome'), ('France', 'Paris'), ('US', 'Washington DC')]
>>> dict(iter(L))
{'Italy': 'Rome', 'US': 'Washington DC', 'France': 'Paris'}

Files also support iteration by calling the readline() method until there are no more lines in the file. This means you can read each line of a file like this:

for line in file:
    # do something for each line
    ...

集合可以从可遍历的对象获取内容,也可以让你遍历集合的元素:

S = set((2, 3, 5, 7, 11, 13))
for i in S:
    print i

生成器表达式和列表推导式

迭代器的输出有两个很常见的使用方式,1) 对每一个元素执行操作,2) 选择一个符合条件的元素子集。比如,给定一个字符串列表,你可能想去掉每个字符串尾部的空白字符,或是选出所有包含给定子串的字符串。

列表推导式和生成器表达时(简写:”listcomps” 和 “genexps”)让这些操作更加简明,这个形式借鉴自函数式程序语言 Haskell(https://www.haskell.org/)。你可以用以下代码去掉一个字符串流中的所有空白字符:

line_list = ['  line 1\n', 'line 2  \n', ...]

# Generator expression -- returns iterator
stripped_iter = (line.strip() for line in line_list)

# List comprehension -- returns list
stripped_list = [line.strip() for line in line_list]

你可以加上条件语句 "if" 来选取特定的元素:

stripped_list = [line.strip() for line in line_list
                 if line != ""]

通过列表推导式,你会获得一个 Python 列表;stripped_list 就是一个包含所有结果行的列表,并不是迭代器。 生成器表达式会返回一个迭代器,它在必要的时候计算结果,避免一次性生成所有的值。 这意味着,如果迭代器返回一个无限数据流或者大量的数据,列表推导式就不太好用了。 这种情况下生成器表达式会更受青睐。

生成器表达式两边使用圆括号 (“()”) ,而列表推导式则使用方括号 (“[]”)。生成器表达式的形式为:

( expression for expr in sequence1
             if condition1
             for expr2 in sequence2
             if condition2
             for expr3 in sequence3 ...
             if condition3
             for exprN in sequenceN
             if conditionN )

再次说明,列表推导式只有两边的括号不一样(方括号而不是圆括号)。

这些生成用于输出的元素会成为 expression 的后继值。其中 if 语句是可选的;如果给定的话 expression 只会在符合条件时计算并加入到结果中。

生成器表达式总是写在圆括号里面,不过也可以算上调用函数时用的括号。如果你想即时创建一个传递给函数的迭代器,可以这么写:

obj_total = sum(obj.count for obj in list_all_objects())

其中 for...in 语句包含了将要遍历的序列。这些序列并不必须同样长,因为它们会从左往右开始遍历,而 不是 同时执行。对每个 sequence1 中的元素,sequence2 会从头开始遍历。sequence3 会对每个 sequence1sequence2 的元素对开始遍历。

换句话说,列表推导式器是和下面的 Python 代码等价:

for expr1 in sequence1:
    if not (condition1):
        continue   # Skip this element
    for expr2 in sequence2:
        if not (condition2):
            continue   # Skip this element
        ...
        for exprN in sequenceN:
            if not (conditionN):
                continue   # Skip this element

            # Output the value of
            # the expression.

这说明,如果有多个 for...in 语句而没有 if 语句,输出结果的长度就是所有序列长度的乘积。如果你的两个列表长度为3,那么输出的列表长度就是9:

>>> seq1 = 'abc'
>>> seq2 = (1,2,3)
>>> [(x,y) for x in seq1 for y in seq2]
[('a', 1), ('a', 2), ('a', 3),
 ('b', 1), ('b', 2), ('b', 3),
 ('c', 1), ('c', 2), ('c', 3)]

为了不让 Python 语法变得含糊,如果 expression 会生成元组,那这个元组必须要用括号括起来。下面第一个列表推导式语法错误,第二个则是正确的:

# Syntax error
[ x,y for x in seq1 for y in seq2]
# Correct
[ (x,y) for x in seq1 for y in seq2]

生成器

生成器是一类用来简化编写迭代器工作的特殊函数。普通的函数计算并返回一个值,而生成器返回一个能返回数据流的迭代器。

毫无疑问,你已经对如何在 Python 和 C 中调用普通函数很熟悉了,这时候函数会获得一个创建局部变量的私有命名空间。当函数到达 return 表达式时,局部变量会被销毁然后把返回给调用者。之后调用同样的函数时会创建一个新的私有命名空间和一组全新的局部变量。但是,如果在退出一个函数时不扔掉局部变量会如何呢?如果稍后你能够从退出函数的地方重新恢复又如何呢?这就是生成器所提供的;他们可以被看成可恢复的函数。

这里有简单的生成器函数示例:

def generate_ints(N):
    for i in range(N):
        yield i

Any function containing a yield keyword is a generator function; this is detected by Python’s bytecode compiler which compiles the function specially as a result.

When you call a generator function, it doesn’t return a single value; instead it returns a generator object that supports the iterator protocol. On executing the yield expression, the generator outputs the value of i, similar to a return statement. The big difference between yield and a return statement is that on reaching a yield the generator’s state of execution is suspended and local variables are preserved. On the next call to the generator’s .next() method, the function will resume executing.

这里有一个 generate_ints() 生成器的示例:

>>> gen = generate_ints(3)
>>> gen
<generator object generate_ints at ...>
>>> gen.next()
0
>>> gen.next()
1
>>> gen.next()
2
>>> gen.next()
Traceback (most recent call last):
  File "stdin", line 1, in <module>
  File "stdin", line 2, in generate_ints
StopIteration

You could equally write for i in generate_ints(5), or a,b,c = generate_ints(3).

Inside a generator function, the return statement can only be used without a value, and signals the end of the procession of values; after executing a return the generator cannot return any further values. return with a value, such as return 5, is a syntax error inside a generator function. The end of the generator’s results can also be indicated by raising StopIteration manually, or by just letting the flow of execution fall off the bottom of the function.

You could achieve the effect of generators manually by writing your own class and storing all the local variables of the generator as instance variables. For example, returning a list of integers could be done by setting self.count to 0, and having the next() method increment self.count and return it. However, for a moderately complicated generator, writing a corresponding class can be much messier.

The test suite included with Python’s library, test_generators.py, contains a number of more interesting examples. Here’s one generator that implements an in-order traversal of a tree using generators recursively.

# A recursive generator that generates Tree leaves in in-order.
def inorder(t):
    if t:
        for x in inorder(t.left):
            yield x

        yield t.label

        for x in inorder(t.right):
            yield x

另外两个 test_generators.py 中的例子给出了 N 皇后问题(在 NxN 的棋盘上放置 N 个皇后,任何一个都不能吃掉另一个),以及马的遍历路线(在NxN 的棋盘上给马找出一条不重复的走过所有格子的路线)的解。

向生成器传递值

在 Python 2.4 及之前的版本中,生成器只产生输出。一旦调用生成器的代码创建一个迭代器,就没有办法在函数恢复执行的时候向它传递新的信息。你可以设法实现这个功能,让生成器引用一个全局变量或者一个调用者可以修改的可变对象,但是这些方法都很繁杂。

在 Python 2.5 里有一个简单的将值传递给生成器的方法。yield 变成了一个表达式,返回一个可以赋给变量或执行操作的值:

val = (yield i)

我建议你在处理 yield 表达式返回值的时候, 总是 两边写上括号,就像上面的例子一样。括号并不总是必须的,但是比起记住什么时候需要括号,写出来会更容易一点。

(PEP 342 explains the exact rules, which are that a yield-expression must always be parenthesized except when it occurs at the top-level expression on the right-hand side of an assignment. This means you can write val = yield i but have to use parentheses when there’s an operation, as in val = (yield i) + 12.)

Values are sent into a generator by calling its send(value) method. This method resumes the generator’s code and the yield expression returns the specified value. If the regular next() method is called, the yield returns None.

这里有一个简单的每次加1的计数器,并允许改变内部计数器的值。

def counter (maximum):
    i = 0
    while i < maximum:
        val = (yield i)
        # If value provided, change counter
        if val is not None:
            i = val
        else:
            i += 1

这是改变计数器的一个示例

>>> it = counter(10)
>>> print it.next()
0
>>> print it.next()
1
>>> print it.send(8)
8
>>> print it.next()
9
>>> print it.next()
Traceback (most recent call last):
  File "t.py", line 15, in <module>
    print it.next()
StopIteration

Because yield will often be returning None, you should always check for this case. Don’t just use its value in expressions unless you’re sure that the send() method will be the only method used to resume your generator function.

In addition to send(), there are two other new methods on generators:

  • throw(type, value=None, traceback=None) is used to raise an exception inside the generator; the exception is raised by the yield expression where the generator’s execution is paused.

  • close() raises a GeneratorExit exception inside the generator to terminate the iteration. On receiving this exception, the generator’s code must either raise GeneratorExit or StopIteration; catching the exception and doing anything else is illegal and will trigger a RuntimeError. close() will also be called by Python’s garbage collector when the generator is garbage-collected.

    如果你要在 GeneratorExit 发生的时候清理代码,我建议使用 try: ... finally: 组合来代替 GeneratorExit

这些改变的累积效应是,让生成器从单向的信息生产者变成了既是生产者,又是消费者。

生成器也可以成为 协程 ,一种更广义的子过程形式。子过程可以从一个地方进入,然后从另一个地方退出(从函数的顶端进入,从 return 语句退出),而协程可以进入,退出,然后在很多不同的地方恢复(yield 语句)。

内置函数

我们可以看看迭代器常常用到的函数的更多细节。

Two of Python’s built-in functions, map() and filter(), are somewhat obsolete; they duplicate the features of list comprehensions but return actual lists instead of iterators.

map(f, iterA, iterB, ...) returns a list containing f(iterA[0], iterB[0]), f(iterA[1], iterB[1]), f(iterA[2], iterB[2]), ....

>>> def upper(s):
...     return s.upper()
>>> map(upper, ['sentence', 'fragment'])
['SENTENCE', 'FRAGMENT']
>>> [upper(s) for s in ['sentence', 'fragment']]
['SENTENCE', 'FRAGMENT']

As shown above, you can achieve the same effect with a list comprehension. The itertools.imap() function does the same thing but can handle infinite iterators; it’ll be discussed later, in the section on the itertools module.

filter(predicate, iter) returns a list that contains all the sequence elements that meet a certain condition, and is similarly duplicated by list comprehensions. A predicate is a function that returns the truth value of some condition; for use with filter(), the predicate must take a single value.

>>> def is_even(x):
...     return (x % 2) == 0
>>> filter(is_even, range(10))
[0, 2, 4, 6, 8]

这也可以写成列表推导式:

>>> [x for x in range(10) if is_even(x)]
[0, 2, 4, 6, 8]

filter() also has a counterpart in the itertools module, itertools.ifilter(), that returns an iterator and can therefore handle infinite sequences just as itertools.imap() can.

reduce(func, iter, [initial_value]) doesn’t have a counterpart in the itertools module because it cumulatively performs an operation on all the iterable’s elements and therefore can’t be applied to infinite iterables. func must be a function that takes two elements and returns a single value. reduce() takes the first two elements A and B returned by the iterator and calculates func(A, B). It then requests the third element, C, calculates func(func(A, B), C), combines this result with the fourth element returned, and continues until the iterable is exhausted. If the iterable returns no values at all, a TypeError exception is raised. If the initial value is supplied, it’s used as a starting point and func(initial_value, A) is the first calculation.

>>> import operator
>>> reduce(operator.concat, ['A', 'BB', 'C'])
'ABBC'
>>> reduce(operator.concat, [])
Traceback (most recent call last):
  ...
TypeError: reduce() of empty sequence with no initial value
>>> reduce(operator.mul, [1,2,3], 1)
6
>>> reduce(operator.mul, [], 1)
1

If you use operator.add() with reduce(), you’ll add up all the elements of the iterable. This case is so common that there’s a special built-in called sum() to compute it:

>>> reduce(operator.add, [1,2,3,4], 0)
10
>>> sum([1,2,3,4])
10
>>> sum([])
0

For many uses of reduce(), though, it can be clearer to just write the obvious for loop:

# Instead of:
product = reduce(operator.mul, [1,2,3], 1)

# You can write:
product = 1
for i in [1,2,3]:
    product *= i

enumerate(iter) counts off the elements in the iterable, returning 2-tuples containing the count and each element.

>>> for item in enumerate(['subject', 'verb', 'object']):
...     print item
(0, 'subject')
(1, 'verb')
(2, 'object')

enumerate() 常常用于遍历列表并记录达到特定条件时的下标:

f = open('data.txt', 'r')
for i, line in enumerate(f):
    if line.strip() == '':
        print 'Blank line at line #%i' % i

sorted(iterable, [cmp=None], [key=None], [reverse=False]) collects all the elements of the iterable into a list, sorts the list, and returns the sorted result. The cmp, key, and reverse arguments are passed through to the constructed list’s .sort() method.

>>> import random
>>> # Generate 8 random numbers between [0, 10000)
>>> rand_list = random.sample(range(10000), 8)
>>> rand_list
[769, 7953, 9828, 6431, 8442, 9878, 6213, 2207]
>>> sorted(rand_list)
[769, 2207, 6213, 6431, 7953, 8442, 9828, 9878]
>>> sorted(rand_list, reverse=True)
[9878, 9828, 8442, 7953, 6431, 6213, 2207, 769]

(For a more detailed discussion of sorting, see the Sorting mini-HOWTO in the Python wiki at https://wiki.python.org/moin/HowTo/Sorting.)

The any(iter) and all(iter) built-ins look at the truth values of an iterable’s contents. any() returns True if any element in the iterable is a true value, and all() returns True if all of the elements are true values:

>>> any([0,1,0])
True
>>> any([0,0,0])
False
>>> any([1,1,1])
True
>>> all([0,1,0])
False
>>> all([0,0,0])
False
>>> all([1,1,1])
True

Small functions and the lambda expression

When writing functional-style programs, you’ll often need little functions that act as predicates or that combine elements in some way.

If there’s a Python built-in or a module function that’s suitable, you don’t need to define a new function at all:

stripped_lines = [line.strip() for line in lines]
existing_files = filter(os.path.exists, file_list)

If the function you need doesn’t exist, you need to write it. One way to write small functions is to use the lambda statement. lambda takes a number of parameters and an expression combining these parameters, and creates a small function that returns the value of the expression:

lowercase = lambda x: x.lower()

print_assign = lambda name, value: name + '=' + str(value)

adder = lambda x, y: x+y

An alternative is to just use the def statement and define a function in the usual way:

def lowercase(x):
    return x.lower()

def print_assign(name, value):
    return name + '=' + str(value)

def adder(x,y):
    return x + y

Which alternative is preferable? That’s a style question; my usual course is to avoid using lambda.

One reason for my preference is that lambda is quite limited in the functions it can define. The result has to be computable as a single expression, which means you can’t have multiway if... elif... else comparisons or try... except statements. If you try to do too much in a lambda statement, you’ll end up with an overly complicated expression that’s hard to read. Quick, what’s the following code doing?

total = reduce(lambda a, b: (0, a[1] + b[1]), items)[1]

You can figure it out, but it takes time to disentangle the expression to figure out what’s going on. Using a short nested def statements makes things a little bit better:

def combine (a, b):
    return 0, a[1] + b[1]

total = reduce(combine, items)[1]

But it would be best of all if I had simply used a for loop:

total = 0
for a, b in items:
    total += b

Or the sum() built-in and a generator expression:

total = sum(b for a,b in items)

Many uses of reduce() are clearer when written as for loops.

Fredrik Lundh once suggested the following set of rules for refactoring uses of lambda:

  1. Write a lambda function.
  2. Write a comment explaining what the heck that lambda does.
  3. Study the comment for a while, and think of a name that captures the essence of the comment.
  4. Convert the lambda to a def statement, using that name.
  5. Remove the comment.

I really like these rules, but you’re free to disagree about whether this lambda-free style is better.

itertools 模块

itertools 模块包含很多常用的迭代器以及用来组合迭代器的函数。本节会用些小的例子来介绍这个模块的内容。

这个模块里的函数大致可以分为几类:

  • 从已有的迭代器创建新的迭代器的函数。
  • 接受迭代器元素作为参数的函数。
  • 选取部分迭代器输出的函数。
  • 给迭代器输出分组的函数。

创建新的迭代器

itertools.count(n) returns an infinite stream of integers, increasing by 1 each time. You can optionally supply the starting number, which defaults to 0:

itertools.count() =>
  0, 1, 2, 3, 4, 5, 6, 7, 8, 9, ...
itertools.count(10) =>
  10, 11, 12, 13, 14, 15, 16, 17, 18, 19, ...

itertools.cycle(iter) saves a copy of the contents of a provided iterable and returns a new iterator that returns its elements from first to last. The new iterator will repeat these elements infinitely.

itertools.cycle([1,2,3,4,5]) =>
  1, 2, 3, 4, 5, 1, 2, 3, 4, 5, ...

itertools.repeat(elem, [n]) returns the provided element n times, or returns the element endlessly if n is not provided.

itertools.repeat('abc') =>
  abc, abc, abc, abc, abc, abc, abc, abc, abc, abc, ...
itertools.repeat('abc', 5) =>
  abc, abc, abc, abc, abc

itertools.chain(iterA, iterB, ...) takes an arbitrary number of iterables as input, and returns all the elements of the first iterator, then all the elements of the second, and so on, until all of the iterables have been exhausted.

itertools.chain(['a', 'b', 'c'], (1, 2, 3)) =>
  a, b, c, 1, 2, 3

itertools.izip(iterA, iterB, ...) takes one element from each iterable and returns them in a tuple:

itertools.izip(['a', 'b', 'c'], (1, 2, 3)) =>
  ('a', 1), ('b', 2), ('c', 3)

It’s similar to the built-in zip() function, but doesn’t construct an in-memory list and exhaust all the input iterators before returning; instead tuples are constructed and returned only if they’re requested. (The technical term for this behaviour is lazy evaluation.)

这个迭代器设计用于长度相同的可迭代对象。如果可迭代对象的长度不一致,返回的数据流的长度会和最短的可迭代对象相同

itertools.izip(['a', 'b'], (1, 2, 3)) =>
  ('a', 1), ('b', 2)

然而,你应该避免这种情况,因为所有从更长的迭代器中取出的元素都会被丢弃。这意味着之后你也无法冒着跳过被丢弃元素的风险来继续使用这个迭代器。

itertools.islice(iter, [start], stop, [step]) returns a stream that’s a slice of the iterator. With a single stop argument, it will return the first stop elements. If you supply a starting index, you’ll get stop-start elements, and if you supply a value for step, elements will be skipped accordingly. Unlike Python’s string and list slicing, you can’t use negative values for start, stop, or step.

itertools.islice(range(10), 8) =>
  0, 1, 2, 3, 4, 5, 6, 7
itertools.islice(range(10), 2, 8) =>
  2, 3, 4, 5, 6, 7
itertools.islice(range(10), 2, 8, 2) =>
  2, 4, 6

itertools.tee(iter, [n]) replicates an iterator; it returns n independent iterators that will all return the contents of the source iterator. If you don’t supply a value for n, the default is 2. Replicating iterators requires saving some of the contents of the source iterator, so this can consume significant memory if the iterator is large and one of the new iterators is consumed more than the others.

itertools.tee( itertools.count() ) =>
   iterA, iterB

where iterA ->
   0, 1, 2, 3, 4, 5, 6, 7, 8, 9, ...

and   iterB ->
   0, 1, 2, 3, 4, 5, 6, 7, 8, 9, ...

itertools.tee(iter, [n]) 可以复制一个迭代器;它返回 n 个能够返回源迭代器内容的独立迭代器。如果你不提供参数 n,默认值为 2。复制迭代器需要保存源迭代器的一部分内容,因此在源迭代器比较大的时候会显著地占用内存;同时,新迭代器中的一个会比其他迭代器占用更多的内存。

Two functions are used for calling other functions on the contents of an iterable.

itertools.imap(f, iterA, iterB, ...) returns a stream containing f(iterA[0], iterB[0]), f(iterA[1], iterB[1]), f(iterA[2], iterB[2]), ...:

itertools.imap(operator.add, [5, 6, 5], [1, 2, 3]) =>
  6, 8, 8

The operator module contains a set of functions corresponding to Python’s operators. Some examples are operator.add(a, b) (adds two values), operator.ne(a, b) (same as a!=b), and operator.attrgetter('id') (returns a callable that fetches the "id" attribute).

itertools.starmap(func, iter) assumes that the iterable will return a stream of tuples, and calls f() using these tuples as the arguments:

itertools.starmap(os.path.join,
                  [('/usr', 'bin', 'java'), ('/bin', 'python'),
                   ('/usr', 'bin', 'perl'),('/usr', 'bin', 'ruby')])
=>
  /usr/bin/java, /bin/python, /usr/bin/perl, /usr/bin/ruby

选择元素

另外一系列函数根据谓词选取一个迭代器中元素的子集。

itertools.ifilter(predicate, iter) returns all the elements for which the predicate returns true:

def is_even(x):
    return (x % 2) == 0

itertools.ifilter(is_even, itertools.count()) =>
  0, 2, 4, 6, 8, 10, 12, 14, ...

itertools.ifilterfalse(predicate, iter) is the opposite, returning all elements for which the predicate returns false:

itertools.ifilterfalse(is_even, itertools.count()) =>
  1, 3, 5, 7, 9, 11, 13, 15, ...

itertools.takewhile(predicate, iter) returns elements for as long as the predicate returns true. Once the predicate returns false, the iterator will signal the end of its results.

def less_than_10(x):
    return (x < 10)

itertools.takewhile(less_than_10, itertools.count()) =>
  0, 1, 2, 3, 4, 5, 6, 7, 8, 9

itertools.takewhile(is_even, itertools.count()) =>
  0

itertools.dropwhile(predicate, iter) discards elements while the predicate returns true, and then returns the rest of the iterable’s results.

itertools.dropwhile(less_than_10, itertools.count()) =>
  10, 11, 12, 13, 14, 15, 16, 17, 18, 19, ...

itertools.dropwhile(is_even, itertools.count()) =>
  1, 2, 3, 4, 5, 6, 7, 8, 9, 10, ...

Grouping elements

The last function I’ll discuss, itertools.groupby(iter, key_func=None), is the most complicated. key_func(elem) is a function that can compute a key value for each element returned by the iterable. If you don’t supply a key function, the key is simply each element itself.

groupby() collects all the consecutive elements from the underlying iterable that have the same key value, and returns a stream of 2-tuples containing a key value and an iterator for the elements with that key.

city_list = [('Decatur', 'AL'), ('Huntsville', 'AL'), ('Selma', 'AL'),
             ('Anchorage', 'AK'), ('Nome', 'AK'),
             ('Flagstaff', 'AZ'), ('Phoenix', 'AZ'), ('Tucson', 'AZ'),
             ...
            ]

def get_state ((city, state)):
    return state

itertools.groupby(city_list, get_state) =>
  ('AL', iterator-1),
  ('AK', iterator-2),
  ('AZ', iterator-3), ...

where
iterator-1 =>
  ('Decatur', 'AL'), ('Huntsville', 'AL'), ('Selma', 'AL')
iterator-2 =>
  ('Anchorage', 'AK'), ('Nome', 'AK')
iterator-3 =>
  ('Flagstaff', 'AZ'), ('Phoenix', 'AZ'), ('Tucson', 'AZ')

groupby() assumes that the underlying iterable’s contents will already be sorted based on the key. Note that the returned iterators also use the underlying iterable, so you have to consume the results of iterator-1 before requesting iterator-2 and its corresponding key.

The functools module

The functools module in Python 2.5 contains some higher-order functions. A higher-order function takes one or more functions as input and returns a new function. The most useful tool in this module is the functools.partial() function.

For programs written in a functional style, you’ll sometimes want to construct variants of existing functions that have some of the parameters filled in. Consider a Python function f(a, b, c); you may wish to create a new function g(b, c) that’s equivalent to f(1, b, c); you’re filling in a value for one of f()’s parameters. This is called “partial function application”.

The constructor for partial takes the arguments (function, arg1, arg2, ... kwarg1=value1, kwarg2=value2). The resulting object is callable, so you can just call it to invoke function with the filled-in arguments.

Here’s a small but realistic example:

import functools

def log (message, subsystem):
    "Write the contents of 'message' to the specified subsystem."
    print '%s: %s' % (subsystem, message)
    ...

server_log = functools.partial(log, subsystem='server')
server_log('Unable to open socket')

The operator module

The operator module was mentioned earlier. It contains a set of functions corresponding to Python’s operators. These functions are often useful in functional-style code because they save you from writing trivial functions that perform a single operation.

Some of the functions in this module are:

  • Math operations: add(), sub(), mul(), div(), floordiv(), abs(), …
  • Logical operations: not_(), truth().
  • Bitwise operations: and_(), or_(), invert().
  • Comparisons: eq(), ne(), lt(), le(), gt(), and ge().
  • Object identity: is_(), is_not().

Consult the operator module’s documentation for a complete list.

Revision History and Acknowledgements

The author would like to thank the following people for offering suggestions, corrections and assistance with various drafts of this article: Ian Bicking, Nick Coghlan, Nick Efford, Raymond Hettinger, Jim Jewett, Mike Krell, Leandro Lameiro, Jussi Salmela, Collin Winter, Blake Winton.

Version 0.1: posted June 30 2006.

Version 0.11: posted July 1 2006. Typo fixes.

Version 0.2: posted July 10 2006. Merged genexp and listcomp sections into one. Typo fixes.

Version 0.21: Added more references suggested on the tutor mailing list.

Version 0.30: Adds a section on the functional module written by Collin Winter; adds short section on the operator module; a few other edits.

引用文献

通用文献

Structure and Interpretation of Computer Programs, Harold Abelson, Gerald Jay Sussman 和 Julie Sussman 著。全文可见 https://mitpress.mit.edu/sicp/ 。在这部计算机科学的经典教科书中,第二和第三章讨论了使用序列和流来组织程序内部的数据传递。书中的示例采用 Scheme 语言,但其中这些章节中描述的很多设计方法同样适用于函数式风格的 Python 代码。

http://www.defmacro.org/ramblings/fp.html: 一个使用 Java 示例的函数式编程的总体介绍,有很长的历史说明。

https://en.wikipedia.org/wiki/Functional_programming: 一般性的函数式编程的 Wikipedia 条目。

https://en.wikipedia.org/wiki/Coroutine: 协程条目。

https://en.wikipedia.org/wiki/Currying: 函数柯里化条目。

Python 相关

http://gnosis.cx/TPiP/:David Mertz 书中的第一章 Text Processing in Python,”Utilizing Higher-Order Functions in Text Processing” 标题部分讨论了文本处理的函数式编程。

Mertz also wrote a 3-part series of articles on functional programming for IBM’s DeveloperWorks site; see

part 1, part 2, and part 3,

Python 文档

itertools 模块文档。

operator 模块文档。

PEP 289: “Generator Expressions”

PEP 342: “Coroutines via Enhanced Generators” 描述了 Python 2.5 中新的生成器特性。