读书笔记
《Python并行编程实战》 第3章 基于进程的并行
该模块是Python语言标准库的一部分,用于实现基于进程的并行。
要求子进程能够进入main模块,即’__main__’
创建进程的步骤:
import multiprocessing
def myFunc(i):
print("calling myFunc from process n : %s" % i)
for j in range(0,i):
print('output from myFunc is : %s' %j)
if __name__ == '__main__':
for i in range(6):
process = multiprocessing.Process(target = myFunc, args=(i,))
process.start()
process.join()
如果没有join方法,子进程不会结束,必须手动杀死。
calling myFunc from process n : 0
calling myFunc from process n : 1
output from myFunc is : 0
calling myFunc from process n : 2
output from myFunc is : 0
output from myFunc is : 1
calling myFunc from process n : 3
output from myFunc is : 0
output from myFunc is : 1
output from myFunc is : 2
calling myFunc from process n : 4
output from myFunc is : 0
output from myFunc is : 1
output from myFunc is : 2
output from myFunc is : 3
calling myFunc from process n : 5
output from myFunc is : 0
output from myFunc is : 1
output from myFunc is : 2
output from myFunc is : 3
output from myFunc is : 4
可以用multiprocessing.current_process()来访问正在运行的进程的一些属性。name属性来标识其名字。
import multiprocessing
import time
def myFunc():
name = multiprocessing.current_process().name
print("Starting process name = %s\n" % name)
time.sleep(3)
print("Exiting process name = %s" % name)
if __name__ == '__main__':
process_with_name = multiprocessing.Process(name='myFunc process', target=myFunc)
process_without_default_name = multiprocessing.Process(target=myFunc)
process_with_name.start()
process_without_default_name.start()
# 主进程阻塞直到子进程完成
process_with_name.join()
process_without_default_name.join()
Starting process name = myFunc process
Starting process name = Process-2
Exiting process name = Process-2
Exiting process name = myFunc process
守护进程即后台运行的进程,将process.daemon赋值为True即可在后台运行进程。
import multiprocessing
import time
def foo():
name = multiprocessing.current_process().name
print('Starting %s \n' % name)
if name=='background_process':
for i in range(5):
print('--> %d\n' %i)
time.sleep(1)
else:
for i in range(5,10):
print('--> %d\n' %i)
time.sleep(1)
print('Exiting %s\n' %name)
if __name__ == '__main__':
background_proess = multiprocessing.Process\
(name='background_process',
target=foo)
background_proess.daemon = True
No_background_process = multiprocessing.Process\
(name='No_background_process',
target=foo)
No_background_process.daemon = False
background_proess.start()
No_background_process.start()
Starting No_background_process
--> 5
--> 6
--> 7
--> 8
--> 9
Exiting No_background_process
没有完美的软件,杀死一个进程总是有必要的。
import multiprocessing
import time
def foo():
print('starting function')
for i in range(0,10):
print('-->%d\n' % i)
time.sleep(1)
print('Finished function')
if __name__ == '__main__':
p = multiprocessing.Process(target=foo)
print('Process before execution:', p, p.is_alive())
p.start()
print('Process running:', p, p.is_alive())
p.terminate()
print('Process terminated:', p, p.is_alive())
p.join()
print('Process joined:', p, p.is_alive())
print('Process exit code:', p.exitcode)
Process before execution: <Process name='Process-1' parent=13404 initial> False
Process running: <Process name='Process-1' pid=2592 parent=13404 started> True
Process terminated: <Process name='Process-1' pid=2592 parent=13404 started> True
Process joined: <Process name='Process-1' pid=2592 parent=13404 stopped exitcode=-SIGTERM> False
Process exit code: -15
0 :进程有一个错误,并退出这个代码
为了实现一个multiprocessing定制子类,需要完成以下工作:
import multiprocessing
class MyProcess(multiprocessing.Process):
def run(self):
print('called run method by %s' % self.name)
if __name__ == '__main__':
for i in range(10):
process = MyProcess()
process.start()
process.join()
called run method by MyProcess-1
called run method by MyProcess-2
called run method by MyProcess-3
called run method by MyProcess-4
called run method by MyProcess-5
called run method by MyProcess-6
called run method by MyProcess-7
called run method by MyProcess-8
called run method by MyProcess-9
called run method by MyProcess-10
队列是一种先进先出(FIFO,First In First Out)的数据结构,就像排队一样。
以经典的生产者/消费者问题来练习。生产者生产产品,消费者消费产品,有一个公共的缓冲区用来存放产品。需要实现生产者/消费者对缓冲区的互斥访问和先生产后消费的同步。这里就不着重讲这个问题了,这是操作系统的一个经典问题。
这里的实现并没有涉及信号量,按理说要实现对队列的互斥访问。
import multiprocessing
import random
import time
class producer(multiprocessing.Process):
def __init__(self, queue):
multiprocessing.Process.__init__(self)
self.queue = queue
def run(self):
for i in range(5):
item = random.randint(0,256)
self.queue.put(item)
print('Process Producer : item %d appended to queue by %s' % (item, self.name))
time.sleep(1)
print('The size of queue is %s' % self.queue.qsize())
class consumer(multiprocessing.Process):
def __init__(self, queue):
multiprocessing.Process.__init__(self)
self.queue = queue
def run(self):
while True:
if self.queue.empty():
print('the queue is empty')
break
else:
time.sleep(2)
item = self.queue.get()
print('Process Consumer : item %d popped from queue by %s\n' % (item, self.name))
time.sleep(1)
if __name__ == '__main__':
queue = multiprocessing.Queue()
process_producer = producer(queue)
process_consumer = consumer(queue)
process_producer.start()
time.sleep(1)
process_consumer.start()
process_producer.join()
process_consumer.join()
Process Producer : item 23 appended to queue by producer-1
The size of queue is 1
Process Producer : item 18 appended to queue by producer-1
The size of queue is 2
Process Producer : item 0 appended to queue by producer-1
Process Consumer : item 23 popped from queue by consumer-2
The size of queue is 2
Process Producer : item 3 appended to queue by producer-1
The size of queue is 3
Process Producer : item 36 appended to queue by producer-1
The size of queue is 4
Process Consumer : item 18 popped from queue by consumer-2
Process Consumer : item 0 popped from queue by consumer-2
Process Consumer : item 3 popped from queue by consumer-2
Process Consumer : item 36 popped from queue by consumer-2
the queue is empty
管道(pipe)是什么?管道连接两个进程,通过接收/发送来实现进程间通信。
进程1创建item 0~9数字送入Pipe1,进程2从Pipe1接收数字,将其平方后送入Pipe2。最后从Pipe2接收。
import multiprocessing
from venv import create
def create_items(pipe):
output_pip, _ = pipe
for item in range(10):
output_pip.send(item)
output_pip.close()
def multiply_items(pipe_1, pipe_2):
close, input_pip = pipe_1
close.close()
output_pipe, _ = pipe_2
try:
while True:
item = input_pip.recv()
output_pipe.send(item*item) # 返回各个管道元素的乘积,即平方
except EOFError:
output_pipe.close()
if __name__ == '__main__':
pipe_1 = multiprocessing.Pipe(True)
process_pipe_1 = multiprocessing.Process(target=create_items, args=(pipe_1,))
process_pipe_1.start()
# 进程1创建0~9的数字,送入pipe1
pipe_2 = multiprocessing.Pipe(True)
process_pipe_2 = multiprocessing.Process(target=multiply_items, args=(pipe_1,pipe_2,))
process_pipe_2.start()
# 进程2接收pipe1的数字,并将其平方后送入pipe2
# 关闭两个管道
pipe_1[0].close()
pipe_2[0].close()
# 打印结果
try:
while True:
print(pipe_2[1].recv())
except EOFError:
print("End")
0
1
4
9
16
25
36
49
64
81
End
为什么需要同步?因为多个进程一起工作时,有时需要严格保证执行顺序,否则会造成错误或无法预知的结果。
进程的同步原语与threading库中的同步原语非常相似。如下:
Python中的屏障(Barrier)用来等待固定数目的进程执行完成,然后给定的进程才能继续执行。这里是一个用屏障实现同步的例子。
# 用barrier来实现进程同步
import multiprocessing
from multiprocessing import Barrier, Lock, Process
from time import time
from datetime import datetime
def test_with_barrier(synchronizer, serializer):
name = multiprocessing.current_process().name
synchronizer.wait()
now = time()
with serializer:
print('process %s ----> %s' % (name, datetime.fromtimestamp(now)))
def test_without_barrier():
name = multiprocessing.current_process().name
now = time()
print('process %s ----> %s' % (name, datetime.fromtimestamp(now)))
if __name__ == '__main__':
synchronizer = Barrier(2)
serializer = Lock()
Process(name='p1 - test_with_barrier', target=test_with_barrier, args=(synchronizer, serializer,)).start()
Process(name='p2 - test_with_barrier', target=test_with_barrier, args=(synchronizer, serializer,)).start()
Process(name='p3 - test_without_barrier', target=test_without_barrier).start()
Process(name='p4 - test_without_barrier', target=test_without_barrier).start()
process p3 - test_without_barrier ----> 2022-01-21 14:10:45.443280
process p4 - test_without_barrier ----> 2022-01-21 14:10:45.453279
process p2 - test_with_barrier ----> 2022-01-21 14:10:45.473277
process p1 - test_with_barrier ----> 2022-01-21 14:10:45.473277
在9.2的代码中,Barrier实现了这样的功能:等两个进程都到达指定位置,然后一起继续前进,所以p1和p2才打印了相同的时间戳。
利用进程池机制,在多个输入值上执行的一个函数可以并行化,将输入数据分布到多个进程,实现数据级并行(data parallelism)。
multiprocessing.Pool类可以完成简单的并行处理任务。
Pool类有以下方法:
import multiprocessing
import time
def function_square(data):
x = data*data
x = 2*x+9
x = x*x
x = x-1000
x = x*6
x = x*x
x = x/78
x = x*2.98
x = x*0.492
return x
if __name__ == '__main__':
inputs = list(range(0,1000000))
pool = multiprocessing.Pool(processes=8)
t0 = time.time()
pool_outputs = pool.map(function_square, inputs)
t1 = time.time()
t2 = time.time()
outputs = list(map(function_square, inputs))
#time.sleep(1)
t3 = time.time()
pool.close()
pool.join()
print('pool: %.12f | no_pool: %.12f' % (t1-t0, t3-t2))
pool: 0.280999183655 | no_pool: 0.480020523071
总结一下就是多进程地进行map,然而额外开销很大,简单运算的情况下还是直接map要快得多。