作用:包含处理文本的常量和类
Python版本:1.4及以后版本
capwords():将一个字符串中所有单词的首字母大写
>>> import string >>> s = 'The quick brown fox jumped over the lazy dog' >>> string.capwords(s) 'The Quick Brown Fox Jumped Over The Lazy Dog'1. 使用列表来完成
>>> s 'The quick brown fox jumped over the lazy dog' >>> " ".join(map(lambda x: x[0].upper() + x[1:], s.split(" "))) 'The Quick Brown Fox Jumped Over The Lazy Dog'
但是如果单词之间存在多个空白字符,则列表完成的代码存在瑕疵.新修改的代码如下:
>>> ss 'The quick brown fox jumped over the lazy dog' >>> for index in range(len(ss)): if (index == 0 or ss[index] == " ") and index != len(ss) - 1 and ss[index + 1] != " ": ss = ss[:index + 1] + ss[index + 1].upper() + ss[index + 2:] >>> ss 'THe Quick Brown Fox Jumped Over The Lazy Dog'
maketrans():结合translate()方法将一组字符修改为另一组字符,这种做法优于反复调用replace()
>>> import string >>> leet = string.maketrans('abegiloprstz', '463611092572') >>> s 'The quick brown fox jumped over the lazy dog' >>> s.translate(leet) 'Th3 qu1ck 620wn f0x jum93d 0v32 7h3 142y d06'1. 使用replace()方法反复完成
>>> s 'The quick brown fox jumped over the lazy dog' >>> subStr = s >>> length = len('abegiloprstz') >>> for i in range(0, length): subStr = subStr.replace('abegiloprstz'[i], '463611092572'[i]) >>> subStr 'Th3 qu1ck 620wn f0x jum93d 0v32 7h3 142y d06'
使用string.Template拼接时,可以在变量名前面加上前缀$(如$var)来标识变量,或者如果需要与两侧的文本相区分,还可以使用大括号将变量括起(如${var})
一个简单的例子如下:
import string values = {'var': 'foo'} #通过string.Template进行转移,需要转义符$ t = string.Template(""" Variable : $var Escape : $$ #$重复两次来完成转义 Variable in text: ${var}iable """) print 'TEMPLATE:', t.substitute(values) #字符串的格式化显示,通过关键字来匹配数据 s = """ Variable : %(var)s Escape : %% #%重复两次来完成转义 Variable in text: %(var)siable """ print 'INTERPOLATION:', s % values解释器输出:
>>> TEMPLATE: Variable : foo Escape : $ Variable in text: fooiable INTERPOLATION: Variable : foo Escape : % Variable in text: fooiable模板与标准字符拼接有一个重要区别,即 模板不考虑参数类型.值会转换为字符串,再将字符串插入到结果中.这里没有提供格式化选项.
import string values = {'var': 'foo'} t = string.Template("$var is here but $missing is not provided") try: print 'substitute() :', t.substitute(values) except KeyError, err: print 'ERROR:', str(err) #如果模板未提供,则保持原值 print 'safe_substitute():', t.safe_substitute(values)解释器显示如下:
>>> substitute() : ERROR: 'missing' safe_substitute(): foo is here but $missing is not provided
可以修改string.Template的默认语法,为此要调整它在模板体中查找变量名所使用的正则表达式模式.一种简单的做法是修改delimiter和idpattern类属性.
import string template_text = """ Delimiter : %% Replatec : %with_underscore Ignored : %notunderscored """ d = {'with_underscore' : 'replaced', 'notunderscored' : 'not replaced',} #定界符修改为% #变量名的格式必须符合'[a-z]+_[a-z]+',即中间必须有下划线_ class MyTemplate(string.Template): delimiter = '%' idpattern = '[a-z]+_[a-z]+' t = MyTemplate(template_text) print 'Modified ID pattern' print t.safe_substitute(d)
解释器显示如下:
>>> Modified ID pattern Delimiter : % Replatec : replaced Ignored : %notunderscored要完成更复杂的修改,可以覆盖pattern属性,定义一个全新的正则表达式.所提供的模式必须包含4个命名组,分别对应 定界符,命名变量,用大括号括住的变量名,以及不合法的定界符模式
要完成更复杂的修改,可以覆盖pattern属性,定义一个全新的正则表达式.所提供的模式必须包含4个命名组,分别对应定界符,命名变量,用大括号括住的变量名,以及不合法的定界符模式 import re import string class MyTemplate(string.Template): delimiter = '{{' #将定界符修改为'{{' pattern = r""" \{\{(?: (?P<escaped>\{\{)| (?P<named>[_a-z][_a-z0-9]*)\}\}| (?P<braced>[_a-z][_a-z0-9]*)\}\}| (?P<invalid>) ) """ t = MyTemplate(""" {{{{ {{var}} {{foo}} """) print 'MATCHES:', t.pattern.findall(t.template) print 'SUBSTITUTED:', t.safe_substitute(var='123replacement', foo='replacement')
解释器显示如下:
>>> MATCHES: [('{{', '', '', ''), ('', 'var', '', ''), ('', 'foo', '', '')] SUBSTITUTED: {{ 123replacement replacement备注: 不理解pattern的四个参数的使用.
作用:通过调整换行符在段落中出现的位置来格式化文本
Python版本: 2.5及以后版本
需要美观打印时,可以用textwrap模块来格式化要输出的文本.这个模块允许通过编程提供类似段落自动换行或填充特性等功能.
sample_text = """ The textwrap module can be used to format text for output in situations where pretty-printing is desired. It offers programmatic functionality similar to the paragraph wrapping or filling features found in many text editors """存入模块textwrap_example.py中,供后面程序的导入.
通过提供宽度来填充数据
>>> import textwrap >>> from textwrap_example import sample_text >>> print textwrap.fill(sample_text, width = 50) The textwrap module can be used to format text for output in situations where pretty- printing is desired. It offers programmatic functionality similar to the paragraph wrapping or filling features found in many text editors结果显示只有第一行有缩进,其余的均没有.
我们可以通过dedent来引入一级缩进:
>>> print textwrap.dedent(sample_text) The textwrap module can be used to format text for output in situations where pretty-printing is desired. It offers programmatic functionality similar to the paragraph wrapping or filling features found in many text editors
我们可以通过dedent达到缩进,而通过fill来填充空格:
>>> dedented_text = textwrap.dedent(sample_text).strip() >>> for width in [45, 70]: print '%d Columns:\n' % width print textwrap.fill(dedented_text, width=width) print 45 Columns: The textwrap module can be used to format text for output in situations where pretty- printing is desired. It offers programmatic functionality similar to the paragraph wrapping or filling features found in many text editors 70 Columns: The textwrap module can be used to format text for output in situations where pretty-printing is desired. It offers programmatic functionality similar to the paragraph wrapping or filling features found in many text editors
更好的情况是:第一行保持缩进,用于区别后面各行
>>> dedented_text = textwrap.dedent(sample_text).strip() >>> print textwrap.fill(dedented_text, initial_indent='', subsequent_indent=' ' * 4, width = 50,) The textwrap module can be used to format text for output in situations where pretty-printing is desired. It offers programmatic functionality similar to the paragraph wrapping or filling features found in many text editors
search()函数取模式和要扫描的文本作为输入,找到则返回一个Match对象,否则返回None.
而每个Match对象包含有关匹配性质的信息,包括原输入字符串,使用的正则表达式,以及模式在原字符串中出现的位置:
>>> import re >>> pattern = 'this' >>> text = 'Does this text match the pattern?' >>> match = re.search(pattern, text) >>> dir(match) ['__class__', '__copy__', '__deepcopy__', '__delattr__', '__doc__', '__format__', '__getattribute__', '__hash__', '__init__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', 'end', 'endpos', 'expand', 'group', 'groupdict', 'groups', 'lastgroup', 'lastindex', 'pos', 're', 'regs', 'span', 'start', 'string'] >>> match.string 'Does this text match the pattern?' >>> match.start <built-in method start of _sre.SRE_Match object at 0x0000000002A96648> >>> match.start() 5 >>> match.re <_sre.SRE_Pattern object at 0x0000000002A9E258> >>> match.re() Traceback (most recent call last): File "<pyshell#24>", line 1, in <module> match.re() TypeError: '_sre.SRE_Pattern' object is not callable >>> match.re.pattern 'this'备注:使用dir()和help()函数来查看各个对象的功能,很重要.
如果表达式经常被使用,编译这些表达式会更加高效.compile()函数会把一个表达式字符串转换为一个RegexObject
import re #预编译模式 regexes = [re.compile(p) for p in ['this', 'that']] text = 'Does this text match the pattern' print 'Text: %r\n' % text for regex in regexes: print 'Seeking "%s" ->' % regex.pattern, if regex.search(text): print 'match' else: print 'no match'解释器显示如下:
>>> Text: 'Does this text match the pattern' Seeking "this" -> match Seeking "that" -> no match >>> type(regexes) <type 'list'> >>> regexes [<_sre.SRE_Pattern object at 0x0000000002BAE0E8>, <_sre.SRE_Pattern object at 0x0000000002BAE258>]
findall()函数会返回输入中与模式匹配而不重叠的所有字串
import re text = 'abbaaabbbbaaaaa' pattern = 'ab' for match in re.findall(pattern, text): print 'Found "%s"' % match #这里re.finditer(pattern, text)只会运行一次,所以match才会递归显示每一项(for在Python中的语法) for match in re.finditer(pattern, text): s = match.start() e = match.end() print 'Found "%s" at %d:%d' % (text[s:e], s, e)解释器显示如下:
>>> Found "ab" Found "ab" Found "ab" at 0:2 Found "ab" at 5:7
正则表达式支持更强大的模式,而不只是简单的字面量文本字符串.模式可以重复,可以锚定到输入中不同的逻辑位置,还可以采用紧凑形式表示而不需要在模式中提供每一个字面量字符.使用所有这些特性时,需要结合字面量文本值和元字符,元字符是re实现的正则表达式模式语法的一部分.
import re def test_patterns(text, patterns=[]): for pattern, desc in patterns: print 'Pattern %r (%s)\n' % (pattern, desc) print ' %r' % text for match in re.finditer(pattern, text): s = match.start() e = match.end() substr = text[s:e] n_backslashes = text[:s].count('\\') prefix = '.' * (s + n_backslashes) print ' %s%r|' % (prefix, substr), print return if __name__ == "__main__": test_patterns('abbaaabbbbaaaaa', [('ab', "'a' followed by 'b'"),])存储在文件re_test_patterns.py中.
模式中有五种表达重复的方式.如果模式后面跟元字符*,这个模式会重复0次或多次.如果为+,则至少重复1次.为?则重复0或1次.{m}特定重复m次.{m,n}则至少重复m次,最大重复n次.{m,}则至少重复m次,无上限.
from re_test_patterns import test_patterns test_patterns( 'abbaabbba', [('ab*', 'a followed by zero or more b'), ('ab+', 'a followed by one or more b'), ('ab?', 'a followed by zero or one b'), ('ab{3}', 'a followed by three b'), ('ab{2,3}', 'a followed by two to three b'), ])解释器显示如下:
>>> Pattern 'ab*' (a followed by zero or more b) 'abbaabbba' 'abb'| ...'a'| ....'abbb'| ........'a'| Pattern 'ab+' (a followed by one or more b) 'abbaabbba' 'abb'| ....'abbb'| Pattern 'ab?' (a followed by zero or one b) 'abbaabbba' 'ab'| ...'a'| ....'ab'| ........'a'| Pattern 'ab{3}' (a followed by three b) 'abbaabbba' ....'abbb'| Pattern 'ab{2,3}' (a followed by two to three b) 'abbaabbba' 'abb'| ....'abbb'|正常情况下,处理重复指令时, re匹配模式时会利用尽可能多的输入.这种所谓"贪心"的行为可能导致单个匹配减少,或者匹配中包含了多于原先预计的输入文本.在重复指令后面加上 "?"可以关闭这种贪心行为:
from re_test_patterns import test_patterns test_patterns( 'abbaabbba', [('ab*?', 'a followed by zero or more b'), ('ab+?', 'a followed by one or more b'), ('ab??', 'a followed by zero or one b'), ('ab{3}?', 'a followed by three b'), ('ab{2,3}?', 'a followed by two to three b'), ])解释器显示如下:
>>> Pattern 'ab*?' (a followed by zero or more b) 'abbaabbba' 'a'| ...'a'| ....'a'| ........'a'| Pattern 'ab+?' (a followed by one or more b) 'abbaabbba' 'ab'| ....'ab'| Pattern 'ab??' (a followed by zero or one b) 'abbaabbba' 'a'| ...'a'| ....'a'| ........'a'| Pattern 'ab{3}?' (a followed by three b) 'abbaabbba' ....'abbb'| Pattern 'ab{2,3}?' (a followed by two to three b) 'abbaabbba' 'abb'| ....'abb'|
字符集是一组字符,包含可以与模式中相应位置匹配的所有字符.例如[ab]可以匹配a或b:
from re_test_patterns import test_patterns test_patterns( 'abbaabbba', [('[ab]', 'either a or b'), ('a[ab]+', 'a followed by 1 or more a or b'), ('a[ab]+?', 'a followed by 1 or more a or b, not greedy'), ])解释器显示如下:(注意贪心算法)
>>> Pattern '[ab]' (either a or b) 'abbaabbba' 'a'| .'b'| ..'b'| ...'a'| ....'a'| .....'b'| ......'b'| .......'b'| ........'a'| Pattern 'a[ab]+' (a followed by 1 or more a or b) 'abbaabbba' 'abbaabbba'| Pattern 'a[ab]+?' (a followed by 1 or more a or b, not greedy) 'abbaabbba' 'ab'| ...'aa'|字符集还可以用来排除某些特定字符.尖字符(^)表示要查找未在随后的字符集中出现的字符.
from re_test_patterns import test_patterns test_patterns( 'This is some text -- with punctuation', #找到不包含字符"-","."或空格的所有字符串 [('[^-. ]+', 'sequences without -, ., or space'), ])解释器显示如下:
>>> Pattern '[^-. ]+' (sequences without -, ., or space) 'This is some text -- with punctuation' 'This'| .....'is'| ........'some'| .............'text'| .....................'with'| ..........................'punctuation'|利用字符区间来定义一个字符集,其中包括一个起点和一个终点之间所有连续的字符:
from re_test_patterns import test_patterns test_patterns( 'This is some text -- with punctuation', [('[a-z]+', 'sequences of lowercase letters'), ('[A-Z]+', 'sequences of uppercase letters'), ('[a-zA-Z]+', 'sequences of lowercase or uppercase letters'), ('[A-Z][a-z]+', 'one uppercase followed by lowercase'), ])解释器显示如下:
>>> Pattern '[a-z]+' (sequences of lowercase letters) 'This is some text -- with punctuation' .'his'| .....'is'| ........'some'| .............'text'| .....................'with'| ..........................'punctuation'| Pattern '[A-Z]+' (sequences of uppercase letters) 'This is some text -- with punctuation' 'T'| Pattern '[a-zA-Z]+' (sequences of lowercase or uppercase letters) 'This is some text -- with punctuation' 'This'| .....'is'| ........'some'| .............'text'| .....................'with'| ..........................'punctuation'| Pattern '[A-Z][a-z]+' (one uppercase followed by lowercase) 'This is some text -- with punctuation' 'This'|作为字符集的一种特殊情况,元字符"."指模式应当匹配该位置的任何单字符.
from re_test_patterns import test_patterns test_patterns( 'abbaabbba', [('a.', 'a followed by any one character'), ('b.', 'b followed by any one character'), ('a.*b', 'a followed by anything, ending in b'), ('a.*?b', 'a followed by anything, ending in b'), ])解释器显示如下:
>>> Pattern 'a.' (a followed by any one character) 'abbaabbba' 'ab'| ...'aa'| Pattern 'b.' (b followed by any one character) 'abbaabbba' .'bb'| .....'bb'| .......'ba'| Pattern 'a.*b' (a followed by anything, ending in b) 'abbaabbba' 'abbaabbb'| Pattern 'a.*?b' (a followed by anything, ending in b) 'abbaabbba' 'ab'| ...'aab'|
re可以识别的转义码如下:
转义码 |
含义 |
\d |
一个数字 |
\D |
一个非数字 |
\s |
空白符(制表符,空格,换行符等) |
\S |
非空白符 |
\w |
字母数字 |
\W |
非字母数字 |
from re_test_patterns import test_patterns test_patterns( 'A prime #1 example!', [(r'\d+', 'sequence of digits'), (r'\D+', 'sequence of nondigits'), (r'\s+', 'sequence of whitespace'), (r'\S+', 'sequence of nonwhitespace'), (r'\w+', 'alphanumeric characters'), (r'\W+', 'nonalphanumeric') ])解释器显示如下:
>>> Pattern '\\d+' (sequence of digits) 'A prime #1 example!' .........'1'| Pattern '\\D+' (sequence of nondigits) 'A prime #1 example!' 'A prime #'| ..........' example!'| Pattern '\\s+' (sequence of whitespace) 'A prime #1 example!' .' '| .......' '| ..........' '| Pattern '\\S+' (sequence of nonwhitespace) 'A prime #1 example!' 'A'| ..'prime'| ........'#1'| ...........'example!'| Pattern '\\w+' (alphanumeric characters) 'A prime #1 example!' 'A'| ..'prime'| .........'1'| ...........'example'| Pattern '\\W+' (nonalphanumeric) 'A prime #1 example!' .' '| .......' #'| ..........' '| ..................'!'|要匹配属于正则表达式语法的字符,需要对搜索模式中的字符进行转义:
from re_test_patterns import test_patterns test_patterns( r'\d+ \D+ \s+', [(r'\\.\+', 'escape code'), ])解释器显示如下:
>>> Pattern '\\\\.\\+' (escape code) '\\d+ \\D+ \\s+' '\\d+'| .....'\\D+'| ..........'\\s+'|
可以使用锚定指令指定输入文本中模式应当出现的相对位置.
锚定码 |
含义 |
^ |
字符串或行的开始 |
$ |
字符串或行的结束 |
\A |
字符串开始 |
\Z |
字符串结束 |
\b |
一个单词开头或末尾的空串 |
\B |
不在一个单词开头或末尾的空串 |
from re_test_patterns import test_patterns test_patterns( 'This is some text -- with punctuation.', [(r'^\w+', 'word at start of string'), (r'\A\w+', 'word at start of string'), (r'\w+\S*$', 'word near end of string, skip punctuation'), (r'\w+\S*\Z', 'word near end of string, skip punctuation'), (r'\w*t\w*', 'word containing t'), (r'\bt\w+', 't at start of word'), (r'\w+t\b', 't at end of word'), (r'\Bt\B', 't not start or end of word'), ])解释器显示如下:
>>> Pattern '^\\w+' (word at start of string) 'This is some text -- with punctuation.' 'This'| Pattern '\\A\\w+' (word at start of string) 'This is some text -- with punctuation.' 'This'| Pattern '\\w+\\S*$' (word near end of string, skip punctuation) 'This is some text -- with punctuation.' ..........................'punctuation.'| Pattern '\\w+\\S*\\Z' (word near end of string, skip punctuation) 'This is some text -- with punctuation.' ..........................'punctuation.'| Pattern '\\w*t\\w*' (word containing t) 'This is some text -- with punctuation.' .............'text'| .....................'with'| ..........................'punctuation'| Pattern '\\bt\\w+' (t at start of word) 'This is some text -- with punctuation.' .............'text'| Pattern '\\w+t\\b' (t at end of word) 'This is some text -- with punctuation.' .............'text'| Pattern '\\Bt\\B' (t not start or end of word) 'This is some text -- with punctuation.' .......................'t'| ..............................'t'| .................................'t'|
如果提前已经知道只需搜索整个输入的一个子集,可以告诉re限制搜索范围,从而进一步约束正则表达式匹配.例如,如果模式必须出现在输入的最前面,那么使用match()而不是search()会锚定搜索,而不必在搜索模式中显式的包含一个锚.
>>> import re >>> text = 'This is some text -- with punctuation.' >>> pattern = 'is' >>> m = re.match(pattern, text) >>> print m None >>> s = re.search(pattern, text) >>> print s <_sre.SRE_Match object at 0x0000000002C265E0>已编译正则表达式的search()方法还接受可选的start和end位置参数,将搜索限制在输入的一个子串中:
import re text = 'This is some text -- with punctuation.' pattern = re.compile(r'\b\w*is\w*\b') print 'Text:', text print pos = 0 while True: match = pattern.search(text, pos) if not match: break s = match.start() e = match.end() print ' %2d : %2d = "%s"' % (s, e - 1, text[s:e]) pos = e解释器显示如下:
>>> Text: This is some text -- with punctuation. 0 : 3 = "This" 5 : 6 = "is"
搜索模式匹配是正则表达式所提供强大功能的基础.为模式增加组(group)可以隔离匹配文本的各个部分.通过小括号("("和")")来分组:
from re_test_patterns import test_patterns test_patterns( 'abbaaabbbbaaaaa', [('a(ab)', 'a followed by literal ab'), ('a(a*b*)', 'a followed by 0-n a and 0-n b'), ('a(ab)*', 'a followed by 0-n ab'), ('a(ab)+', 'a followed by 1-n ab'), ])解释器显示如下:
>>> Pattern 'a(ab)' (a followed by literal ab) 'abbaaabbbbaaaaa' ....'aab'| Pattern 'a(a*b*)' (a followed by 0-n a and 0-n b) 'abbaaabbbbaaaaa' 'abb'| ...'aaabbbb'| ..........'aaaaa'| Pattern 'a(ab)*' (a followed by 0-n ab) 'abbaaabbbbaaaaa' 'a'| ...'a'| ....'aab'| ..........'a'| ...........'a'| ............'a'| .............'a'| ..............'a'| Pattern 'a(ab)+' (a followed by 1-n ab) 'abbaaabbbbaaaaa' ....'aab'|要访问一个模式中单个组所匹配的子串,可以使用Match对象的group()方法:
import re text = 'This is some text -- with punctuation.' print text print patterns = [ (r'^(\w+)', 'word at start of string'), (r'(\w+)\S*$', 'word at end, with optional punctuation'), (r'(\bt\w+)\W+(\w+)', 'word starting with t, another word'), (r'(\w+t)\b', 'word ending with t'), ] for pattern, desc in patterns: regex = re.compile(pattern) match = regex.search(text) print 'Pattern %r (%s)\n' % (pattern, desc) print ' ', match.groups() print解释器显示如下:
>>> This is some text -- with punctuation. Pattern '^(\\w+)' (word at start of string) ('This',) Pattern '(\\w+)\\S*$' (word at end, with optional punctuation) ('punctuation',) Pattern '(\\bt\\w+)\\W+(\\w+)' (word starting with t, another word) ('text', 'with') Pattern '(\\w+t)\\b' (word ending with t) ('text',)Python对基本分组语法做了扩展,增加了命名组.通过使用名字来指示组,这样以后就可以更容易的修改模式,而不必同时修改使用了匹配结果的代码.要设置一个组的名字,可以使用以下语法: (?P<name>pattern):
import re text = 'This is some text -- with punctuation.' print text print patterns = [ r'^(?P<first_word>\w+)', r'(?P<last_word>\w+)\S*$', r'(?P<t_word>\bt\w+)\W+(?P<other_word>\w+)', r'(?P<ends_with_t>\w+t)\b', ] for pattern in patterns: regex = re.compile(pattern) match = regex.search(text) print 'Matching "%s"' % pattern print ' ', match.groups() print ' ', match.groupdict() print解释器显示如下:
>>> This is some text -- with punctuation. Matching "^(?P<first_word>\w+)" ('This',) {'first_word': 'This'} Matching "(?P<last_word>\w+)\S*$" ('punctuation',) {'last_word': 'punctuation'} Matching "(?P<t_word>\bt\w+)\W+(?P<other_word>\w+)" ('text', 'with') {'other_word': 'with', 't_word': 'text'} Matching "(?P<ends_with_t>\w+t)\b" ('text',) {'ends_with_t': 'text'}备注: 使用 groupdict()可以获取一个字典,它将组名映射到匹配的子串. groups()返回的有序序列还包含命名模式.
import re def test_patterns(text, patterns=[]): for pattern, desc in patterns: print 'Pattern %r (%s)\n' % (pattern, desc) print ' %r' % text for match in re.finditer(pattern, text): s = match.start() e = match.end() prefix = ' ' * (s) print ' %s%r%s ' % (prefix, text[s:e], ' ' * (len(text) - e)), print match.groups() if match.groupdict(): print '%s%s' % (' ' * (len(text) - s), match.groupdict()) print return if __name__ == "__main__": test_patterns('abbaabbba', [(r'a((a*)(b*))', "'a' followed by 0-n a and 0-n b"),])解释器显示如下:
>>> Pattern 'a((a*)(b*))' ('a' followed by 0-n a and 0-n b) 'abbaabbba' 'abb' ('bb', '', 'bb') 'aabbb' ('abbb', 'a', 'bbb') 'a' ('', '', '')组对于指定候选模式也很有用.可以使用管道符号(|)指示应当匹配某一个或另一个模式:
from re_test_patterns import test_patterns test_patterns( 'abbaabbba', [(r'a((a+)|(b+))', 'a then seq. of a or seq. of b'), (r'a((a|b)+)', 'a then seq. of [ab]'), ])解释器显示如下:
>>> Pattern 'a((a+)|(b+))' (a then seq. of a or seq. of b) 'abbaabbba' 'abb' ('bb', None, 'bb') 'aa' ('a', 'a', None) Pattern 'a((a|b)+)' (a then seq. of [ab]) 'abbaabbba' 'abbaabbba' ('bbaabbba', 'a')如果匹配子模式的字符串并不是从整个文本抽取的一部分,此时定义一个包含子模式的组也很有用.这些组称为"非捕获组".非捕获组可以用来描述重复模式或候选模式,而不再返回值中区分字符串的匹配部分.要创建一个非捕获组,可以使用语法(?:pattern)
from re_test_patterns import test_patterns test_patterns( 'abbaabbba', [(r'a((a+)|(b+))', 'capturing form'), (r'a((?:a+)|(?:b+))', 'noncapturing'), ])解释器显示如下:
>>> Pattern 'a((a+)|(b+))' (capturing form) 'abbaabbba' 'abb' ('bb', None, 'bb') 'aa' ('a', 'a', None) Pattern 'a((?:a+)|(?:b+))' (noncapturing) 'abbaabbba' 'abb' ('bb',) 'aa' ('a',)
利用选项标志可以改变匹配引擎处理表达式的方式.可以使用OR操作结合这些标志,然后传递至compile(),search(),match()以及其他接受匹配模式完成搜索的函数
IGNORECASE使模式中的字面量字符和字符区间与大小写字符都匹配.
import re text = 'This is some text -- with punctuation.' pattern = r'\bT\w+' with_case = re.compile(pattern) without_case = re.compile(pattern, re.IGNORECASE) print 'Text:\n %r' % text print 'Pattern:\n %s' % pattern print 'Case-sensitive:' for match in with_case.findall(text): print ' %r' % match print 'Case-insensitive:' for match in without_case.findall(text): print ' %r' % match解释器显示如下:
>>> Text: 'This is some text -- with punctuation.' Pattern: \bT\w+ Case-sensitive: 'This' Case-insensitive: 'This' 'text'
有两个标志会影响如何在多行输入中进行搜索:MULTILINE和DOTALL.MULTILINE标志会控制模式匹配代码如何对包含换行符的文本处理锚定指令.当打开多行模式时,除了整个字符串外,还要在每一行的开头和结尾应用^和$的锚定规则:
import re text = 'This is some text -- with punctuation.\nA second line.' pattern = r'(^\w+)|(\w+\S*$)' single_line = re.compile(pattern) multiline = re.compile(pattern, re.MULTILINE) print 'Text:\n %r' % text print 'Pattern:\n %s' % pattern print 'Single Line:' for match in single_line.findall(text): print ' %r' % (match,) print 'Multiline :' for match in multiline.findall(text): print ' %r' % (match,)解释器显示如下:
>>> Text: 'This is some text -- with punctuation.\nA second line.' Pattern: (^\w+)|(\w+\S*$) Single Line: ('This', '') ('', 'line.') Multiline : ('This', '') ('', 'punctuation.') ('A', '') ('', 'line.')DOTALL也是一个与多行文本有关的标志.正常情况下,点字符(.)可以与输入文本中除了换行符之外的所有其他字符匹配.这个标志则允许点字符还可以匹配换行符.
import re text = 'This is some text -- with punctuation.\nA second line.' pattern = r'.+' no_newlines = re.compile(pattern) dotall = re.compile(pattern, re.DOTALL) print 'Text:\n %r' % text print 'Pattern:\n %s' % pattern print 'No newlines:' for match in no_newlines.findall(text): print ' %r' % (match,) print 'Multiline :' for match in dotall.findall(text): print ' %r' % (match,)解释器显示如下:
>>> Text: 'This is some text -- with punctuation.\nA second line.' Pattern: .+ No newlines: 'This is some text -- with punctuation.' 'A second line.' Multiline : 'This is some text -- with punctuation.\nA second line.'
详细表达式语法:允许在模式中嵌入注释和额外的空白符
import re address = re.compile( ''' [\w\d.+-]+ #username @ ([\w\d.]+\.)+ #domain name prefix (com|org|edu) ''', re.UNICODE | re.VERBOSE) candidates = [ u'[email protected]', u'[email protected]', u'[email protected]', u'[email protected]' ] for candidate in candidates: match = address.search(candidate) print '%-30s %s' % (candidate, 'Matches' if match else 'No match')解释器显示如下:
>>> [email protected] Matches [email protected] Matches [email protected] Matches [email protected] No match则我们可以扩展此版本:解析包含人名和Email地址的输入.
import re address = re.compile( ''' ((?P<name> ([\w.,]+\s+)*[\w.,]+) \s* < )? (?P<email> [\w\d.+-]+ #username @ ([\w\d.]+\.)+ #domain name prefix (com|org|edu) ) >? ''', re.UNICODE | re.VERBOSE) candidates = [ u'[email protected]', u'[email protected]', u'[email protected]', u'[email protected]' u'First Last <[email protected]>', u'No Brackets [email protected]', u'First Last', u'First Middle Last <[email protected]>', u'First M. Last <[email protected]>', u'<[email protected]>', ] for candidate in candidates: print 'Candidate:', candidate match = address.search(candidate) if match: print ' Name :', match.groupdict()['name'] print ' Email:', match.groupdict()['email'] else: print ' No match'解释器显示如下:
>>> Candidate: [email protected] Name : None Email: [email protected] Candidate: [email protected] Name : None Email: [email protected] Candidate: [email protected] Name : None Email: [email protected] Candidate: [email protected] Last <[email protected]> Name : example.fooFirst Last Email: [email protected] Candidate: No Brackets [email protected] Name : None Email: [email protected] Candidate: First Last No match Candidate: First Middle Last <[email protected]> Name : First Middle Last Email: [email protected] Candidate: First M. Last <[email protected]> Name : First M. Last Email: [email protected] Candidate: <[email protected]> Name : None Email: [email protected]
如果编译表达式时不能增加标志,如将模式作为参数传入一个将在以后编译该模式的库函数时,可以把标志嵌入到表达式字符串本身.例如不区分大小写的匹配,可以在表达式开头增加(?i)
import re text = 'This is some text -- with punctuation.' pattern = r'(?i)\bT\w+' regex = re.compile(pattern) print 'Text :', text print 'Pattern :', pattern print 'Matches :', regex.findall(text)解释器显示如下:
>>> Text : This is some text -- with punctuation. Pattern : (?i)\bT\w+ Matches : ['This', 'text']所有标志的缩写如下:
标志 |
缩写 |
IGNORECASE |
i |
MULTILINE |
m |
DOTALL |
s |
UNICODE |
u |
VERBOSE |
x |
很多情况下,仅当模式中另外某个部分也匹配时才匹配模式的某一部分,这非常有用.例如上例中只有尖括号成对时候,表达式才匹配.所以修改如下,修改后使用了一个肯定前向断言来匹配尖括号对.前向断言语法为(?=pattern):
import re address = re.compile( ''' ((?P<name> ([\w.,]+\s+)*[\w.,]+) \s+ ) (?= (<.*>$) | ([^<].*[^>]$) ) <? (?P<email> [\w\d.+-]+ #username @ ([\w\d.]+\.)+ #domain name prefix (com|org|edu) ) >? ''', re.UNICODE | re.VERBOSE) candidates = [ u'[email protected]', u'No Brackets [email protected]', u'Open Bracket <[email protected]>', u'Close Bracket [email protected]>', ] for candidate in candidates: print 'Candidate:', candidate match = address.search(candidate) if match: print ' Name :', match.groupdict()['name'] print ' Email:', match.groupdict()['email'] else: print ' No match'解释器显示如下:
>>> Candidate: [email protected] No match Candidate: No Brackets [email protected] Name : No Brackets Email: [email protected] Candidate: Open Bracket <[email protected]> Name : Open Bracket Email: [email protected] Candidate: Close Bracket [email protected]> No match否定前向断言((?!pattern))要求模式不匹配当前位置后面的文本.例如,Email识别模式可以修改为忽略自动系统常用的noreply邮件地址:
import re address = re.compile( ''' ^ (?!noreply@.*$) [\w\d.+-]+ #username @ ([\w\d.]+\.)+ #domain name prefix (com|org|edu) $ ''', re.UNICODE | re.VERBOSE) candidates = [ u'[email protected]', u'[email protected]', ] for candidate in candidates: print 'Candidate:', candidate match = address.search(candidate) if match: print ' Match:', candidate[match.start():match.end()] else: print ' No match'解释器显示如下:
>>> Candidate: [email protected] Match: [email protected] Candidate: [email protected] No match相应的 否定后向断言语法为:(?<!pattern)
address = re.compile( ''' ^ [\w\d.+-]+ #username (?<!noreply) @ ([\w\d.]+\.)+ #domain name prefix (com|org|edu) $ ''', re.UNICODE | re.VERBOSE)可以借组语法(?<=pattern)用肯定后向断言查找符合某个模式的文本:
import re twitter = re.compile( ''' (?<=@) ([\w\d_]+) ''', re.UNICODE | re.VERBOSE) text = '''This text includes two Twitter handles. One for @ThePSF, and one for the author, @doughellmann.''' print text for match in twitter.findall(text): print 'Handle:', match解释器显示如下:
>>> This text includes two Twitter handles. One for @ThePSF, and one for the author, @doughellmann. Handle: ThePSF Handle: doughellmann
匹配的值还可以用在表达式后面的部分中.最容易的办法是使用\num按id编号引用先前匹配的组:
import re address = re.compile( r''' (\w+) #first name \s+ (([\w.]+)\s+)? #optional middle name or initial (\w+) #last name \s+ < (?P<email> \1 \. \4 @ ([\w\d.]+\.)+ (com|org|edu) ) > ''', re.UNICODE | re.VERBOSE | re.IGNORECASE) candidates = [ u'First Last <[email protected]>', u'Different Name <[email protected]>', u'First Middle Last <[email protected]>', u'First M. Last <[email protected]>', ] for candidate in candidates: print 'Candidate:', candidate match = address.search(candidate) if match: print ' Match name:', match.group(1), match.group(4) print ' Match email:', match.group(5) else: print ' No match'解释器显示如下:
>>> Candidate: First Last <[email protected]> Match name: First Last Match email: [email protected] Candidate: Different Name <[email protected]> No match Candidate: First Middle Last <[email protected]> Match name: First Last Match email: [email protected] Candidate: First M. Last <[email protected]> Match name: First Last Match email: [email protected]按数字id创建反向引用有两个缺点:1是表达式改变时需要重新编号,这样难以维护.2是最多创建99个引用,如果超过99个,则会产生更难维护的问题.
所以Python的表达式可以使用(?P=name)指示表达式中先前匹配的一个命名组的值:
address = re.compile( r''' (?P<first_name>\w+) #first name \s+ (([\w.]+)\s+)? #optional middle name or initial (?P<last_name>\w+) #last name \s+ < (?P<email> (?P=first_name) \. (?P=last_name) @ ([\w\d.]+\.)+ (com|org|edu) ) > ''', re.UNICODE | re.VERBOSE | re.IGNORECASE)在表达式中使用反向引用还有一种机制,即根据前一个组是否匹配来选择不同的模式.可以修正这个Email模式,使得如果出现名字就需要有尖括号,不过如果只有Email地址本身就不需要尖括号.语法是(?(id)yes-expression|no-expression),这里id是组名或编号,yes-expression是组有值时使用的模式,no-expression则是组没有值时使用的模式.
import re address = re.compile( r''' ^ (?P<name> ([\w.]+\s+)*[\w.]+ )? \s* (?(name) (?P<brackets>(?=(<.*>$))) | (?=([^<].*[^>]$)) ) (?(brackets)<|\s*) (?P<email> [\w\d.+-]+ @ ([\w\d.]+\.)+ (com|org|edu) ) (?(brackets)>|\s*) $ ''', re.UNICODE | re.VERBOSE) candidates = [ u'First Last <[email protected]>', u'No Brackets [email protected]', u'Open Bracket <[email protected]', u'Close Bracket [email protected]>', u'[email protected]', ] for candidate in candidates: print 'Candidate:', candidate match = address.search(candidate) if match: print ' Match name:', match.groupdict()['name'] print ' Match email:', match.groupdict()['email'] else: print ' No match'解释器显示如下:
>>> Candidate: First Last <[email protected]> Match name: First Last Match email: [email protected] Candidate: No Brackets [email protected] No match Candidate: Open Bracket <[email protected] No match Candidate: Close Bracket [email protected]> No match Candidate: [email protected] Match name: None Match email: [email protected]
使用sub()可以将一个模式的所有出现替换为另一个字符串:
import re bold = re.compile(r'\*{2}(.*?)\*{2}') text = 'Make this **bold**. This **too**.' print 'Text:', text print 'Bold:', bold.sub(r'<b>\1</b>', text)解释器显示如下:
>>> Text: Make this **bold**. This **too**. Bold: Make this <b>bold</b>. This <b>too</b>.要在替换中使用命名组,可以使用语法\g<name>.我们可以使用count来限制完成的替换数:
import re bold = re.compile(r'\*{2}(?P<bold_text>.*?)\*{2}', re.UNICODE) text = 'Make this **bold**. This **too**.' print 'Text:', text print 'Bold:', bold.sub(r'<b>\g<bold_text></b>', text, count=1)解释器显示如下:
>>> Text: Make this **bold**. This **too**. Bold: Make this <b>bold</b>. This **too**.
str.split()是分解字符串来完成解析的最常用方法之一.但是如果存在多行情况下,我们则需要findall,使用(.+?)\n{2,}的模式.
import re text = '''Paragraph one on two lines. Paragraph two. Paragraph three.''' for num, para in enumerate(re.findall(r'(.+?)\n{2,}', text, flags=re.DOTALL) ): print num, repr(para) print解释器显示如下:(注意{2,}这个模式)
>>> 0 'Paragraph one\non two lines.' 1 'Paragraph two.'但是这样最后一行无法显示.我们可以使用split来处理:
import re text = '''Paragraph one on two lines. Paragraph two. Paragraph three.''' print 'With findall:' for num, para in enumerate(re.findall(r'(.+?)(\n{2,}|$)', text, flags=re.DOTALL) ): print num, repr(para) print print print 'With split:' for num, para in enumerate(re.split(r'\n{2,}', text)): print num, repr(para) print解释器显示如下:
>>> With findall: 0 ('Paragraph one\non two lines.', '\n\n') 1 ('Paragraph two.', '\n\n\n') 2 ('Paragraph three.', '') With split: 0 'Paragraph one\non two lines.' 1 'Paragraph two.' 2 'Paragraph three.'