Python爬虫常用正则re.findall的使用

re.findall正则符说明

1、单字符表达

. : 除换行以外所有字符

[] ：[aoe] [a-w] 匹配集合中任意一个字符

\d ：数字 [0-9]

\D : 非数字

\w ：数字、字母、下划线、中文

\W : 非\w

\s ：所有的空白字符包,括空格、制表符、换页符等等。等价于 [ \f\n\r\t\v]

\S : 非空白

2、数量修饰

* : 任意多次 >=0

+ : 至少1次 >=1

? : 可有可无 0次或者1次

{m} ：固定m次 hello{3,}

{m,} ：至少m次

{m,n} ：m-n次

3、边界

$ : 以某某结尾

^ : 以某某开头

4、分组

(ab)

5、贪婪模式

6、非贪婪惰性模式

.*?

7、re.findall 可以对多行进行匹配，并依据参数作出不同结果。

re.findall(取值,值,re.M)

- re.M ：多行匹配

- re.S ：单行匹配如果分行则显示/n

- re.I : 忽略大小写

- re.sub(正则表达式, 替换内容, 字符串)

二、举例说明

1、提取出python

'''

key = 'javapythonc++php'

re.findall('python',key)

re.findall('python',key)[0] ###[0]代表打印的时候去掉中括号和引号

2、提取出 hello word

'''

key = '<html><h1>hello word</h1></html>'

print(re.findall('<h1>.*</h1>', key))

print(re.findall('<h1>(.*)</h1>', key))

print(re.findall('<h1>(.*)</h1>', key)[0])

'''

3、提取170

'''

key = '这个女孩身高170厘米'

print(re.findall('\d+', key)[0])

'''

4、提取出http://和https://

'''

key = 'http://www.baidu.com and https://www.cnblogs.com'

print(re.findall('https?://', key))

'''

5、提取出 hello

'''

key = 'lalala<hTml>hello</HtMl>hahaha' # 输出的结果<hTml>hello</HtMl>

print(re.findall('<[hH][tT][mM][lL]>(.*)</[hH][tT][mM][lL]>',key))

'''

6、提取hit. 贪婪模式;尽可能多的匹配数据

'''

key = 'qiang@hit.edu.com' # 加?是贪婪匹配,不加?是非贪婪匹配

print(re.findall('h.*?\.', key))

'''

7、匹配出所有的saas和sas

'''

key = 'saas and sas and saaas'

print(re.findall('sa{1,2}s',key))

'''

8、匹配出 i 开头的行

'''

key = """fall in love with you

i love you very much

i love she

i love her

"""

print(re.findall('^i.*', key, re.M))

'''

9、匹配全部行

'''

key = """

<div>细思极恐

你的队友在看书,

你的闺蜜在减肥,

你的敌人在磨刀,

隔壁老王在练腰.

</div>

"""

print(re.findall('.*', key, re.S))

'''

————————————————

原文链接：https://blog.csdn.net/icanflyingg/article/details/124128611

Python爬虫常用正则re.findall的使用

添加新评论

最新文章

最近回复

分类

归档

其它