python爬虫---从零开始(五)pyQuery库

 

什么是pyQuery:

  强大又灵活的网页解析库。如果你觉得正则写起来太麻烦(我不会写正则),如果你觉得BeautifulSoup的语法太难记,如果你熟悉JQuery的语法,那么PyQuery就是你最佳的选择。

pyQuery的安装pip3 install pyquery即可安装啦。

pyQuery的基本用法:

初始化:

字符串初始化:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

html = """
The Dormouse's story</head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters;and thier names were
<a href ="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>
<a href ="http://example.com/lacle" class="sister" id="link2">Lacie</a> and
<a href ="http://example.com/title" class="sister" id="link3">Title</a>; and they lived at the boottom of a well.</p>
<p class="story">...</p>
</span><span style="color: #800000;">"""</span>

<span style="color: #0000ff;">from</span> pyquery <span style="color: #0000ff;">import</span><span style="color: #000000;"> PyQuery as pq
doc </span>=<span style="color: #000000;"> pq(html)
</span><span style="color: #0000ff;">print</span>(doc(<span style="color: #800000;">'</span><span style="color: #800000;">a</span><span style="color: #800000;">'</span>))</pre> 
 </div> 
 <p>运行结果:</p> 
 <p><strong><a href="http://img.e-com-net.com/image/info8/021ebcdec18f4e02949068ec07b3c304.jpg" target="_blank"><img src="http://img.e-com-net.com/image/info8/021ebcdec18f4e02949068ec07b3c304.jpg" alt="python爬虫---从零开始(五)pyQuery库_第1张图片" width="650" height="126" style="border:1px solid black;"></a></strong></p> 
 <p><strong>URL初始化:</strong></p> 
 <div class="cnblogs_code"> 
  <pre><span style="color: #008000;">#</span><span style="color: #008000;">!/usr/bin/env python</span><span style="color: #008000;">
#</span><span style="color: #008000;"> -*- coding: utf-8 -*-</span><span style="color: #008000;">
#</span><span style="color: #008000;"> URL初始化</span>

<span style="color: #0000ff;">from</span> pyquery <span style="color: #0000ff;">import</span><span style="color: #000000;"> PyQuery as pq
doc </span>= pq(<span style="color: #800000;">'</span><span style="color: #800000;">http://www.baidu.com</span><span style="color: #800000;">'</span><span style="color: #000000;">)
</span><span style="color: #0000ff;">print</span>(doc(<span style="color: #800000;">'</span><span style="color: #800000;">input</span><span style="color: #800000;">'</span>))</pre> 
 </div> 
 <p>运行结果:</p> 
 <p><a href="http://img.e-com-net.com/image/info8/4ea8ce6c301e44e2a38cc32a371ec99b.jpg" target="_blank"><img src="http://img.e-com-net.com/image/info8/4ea8ce6c301e44e2a38cc32a371ec99b.jpg" alt="" width="650" height="83"></a></p> 
 <p><strong>文件初始化:</strong></p> 
 <div class="cnblogs_code"> 
  <pre><span style="color: #008000;">#</span><span style="color: #008000;">!/usr/bin/env python</span><span style="color: #008000;">
#</span><span style="color: #008000;"> -*- coding: utf-8 -*-</span><span style="color: #008000;">
#</span><span style="color: #008000;"> 文件初始化</span>

<span style="color: #0000ff;">from</span> pyquery <span style="color: #0000ff;">import</span><span style="color: #000000;"> PyQuery as pq
doc </span>= pq(filename=<span style="color: #800000;">'</span><span style="color: #800000;">baidu.html</span><span style="color: #800000;">'</span><span style="color: #000000;">)
</span><span style="color: #0000ff;">print</span>(doc(<span style="color: #800000;">'</span><span style="color: #800000;">title</span><span style="color: #800000;">'</span>))</pre> 
 </div> 
 <p>运行结果:</p> 
 <p><a href="http://img.e-com-net.com/image/info8/b52a63af43f54c459e0e5b6257698e94.jpg" target="_blank"><img src="http://img.e-com-net.com/image/info8/b52a63af43f54c459e0e5b6257698e94.jpg" alt="python爬虫---从零开始(五)pyQuery库_第2张图片" width="650" height="176" style="border:1px solid black;"></a></p> 
 <p> 选择方式和jquery一致,id、name、class都是如此,还有很多都和jquery一致。</p> 
 <p><strong>基本CSS选择器:</strong></p> 
 <div class="cnblogs_code"> 
  <pre><span style="color: #008000;">#</span><span style="color: #008000;">!/usr/bin/env python</span><span style="color: #008000;">
#</span><span style="color: #008000;"> -*- coding: utf-8 -*-</span><span style="color: #008000;">
#</span><span style="color: #008000;"> Css选择器</span>
<span style="color: #000000;">
html </span>= <span style="color: #800000;">"""</span><span style="color: #800000;">
<html><head><title>The Dormouse's story</head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters;and thier names were
<a href ="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>
<a href ="http://example.com/lacle" class="sister" id="link2">Lacie</a> and
<a href ="http://example.com/title" class="title" id="link3">Title</a>; and they lived at the boottom of a well.</p>
<p class="story">...</p>
</span><span style="color: #800000;">"""</span>
<span style="color: #0000ff;">from</span> pyquery <span style="color: #0000ff;">import</span><span style="color: #000000;"> PyQuery as pq
doc </span>=<span style="color: #000000;"> pq(html)
</span><span style="color: #0000ff;">print</span>(doc(<span style="color: #800000;">'</span><span style="color: #800000;">.title</span><span style="color: #800000;">'</span>))</pre> 
 </div> 
 <p>运行结果:</p> 
 <p><a href="http://img.e-com-net.com/image/info8/09124e5acf284aacb9a60c16a08d4f19.jpg" target="_blank"><img src="http://img.e-com-net.com/image/info8/09124e5acf284aacb9a60c16a08d4f19.jpg" alt="" width="650" height="100"></a></p> 
 <p><strong>查找元素:</strong></p> 
 <p><strong>子元素:</strong></p> 
 <div class="cnblogs_code"> 
  <pre><span style="color: #008000;">#</span><span style="color: #008000;">!/usr/bin/env python</span><span style="color: #008000;">
#</span><span style="color: #008000;"> -*- coding: utf-8 -*-</span><span style="color: #008000;">
#</span><span style="color: #008000;"> 子元素</span>
<span style="color: #000000;">
html </span>= <span style="color: #800000;">"""</span><span style="color: #800000;">
<html><head><title>The Dormouse's story</head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters;and thier names were
<a href ="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>
<a href ="http://example.com/lacle" class="sister" id="link2">Lacie</a> and
<a href ="http://example.com/title" class="title" id="link3">Title</a>; and they lived at the boottom of a well.</p>
<p class="story">...</p>
</span><span style="color: #800000;">"""</span>
<span style="color: #0000ff;">from</span> pyquery <span style="color: #0000ff;">import</span><span style="color: #000000;"> PyQuery as pq
doc </span>=<span style="color: #000000;"> pq(html)
items </span>= doc(<span style="color: #800000;">'</span><span style="color: #800000;">.title</span><span style="color: #800000;">'</span><span style="color: #000000;">)
</span><span style="color: #0000ff;">print</span><span style="color: #000000;">(type(items))
</span><span style="color: #0000ff;">print</span><span style="color: #000000;">(items)
p </span>= items.find(<span style="color: #800000;">'</span><span style="color: #800000;">b</span><span style="color: #800000;">'</span><span style="color: #000000;">)
</span><span style="color: #0000ff;">print</span><span style="color: #000000;">(type(p))
</span><span style="color: #0000ff;">print</span>(p)</pre> 
 </div> 
 <p>该代码为查找id为title的标签,我们可以看到id为title的标签有两个一个是p标签,一个是a标签,然后我们再使用find方法,查找出我们需要的p标签,运行结果:</p> 
 <p><a href="http://img.e-com-net.com/image/info8/11489db670944dc9b19fe82af7291cae.jpg" target="_blank"><img src="http://img.e-com-net.com/image/info8/11489db670944dc9b19fe82af7291cae.jpg" alt="python爬虫---从零开始(五)pyQuery库_第3张图片" width="650" height="119" style="border:1px solid black;"></a></p> 
 <p>这里需要注意的是,我们所使用的find是查找每一个元素内部的标签.</p> 
 <p><strong>children:</strong></p> 
 <div class="cnblogs_code"> 
  <pre><span style="color: #008000;">#</span><span style="color: #008000;">!/usr/bin/env python</span><span style="color: #008000;">
#</span><span style="color: #008000;"> -*- coding: utf-8 -*-</span><span style="color: #008000;">
#</span><span style="color: #008000;"> 子元素</span>
<span style="color: #000000;">
html </span>= <span style="color: #800000;">"""</span><span style="color: #800000;">
<html><head><title>The Dormouse's story</head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters;and thier names were
<a href ="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>
<a href ="http://example.com/lacle" class="sister" id="link2">Lacie</a> and
<a href ="http://example.com/title" class="title" id="link3">Title</a>; and they lived at the boottom of a well.</p>
<p class="story">...</p>
</span><span style="color: #800000;">"""</span>
<span style="color: #0000ff;">from</span> pyquery <span style="color: #0000ff;">import</span><span style="color: #000000;"> PyQuery as pq
doc </span>=<span style="color: #000000;"> pq(html)
items </span>= doc(<span style="color: #800000;">'</span><span style="color: #800000;">.title</span><span style="color: #800000;">'</span><span style="color: #000000;">)
</span><span style="color: #0000ff;">print</span>(items.children())</pre> 
 </div> 
 <p>运行结果:</p> 
 <p><a href="http://img.e-com-net.com/image/info8/41051fcea4f54c448240d8bbdebc7ed9.jpg" target="_blank"><img src="http://img.e-com-net.com/image/info8/41051fcea4f54c448240d8bbdebc7ed9.jpg" alt="python爬虫---从零开始(五)pyQuery库_第4张图片" width="650" height="122" style="border:1px solid black;"></a></p> 
 <p>也可以在children()内添加选择器条件:</p> 
 <div class="cnblogs_code"> 
  <pre><span style="color: #008000;">#</span><span style="color: #008000;">!/usr/bin/env python</span><span style="color: #008000;">
#</span><span style="color: #008000;"> -*- coding: utf-8 -*-</span><span style="color: #008000;">
#</span><span style="color: #008000;"> 子元素</span>
<span style="color: #000000;">
html </span>= <span style="color: #800000;">"""</span><span style="color: #800000;">
<html><head><title>The Dormouse's story</head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters;and thier names were
<a href ="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>
<a href ="http://example.com/lacle" class="sister" id="link2">Lacie</a> and
<a href ="http://example.com/title" class="title" id="link3">Title</a>; and they lived at the boottom of a well.</p>
<p class="story">...</p>
</span><span style="color: #800000;">"""</span>
<span style="color: #0000ff;">from</span> pyquery <span style="color: #0000ff;">import</span><span style="color: #000000;"> PyQuery as pq
doc </span>=<span style="color: #000000;"> pq(html)
items </span>= doc(<span style="color: #800000;">'</span><span style="color: #800000;">.title</span><span style="color: #800000;">'</span><span style="color: #000000;">)
</span><span style="color: #0000ff;">print</span>(items.children(<span style="color: #800000;">'</span><span style="color: #800000;">b</span><span style="color: #800000;">'</span>))</pre> 
 </div> 
 <p>输出结果和上面的一致。</p> 
 <p><strong> 父元素:</strong></p> 
 <div class="cnblogs_code"> 
  <pre><span style="color: #008000;">#</span><span style="color: #008000;">!/usr/bin/env python</span><span style="color: #008000;">
#</span><span style="color: #008000;"> -*- coding: utf-8 -*-</span><span style="color: #008000;">
#</span><span style="color: #008000;"> 子元素</span>
<span style="color: #000000;">
html </span>= <span style="color: #800000;">"""</span><span style="color: #800000;">
<html><head><title>The Dormouse's story</head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters;and thier names were
<a href ="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>
<a href ="http://example.com/lacle" class="sister" id="link2">Lacie</a> and
<a href ="http://example.com/title" class="title" id="link3">Title</a>; and they lived at the boottom of a well.</p>
<p class="story">...</p>
</span><span style="color: #800000;">"""</span>
<span style="color: #0000ff;">from</span> pyquery <span style="color: #0000ff;">import</span><span style="color: #000000;"> PyQuery as pq
doc </span>=<span style="color: #000000;"> pq(html)
items </span>= doc(<span style="color: #800000;">'</span><span style="color: #800000;">#link1</span><span style="color: #800000;">'</span><span style="color: #000000;">)
</span><span style="color: #0000ff;">print</span><span style="color: #000000;">(items)
</span><span style="color: #0000ff;">print</span>(items.parent())</pre> 
 </div> 
 <p>运行结果:</p> 
 <p><a href="http://img.e-com-net.com/image/info8/4befe231cf7e410d83e24cd3ea189331.jpg" target="_blank"><img src="http://img.e-com-net.com/image/info8/4befe231cf7e410d83e24cd3ea189331.jpg" alt="python爬虫---从零开始(五)pyQuery库_第5张图片" width="650" height="167" style="border:1px solid black;"></a></p> 
 <p>这里只输出一个父元素。这里我们用parents方法会给予我们返回所有父元素,祖先元素</p> 
 <div class="cnblogs_code"> 
  <pre><span style="color: #008000;">#</span><span style="color: #008000;">!/usr/bin/env python</span><span style="color: #008000;">
#</span><span style="color: #008000;"> -*- coding: utf-8 -*-</span><span style="color: #008000;">
#</span><span style="color: #008000;"> 祖先元素</span>
<span style="color: #000000;">
html </span>= <span style="color: #800000;">"""</span><span style="color: #800000;">
<html>
    <head>
        <title>The Dormouse's story
    
    
        

Once upo a time were three little sister;and theru name were Elsie Lacie and Title Title

...

""" from pyquery import PyQuery as pq doc = pq(html) items = doc('#link1') print(items) print(items.parents('body'))

运行结果:

python爬虫---从零开始(五)pyQuery库_第6张图片

兄弟元素:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 兄弟元素

html = """

    
        The Dormouse's story
    
    
        

Once upo a time were three little sister;and theru name were Elsie Lacie and Title Title

...

""" from pyquery import PyQuery as pq doc = pq(html) items = doc('#link1') print(items) print(items.siblings('#link2'))

运行结果:

python爬虫---从零开始(五)pyQuery库_第7张图片

上面就把查找元素的方法都说了,下面我来看一下如何遍历元素。

遍历

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 兄弟元素

html = """

    
        The Dormouse's story
    
    
        

Once upo a time were three little sister;and theru name were Elsie Lacie and Title Title

...

""" from pyquery import PyQuery as pq doc = pq(html) items = doc('a') for k,v in enumerate(items.items()): print(k,v)

运行结果:

python爬虫---从零开始(五)pyQuery库_第8张图片

 获取信息:

  获取属性:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 获取属性

html = """

    
        The Dormouse's story
    
    
        

Once upo a time were three little sister;and theru name were Elsie Lacie and Title Title

...

""" from pyquery import PyQuery as pq doc = pq(html) items = doc('a') print(items) print(items.attr('href')) print(items.attr.href)

运行结果:

python爬虫---从零开始(五)pyQuery库_第9张图片

  获得文本:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 获取属性

html = """

    
        The Dormouse's story
    
    
        

Once upo a time were three little sister;and theru name were Elsie Lacie and Title Title

...

""" from pyquery import PyQuery as pq doc = pq(html) items = doc('a') print(items) print(items.text()) print(type(items.text()))

运行结果:

python爬虫---从零开始(五)pyQuery库_第10张图片

  获得HTML:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 获取属性

html = """

    
        The Dormouse's story
    
    
        

Once upo a time were three little sister;and theru name were Elsie Lacie and Title Title

...

""" from pyquery import PyQuery as pq doc = pq(html) items = doc('a') print(items.html())

运行结果:

python爬虫---从零开始(五)pyQuery库_第11张图片

DOM操作:

addClass、removeClass

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# DOM操作,addClass、removeClass

html = """

    
        The Dormouse's story
    
    
        

Once upo a time were three little sister;and theru name were Elsie Lacie and Title Title

...

""" from pyquery import PyQuery as pq doc = pq(html) items = doc('#link2') print(items) items.addClass('addStyle') # add_class print(items) items.remove_class('sister') # removeClass print(items)

运行结果:

python爬虫---从零开始(五)pyQuery库_第12张图片

attr、css:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# DOM操作,attr,css

html = """

    
        The Dormouse's story
    
    
        

Once upo a time were three little sister;and theru name were Elsie Lacie and Title Title

...

""" from pyquery import PyQuery as pq doc = pq(html) items = doc('#link2') items.attr('name','addname') print(items) items.css('width','100px') print(items)

可以给予新的属性,如果原来有该属性,会覆盖掉原有的属性

运行结果:

python爬虫---从零开始(五)pyQuery库_第13张图片

remove:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# DOM操作,remove

html = """
Hello World

This is a paragraph.

""" from pyquery import PyQuery as pq doc = pq(html) wrap = doc('.wrap') print(wrap.text()) wrap.find('p').remove() print("remove以后的数据") print(wrap)

运行结果:

python爬虫---从零开始(五)pyQuery库_第14张图片

还有很多其他的DOM方法,想了解更多的小伙伴可以阅读其官方文档,地址:https://pyquery.readthedocs.io/en/latest/api.html

伪类选择器:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# DOM操作,伪类选择器

html = """

    
        The Dormouse's story
    
    
        

Once upo a time were three little sister;and theru name were Elsie Lacie and Title Title

...

""" from pyquery import PyQuery as pq doc = pq(html) # print(doc) wrap = doc('a:first-child') # 第一个标签 print(wrap) wrap = doc('a:last-child') # 最后一个标签 print(wrap) wrap = doc('a:nth-child(2)') # 第二个标签 print(wrap) wrap = doc('a:gt(2)') # 比2大的索引 标签 即为 0 1 2 3 4 从0开始的 不是1 print(wrap) wrap = doc('a:nth-child(2n)') # 第 2的整数倍 个标签 print(wrap) wrap = doc('a:contains(Lacie)') # 包含Lacie文本的标签 print(wrap)

这里不在详细的一一列举了,了解更多CSS选择器可以查看官方文档,由W3C提供地址:http://www.w3school.com.cn/css/index.asp

到这里我们就把pyQuery的使用方法大致的说完了,想了解更多,更详细的可以阅读官方文档,地址:https://pyquery.readthedocs.io/en/latest/

上述代码地址:https://gitee.com/dwyui/pyQuery.git

          感谢大家的阅读,不正确的地方,还希望大家来斧正,鞠躬,谢谢。

你可能感兴趣的:(python爬虫---从零开始(五)pyQuery库)