基础概念
HTTP协议
wikipedia: The Hypertext transfer Protocal(HTTP) is a stateless(无状态) application-level protocol for distributed(分布式),collaborative(协作式),hypertext information systems.
HyperText Transfer Protocol 超文本传输协议
是一个基于“请求与响应”模式的、无状态的应用层协议
采用URL作为定位网络资源的标识,格式如下:
http://host[:port][path]
host
: 合法的Internet主机域名或IP地址port
: 端口号,缺省端口为80path
: 请求资源的路径
对资源的操作:
GET
请求获取URL位置的资源HEAD
请求获取URL位置资源的响应消息报告,即获得该资源的头部信息POST
请求向URL位置的资源后附加新的数据PUT
请求向URL位置存储一个资源,覆盖原URL位置的资源PATCH
请求局部更新URL位置的资源,即改变该处资源的部分内容DELETE
请求删除URL位置存储的资源PATCH
vs.PUT
:- 假设URL位置有一组数据UserInfo,包括UserID、UserName等20个字段, 需求:用户修改了UserName,其他不变
PATCH
: 仅向URL提交UserName的局部更新请求PUT
须将所有20个字段一并提交到URL,未提交字段被删除PATCH
的最主要好处:节省网络带宽
响应状态码
2xx
成功3xx
跳转- 300 Multiple Choices 存在多个可用资源,可处理可丢弃
- 301 Moved Permanetly 重定向
- 302 Found 重定向
- 304 Not Modified 请求资源未更新,丢弃
- 注:一些python库(urllib2,requests,...)已经对重定向做了处理,会自动跳转
4xx
客户端错误- 400 Bad Request 客户端请求有语法错误,不能被服务器所理解(请求参数或者路径错误)
- 401 Unauthorized 请求未经授权,这个状态吗需和www-Authenticate报头域一起使用(无权限访问)
- 403 Forbidden 服务器收到请求,但拒绝提供服务(未登录/IP被封/...)
- 404 Not Found 请求资源部存在
5xx
服务端错误- 500 Internal Server Error 服务器发生了不可预期的错误
- 503 Server Unavailable 服务器当前不能处理客户端请求,一段时间后可能恢复正常
Http Header
- Request Http Header
Accept: text/plain Accept-Charset: utf-8 Accept-Encoding: gzip,deflate Accept-Language: en-US Connection: keep-alive Content-Length: 348 Content-Type: application/x-www-form-urlencoded User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:60.0) Gecko/20100101 Firefox/60.0 Cookie: $version=1; Skin=new; Date: ... Host: ... ....
- Response Http Header
Status: 200 OK Accept: text/plain;charset=utf-8 Content-Eoncoding: gzip Content_Language: en-US Content-Length: 348 Set-Cookie: UserID=xxx,Max-Age=3600;Version=1;... Location: ... Last-Modified: ... ...
- Request Http Header
深度抓取与广度抓取
A
/ \
B C
/ \
D,E,F,G X,Y,Z
|
H,I,J,K
- 深度抓取(垂直)
- 堆栈(递归,先进后出)
- A -> B -> D -> H -> I,J,K -> E,F,G -> C -> X,Y,Z
- 广度抓取(水平)
- 队列(先进先出)
- A -> B,C -> D,E,F,G ; X,Y,Z -> H,I,J,K
- 策略:
- 重要的网页距离种子站点比较近
- 一个网页可能有很多路径可以到达(图)
- 广度优先有利于多爬虫并行抓取
- 深度与广度结合
不重复抓取策略
- 记录抓取历史(URL)
- 保存到数据库(效率低)
- 使用HashSet(内存限制)
- 尽量压缩URL
- MD5/SHA-1编码成一段统一长度的数字/字符串,太长,一般会编码后再取模
- BitMap方法:建立BitSet,将URL(可在MD5基础上)经过Hash函数映射到一个或多个Bit位来记录
- BloomFilter: 在BitMap基础上,使用多个Hash函数
- 注:存在一定碰撞
- 操作:
- 评估网站的网页数量
- 选择合适的Hash算法和空间阈值,降低碰撞几率
- 选择合适的存储结构和算法
- 注:
- 网页数量少的情况下,不需要进行压缩(多数情况)
- 网页数量大的情况下,可使用BloomFilter压缩URL,重点是计算碰撞概率,以此确定存储空间的阈值
- 分布式系统,可将散列映射到多台主机
网络爬虫的限制
- 来源审查: 检查来访HTTP协议头的
User‐Agent
域,只响应浏览器或友好爬虫的访问 - 发布公告:
Robots
协议,网站告知网络爬虫哪些页面可以抓取,哪些不行- 在网站根目录下的
robots.txt
文件 (Robots Exclusion Standard 网络爬虫排除标准) Robots
协议是建议但非约束性,网络爬虫可以不遵守,但存在法律风险- 基本语法
# `*`代表所有, `/`代表根目录 User‐agent: * Disallow: /
- eg:
https://www.jd.com/robots.txt
User‐agent: * Disallow: /?* Disallow: /pop/*.html Disallow: /pinpai/*.html?* User‐agent: EtaoSpider Disallow: / User‐agent: HuihuiSpider Disallow: / User‐agent: GwdangSpider Disallow: / User‐agent: WochachaSpider Disallow: /
- 在网站根目录下的
网站结构分析
- 利用sitemap里的信息
- 对网站目录结构进行分析
- 网页解析器:
- 模糊匹配:
- 正则表达式
- 结构化解析:
- htmp.parser
- BeautifulSoup
- lxml
- 。。。
- 模糊匹配:
文档解析之Re
正则表达式
Regular Expression(regex)
- 一种通用的字符串表达框架,简洁表达一组字符串
- 特点: 简洁,一行胜千言(一行就是特征,即模式)
- 用途(主要应用在字符串匹配中):
- 表达文本类型的特征(病毒、入侵等)
- 匹配字符串的全部或部分
- 查找或替换一组字符串
- ...
语法由字符和操作符构成, 常用操作符:
匹配单个字符
操作符 说明 实例 .
表示任何单个字符 / []
匹配 []
中列举的字符[abc]
表示a、b、c,[a‐z]
表示a到z单个字符[^ ]
非字符集,对单个字符给出排除范围 [^abc]
表示非a或b或c的单个字符\d
数字,等价于 [0‐9]
/ \D
匹配非数字 / \w
单词字符,等价于 [A‐Za‐z0‐9]
/ \W
匹配非单词字符 / \s
匹配空白,即 空格,tab键 / \S
匹配非空白 / 匹配数量
操作符 说明 实例 *
前一个字符0次或无限次扩展,即可有可无 abc*
表示 ab、abc、abcc、abccc等+
前一个字符1次或无限次扩展,即至少有1次 abc+
表示 abc、abcc、abccc等?
前一个字符0次或1次扩展,即要么有1次,要么没有 abc?
表示 ab、abc{m}
扩展前一个字符m次 ab{2}c
表示abbc{m,}
扩展前一个字符至少m次 ab{2}c
表示abbc,abbbc,abbbbc等{m,n}
扩展前一个字符m至n次(含n) ab{1,2}c
表示abc、abbc匹配边界
操作符 说明 实例 ^
匹配字符串开头 ^abc
表示abc且在一个字符串的开头$
匹配字符串结尾 abc$
表示abc且在一个字符串的结尾\b
匹配一个单词的边界,注意:并不是匹配分隔符,而是单词和符号之间的边界
(单词可是中英文字符,数字;符号可是中英文符号,空格,制表符,换行)"a nice day","a niceday": \bnice\b
可匹配出"a nice day"的"nice"\B
匹配一个非单词的边界 "a nice day","a niceday": \bnice\B
可匹配出"a niceday"的"nice"匹配分组
操作符 说明 实例 |
左右表达式任意一个 abc|def
表示 abc、def( )
分组标记,内部只能使用 (abc)
表示abc,(abc|def)
表示abc、def\num
引用分组num匹配到的字符串 <(\w*)><(\w*)>.*</\2></\1>
=><html><h1>hh<h1></html>
正确,<html><h1>hh</h1></abc>
错误(?P<name>)
分组起别名 <(?P<name1>\w*)><(?P<name2>\w*)>.*</(?P=name2)></(?P=name1)>
=><html><h1>hh<h1></html>
正确,<html><h1>hh</h1></abc>
错误(?P=name)
引用别名为name分组匹配到的字符串 /
eg1:
- 一组字符串(无穷个): 'PY', 'PYY', 'PYYY', 'PYYYY', ......, 'PYYYY......'
- 正则表达式(无穷字符串组的简洁表达):
PY+
- eg2:
- 一组字符串: 'PN', 'PYN', 'PYTN', 'PYTHN', 'PYTHON'
- 正则表达式(简洁表达):
P(Y|YT|YTH|YTHO)?N
- eg3:
- 表示一组'PY'开头,后续存在不多于10个字符,且不能是'P'或'Y'的字符串(如:'PYABC' 正确;'PYKXYZ' 不正确)
- 正则表达式(特征字符串组的简洁表达):
PY[^PY]{0,10}
eg4:
- 'PN'、'PYN'、'PYYN'、'PYYYN'...
PY{:3}N
经典正则表达式实例:
^[A‐Za‐z]+$
由26个字母组成的字符串^[A‐Za‐z0‐9]+$
由26个字母和数字组成的字符串[1‐9]\d{5}
中国境内6位邮政编码[\u4e00‐\u9fa5]
匹配中文字符\d{3}‐\d{8}|\d{4}‐\d{7}
国内电话号码(eg: 010‐68913536)(([1‐9]?\d|1\d{2}|2[0‐4]\d|25[0‐5]).){3}([1‐9]?\d|1\d{2}|2[0‐4]\d|25[0‐5])
IP地址(4段)- 0‐99:
[1‐9]?\d
- 100‐199:
1\d{2}
- 200‐249:
2[0‐4]\d
- 250‐255:
25[0‐5]
- 简化表达:
\d+.\d+.\d+.\d+
或\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3}
- 0‐99:
Re库
- Python的标准库(用于字符串匹配)
- 导入
import re
- 正则表达式的表示类型:
- raw string(原生字符串类型):
r'text'
, eg:r'[1‐9]\d{5}'
,r'\d{3}‐\d{8}|\d{4}‐\d{7}'
- string(更繁琐), eg:
'[1‐9]\\d{5}'
,'\\d{3}‐\\d{8}|\\d{4}‐\\d{7}'
- 注:raw string不包含对转义符再次转义的字符串,所以建议当正则表达式包含转义符时,使用raw string
- raw string(原生字符串类型):
函数式用法: 一次性操作
re.search(pattern, string, flags=0)
: 搜索第一个匹配的,返回match
对象pattern
: 正则表达式(string/raw string)string
: 待匹配字符串flags
: 控制标记re.I
,re.IGNORECASE
: 忽略大小写re.M
,re.MULTILINE
:^
操作符能将给定字符串的每行当作匹配开始re.S
,re.DOTALL
:.
操作符能匹配所有字符(默认是匹配除换行外的所有字符)
- eg:
import re match=re.search(r'[1-9]\d{5}','BIT100081 TSU100084') if match: print(match.group(0)) # 100081
re.match(pattern, string, flags=0)
: 从头开始匹配,返回match
对象- 参数同上
eg:
match=re.match(r'[1-9]\d{5}','BIT100081 TSU100084') if match: print(match.group(0)) # AttributeError: 'NoneType' object has no attribute 'group' match=re.match(r'[1-9]\d{5}','100081BIT TSU100084') if match: print(match.group(0)) # 100081
re.findall(pattern, string, flags=0)
: 搜索,返回匹配字串列表- 参数同上
- eg:
ls=re.findall(r'[1-9]\d{5}','BIT100081 TSU100084') # ['100081','100084']
re.finditer(pattern, string, flags=0)
: 搜索,返回匹配结果的迭代类型,每个迭代元素是match
对象- 参数同上
- eg:
for match in re.finditer(r'[1-9]\d{5}','BIT100081 TSU100084'): if match: print(match.group(0)) # 100081 # 100084
re.split(pattern, string, maxsplit=0, flags=0)
: 分割, 返回列表maxsplit
: 最大分割数,剩余部分作为最后一个元素输出- eg:
re.split(r'[1-9]\d{5}','BIT100081 TSU100084') # ['BIT',' TSU',''] re.split(r'[1-9]\d{3}','BIT100081 TSU100084') # ['BIT', '81 TSU', '84'] re.split(r'[1-9]\d{5}','BIT100081 TSU100084',maxsplit=1) # ['BIT',' TSU100084']
re.sub(pattern, repl, string, count=0, flags=0)
: 替换所有匹配的子串,返回替换后的字符串repl
替换字符串string
: 待匹配字符串count
最大替换次数- eg:
re.sub(r'[1-9]\d{5}',':zipcode','BIT100081 TSU100084') # 'BIT:zipcode TSU:zipcode'
面向对象用法:编译后的多次操作
- Step1:
regex = re.compile(pattern, flags=0)
将正则表达式的字符串形式编译
成正则表达式对象 - Step2:
regex.search(string, flags=0)
regex.match(string, flags=0)
regex.findall(string, flags=0)
regex.finditer(string, flags=0)
regex.split(string, maxsplit=0, flags=0)
regex.sub(repl, string, count=0, flags=0)
- Step1:
Match对象:
- 一次匹配的结果,包含匹配的很多信息
- 属性:
.string
: 待匹配的文本.re
: 匹配时使用的pattern对象(正则表达式).pos
: 搜索文本的开始位置.endpos
: 搜索文本的结束位置
- 方法:
.group(0)
: 获得匹配后的字符串.start()
: 匹配字符串在原始字符串的开始位置.end()
: 匹配字符串在原始字符串的结束位置.span()
: 返回(.start(), .end())
- eg:
match=re.search(r'[1-9]\d{5}','BIT100081 TSU100084') # 属性: match.string # 'BIT100081 TSU100084' match.re # re.compile('[1-9]\\d{5}') match.pos # 0 match.endpos # 19 # 方法: match.group(0) # 100081 match.start() # 3 match.end() # 9 match.span() # (3,9)
贪婪匹配(默认):输出匹配最长的子串
match = re.search(r'PY.*N', 'PYANBNCNDN') match.group(0) # 'PYANBNCNDN'
最小匹配: 输出最短匹配子串,操作符后增加
?
操作符''' 只要长度输出可能不同的,都可以通过在操作符后增加?变成最小匹配 `*?`: 前一个字符0次或无限次扩展,最小匹配 `+?`: 前一个字符1次或无限次扩展,最小匹配 `??`: 前一个字符0次或1次扩展,最小匹配 `{m,n}?`: 扩展前一个字符m至n次(含n),最小匹配 ''' match = re.search(r'PY.*?N', 'PYANBNCNDN') match.group(0) # 'PYAN'
文档解析之BeautifulSoup
一个网页解析库,处理高效,目前可支持
html
,xml
,html5
文档解析,可配置使用不同解析器
常见解析器:
解析器 | 使用 | 说明 |
---|---|---|
html.parser | BeautifulSoup(content,'html.parser') | Python内置标准库,速度适中,容错能力适中,不依赖扩展 |
lxml | BeautifulSoup(content,'lxml'),BeautifulSoup(content,'xml') | 第三方库(pip install lxml ),速度快(局部遍历),支持XML解析,容错能力强,依赖C扩展 |
html5hib | BeautifulSoup(content,'html5hib') | 第三方库(pip install html5hib ),速度慢,以浏览器的方式解析生成HTML5格式的文档,容错能力最好,不依赖外部扩展 |
安装 (BeautifulSoup 包含在一个名为 bs4 的文件包中,需要另外安装)
pip install bs4
创建BeautifulSoup对象,结构化解析Dom树(
HTML/XML
<=>文档树
<=>BeautifulSoup对象
)from bs4 import BeautifulSoup # soup = BeautifulSoup("<html><body><p>data</p></body></html>",'html.parser') soup = BeautifulSoup("<html><body><p>data</p></body></html>") print(soup.p) # 格式化输出(为HTML文本及其内容增加添加`\n`),也可用于标签:`<tag>.prettify()` print(soup.prettify())
访问节点
BeautifulSoup基本元素 说明 使用 示例 Tag 标签 <tag>
soup.p
Name 标签名字,字符串类型 <tag>.name
soup.p.name
Attributes 标签的属性,字典形式组织 <tag>.attrs
soup.p.attrs
,soup.p['attrname']
NavigableString 标签内非属性字符串( <>⋯</>
中字符串)<tag>.string
soup.p.string
Comment 标签内字符串的注释部分, 特殊类型的 NavigableString 对象 / / - 是否有设置属性
has_attr("attrname")
- 获取属性
.attrs["attrname"]
["attrname"]
- 获取内容
.text
.get_text()
.string
vs.text
.string
on a Tag type object returns aNavigableString
type object..text
gets all the child strings and return concatenated using the given separator.sample:
Html string text <td>some text</td>
some text
some text
<td></td>
None
/ <td><p>more text</p></td>
more text
more text
<td>even <p>more text</p></td>
None
(因为文本数>=2
,.string
不知道获取哪一个)even more text
(.text
返回的是,两段文本的拼接)<td><!--This is comment--></td>
This is comment
/ <td>even <!--This is comment--></td>
None
even
- 是否有设置属性
Navigating the Tree
- Going Down: 下行遍历(子节点和子孙节点)
.contents
返回儿子节点列表list
.children
返回儿子节点迭代类型list_iterator
.descendants
返回子孙节点迭代类型generator
(包含所有子孙节点)
- Going up: 上行遍历(父节点和祖先节点)
.parent
返回节点的父亲节点.parents
返回所有先辈节点的迭代类型generator
(包括soup本身)
- Going sideways: 平行遍历(兄弟节点)
.next_sibling
/.previous_sibling
返回HTML文本顺序的下/上一个平行节点.next_siblings
/.previous_siblings
返回HTML文本顺序的后/前续所有平行节点的迭代类型generator
- Going back and forth: 前后遍历(不分层次)
.next_element
/.next_elements
.previous_element
/.previous_elements
- Going Down: 下行遍历(子节点和子孙节点)
Searching the Tree
- Searching Down: 下行搜索(子孙)
find
/find_all
- Searching up: 上行搜索(父祖)
find_parent
/find_parents
- Searching sideway: 平行搜索(兄弟):
find_next_sibling
/find_previous_sibling
find_next_siblings
/find_previous_siblings
- Searching back and forth: 前后搜索(不分层次的前/后节点):
find_next
/find_all_next
find_previous
/find_all_previous
- 注:
- 方法参数
(name=None,attrs={},recursive=True,text=None,limit=None,**kwargs)
,可以使用正则表达式name
: 标签名称attrs
: 标签属性recursive
: 是否对子孙全部检索,默认Truetext
: 内容字符串limit
: 限制条数
<tag>(..)
等价于<tag>.find_all(..)
soup(..)
等价于soup.find_all(..)
- 每个元素是一个
bs4.element.Tag
对象
- 方法参数
- Searching Down: 下行搜索(子孙)
可使用CSS Selectors选择节点:
.select('...')
- 基础选择:
#id
tagName
.styleClass
- 属性过滤:
[attribute]
[attribute=value]
[attribute!=value]
[attribute^=value]
[attribute$=value]
[attribute*=value]
- 层级选择:
ancestor descendent
parent > child
prev + next
(next sibling tag)prev ~ siblings
(next all sibling tags)
- 元素过滤:
:not(selector)
:nth-of-type(index)
:nth-child(index)
:first-child
:last-child
:only-child
- 内容过滤:
:contains(text)
:empty
:has(selector)
- 表单属性过滤:
:enabled
:checked
:disabled
- 混合:
selector1, selector2, selectorN
:获取多个选择符的合集[selector1][selector2][selectorN]
:匹配同时符合多个属性选择符的对象
- 基础选择:
注:
- BeautifulSoup用编码自动检测子库来识别当前文档编码并转换成
Unicode
编码,输出使用utf-8
编码 - 获取属性值
.attrs
,.attrs['xxx']
- 获取内容
.text
,.get_text()
,.string
,.strings
- BeautifulSoup用编码自动检测子库来识别当前文档编码并转换成
Demo: 访问节点
from bs4 import BeautifulSoup
content='''
<b>Chat with sb</b>
<a> This is title <!-- Guess --> </a>
<i><!--This is comment--></i>
<div id="div1">
<div id="div2">
<p id="test" class="highlight">
Hello <a>Tom</a>
Nice to meet you <!-- This is a comment -->
</p>
</div>
</div>
'''
soup=BeautifulSoup(content,'html.parser')
Tag
name
,attrs
print("soup.p:",soup.p) # <p class="highlight" id="test1"> # Hello <a>Tom</a> # Nice to meet you <!-- This is a comment --> # </p> print("soup.p.name:",soup.p.name) # p print("soup.p.attrs:",soup.p.attrs) # {'id': 'test1', 'class': ['highlight']} print("soup.p.attr['class']:",soup.p.attrs["class"]) # ['highlight'] print("soup.p.attrs['id']:",soup.p.attrs["id"]) # test1 print("soup.p['class']:",soup.p["class"]) #['highlight']
Tag
text
/string
print("soup.p.text:",soup.p.text) # # Hello Tom # Nice to meet you # print("soup.p.get_text():",soup.p.get_text()) # # Hello Tom # Nice to meet you # print("type(soup.p.get_text()):",type(soup.p.get_text())) # <class 'str'> print("-----------------------------------") print('--- Demo: Tag <p> string ---') print("soup.p.string:",soup.p.string) # None print("type(soup.p.string)",type(soup.p.string)) # <class 'NoneType'> print("soup.p.strings:",soup.p.strings) # <generator object Tag._all_strings at 0x00000000028FDD68> for i,s in enumerate(soup.p.strings): print(i,":",s) print("-----------------------------------") # 0 : # Hello # 1 : Tom # 2 : # Nice to meet you # 3 : print('--- Demo: Tag <a> text/string ---') print("soup.a.text:",soup.a.text) # Chat with sb print("soup.a.string:",soup.a.string) # Chat with sb print("type(soup.a.string):",type(soup.a.string)) # <class 'bs4.element.NavigableString'> print("-----------------------------------") print('--- Demo: Tag <b> text/string ---') print("soup.b.text:",soup.b.text) # This is title print("soup.b.string:",soup.b.string) # None print("type(soup.b.string):",type(soup.b.string)) # <class 'NoneType'> print("-----------------------------------") print('--- Demo: Tag <i> text/string ---') print("soup.i.text:",soup.i.text) # print("soup.i.string:",soup.i.string) # This is comment print("type(soup.i.string):",type(soup.i.string)) # <class 'bs4.element.Comment'>
Demo: Navigating the Tree
from bs4 import BeautifulSoup
content='''
<b>Chat with sb</b>
<a> This is title <!-- Guess --> </a>
<i><!--This is comment--></i>
<div id="div1">
<div id="div2">
<p id="test" class="highlight">
Hello <a>Tom</a>
Nice to meet you <!-- This is a comment -->
</p>
</div>
</div>
'''
soup=BeautifulSoup(content,'html.parser')
def print_result(result):
if type(result)==element.Tag or (type(result)== list and len(result)==0):
print(result)
return
for i,r in enumerate(result):
print(i,":",r)
print('-------------------------')
def print_result_name(result):
if type(result)==element.Tag or type(result)==element.NavigableString or (type(result)== list and len(result)==0):
print(result)
return
for i,r in enumerate(result):
print(i,":",r.name)
print('-------------------------')
Going down:
.contents
print(soup.p.contents) # <class 'list'> print_result(soup.p.contents) # 0 : # Hello # 1 : <a>Tom</a> # 2 : # Nice to meet you # 3 : This is a comment # 4 :
.children
print(soup.p.children) # <list_iterator object at 0x0000000001E742E8> print_result(soup.p.children) # 0 : # Hello # 1 : <a>Tom</a> # 2 : # Nice to meet you # 3 : This is a comment # 4 :
.descendants
print('--- Demo: Tag <p> descendants ---') print(soup.p.descendants) # <generator object Tag.descendants at 0x00000000028ADD68> print_result(soup.p.descendants) # 0 : # Hello # 1 : <a>Tom</a> # 2 : Tom # 3 : # Nice to meet you # 4 : This is a comment # 5 :
Going up:
.parent
print(type(soup.p.parent)) # <class 'bs4.element.Tag'> print_result(soup.p.parent) # <div id="div2"> # <p class="highlight" id="test1"> # Hello <a>Tom</a> # Nice to meet you <!-- This is a comment --> # </p> # <p class="story" id="test2">Story1</p> # <p class="story" id="test3">Story2</p> # </div>
.parents
print(soup.p.parents) # <generator object PageElement.parents at 0x00000000028FDD68> print_result_name(soup.p.parents) # 0 : div # 1 : div # 2 : [document]
Going sideway:
next_sibling
print_result(soup.p.next_sibling) # 0 :
next_siblings
print(soup.p.next_siblings) # <generator object PageElement.next_siblings at 0x00000000028FDD68> print_result(soup.p.next_siblings) # 0 : # # 1 : <p class="story" id="test2">Story1</p> # 2 : # # 3 : <p class="story" id="test3">Story2</p> # 4 :
- vs.
find_next_silbings
print('--- Demo: `find_next_siblings()` ---') result=soup.p.find_next_siblings() print_result(result) # 0 : <p class="story" id="test2">Story1</p> # 1 : <p class="story" id="test3">Story2</p>
Going forth and back:
next_element
print(soup.p.next_element) # # Hello print(type(soup.p.next_element)) # <class 'bs4.element.NavigableString'>
next_elements
print(soup.p.next_elements) # <generator object PageElement.next_elements at 0x00000000028FDD68> print_result(soup.p.next_elements) # 0 : # Hello # 1 : <a>Tom</a> # 2 : Tom # 3 : # Nice to meet you # 4 : This is a comment # 5 : # # 6 : # # 7 : <p class="story" id="test2">Story1</p> # 8 : Story1 # 9 : # # 10 : <p class="story" id="test3">Story2</p> # 11 : Story2 # 12 : # # 13 : # # 14 :
- vs.
find_all_next()
result=soup.p.find_all_next() print_result(result) # 0 : <a>Tom</a> # 1 : <p class="story" id="test2">Story1</p> # 2 : <p class="story" id="test3">Story2</p>
Demo: Searching the Tree
from bs4 import BeautifulSoup
from bs4 import element
import re
content='''
<html><head><title>The Dormouse's story</title></head> <body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.
</p>
<p class="story">...</p>
</body>
</html>
'''
soup=BeautifulSoup(content,'html.parser')
print(soup.prettify())
def print_result(result):
if type(result)==element.Tag or (type(result)== list and len(result)==0):
print(result)
return
for i,r in enumerate(result):
print(i,":",r)
print('-------------------------')
def print_result_name(result):
if type(result)==element.Tag or (type(result)== list and len(result)==0):
print(result)
return
for i,r in enumerate(result):
print(i,":",r.name)
print('-------------------------')
Searching down
by
name
print('--- Demo: `find_all("a")` ---') result=soup.find_all('a') print_result(result) # 0 : <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> # 1 : <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> # 2 : <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> print('--- Demo: `find_all(["a","title"])` ---') result=soup.find_all(['a','title']) print_result(result) # 0 : <title>The Dormouse's story</title> # 1 : <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> # 2 : <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> # 3 : <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> print('--- Demo: `find_all(True)` ---') result=soup.find_all(True) print_result(result) # 0 : <html><head><title>The Dormouse's story</title></head> <body> # <p class="title"><b>The Dormouse's story</b></p> # <p class="story"> # Once upon a time there were three little sisters; and their names were # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well. # </p> # <p class="story">...</p> # </body> # </html> # 1 : <head><title>The Dormouse's story</title></head> # 2 : <title>The Dormouse's story</title> # 3 : <body> # <p class="title"><b>The Dormouse's story</b></p> # <p class="story"> # Once upon a time there were three little sisters; and their names were # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well. # </p> # <p class="story">...</p> # </body> # 4 : <p class="title"><b>The Dormouse's story</b></p> # 5 : <b>The Dormouse's story</b> # 6 : <p class="story"> # Once upon a time there were three little sisters; and their names were # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well. # </p> # 7 : <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> # 8 : <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> # 9 : <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> # 10 : <p class="story">...</p> print('--- Demo: `find_all(re.compile("b")` ---') result=soup.find_all(re.compile('b')) print_result(result) # 0 : <body> # <p class="title"><b>The Dormouse's story</b></p> # <p class="story"> # Once upon a time there were three little sisters; and their names were # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well. # </p> # <p class="story">...</p> # </body> # 1 : <b>The Dormouse's story</b>
by
attrs
print('--- Demo: find_all("p","story") ---') result=soup.find_all('p','story') print_result(result) # 0 : <p class="story"> # Once upon a time there were three little sisters; and their names were # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well. # </p> # 1 : <p class="story">...</p> print('--- Demo: find_all(id="link1") ---') result=soup.find_all(id='link1') print_result(result) # 0 : <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> print('--- Demo: find_all(class_="sister") ---') result=soup.find_all(class_='sister') print_result(result) # 0 : <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> # 1 : <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> # 2 : <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> print('--- Demo: find_all(re.compile("link")) ---') result=soup.find_all(id=re.compile('link')) print_result(result) # 0 : <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> # 1 : <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> # 2 : <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> print('--- Demo: find_all(attrs={"class":"story"}) ---') result=soup.find_all(attrs={'class':'story'}) print_result(result) # 0 : <p class="story"> # Once upon a time there were three little sisters; and their names were # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well. # </p> # 1 : <p class="story">...</p>
by
recursive
print('--- Demo: find_all("a") ---') result=soup.find_all('a') print_result(result) # 0 : <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> # 1 : <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> # 2 : <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> print('--- Demo: find_all("a",recursive=False) ---') result=soup.find_all('a',recursive=False) print_result(result) # []
by
string/text
print('--- Demo: find_all(string="three") ---') result=soup.find_all(string='three') print_result(result) # [] print('--- Demo: find_all(string=re.compile("e")) ---') result=soup.find_all(string=re.compile('e')) print_result(result) # 0 : The Dormouse's story # 1 : The Dormouse's story # 2 : # Once upon a time there were three little sisters; and their names were # # 3 : Elsie # 4 : Lacie # 5 : Tillie # 6 : ; and they lived at the bottom of a well.
- by
limit
:find()
也就是当limit=1
时的find_all()
print('--- Demo: find_all("a",limit-2) ---') result=soup.find_all('a',limit=2) print_result(result) # 0 : <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> # 1 : <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
by
self def function
print('--- Demo: using `self def function` ---') def my_filter(tag): return tag.has_attr('id') and re.match('link',tag.get("id")) result=soup.find_all(my_filter) print_result(result) # 0 : <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> # 1 : <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> # 2 : <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
Searching up:
find_parents
print('--- Demo: link2.`find_parents()` ---') result=soup.find(id="link2").find_parents() print_result_name(result) # 0 : p # 1 : body # 2 : html # 3 : [document] print('--- Demo: link2.`find_parents("p")` ---') result=soup.find(id="link2").find_parents('p') print_result(result) # 0 : <p class="story"> # Once upon a time there were three little sisters; and their names were # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well. # </p>
Searching sideway:
find_next_siblings
print('--- Demo: `find_next_siblings()` ---') result=soup.find(id="link1").find_next_siblings() print_result(result) # 0 : <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> # 1 : <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
Searching forth and back:
find_all_next
print('--- Demo: `find_all_next()` ---') result=soup.find(id="link1").find_all_next() print_result(result) # 0 : <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> # 1 : <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> # 2 : <p class="story">...</p>
Demo: CSS Selectors
<html><head><title>The Dormouse's story</title></head> <body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.
</p>
<p class="story">...</p>
<input type="text" disabled value="input something"></input>
</body>
</html>
基础选择
#id
print('--- Demo: `select("#link1")` ---') result=soup.select("#link1") print_result(result) # 0 : <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> print('--- Demo: `select("a#link1")` ---') result=soup.select("a#link2") print_result(result) # 0 : <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
tagName
print('--- Demo: `select("title")` ---') result=soup.select("title") print_result(result) # 0 : <title>The Dormouse's story</title>
.styleClass
print('--- Demo: `select(".sister")` ---') result=soup.select(".sister") print_result(result) # 0 : <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> # 1 : <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> # 2 : <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
属性过滤
[attribute]
print('--- Demo: `select("a[href]")` ---') result=soup.select('a[href]') print_result(result) # 0 : <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> # 1 : <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> # 2 : <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
[attribute=value]
print('--- Demo: `select("[class=sister]")` ---') result=soup.select("[class=sister]") print_result(result) # 0 : <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> # 1 : <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> # 2 : <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
[attribute^=value]
print('--- Demo: `select("a[href^="http://example.com/"]")` ---') result=soup.select('a[href^="http://example.com/"]') print_result(result) # 0 : <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> # 1 : <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> # 2 : <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
[attribute$=value]
-print('--- Demo: `select("a[href$="tillie"])` ---') result=soup.select('a[href$="tillie"]') print_result(result) # 0 : <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
[attribute*=value]
print('--- Demo: `select("a[href*=".com/el"]")` ---') result=soup.select('a[href*=".com/el"]') print_result(result) # 0 : <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
[selector1][selector2][selectorN]
print("--- Demo: `[class='sister'][id=link2]` --- ") print_result(soup.select("[class=sister][id=link2]")) # 0 : <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
层级选择
ancestor descendent
print('--- Demo: `select("body a")` ---') result=soup.select("body a") print_result(result) # 0 : <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> # 1 : <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> # 2 : <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
parent > child
print('--- Demo: `select("body > a") ---') result=soup.select("body > a") print_result(result) # [] print('--- Demo: `select("p > a") ---') result=soup.select("p > a") print_result(result) # 0 : <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> # 1 : <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> # 2 : <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> print('--- Demo: `select("p > a:nth-of-type(2)")` ---') result=soup.select("p > a:nth-of-type(2)") print_result(result) # 0 : <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> print('--- Demo: `select("p > #link1")` ---') result=soup.select("p > #link1") print_result(result) # 0 : <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
prev + next
:next sibling tagprint('--- Demo: `select("#link1 ~ .sister")` ---') result=soup.select("#link1 ~ .sister") print_result(result) # 0 : <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> # 1 : <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
prev ~ siblings
:next all sibling tagsprint('--- Demo: `select("#link1 + .sister")` ---') result=soup.select("#link1 + .sister") print_result(result) # 0 : <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
元素过滤
:not(selector)
print("--- Demo: `:not(.story)` --- ") print_result(soup.select("p:not(.story)")) # 0 : <p class="title"><b>The Dormouse's story</b></p>
:nth-of-type(index)
print('--- Demo: `select("p:nth-of-type(3)")` ---') result=soup.select("p:nth-of-type(3)") print_result(result) # 0 : <p class="story">...</p>
:nth-child(index)
print("--- Demo: `p > :nth-child(1)` --- ") print_result(soup.select("p > :nth-child(1)")) # 0 : <b>The Dormouse's story</b> # 1 : <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
:first-child
print("--- Demo: `p > :first-child` --- ") print_result(soup.select("p > :first-child")) # 0 : <b>The Dormouse's story</b> # 1 : <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
:last-child
print("--- Demo: `p > :last-child` --- ") print_result(soup.select("p > :last-child")) # 0 : <b>The Dormouse's story</b> # 1 : <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
:only-child
print("--- Demo: `p > :only-child` --- ") print_result(soup.select("p > :only-child")) # 0 : <b>The Dormouse's story</b>
内容过滤
:contains(text)
print("--- Demo: `p:contains(story)` --- ") print_result(soup.select("p:contains(story)")) # 0 : <p class="title"><b>The Dormouse's story</b></p>
:empty
print("--- Demo: `p:empty` --- ") print_result(soup.select("p:empty")) # []
:has(selector)
print("--- Demo: `p:has(b)` --- ") print_result(soup.select("p:has(b)")) # 0 : <p class="title"><b>The Dormouse's story</b></p>
表单属性过滤
:enabled
,:disabled
,:checked
print("--- Demo: `:disabled`` --- ") print_result(soup.select(":disabled")) # 0 : <input disabled="" type="text" value="input something"/>
其他:
selector1, selector2, selectorN
print('--- Demo: `select("#link1,#link2")` ---') result=soup.select("#link1,#link2") print_result(result) # 0 : <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> # 1 : <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
select_one()
print('--- Demo: `select_one(".sister")` ---') result=soup.select_one(".sister") print_result(result) # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
get attribute value:
print('--- Demo: `get attribute value` ---') result=soup.select(".sister") print_result(result) # 0 : <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> # 1 : <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> # 2 : <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> print(result[0].get_text()) #Elsie print(result[0].attrs) #{'href': 'http://example.com/elsie', 'class': ['sister'], 'id': 'link1'} print(result[0].attrs['id']) #link1
文档解析之XPath
- 使用路径表达式来选取
XML/HTML
文档中的节点或节点集 - 安装:
pip install lxml
- 导入:
from lxml import etree
- 注意:
lxml
和正则一样用C
实现,是一款高性能的PythonHTML/XML
解析器,可以利用XPath
语法来快速的定位特定元素以及节点信息 - Tools: Chrome的插件
XPath Helper
,快速得到页面元素的匹配规则
路径表达式
//
: 选取所有的当前节点,不考虑他们的位置//p
(.//p
)/p//a
//p/a
/
: 从根节点选取/p
(./p
)/p/a
.
当前节点,..
: 当前节点的父节点./p
../p
//p/b/../a
root.xpath('//p/b').xpath('./a')
root.xpath('//p/b').xpath('../text()')
root.xpath('//p/b/..//a')[0].text
@
: 选取属性//@class
//p/@class
//p//@class
//p[@class]
//p[@class='s1']
//p[@class='s1']/@class
/text()
,string(.)
选择内容"//b/text()"
//b//text()
string(.)
string(./description)
[]
: Predicates//p[1]
,//p[last()]
,//p[last()-1]
//p[position()<=2]
//p[@class]
,//p[@class='s1']
//p[b]
,//p[b/@class]
,//p[b[@class='s1']]
*
: 通配符,匹配任何//p/*
//p//*
//p/*/a
//p[@*]
//*[@class='s1']
|
: 选取多个路径/p | //b
//p/a | //p/b[@class]
and
,or
,not
://a[@class='sister' and @id='link2']
,//a[@class='sister'][@id='link2']
//a[@id='link1' or @class='outAstyle']
//a[not(@class='sister')]
//a[not(@class='sister') and @class or @id='link1']
xxx()
:starts-with()
://a[starts-with(@href,'http://example.com/')]
contains()
://a[contains(text(),'ie') and contains(@id,'link')]
text()
://b/text()
,//b//text()
string(.)
:data.xpath('//div[@class="name"]')[0].xpath('string(.)')
::
- go self:
self::
,eg://self::b
- go up:
ancestor::
,ancestor-or-self::
,parent::
,eg://a/ancestor::p
- go down:
descendant::
,child::
,eg://p/descendant::a[not(@class)]
- go forward:
following::
,following-sibling::
, eg:p[last()-1]/following::*
- go back:
preceding::
,preceding-sibling::
, eg:p[2]/preceding::*
- get attributes:
attribute::
,eg://a/attribute::*
,//a/attribute::class
- go self:
lxml.etree._Element
:tag
attrib
text
.xpath('string(.)')
.get('attribute')
Demo: 解析HTML
#!/usr/bin/env python
# -*- coding:utf-8 -*-
from lxml import etree
content='''
<div>
<p class="title"><b class='bstyle'>The Dormouse's story</b></p>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
; and they lived at the bottom of a well.
<p> hello ...<b><a> World </a></b> </p>
</p>
<p class="story">...<a class="outAstyle">Miss</a> </p>
</div>
'''
# html = etree.parse('./test.html',etree.HTMLParser())
html = etree.HTML(content)
print(html)
# <Element html at 0x1019312c8>
# result = etree.tostring(html) # 会补全缺胳膊少腿的标签
# print(result.decode("utf-8"))
print(etree.tounicode(html)) # 会补全缺胳膊少腿的标签
# <html><body><div>
# <p class="title"><b class="bstyle">The Dormouse's story</b></p>
# <p class="story">
# Once upon a time there were three little sisters; and their names were
# <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
# <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
# and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
# ; and they lived at the bottom of a well.
# </p><p> hello ...<b><a> World </a></b> </p>
# <p class="story">...<a class="outAstyle">Miss</a> </p>
# </div>
# </body></html>
result=html.xpath("//p/b")
for i,r in enumerate(result):
print(i,type(r),":",r.tag,r.attrib,r.get('class'),r.text,r.xpath('string(.)'))
# 0 <class 'lxml.etree._Element'> : b {'class': 'bstyle'} bstyle The Dormouse's story The Dormouse's story
# 1 <class 'lxml.etree._Element'> : b {} None None World
###########################
# More test:
test_path_any(html)
test_path_attr(html)
test_path_predicates(html)
def test_path_any(root):
print("--- `//` ----")
do_xpath(root,'p')
# []
do_xpath(root,'//p')
# [<Element p at 0x109f34148>, <Element p at 0x109f34188>, <Element p at 0x109f34248>, <Element p at 0x109f34288>]
do_xpath(root,'//p/a/text()')
# ['Elsie', 'Lacie', 'Tillie', 'Miss']
do_xpath(root,'//p//a/text()')
# ['Elsie', 'Lacie', 'Tillie', ' World ', 'Miss']
do_xpath(root,'.//a/text()')
# ['Elsie', 'Lacie', 'Tillie', ' World ', 'Miss']
print('--- `xpath` ---')
print(root.xpath("//p/b//a"))
# [<Element a at 0x10b555f08>]
print(root.xpath("//p/b")[1].xpath("//a"))
# [<Element a at 0x10b555f08>, <Element a at 0x10b5770c8>, <Element a at 0x10b577108>, <Element a at 0x10b577048>, <Element a at 0x10b577088>]
print(root.xpath("//p/b")[1].xpath("./a"))
# [<Element a at 0x10c719f48>]
print(root.xpath("//p/b")[1].xpath("../text()"))
# [' hello ...', ' ']
print(root.xpath('//p/b/..//a')[0].text)
# World
print('------------------------')
def test_path_attr(root):
print("--- `@` ----")
do_xpath(root,'/@class')
# []
do_xpath(root,'//@class')
# ['title', 'bstyle', 'story', 'sister', 'sister', 'sister', 'story', 'outAstyle']
do_xpath(root,'//p[@class]')
# [<Element p at 0x10e4c3888>, <Element p at 0x10e4c36c8>, <Element p at 0x10e4c3708>]
do_xpath(root,"//p[@class='story']")
# [<Element p at 0x110ba8708>, <Element p at 0x110ba8548>]
do_xpath(root,"//p/@class")
# ['title', 'story', 'story']
do_xpath(root,"//p[@class='story']/@class")
# ['story', 'story']
do_xpath(root,"//p[@class='story']//@class")
# ['story', 'sister', 'sister', 'sister', 'story', 'outAstyle']
print('------------------------')
def test_path_predicates(root):
print("--- `[]` ----")
do_xpath_detail(root,'//p[1]')
# 0 : <p class="title"><b class="bstyle">The Dormouse's story</b></p>
do_xpath_detail(root,'//p[last()]')
# 0 : <p class="story">...<a class="outAstyle">Miss</a> </p>
do_xpath_detail(root,'//p[last()-1]')
# 0 : <p> hello ...<b><a> World </a></b> </p>
do_xpath_detail(root,'//a[1]')
# 0 : <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
# 1 : <a> World </a>
# 2 : <a class="outAstyle">Miss</a>
do_xpath_detail(root,'//p/a[1]')
# 0 : <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
# 1 : <a class="outAstyle">Miss</a>
do_xpath_detail(root,'//a[position()<=2]')
# 0 : <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
# 1 : <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>and
# 2 : <a> World </a>
# 3 : <a class="outAstyle">Miss</a>
do_xpath_detail(root,'//a[@class]')
# 0 : <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
# 1 : <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>and
# 2 : <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.
# 3 : <a class="outAstyle">Miss</a>
do_xpath_detail(root,'//a[@class="outAstyle"]')
# 0 : <a class="outAstyle">Miss</a>
do_xpath_detail(root,'//p[b]')
# 0 : <p class="title"><b class="bstyle">The Dormouse's story</b></p>
# 1 : <p> hello ...<b><a> World </a></b> </p>
do_xpath_detail(root,"//p[b/@class]")
# 0 : <p class="title"><b class="bstyle">The Dormouse's story</b></p>
do_xpath_detail(root,"//p[b[@class='bstyle']]")
# 0 : <p class="title"><b class="bstyle">The Dormouse's story</b></p>
print('------------------------')
def do_xpath(root,path):
result=root.xpath(path)
print("%s : \n%s" % (path,result))
return result
def do_xpath_detail(root,path):
result=root.xpath(path)
print(path,":")
if type(result)==list and len(result)>0:
for i,r in enumerate(result):
if type(r)==etree._Element:
print(i,":",etree.tounicode(r))
else:
print(i,":",r)
else:
print(result)
return result
Demo: 解析XML
from lxml import etree
content='''
<collection shelf="New Arrivals">
<movie title="Enemy Behind">
<type>War, Thriller</type>
<format>DVD</format>
<year>2003</year>
<rating>PG</rating>
<stars>10</stars>
<description>Talk about a US-Japan war</description>
</movie>
<movie title="Transformers">
<type>Anime, Science Fiction</type>
<format>DVD</format>
<year>1989</year>
<rating>R</rating>
<stars>8</stars>
<description>A schientific fiction</description>
</movie>
</collection>
'''
root=etree.XML(content)
print(root)
print(etree.tounicode(root))
result=root.xpath('//movie')
for i,r in enumerate(result):
print(i,r,":",r.tag,r.attrib,r.get('title'))
print("text:",r.text)
print("string:",r.xpath('string(./description)'))
print('rating:',r.xpath('./rating/text()'))
文档解析之JSonPath
- 是一种信息抽取类库, 是从
JSON
文档中抽取指定信息的工具,提供多种语言实现版本,包括:Javascript,Python,PHP,Java - JsonPath对于
JSON
来说,相当于XPATH
对于XML
, Refer JSONPath - XPath for JSON - python中有2个类库可使用
pip install jsonpath
,import jsonpath
pip install jsonpath-rw
,from jsonpath import jsonpath,parse
, Refer Github
Jsonpath 操作符
$
: 根节点@
: 当前节点*
: 通配符,匹配所有..
: 递归搜索.
: 子节点[]
: 取子节点,迭代器标示(可在里面做简单的迭代操作,如数组下标,根据内容选值等)[start:end]
,[start:end:step]
[,]
支持迭代器中做多选
()
: 支持表达式计算?()
: 过滤操作,表达式结果必须是boolean类型
Json转换
import json
- function:
loads
,load
: jsonString -> pythonObjdumps
,dump
: pythonObj -> jsonString
- 转换:
Json Python object dict array list string unicode number(int) int,long number(real) float true True false False null None
示例:
import json
content='''
{"subjects":[
{"rate":"6.5","cover_x":1000,"title":"硬核","url":"https://movie.douban.com/subject/27109879/","playable":false,"cover":"https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2532653002.webp","id":"27109879","cover_y":1414,"is_new":false}
,{"rate":"7.1","cover_x":2000,"title":"奎迪:英雄再起","url":"https://movie.douban.com/subject/26707088/","playable":false,"cover":"https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2544510053.webp","id":"26707088","cover_y":2800,"is_new":false}
,{"rate":"6.1","cover_x":800,"title":"芳龄十六","url":"https://movie.douban.com/subject/30334122/","playable":false,"cover":"https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2549923514.webp","id":"30334122","cover_y":1185,"is_new":false}
,{"rate":"7.7","cover_x":1500,"title":"污垢","url":"https://movie.douban.com/subject/1945750/","playable":false,"cover":"https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2548709468.webp","id":"1945750","cover_y":2222,"is_new":false}
,{"rate":"6.8","cover_x":1179,"title":"欢乐满人间2","url":"https://movie.douban.com/subject/26611891/","playable":false,"cover":"https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2515404175.webp","id":"26611891","cover_y":1746,"is_new":false}
]}
'''
# 1. loads: string -> python obj
print('---- loads: --------------')
result=json.loads(content)
print(type(result)) # <class 'dict'>
print(result)
# 2. dumps: python obj -> string
print('---- dumps: --------------')
subjects=result.get('subjects')
result=json.dumps(subjects,ensure_ascii=False) # 禁用ascii编码,按utf-8编码
print(type(result)) # <class 'str'>
print(result)
# 3. dump: python obj -> string -> file
print('---- dump: --------------')
json.dump(subjects,open('test.json','w'),ensure_ascii=False)
with open('test.json','r') as f:
print(f.read())
# 4. load: file -> string -> python obj
print('---- load: --------------')
result=json.load(open('test.json','r'))
print(type(result)) # <class 'list'>
print(result)
print('-------------------------')
Demo:使用Jsonpath解析JSON
import json
import jsonpath
content='''
{"subjects":[
{"rate":"6.5","cover_x":1000,"title":"硬核","url":"https://movie.douban.com/subject/27109879/","playable":false,"cover":"https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2532653002.webp","id":"27109879","cover_y":1414,"is_new":false}
,{"rate":"7.1","cover_x":2000,"title":"奎迪:英雄再起","url":"https://movie.douban.com/subject/26707088/","playable":false,"cover":"https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2544510053.webp","id":"26707088","cover_y":2800,"is_new":false}
,{"rate":"6.1","cover_x":800,"title":"芳龄十六","url":"https://movie.douban.com/subject/30334122/","playable":false,"cover":"https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2549923514.webp","id":"30334122","cover_y":1185,"is_new":false}
,{"rate":"7.7","cover_x":1500,"title":"污垢","url":"https://movie.douban.com/subject/1945750/","playable":false,"cover":"https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2548709468.webp","id":"1945750","cover_y":2222,"is_new":false}
,{"rate":"6.8","cover_x":1179,"title":"欢乐满人间2","url":"https://movie.douban.com/subject/26611891/","playable":false,"cover":"https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2515404175.webp","id":"26611891","cover_y":1746,"is_new":false}
]}
'''
# 0. 加载
obj=json.loads(content)
# 1. `[?()]`
results=jsonpath.jsonpath(obj,'$.subjects[?(float(@.rate)>=7)]')
print(type(results))
# <class 'list'>
print(results)
#[{'rate': '7.1', 'cover_x': 2000, 'title': '奎迪:英雄再起', 'url': 'https://movie.douban.com/subject/26707088/', 'playable': False, 'cover': 'https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2544510053.webp', 'id': '26707088', 'cover_y': 2800, 'is_new': False}
# , {'rate': '7.7', 'cover_x': 1500, 'title': '污垢', 'url': 'https://movie.douban.com/subject/1945750/', 'playable': False, 'cover': 'https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2548709468.webp', 'id': '1945750', 'cover_y': 2222, 'is_new': False}
# ]
# 2. `.xxx`
results=jsonpath.jsonpath(obj,'$.subjects[?(float(@.rate)>=7)].title')
print(results)
# ['奎迪:英雄再起', '污垢']
# 3. `[index1,index2]`
results=jsonpath.jsonpath(obj,'$.subjects[0,2,3].cover_x')
print(results)
# [1000, 800, 1500]
# 4. `[start:end]`
results=jsonpath.jsonpath(obj,'$.subjects[0:3].cover_x')
print(results)
# [1000, 2000, 800]
# 5. `[start:end:step]`
results=jsonpath.jsonpath(obj,'$.subjects[0:3:2].cover_x')
print(results)
# [1000, 800]
# 6. `?( && )`,`?(,)`
# cover_x cover_y
# 1000 1414
# 2000 2800
# 800 1185
# 1500 2222
# 1179 1746
results=jsonpath.jsonpath(obj,'$.subjects[?(@.cover_x>=1000 && @.cover_y<1500)]')
print(len(results))
# 1
results=jsonpath.jsonpath(obj,'$.subjects[?(@.cover_x>=1000,@.cover_y<1500)]')
print(len(results))
# 5
print('-------------------------')