Python 爬虫基础

基础概念

HTTP协议

wikipedia: The Hypertext transfer Protocal(HTTP) is a stateless(无状态) application-level protocol for distributed(分布式),collaborative(协作式),hypertext information systems.

HyperText Transfer Protocol 超文本传输协议

是一个基于“请求与响应”模式的、无状态的应用层协议
采用URL作为定位网络资源的标识,格式如下: http://host[:port][path]
- host: 合法的Internet主机域名或IP地址
- port: 端口号,缺省端口为80
- path: 请求资源的路径
对资源的操作：
- GET 请求获取URL位置的资源
- HEAD 请求获取URL位置资源的响应消息报告,即获得该资源的头部信息
- POST 请求向URL位置的资源后附加新的数据
- PUT 请求向URL位置存储一个资源,覆盖原URL位置的资源
- PATCH 请求局部更新URL位置的资源,即改变该处资源的部分内容
- DELETE 请求删除URL位置存储的资源
- PATCH vs. PUT:
  - 假设URL位置有一组数据UserInfo,包括UserID、UserName等20个字段, 需求:用户修改了UserName,其他不变
  - PATCH: 仅向URL提交UserName的局部更新请求
  - PUT 须将所有20个字段一并提交到URL,未提交字段被删除
  - PATCH的最主要好处:节省网络带宽
响应状态码
- 2xx 成功
- 3xx 跳转
  - 300 Multiple Choices 存在多个可用资源，可处理可丢弃
  - 301 Moved Permanetly 重定向
  - 302 Found 重定向
  - 304 Not Modified 请求资源未更新，丢弃
  - 注：一些python库（urllib2,requests,...）已经对重定向做了处理，会自动跳转
- 4xx 客户端错误
  - 400 Bad Request 客户端请求有语法错误，不能被服务器所理解（请求参数或者路径错误）
  - 401 Unauthorized 请求未经授权，这个状态吗需和www-Authenticate报头域一起使用（无权限访问）
  - 403 Forbidden 服务器收到请求，但拒绝提供服务（未登录/IP被封/...）
  - 404 Not Found 请求资源部存在
- 5xx 服务端错误
  - 500 Internal Server Error 服务器发生了不可预期的错误
  - 503 Server Unavailable 服务器当前不能处理客户端请求，一段时间后可能恢复正常

Http Header

Request Http Header

  Accept: text/plain
  Accept-Charset: utf-8
  Accept-Encoding: gzip,deflate
  Accept-Language: en-US
  Connection: keep-alive
  Content-Length: 348
  Content-Type: application/x-www-form-urlencoded
  User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:60.0) Gecko/20100101 Firefox/60.0
  Cookie: $version=1; Skin=new;
  Date: ...
  Host: ...
  ....

Response Http Header

  Status: 200 OK
  Accept: text/plain;charset=utf-8
  Content-Eoncoding: gzip
  Content_Language: en-US
  Content-Length: 348
  Set-Cookie: UserID=xxx,Max-Age=3600;Version=1;...
  Location: ...
  Last-Modified: ...
  ...

深度抓取与广度抓取

        A
     /     \
    B       C
    /       \ 
D,E,F,G     X,Y,Z
|
H,I,J,K

深度抓取(垂直)
- 堆栈（递归，先进后出）
- A -> B -> D -> H -> I,J,K -> E,F,G -> C -> X,Y,Z
广度抓取(水平)
- 队列（先进先出）
- A -> B,C -> D,E,F,G ; X,Y,Z -> H,I,J,K
策略：
- 重要的网页距离种子站点比较近
- 一个网页可能有很多路径可以到达（图）
- 广度优先有利于多爬虫并行抓取
- 深度与广度结合

不重复抓取策略

记录抓取历史（URL）
- 保存到数据库（效率低）
- 使用HashSet(内存限制)
尽量压缩URL
- MD5／SHA-1编码成一段统一长度的数字／字符串，太长，一般会编码后再取模
- BitMap方法：建立BitSet,将URL（可在MD5基础上）经过Hash函数映射到一个或多个Bit位来记录
- BloomFilter: 在BitMap基础上，使用多个Hash函数
- 注：存在一定碰撞
操作：
- 评估网站的网页数量
- 选择合适的Hash算法和空间阈值，降低碰撞几率
- 选择合适的存储结构和算法
注：
- 网页数量少的情况下，不需要进行压缩（多数情况）
- 网页数量大的情况下，可使用BloomFilter压缩URL，重点是计算碰撞概率，以此确定存储空间的阈值
- 分布式系统，可将散列映射到多台主机

网络爬虫的限制

来源审查: 检查来访HTTP协议头的User‐Agent域,只响应浏览器或友好爬虫的访问

发布公告: Robots协议,网站告知网络爬虫哪些页面可以抓取,哪些不行

在网站根目录下的robots.txt文件 (Robots Exclusion Standard 网络爬虫排除标准)
Robots协议是建议但非约束性,网络爬虫可以不遵守,但存在法律风险

基本语法

  # `*`代表所有, `/`代表根目录 
  User‐agent: *
  Disallow: /

eg: https://www.jd.com/robots.txt

  User‐agent: *
  Disallow: /?*
  Disallow: /pop/*.html
  Disallow: /pinpai/*.html?*
  User‐agent: EtaoSpider
  Disallow: /
  User‐agent: HuihuiSpider
  Disallow: /
  User‐agent: GwdangSpider
  Disallow: /
  User‐agent: WochachaSpider
  Disallow: /

网站结构分析

利用sitemap里的信息
对网站目录结构进行分析
网页解析器：
- 模糊匹配：
  - 正则表达式
- 结构化解析：
  - htmp.parser
  - BeautifulSoup
  - lxml
  - 。。。

文档解析之Re

正则表达式

Regular Expression(regex)

一种通用的字符串表达框架,简洁表达一组字符串
特点: 简洁,一行胜千言（一行就是特征，即模式)
用途(主要应用在字符串匹配中)：
- 表达文本类型的特征(病毒、入侵等)
- 匹配字符串的全部或部分
- 查找或替换一组字符串
- ...

语法由字符和操作符构成, 常用操作符:

匹配单个字符

操作符	说明	实例
`.`	表示任何单个字符	/
`[]`	匹配`[]`中列举的字符	`[abc]`表示a、b、c,`[a‐z]`表示a到z单个字符
`[^ ]`	非字符集,对单个字符给出排除范围	`[^abc]`表示非a或b或c的单个字符
`\d`	数字,等价于`[0‐9]`	/
`\D`	匹配非数字	/
`\w`	单词字符,等价于`[A‐Za‐z0‐9]`	/
`\W`	匹配非单词字符	/
`\s`	匹配空白，即空格，tab键	/
`\S`	匹配非空白	/

匹配数量

操作符	说明	实例
`*`	前一个字符0次或无限次扩展,即可有可无	`abc*` 表示 ab、abc、abcc、abccc等
`+`	前一个字符1次或无限次扩展,即至少有1次	`abc+` 表示 abc、abcc、abccc等
`?`	前一个字符0次或1次扩展,即要么有1次，要么没有	`abc?` 表示 ab、abc
`{m}`	扩展前一个字符m次	`ab{2}c`表示abbc
`{m,}`	扩展前一个字符至少m次	`ab{2}c`表示abbc,abbbc,abbbbc等
`{m,n}`	扩展前一个字符m至n次(含n)	`ab{1,2}c`表示abc、abbc

匹配边界

操作符	说明	实例
`^`	匹配字符串开头	`^abc`表示abc且在一个字符串的开头
`$`	匹配字符串结尾	`abc$`表示abc且在一个字符串的结尾
`\b`	匹配一个单词的边界，注意：并不是匹配分隔符，而是单词和符号之间的边界（单词可是中英文字符,数字；符号可是中英文符号,空格,制表符,换行）	"a nice day","a niceday": `\bnice\b`可匹配出"a nice day"的"nice"
`\B`	匹配一个非单词的边界	"a nice day","a niceday": `\bnice\B`可匹配出"a niceday"的"nice"

匹配分组

操作符	说明	实例
`\|`	左右表达式任意一个	`abc\|def` 表示 abc、def
`( )`	分组标记,内部只能使用	`(abc)`表示abc,`(abc\|def)`表示abc、def
`\num`	引用分组num匹配到的字符串	`<(\w)><(\w)>.*</\2></\1>` => `<html><h1>hh<h1></html>`正确，`<html><h1>hh</h1></abc>` 错误
`(?P<name>)`	分组起别名	`<(?P<name1>\w)><(?P<name2>\w)>.*</(?P=name2)></(?P=name1)>` => `<html><h1>hh<h1></html>`正确,`<html><h1>hh</h1></abc>` 错误
`(?P=name)`	引用别名为name分组匹配到的字符串	/

eg1:
- 一组字符串(无穷个): 'PY', 'PYY', 'PYYY', 'PYYYY', ......, 'PYYYY......'
- 正则表达式(无穷字符串组的简洁表达): PY+
eg2:
- 一组字符串: 'PN', 'PYN', 'PYTN', 'PYTHN', 'PYTHON'
- 正则表达式(简洁表达):P(Y|YT|YTH|YTHO)?N
eg3:
- 表示一组'PY'开头，后续存在不多于10个字符，且不能是'P'或'Y'的字符串(如：'PYABC' 正确;'PYKXYZ' 不正确)
- 正则表达式（特征字符串组的简洁表达）：PY[^PY]{0,10}
eg4:
- 'PN'、'PYN'、'PYYN'、'PYYYN'...
- PY{:3}N
经典正则表达式实例：
- ^[A‐Za‐z]+$ 由26个字母组成的字符串
- ^[A‐Za‐z0‐9]+$ 由26个字母和数字组成的字符串
- [1‐9]\d{5} 中国境内6位邮政编码
- [\u4e00‐\u9fa5] 匹配中文字符
- \d{3}‐\d{8}|\d{4}‐\d{7} 国内电话号码(eg: 010‐68913536)
- (([1‐9]?\d|1\d{2}|2[0‐4]\d|25[0‐5]).){3}([1‐9]?\d|1\d{2}|2[0‐4]\d|25[0‐5]) IP地址(4段)
  - 0‐99: [1‐9]?\d
  - 100‐199: 1\d{2}
  - 200‐249: 2[0‐4]\d
  - 250‐255: 25[0‐5]
  - 简化表达：\d+.\d+.\d+.\d+ 或 \d{1,3}.\d{1,3}.\d{1,3}.\d{1,3}

Re库

Python的标准库(用于字符串匹配)
导入import re
正则表达式的表示类型：
- raw string(原生字符串类型)：r'text', eg: r'[1‐9]\d{5}', r'\d{3}‐\d{8}|\d{4}‐\d{7}'
- string(更繁琐), eg: '[1‐9]\\d{5}','\\d{3}‐\\d{8}|\\d{4}‐\\d{7}'
- 注：raw string不包含对转义符再次转义的字符串,所以建议当正则表达式包含转义符时,使用raw string
函数式用法: 一次性操作
- re.search(pattern, string, flags=0): 搜索第一个匹配的,返回match对象
  - pattern : 正则表达式(string/raw string)
  - string : 待匹配字符串
  - flags: 控制标记
    - re.I,re.IGNORECASE : 忽略大小写
    - re.M,re.MULTILINE : ^操作符能将给定字符串的每行当作匹配开始
    - re.S,re.DOTALL : .操作符能匹配所有字符(默认是匹配除换行外的所有字符）
  - eg:
```
  import re
  match=re.search(r'[1-9]\d{5}','BIT100081 TSU100084')
  if match:
      print(match.group(0))   # 100081
```
- re.match(pattern, string, flags=0): 从头开始匹配，返回match对象
  - 参数同上
  - eg:
```
  match=re.match(r'[1-9]\d{5}','BIT100081 TSU100084')
  if match:
      print(match.group(0))   # AttributeError: 'NoneType' object has no attribute 'group'

  match=re.match(r'[1-9]\d{5}','100081BIT TSU100084')
  if match:
      print(match.group(0))   # 100081
```
- re.findall(pattern, string, flags=0): 搜索,返回匹配字串列表
  - 参数同上
  - eg:
```
  ls=re.findall(r'[1-9]\d{5}','BIT100081 TSU100084') # ['100081','100084']
```
- re.finditer(pattern, string, flags=0): 搜索,返回匹配结果的迭代类型,每个迭代元素是match对象
  - 参数同上
  - eg:
```
  for match in re.finditer(r'[1-9]\d{5}','BIT100081 TSU100084'):
      if match:
          print(match.group(0))
  # 100081
  # 100084
```
- re.split(pattern, string, maxsplit=0, flags=0): 分割, 返回列表
  - maxsplit: 最大分割数,剩余部分作为最后一个元素输出
  - eg:
```
  re.split(r'[1-9]\d{5}','BIT100081 TSU100084')   # ['BIT',' TSU',''] 
  re.split(r'[1-9]\d{3}','BIT100081 TSU100084')   # ['BIT', '81 TSU', '84']
  re.split(r'[1-9]\d{5}','BIT100081 TSU100084',maxsplit=1) # ['BIT',' TSU100084']
```
- re.sub(pattern, repl, string, count=0, flags=0): 替换所有匹配的子串,返回替换后的字符串
  - repl 替换字符串
  - string : 待匹配字符串
  - count 最大替换次数
  - eg:
```
  re.sub(r'[1-9]\d{5}',':zipcode','BIT100081 TSU100084') # 'BIT:zipcode TSU:zipcode'
```
面向对象用法:编译后的多次操作
- Step1: regex = re.compile(pattern, flags=0) 将正则表达式的字符串形式编译成正则表达式对象
- Step2:
  - regex.search(string, flags=0)
  - regex.match(string, flags=0)
  - regex.findall(string, flags=0)
  - regex.finditer(string, flags=0)
  - regex.split(string, maxsplit=0, flags=0)
  - regex.sub(repl, string, count=0, flags=0)
Match对象：
- 一次匹配的结果,包含匹配的很多信息
- 属性：
  - .string: 待匹配的文本
  - .re: 匹配时使用的pattern对象(正则表达式)
  - .pos: 搜索文本的开始位置
  - .endpos: 搜索文本的结束位置
- 方法：
  - .group(0): 获得匹配后的字符串
  - .start(): 匹配字符串在原始字符串的开始位置
  - .end(): 匹配字符串在原始字符串的结束位置
  - .span(): 返回(.start(), .end())
- eg:
```
  match=re.search(r'[1-9]\d{5}','BIT100081 TSU100084')
  # 属性：
  match.string    # 'BIT100081 TSU100084'
  match.re        # re.compile('[1-9]\\d{5}')
  match.pos       # 0
  match.endpos    # 19
  # 方法：
  match.group(0)  # 100081
  match.start()   # 3
  match.end()     # 9
  match.span()    # (3,9)
```

贪婪匹配（默认）：输出匹配最长的子串

  match = re.search(r'PY.*N', 'PYANBNCNDN')
  match.group(0)  # 'PYANBNCNDN'

最小匹配: 输出最短匹配子串，操作符后增加?操作符

  '''
  只要长度输出可能不同的,都可以通过在操作符后增加?变成最小匹配
  `*?`: 前一个字符0次或无限次扩展,最小匹配
  `+?`: 前一个字符1次或无限次扩展,最小匹配
  `??`: 前一个字符0次或1次扩展,最小匹配
  `{m,n}?`: 扩展前一个字符m至n次(含n),最小匹配
  '''
  match = re.search(r'PY.*?N', 'PYANBNCNDN')
  match.group(0)  # 'PYAN'

文档解析之BeautifulSoup

一个网页解析库，处理高效，目前可支持html, xml,html5文档解析，可配置使用不同解析器

常见解析器：

解析器	使用	说明
html.parser	BeautifulSoup(content,'html.parser')	Python内置标准库,速度适中，容错能力适中，不依赖扩展
lxml	BeautifulSoup(content,'lxml'),BeautifulSoup(content,'xml')	第三方库（`pip install lxml`），速度快(局部遍历），支持XML解析，容错能力强，依赖C扩展
html5hib	BeautifulSoup(content,'html5hib')	第三方库（`pip install html5hib`）,速度慢，以浏览器的方式解析生成HTML5格式的文档，容错能力最好，不依赖外部扩展

安装（BeautifulSoup 包含在一个名为 bs4 的文件包中，需要另外安装）
```
 pip install bs4
```

创建BeautifulSoup对象，结构化解析Dom树（HTML/XML <=> 文档树 <=> BeautifulSoup对象）

 from bs4 import BeautifulSoup

 # soup = BeautifulSoup("<html><body><p>data</p></body></html>",'html.parser')
 soup = BeautifulSoup("<html><body><p>data</p></body></html>")
 print(soup.p)

 # 格式化输出（为HTML文本及其内容增加添加`\n`），也可用于标签:`<tag>.prettify()`
 print(soup.prettify())

访问节点

BeautifulSoup基本元素	说明	使用	示例
Tag	标签	`<tag>`	`soup.p`
Name	标签名字，字符串类型	`<tag>.name`	`soup.p.name`
Attributes	标签的属性，字典形式组织	`<tag>.attrs`	`soup.p.attrs`,`soup.p['attrname']`
NavigableString	标签内非属性字符串(`<>⋯</>`中字符串)	`<tag>.string`	`soup.p.string`
Comment	标签内字符串的注释部分, 特殊类型的 NavigableString 对象	/	/

是否有设置属性
- has_attr("attrname")
获取属性
- .attrs["attrname"]
- ["attrname"]
获取内容
- .text
- .get_text()

.string vs .text

.string on a Tag type object returns a NavigableString type object.
.text gets all the child strings and return concatenated using the given separator.

sample:

Html	string	text
`<td>some text</td>`	`some text`	`some text`
`<td></td>`	`None`	/
`<td><p>more text</p></td>`	`more text`	`more text`
`<td>even <p>more text</p></td>`	`None` (因为文本数`>=2`，`.string`不知道获取哪一个)	`even more text` (`.text`返回的是，两段文本的拼接)
`<td><!--This is comment--></td>`	`This is comment`	/
`<td>even <!--This is comment--></td>`	`None`	`even`

Navigating the Tree
- Going Down: 下行遍历(子节点和子孙节点)
  - .contents 返回儿子节点列表list
  - .children 返回儿子节点迭代类型list_iterator
  - .descendants 返回子孙节点迭代类型generator (包含所有子孙节点)
- Going up: 上行遍历(父节点和祖先节点)
  - .parent 返回节点的父亲节点
  - .parents 返回所有先辈节点的迭代类型generator (包括soup本身)
- Going sideways: 平行遍历(兄弟节点)
  - .next_sibling / .previous_sibling 返回HTML文本顺序的下/上一个平行节点
  - .next_siblings / .previous_siblings 返回HTML文本顺序的后/前续所有平行节点的迭代类型generator
- Going back and forth: 前后遍历（不分层次）
  - .next_element / .next_elements
  - .previous_element / .previous_elements
Searching the Tree
- Searching Down: 下行搜索（子孙）
  - find/find_all
- Searching up: 上行搜索（父祖）
  - find_parent / find_parents
- Searching sideway: 平行搜索（兄弟）：
  - find_next_sibling / find_previous_sibling
  - find_next_siblings / find_previous_siblings
- Searching back and forth: 前后搜索（不分层次的前/后节点）：
  - find_next / find_all_next
  - find_previous / find_all_previous
- 注：
  - 方法参数(name=None,attrs={},recursive=True,text=None,limit=None,**kwargs)，可以使用正则表达式
    - name: 标签名称
    - attrs: 标签属性
    - recursive: 是否对子孙全部检索,默认True
    - text: 内容字符串
    - limit: 限制条数
  - <tag>(..) 等价于 <tag>.find_all(..)
  - soup(..) 等价于 soup.find_all(..)
  - 每个元素是一个 bs4.element.Tag 对象
可使用CSS Selectors选择节点: .select('...')
- 基础选择：
  - #id
  - tagName
  - .styleClass
- 属性过滤:
  - [attribute]
  - [attribute=value]
  - [attribute!=value]
  - [attribute^=value]
  - [attribute$=value]
  - [attribute*=value]
- 层级选择:
  - ancestor descendent
  - parent > child
  - prev + next (next sibling tag)
  - prev ~ siblings (next all sibling tags)
- 元素过滤:
  - :not(selector)
  - :nth-of-type(index)
  - :nth-child(index)
  - :first-child
  - :last-child
  - :only-child
- 内容过滤:
  - :contains(text)
  - :empty
  - :has(selector)
- 表单属性过滤:
  - :enabled
  - :checked
  - :disabled
- 混合:
  - selector1, selector2, selectorN：获取多个选择符的合集
  - [selector1][selector2][selectorN]：匹配同时符合多个属性选择符的对象
注：
- BeautifulSoup用编码自动检测子库来识别当前文档编码并转换成Unicode编码，输出使用utf-8编码
- 获取属性值.attrs,.attrs['xxx']
- 获取内容.text,.get_text(),.string,.strings

Demo: 访问节点

from bs4 import BeautifulSoup

content='''
<b>Chat with sb</b>
<a> This is title  <!-- Guess --> </a>
<i><!--This is comment--></i>
<div id="div1">
    <div id="div2">
        <p id="test" class="highlight">
            Hello <a>Tom</a>
            Nice to meet you <!-- This is a comment -->
        </p>
    </div>
</div>
'''
soup=BeautifulSoup(content,'html.parser')

Tag name,attrs

 print("soup.p:",soup.p)
 # <p class="highlight" id="test1">
 #                 Hello <a>Tom</a>
 #                 Nice to meet you <!-- This is a comment -->
 # </p>

 print("soup.p.name:",soup.p.name)
 # p

 print("soup.p.attrs:",soup.p.attrs)
 # {'id': 'test1', 'class': ['highlight']}

 print("soup.p.attr['class']:",soup.p.attrs["class"])
 # ['highlight']

 print("soup.p.attrs['id']:",soup.p.attrs["id"])
 #  test1

 print("soup.p['class']:",soup.p["class"])
 #['highlight']

Tag text/string

 print("soup.p.text:",soup.p.text)
 #
 #                Hello Tom
 #                Nice to meet you
 #

 print("soup.p.get_text():",soup.p.get_text())
 #
 #                Hello Tom
 #                Nice to meet you
 #

 print("type(soup.p.get_text()):",type(soup.p.get_text()))   # <class 'str'>
 print("-----------------------------------")

 print('--- Demo: Tag <p> string ---')
 print("soup.p.string:",soup.p.string)               # None
 print("type(soup.p.string)",type(soup.p.string))    # <class 'NoneType'>
 print("soup.p.strings:",soup.p.strings)             # <generator object Tag._all_strings at 0x00000000028FDD68>
 for i,s in enumerate(soup.p.strings):
     print(i,":",s)
 print("-----------------------------------")    
 # 0 :
 #                 Hello
 # 1 : Tom
 # 2 :
 #                 Nice to meet you
 # 3 :

 print('--- Demo: Tag <a> text/string ---')
 print("soup.a.text:",soup.a.text)                       # Chat with sb
 print("soup.a.string:",soup.a.string)                   # Chat with sb
 print("type(soup.a.string):",type(soup.a.string))       # <class 'bs4.element.NavigableString'>
 print("-----------------------------------")

 print('--- Demo: Tag <b> text/string ---')
 print("soup.b.text:",soup.b.text)                       # This is title
 print("soup.b.string:",soup.b.string)                   # None
 print("type(soup.b.string):",type(soup.b.string))       # <class 'NoneType'>
 print("-----------------------------------")

 print('--- Demo: Tag <i> text/string ---')
 print("soup.i.text:",soup.i.text)                       #
 print("soup.i.string:",soup.i.string)                   # This is comment
 print("type(soup.i.string):",type(soup.i.string))       # <class 'bs4.element.Comment'>

Demo: Navigating the Tree

from bs4 import BeautifulSoup

content='''
<b>Chat with sb</b>
<a> This is title  <!-- Guess --> </a>
<i><!--This is comment--></i>
<div id="div1">
    <div id="div2">
        <p id="test" class="highlight">
            Hello <a>Tom</a>
            Nice to meet you <!-- This is a comment -->
        </p>
    </div>
</div>
'''

soup=BeautifulSoup(content,'html.parser')
def print_result(result):
    if type(result)==element.Tag or (type(result)== list and len(result)==0):
        print(result)
        return
    for i,r in enumerate(result):
        print(i,":",r)
    print('-------------------------')

def print_result_name(result):
    if type(result)==element.Tag or type(result)==element.NavigableString or (type(result)== list and len(result)==0):
        print(result)
        return
    for i,r in enumerate(result):
        print(i,":",r.name)
    print('-------------------------')

Going down:

.contents

  print(soup.p.contents)
  # <class 'list'>

  print_result(soup.p.contents)
  # 0 :
  #                 Hello
  # 1 : <a>Tom</a>
  # 2 :
  #                 Nice to meet you
  # 3 :  This is a comment
  # 4 :

.children

  print(soup.p.children)          
  # <list_iterator object at 0x0000000001E742E8>

  print_result(soup.p.children)
  # 0 :
  #                 Hello
  # 1 : <a>Tom</a>
  # 2 :
  #                 Nice to meet you
  # 3 :  This is a comment
  # 4 :

.descendants

  print('--- Demo: Tag <p> descendants ---')
  print(soup.p.descendants)
  # <generator object Tag.descendants at 0x00000000028ADD68>

  print_result(soup.p.descendants)
  # 0 :
  #                 Hello
  # 1 : <a>Tom</a>
  # 2 : Tom
  # 3 :
  #                 Nice to meet you
  # 4 :  This is a comment
  # 5 :

Going up:

.parent

  print(type(soup.p.parent))      
  # <class 'bs4.element.Tag'>

  print_result(soup.p.parent)
  # <div id="div2">
  # <p class="highlight" id="test1">
  #                 Hello <a>Tom</a>
  #                 Nice to meet you <!-- This is a comment -->
  # </p>
  # <p class="story" id="test2">Story1</p>
  # <p class="story" id="test3">Story2</p>
  # </div>

.parents

  print(soup.p.parents)           
  # <generator object PageElement.parents at 0x00000000028FDD68>

  print_result_name(soup.p.parents)
  # 0 : div
  # 1 : div
  # 2 : [document]

Going sideway:

next_sibling

  print_result(soup.p.next_sibling)
  # 0 :

next_siblings

  print(soup.p.next_siblings)     
  # <generator object PageElement.next_siblings at 0x00000000028FDD68>

  print_result(soup.p.next_siblings)
  # 0 :
  # 
  # 1 : <p class="story" id="test2">Story1</p>
  # 2 :
  # 
  # 3 : <p class="story" id="test3">Story2</p>
  # 4 :

vs. find_next_silbings

  print('--- Demo: `find_next_siblings()` ---')
  result=soup.p.find_next_siblings()
  print_result(result)
  # 0 : <p class="story" id="test2">Story1</p>
  # 1 : <p class="story" id="test3">Story2</p>

Going forth and back:

next_element

  print(soup.p.next_element)
  #
  # Hello
  print(type(soup.p.next_element))
  # <class 'bs4.element.NavigableString'>

next_elements

  print(soup.p.next_elements)     
  # <generator object PageElement.next_elements at 0x00000000028FDD68>

  print_result(soup.p.next_elements)
  # 0 :
  #                 Hello
  # 1 : <a>Tom</a>
  # 2 : Tom
  # 3 :
  #                 Nice to meet you
  # 4 :  This is a comment
  # 5 :
  # 
  # 6 :
  # 
  # 7 : <p class="story" id="test2">Story1</p>
  # 8 : Story1
  # 9 :
  # 
  # 10 : <p class="story" id="test3">Story2</p>
  # 11 : Story2
  # 12 :
  # 
  # 13 :
  # 
  # 14 :

vs. find_all_next()

  result=soup.p.find_all_next()
  print_result(result)
  # 0 : <a>Tom</a>
  # 1 : <p class="story" id="test2">Story1</p>
  # 2 : <p class="story" id="test3">Story2</p>

Demo: Searching the Tree

from bs4 import BeautifulSoup
from bs4 import element
import re

content='''
<html><head><title>The Dormouse's story</title></head> <body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.
</p>
<p class="story">...</p>
</body>
</html>
'''

soup=BeautifulSoup(content,'html.parser')
print(soup.prettify())

def print_result(result):
    if type(result)==element.Tag or (type(result)== list and len(result)==0):
        print(result)
        return
    for i,r in enumerate(result):
        print(i,":",r)
    print('-------------------------')

def print_result_name(result):
    if type(result)==element.Tag or (type(result)== list and len(result)==0):
        print(result)
        return
    for i,r in enumerate(result):
        print(i,":",r.name)
    print('-------------------------')

Searching down

by name

  print('--- Demo: `find_all("a")` ---')
  result=soup.find_all('a')
  print_result(result)
  # 0 : <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
  # 1 : <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
  # 2 : <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

  print('--- Demo: `find_all(["a","title"])` ---')
  result=soup.find_all(['a','title'])
  print_result(result)
  # 0 : <title>The Dormouse's story</title>
  # 1 : <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
  # 2 : <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
  # 3 : <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

  print('--- Demo: `find_all(True)` ---')
  result=soup.find_all(True)
  print_result(result)
  # 0 : <html><head><title>The Dormouse's story</title></head> <body>
  # <p class="title"><b>The Dormouse's story</b></p>
  # <p class="story">
  # Once upon a time there were three little sisters; and their names were
  # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
  # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well.
  # </p>
  # <p class="story">...</p>
  # </body>
  # </html>
  # 1 : <head><title>The Dormouse's story</title></head>
  # 2 : <title>The Dormouse's story</title>
  # 3 : <body>
  # <p class="title"><b>The Dormouse's story</b></p>
  # <p class="story">
  # Once upon a time there were three little sisters; and their names were
  # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
  # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well.
  # </p>
  # <p class="story">...</p>
  # </body>
  # 4 : <p class="title"><b>The Dormouse's story</b></p>
  # 5 : <b>The Dormouse's story</b>
  # 6 : <p class="story">
  # Once upon a time there were three little sisters; and their names were
  # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
  # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well.
  # </p>
  # 7 : <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
  # 8 : <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
  # 9 : <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
  # 10 : <p class="story">...</p>

  print('--- Demo: `find_all(re.compile("b")` ---')
  result=soup.find_all(re.compile('b'))
  print_result(result)
  # 0 : <body>
  # <p class="title"><b>The Dormouse's story</b></p>
  # <p class="story">
  # Once upon a time there were three little sisters; and their names were
  # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
  # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well.
  # </p>
  # <p class="story">...</p>
  # </body>
  # 1 : <b>The Dormouse's story</b>

by attrs

  print('--- Demo: find_all("p","story") ---')
  result=soup.find_all('p','story')
  print_result(result)
  # 0 : <p class="story">
  # Once upon a time there were three little sisters; and their names were
  # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
  # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well.
  # </p>
  # 1 : <p class="story">...</p>

  print('--- Demo: find_all(id="link1") ---')
  result=soup.find_all(id='link1')
  print_result(result)
  # 0 : <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

  print('--- Demo: find_all(class_="sister") ---')
  result=soup.find_all(class_='sister')
  print_result(result)
  # 0 : <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
  # 1 : <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
  # 2 : <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

  print('--- Demo: find_all(re.compile("link")) ---')
  result=soup.find_all(id=re.compile('link'))
  print_result(result)
  # 0 : <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
  # 1 : <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
  # 2 : <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

  print('--- Demo: find_all(attrs={"class":"story"}) ---')
  result=soup.find_all(attrs={'class':'story'})
  print_result(result)
  # 0 : <p class="story">
  # Once upon a time there were three little sisters; and their names were
  # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
  # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well.
  # </p>
  # 1 : <p class="story">...</p>

by recursive

  print('--- Demo: find_all("a") ---')
  result=soup.find_all('a')
  print_result(result)
  # 0 : <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
  # 1 : <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
  # 2 : <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

  print('--- Demo: find_all("a",recursive=False) ---')
  result=soup.find_all('a',recursive=False)
  print_result(result)
  # []

by string/text

  print('--- Demo: find_all(string="three") ---')
  result=soup.find_all(string='three')
  print_result(result)
  # []

  print('--- Demo: find_all(string=re.compile("e")) ---')
  result=soup.find_all(string=re.compile('e'))
  print_result(result)
  # 0 : The Dormouse's story
  # 1 : The Dormouse's story
  # 2 :
  # Once upon a time there were three little sisters; and their names were
  #
  # 3 : Elsie
  # 4 : Lacie
  # 5 : Tillie
  # 6 : ; and they lived at the bottom of a well.

by limit : find()也就是当limit=1时的find_all()

  print('--- Demo: find_all("a",limit-2) ---')
  result=soup.find_all('a',limit=2)
  print_result(result)
  # 0 : <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
  # 1 : <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

by self def function

  print('--- Demo: using `self def function` ---')
  def my_filter(tag):
      return tag.has_attr('id') and re.match('link',tag.get("id"))

  result=soup.find_all(my_filter)
  print_result(result)
  # 0 : <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
  # 1 : <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
  # 2 : <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

Searching up: find_parents

 print('--- Demo: link2.`find_parents()` ---')
 result=soup.find(id="link2").find_parents()
 print_result_name(result)
 # 0 : p
 # 1 : body
 # 2 : html
 # 3 : [document]

 print('--- Demo: link2.`find_parents("p")` ---')
 result=soup.find(id="link2").find_parents('p')
 print_result(result)
 # 0 : <p class="story">
 # Once upon a time there were three little sisters; and their names were
 # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well.
 # </p>

Searching sideway: find_next_siblings

 print('--- Demo: `find_next_siblings()` ---')
 result=soup.find(id="link1").find_next_siblings()
 print_result(result)
 # 0 : <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
 # 1 : <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

Searching forth and back: find_all_next

 print('--- Demo: `find_all_next()` ---')
 result=soup.find(id="link1").find_all_next()
 print_result(result)
 # 0 : <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
 # 1 : <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
 # 2 : <p class="story">...</p>

Demo: CSS Selectors

<html><head><title>The Dormouse's story</title></head> <body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.
</p>
<p class="story">...</p>
<input type="text" disabled value="input something"></input>
</body>
</html>

基础选择

#id

  print('--- Demo: `select("#link1")` ---')
  result=soup.select("#link1")
  print_result(result)
  # 0 : <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

  print('--- Demo: `select("a#link1")` ---')
  result=soup.select("a#link2")
  print_result(result)
  # 0 : <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

tagName

  print('--- Demo: `select("title")` ---')
  result=soup.select("title")
  print_result(result)
  # 0 : <title>The Dormouse's story</title>

.styleClass

  print('--- Demo: `select(".sister")` ---')
  result=soup.select(".sister")
  print_result(result)
  # 0 : <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
  # 1 : <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
  # 2 : <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

属性过滤

[attribute]

  print('--- Demo: `select("a[href]")` ---')
  result=soup.select('a[href]')
  print_result(result)
  # 0 : <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
  # 1 : <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
  # 2 : <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

[attribute=value]

  print('--- Demo: `select("[class=sister]")` ---')
  result=soup.select("[class=sister]")
  print_result(result)
  # 0 : <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
  # 1 : <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
  # 2 : <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

[attribute^=value]

  print('--- Demo: `select("a[href^="http://example.com/"]")` ---')
  result=soup.select('a[href^="http://example.com/"]')
  print_result(result)
  # 0 : <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
  # 1 : <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
  # 2 : <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

[attribute$=value]

  print('--- Demo: `select("a[href$="tillie"])` ---')
  result=soup.select('a[href$="tillie"]')
  print_result(result)
  # 0 : <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

-[attribute*=value]

  print('--- Demo: `select("a[href*=".com/el"]")` ---')
  result=soup.select('a[href*=".com/el"]')
  print_result(result)
  # 0 : <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

[selector1][selector2][selectorN]

  print("--- Demo: `[class='sister'][id=link2]` --- ")
  print_result(soup.select("[class=sister][id=link2]"))
  # 0 : <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

层级选择

ancestor descendent

  print('--- Demo: `select("body a")` ---')
  result=soup.select("body a")
  print_result(result)
  # 0 : <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
  # 1 : <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
  # 2 : <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

parent > child

  print('--- Demo: `select("body > a") ---')
  result=soup.select("body > a")
  print_result(result)
  # []

  print('--- Demo: `select("p > a") ---')
  result=soup.select("p > a")
  print_result(result)
  # 0 : <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
  # 1 : <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
  # 2 : <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

  print('--- Demo: `select("p > a:nth-of-type(2)")` ---')
  result=soup.select("p > a:nth-of-type(2)")
  print_result(result)
  # 0 : <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

  print('--- Demo: `select("p > #link1")` ---')
  result=soup.select("p > #link1")
  print_result(result)
  # 0 : <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

prev + next ：next sibling tag

  print('--- Demo: `select("#link1 ~ .sister")` ---')
  result=soup.select("#link1 ~ .sister")
  print_result(result)
  # 0 : <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
  # 1 : <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

prev ~ siblings ：next all sibling tags

  print('--- Demo: `select("#link1 + .sister")` ---')
  result=soup.select("#link1 + .sister")
  print_result(result)
  # 0 : <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

元素过滤

:not(selector)

  print("--- Demo: `:not(.story)` --- ")
  print_result(soup.select("p:not(.story)"))
  # 0 : <p class="title"><b>The Dormouse's story</b></p>

:nth-of-type(index)

  print('--- Demo: `select("p:nth-of-type(3)")` ---')
  result=soup.select("p:nth-of-type(3)")
  print_result(result)
  # 0 : <p class="story">...</p>

:nth-child(index)

  print("--- Demo: `p > :nth-child(1)` --- ")
  print_result(soup.select("p > :nth-child(1)"))
  # 0 : <b>The Dormouse's story</b>
  # 1 : <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

:first-child

  print("--- Demo: `p > :first-child` --- ")
  print_result(soup.select("p > :first-child"))
  # 0 : <b>The Dormouse's story</b>
  # 1 : <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

:last-child

  print("--- Demo: `p > :last-child` --- ")
  print_result(soup.select("p > :last-child"))
  # 0 : <b>The Dormouse's story</b>
  # 1 : <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

:only-child

  print("--- Demo: `p > :only-child` --- ")
  print_result(soup.select("p > :only-child"))
  # 0 : <b>The Dormouse's story</b>

内容过滤

:contains(text)

  print("--- Demo: `p:contains(story)` --- ")
  print_result(soup.select("p:contains(story)"))
  # 0 : <p class="title"><b>The Dormouse's story</b></p>

:empty

  print("--- Demo: `p:empty` --- ")
  print_result(soup.select("p:empty"))
  # []

:has(selector)

  print("--- Demo: `p:has(b)` --- ")
  print_result(soup.select("p:has(b)"))
  # 0 : <p class="title"><b>The Dormouse's story</b></p>

表单属性过滤

:enabled,:disabled,:checked

  print("--- Demo: `:disabled`` --- ")
  print_result(soup.select(":disabled"))
  # 0 : <input disabled="" type="text" value="input something"/>

其他：

selector1, selector2, selectorN

  print('--- Demo: `select("#link1,#link2")` ---')
  result=soup.select("#link1,#link2")
  print_result(result)
  # 0 : <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
  # 1 : <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

select_one()

  print('--- Demo: `select_one(".sister")` ---')
  result=soup.select_one(".sister")
  print_result(result)
  # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

get attribute value:

 print('--- Demo: `get attribute value` ---')
 result=soup.select(".sister")

 print_result(result)
 # 0 : <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
 # 1 : <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
 # 2 : <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

 print(result[0].get_text())
 #Elsie

 print(result[0].attrs)
 #{'href': 'http://example.com/elsie', 'class': ['sister'], 'id': 'link1'}

 print(result[0].attrs['id'])
 #link1

文档解析之XPath

使用路径表达式来选取XML/HTML文档中的节点或节点集
安装: pip install lxml
导入：from lxml import etree
注意：lxml和正则一样用C实现，是一款高性能的PythonHTML/XML解析器，可以利用XPath语法来快速的定位特定元素以及节点信息
Tools: Chrome的插件XPath Helper，快速得到页面元素的匹配规则

路径表达式

// : 选取所有的当前节点，不考虑他们的位置
- //p (.//p)
- /p//a
- //p/a
/ : 从根节点选取
- /p (./p)
- /p/a
. 当前节点,.. : 当前节点的父节点
- ./p
- ../p
- //p/b/../a
- root.xpath('//p/b').xpath('./a')
- root.xpath('//p/b').xpath('../text()')
- root.xpath('//p/b/..//a')[0].text
@ : 选取属性
- //@class
- //p/@class
- //p//@class
- //p[@class]
- //p[@class='s1']
- //p[@class='s1']/@class
/text(),string(.) 选择内容
- "//b/text()"
- //b//text()
- string(.)
- string(./description)
[]: Predicates
- //p[1],//p[last()],//p[last()-1]
- //p[position()<=2]
- //p[@class],//p[@class='s1']
- //p[b],//p[b/@class],//p[b[@class='s1']]
* : 通配符，匹配任何
- //p/*
- //p//*
- //p/*/a
- //p[@*]
- //*[@class='s1']
| : 选取多个路径
- /p | //b
- //p/a | //p/b[@class]
and,or,not:
- //a[@class='sister' and @id='link2'],//a[@class='sister'][@id='link2']
- //a[@id='link1' or @class='outAstyle']
- //a[not(@class='sister')]
- //a[not(@class='sister') and @class or @id='link1']
xxx():
- starts-with(): //a[starts-with(@href,'http://example.com/')]
- contains(): //a[contains(text(),'ie') and contains(@id,'link')]
- text(): //b/text(),//b//text()
- string(.): data.xpath('//div[@class="name"]')[0].xpath('string(.)')
::
- go self: self::,eg: //self::b
- go up: ancestor:: , ancestor-or-self::,parent::,eg: //a/ancestor::p
- go down: descendant::,child::,eg: //p/descendant::a[not(@class)]
- go forward: following::,following-sibling::, eg: p[last()-1]/following::*
- go back: preceding::,preceding-sibling:: , eg: p[2]/preceding::*
- get attributes: attribute::,eg: //a/attribute::*,//a/attribute::class
lxml.etree._Element:
- tag
- attrib
- text
- .xpath('string(.)')
- .get('attribute')

Demo: 解析HTML

#!/usr/bin/env python
# -*- coding:utf-8 -*-
from lxml import etree

content='''
<div>
    <p class="title"><b class='bstyle'>The Dormouse's story</b></p>
    <p class="story">
        Once upon a time there were three little sisters; and their names were
        <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
        <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
        and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
        ; and they lived at the bottom of a well.
        <p> hello ...<b><a> World </a></b> </p>
    </p>
    <p class="story">...<a class="outAstyle">Miss</a> </p>
</div>
'''

# html = etree.parse('./test.html',etree.HTMLParser())
html = etree.HTML(content)
print(html)
# <Element html at 0x1019312c8>

# result = etree.tostring(html)     # 会补全缺胳膊少腿的标签
# print(result.decode("utf-8"))
print(etree.tounicode(html))        # 会补全缺胳膊少腿的标签
# <html><body><div>
#   <p class="title"><b class="bstyle">The Dormouse's story</b></p>
#   <p class="story">
#       Once upon a time there were three little sisters; and their names were
#       <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
#       <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
#       and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
#       ; and they lived at the bottom of a well.
#       </p><p> hello ...<b><a> World </a></b> </p>
#   <p class="story">...<a class="outAstyle">Miss</a> </p>
# </div>
# </body></html>

result=html.xpath("//p/b")
for i,r in enumerate(result):
    print(i,type(r),":",r.tag,r.attrib,r.get('class'),r.text,r.xpath('string(.)'))
# 0 <class 'lxml.etree._Element'> : b {'class': 'bstyle'} bstyle The Dormouse's story The Dormouse's story
# 1 <class 'lxml.etree._Element'> : b {} None None  World

###########################
# More test:
test_path_any(html)
test_path_attr(html)
test_path_predicates(html)


def test_path_any(root):
    print("--- `//` ----")
    do_xpath(root,'p')
    # []
    do_xpath(root,'//p')
    # [<Element p at 0x109f34148>, <Element p at 0x109f34188>, <Element p at 0x109f34248>, <Element p at 0x109f34288>]
    do_xpath(root,'//p/a/text()')
    # ['Elsie', 'Lacie', 'Tillie', 'Miss']
    do_xpath(root,'//p//a/text()')
    # ['Elsie', 'Lacie', 'Tillie', ' World ', 'Miss']
    do_xpath(root,'.//a/text()')
    # ['Elsie', 'Lacie', 'Tillie', ' World ', 'Miss']

    print('--- `xpath` ---')
    print(root.xpath("//p/b//a"))
    # [<Element a at 0x10b555f08>]
    print(root.xpath("//p/b")[1].xpath("//a"))
    # [<Element a at 0x10b555f08>, <Element a at 0x10b5770c8>, <Element a at 0x10b577108>, <Element a at 0x10b577048>, <Element a at 0x10b577088>]
    print(root.xpath("//p/b")[1].xpath("./a"))
    # [<Element a at 0x10c719f48>]
    print(root.xpath("//p/b")[1].xpath("../text()"))
    # [' hello ...', ' ']
    print(root.xpath('//p/b/..//a')[0].text)
    # World
    print('------------------------')

def test_path_attr(root):
    print("--- `@` ----")
    do_xpath(root,'/@class')
    # []
    do_xpath(root,'//@class')
    # ['title', 'bstyle', 'story', 'sister', 'sister', 'sister', 'story', 'outAstyle']

    do_xpath(root,'//p[@class]')
    # [<Element p at 0x10e4c3888>, <Element p at 0x10e4c36c8>, <Element p at 0x10e4c3708>]
    do_xpath(root,"//p[@class='story']")
    # [<Element p at 0x110ba8708>, <Element p at 0x110ba8548>]

    do_xpath(root,"//p/@class")
    # ['title', 'story', 'story']
    do_xpath(root,"//p[@class='story']/@class")
    # ['story', 'story']
    do_xpath(root,"//p[@class='story']//@class")
    # ['story', 'sister', 'sister', 'sister', 'story', 'outAstyle']
    print('------------------------')

def test_path_predicates(root):
    print("--- `[]` ----")
    do_xpath_detail(root,'//p[1]')
    # 0 : <p class="title"><b class="bstyle">The Dormouse's story</b></p>
    do_xpath_detail(root,'//p[last()]')
    # 0 : <p class="story">...<a class="outAstyle">Miss</a> </p>
    do_xpath_detail(root,'//p[last()-1]')
    # 0 : <p> hello ...<b><a> World </a></b> </p>

    do_xpath_detail(root,'//a[1]')
    # 0 : <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
    # 1 : <a> World </a>
    # 2 : <a class="outAstyle">Miss</a>
    do_xpath_detail(root,'//p/a[1]')
    # 0 : <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
    # 1 : <a class="outAstyle">Miss</a>
    do_xpath_detail(root,'//a[position()<=2]')
    # 0 : <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
    # 1 : <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>and
    # 2 : <a> World </a>
    # 3 : <a class="outAstyle">Miss</a>

    do_xpath_detail(root,'//a[@class]')
    # 0 : <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
    # 1 : <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>and
    # 2 : <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.
    # 3 : <a class="outAstyle">Miss</a>
    do_xpath_detail(root,'//a[@class="outAstyle"]')
    # 0 : <a class="outAstyle">Miss</a>

    do_xpath_detail(root,'//p[b]')
    # 0 : <p class="title"><b class="bstyle">The Dormouse's story</b></p>
    # 1 : <p> hello ...<b><a> World </a></b> </p>
    do_xpath_detail(root,"//p[b/@class]")
    # 0 : <p class="title"><b class="bstyle">The Dormouse's story</b></p>
    do_xpath_detail(root,"//p[b[@class='bstyle']]")
    # 0 : <p class="title"><b class="bstyle">The Dormouse's story</b></p>
    print('------------------------')

def do_xpath(root,path):
    result=root.xpath(path)
    print("%s : \n%s" % (path,result))
    return result

def do_xpath_detail(root,path):
    result=root.xpath(path)
    print(path,":")
    if type(result)==list and len(result)>0:
        for i,r in enumerate(result):
            if type(r)==etree._Element:
                print(i,":",etree.tounicode(r))
            else:
                print(i,":",r)
    else:
        print(result)
    return result

Demo: 解析XML

from lxml import etree

content='''
<collection shelf="New Arrivals">
    <movie title="Enemy Behind">
       <type>War, Thriller</type>
       <format>DVD</format>
       <year>2003</year>
       <rating>PG</rating>
       <stars>10</stars>
       <description>Talk about a US-Japan war</description>
    </movie>
    <movie title="Transformers">
       <type>Anime, Science Fiction</type>
       <format>DVD</format>
       <year>1989</year>
       <rating>R</rating>
       <stars>8</stars>
       <description>A schientific fiction</description>
    </movie>
</collection>
'''
root=etree.XML(content)
print(root)
print(etree.tounicode(root))

result=root.xpath('//movie')
for i,r in enumerate(result):
    print(i,r,":",r.tag,r.attrib,r.get('title'))
    print("text:",r.text)
    print("string:",r.xpath('string(./description)'))
    print('rating:',r.xpath('./rating/text()'))

文档解析之JSonPath

是一种信息抽取类库, 是从JSON文档中抽取指定信息的工具，提供多种语言实现版本，包括：Javascript,Python,PHP，Java
JsonPath对于JSON来说，相当于XPATH 对于XML, Refer JSONPath - XPath for JSON
python中有2个类库可使用
- pip install jsonpath,import jsonpath
- pip install jsonpath-rw, from jsonpath import jsonpath,parse , Refer Github

Jsonpath 操作符

$: 根节点
@: 当前节点
*: 通配符，匹配所有
..: 递归搜索
. : 子节点
[]: 取子节点,迭代器标示(可在里面做简单的迭代操作，如数组下标，根据内容选值等)
- [start:end],[start:end:step]
- [,] 支持迭代器中做多选
(): 支持表达式计算
- ?(): 过滤操作，表达式结果必须是boolean类型

Json转换

import json
function:
- loads,load: jsonString -> pythonObj
- dumps,dump: pythonObj -> jsonString
转换：

Json Python

object dict

array list

string unicode

number(int) int,long

number(real) float

true True

false False

null None

Json	Python
object	dict
array	list
string	unicode
number(int)	int,long
number(real)	float
true	True
false	False
null	None

示例：

import json

content='''
{"subjects":[
    {"rate":"6.5","cover_x":1000,"title":"硬核","url":"https://movie.douban.com/subject/27109879/","playable":false,"cover":"https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2532653002.webp","id":"27109879","cover_y":1414,"is_new":false}
    ,{"rate":"7.1","cover_x":2000,"title":"奎迪：英雄再起","url":"https://movie.douban.com/subject/26707088/","playable":false,"cover":"https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2544510053.webp","id":"26707088","cover_y":2800,"is_new":false}
    ,{"rate":"6.1","cover_x":800,"title":"芳龄十六","url":"https://movie.douban.com/subject/30334122/","playable":false,"cover":"https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2549923514.webp","id":"30334122","cover_y":1185,"is_new":false}
    ,{"rate":"7.7","cover_x":1500,"title":"污垢","url":"https://movie.douban.com/subject/1945750/","playable":false,"cover":"https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2548709468.webp","id":"1945750","cover_y":2222,"is_new":false}
    ,{"rate":"6.8","cover_x":1179,"title":"欢乐满人间2","url":"https://movie.douban.com/subject/26611891/","playable":false,"cover":"https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2515404175.webp","id":"26611891","cover_y":1746,"is_new":false}
]}
'''

# 1. loads: string -> python obj
print('---- loads: --------------')
result=json.loads(content)
print(type(result))                             # <class 'dict'>
print(result)

# 2. dumps: python obj -> string
print('---- dumps: --------------')
subjects=result.get('subjects')
result=json.dumps(subjects,ensure_ascii=False)  # 禁用ascii编码，按utf-8编码    
print(type(result))                             # <class 'str'>
print(result)

# 3. dump: python obj -> string -> file
print('---- dump: --------------')
json.dump(subjects,open('test.json','w'),ensure_ascii=False)
with open('test.json','r') as f:
    print(f.read())

# 4. load: file -> string -> python obj
print('---- load: --------------')
result=json.load(open('test.json','r'))
print(type(result))                             # <class 'list'>
print(result)

print('-------------------------')

Demo：使用Jsonpath解析JSON

import json
import jsonpath

content='''
{"subjects":[
    {"rate":"6.5","cover_x":1000,"title":"硬核","url":"https://movie.douban.com/subject/27109879/","playable":false,"cover":"https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2532653002.webp","id":"27109879","cover_y":1414,"is_new":false}
    ,{"rate":"7.1","cover_x":2000,"title":"奎迪：英雄再起","url":"https://movie.douban.com/subject/26707088/","playable":false,"cover":"https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2544510053.webp","id":"26707088","cover_y":2800,"is_new":false}
    ,{"rate":"6.1","cover_x":800,"title":"芳龄十六","url":"https://movie.douban.com/subject/30334122/","playable":false,"cover":"https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2549923514.webp","id":"30334122","cover_y":1185,"is_new":false}
    ,{"rate":"7.7","cover_x":1500,"title":"污垢","url":"https://movie.douban.com/subject/1945750/","playable":false,"cover":"https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2548709468.webp","id":"1945750","cover_y":2222,"is_new":false}
    ,{"rate":"6.8","cover_x":1179,"title":"欢乐满人间2","url":"https://movie.douban.com/subject/26611891/","playable":false,"cover":"https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2515404175.webp","id":"26611891","cover_y":1746,"is_new":false}
]}
'''

# 0. 加载
obj=json.loads(content)

# 1. `[?()]`
results=jsonpath.jsonpath(obj,'$.subjects[?(float(@.rate)>=7)]')
print(type(results))
# <class 'list'>    
print(results)
#[{'rate': '7.1', 'cover_x': 2000, 'title': '奎迪：英雄再起', 'url': 'https://movie.douban.com/subject/26707088/', 'playable': False, 'cover': 'https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2544510053.webp', 'id': '26707088', 'cover_y': 2800, 'is_new': False}
# , {'rate': '7.7', 'cover_x': 1500, 'title': '污垢', 'url': 'https://movie.douban.com/subject/1945750/', 'playable': False, 'cover': 'https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2548709468.webp', 'id': '1945750', 'cover_y': 2222, 'is_new': False}
# ]

# 2. `.xxx`
results=jsonpath.jsonpath(obj,'$.subjects[?(float(@.rate)>=7)].title')
print(results)
# ['奎迪：英雄再起', '污垢']

# 3. `[index1,index2]`
results=jsonpath.jsonpath(obj,'$.subjects[0,2,3].cover_x')
print(results)
# [1000, 800, 1500]

# 4. `[start:end]`
results=jsonpath.jsonpath(obj,'$.subjects[0:3].cover_x')
print(results)
# [1000, 2000, 800]

# 5. `[start:end:step]`
results=jsonpath.jsonpath(obj,'$.subjects[0:3:2].cover_x')
print(results)
# [1000, 800]

# 6. `?( && )`,`?(,)`
# cover_x   cover_y
# 1000      1414
# 2000      2800
# 800       1185
# 1500      2222
# 1179      1746
results=jsonpath.jsonpath(obj,'$.subjects[?(@.cover_x>=1000 && @.cover_y<1500)]')
print(len(results))
# 1
results=jsonpath.jsonpath(obj,'$.subjects[?(@.cover_x>=1000,@.cover_y<1500)]')
print(len(results))
# 5
print('-------------------------')