Python 爬虫基础

基础概念

HTTP协议

wikipedia: The Hypertext transfer Protocal(HTTP) is a stateless(无状态) application-level protocol for distributed(分布式),collaborative(协作式),hypertext information systems.

HyperText Transfer Protocol 超文本传输协议

  • 是一个基于“请求与响应”模式的、无状态的应用层协议

  • 采用URL作为定位网络资源的标识,格式如下: http://host[:port][path]

    • host: 合法的Internet主机域名或IP地址
    • port: 端口号,缺省端口为80
    • path: 请求资源的路径
  • 对资源的操作:

    • GET 请求获取URL位置的资源
    • HEAD 请求获取URL位置资源的响应消息报告,即获得该资源的头部信息
    • POST 请求向URL位置的资源后附加新的数据
    • PUT 请求向URL位置存储一个资源,覆盖原URL位置的资源
    • PATCH 请求局部更新URL位置的资源,即改变该处资源的部分内容
    • DELETE 请求删除URL位置存储的资源
    • PATCH vs. PUT:
      • 假设URL位置有一组数据UserInfo,包括UserID、UserName等20个字段, 需求:用户修改了UserName,其他不变
      • PATCH: 仅向URL提交UserName的局部更新请求
      • PUT 须将所有20个字段一并提交到URL,未提交字段被删除
      • PATCH的最主要好处:节省网络带宽
  • 响应状态码

    • 2xx 成功
    • 3xx 跳转
      • 300 Multiple Choices 存在多个可用资源,可处理可丢弃
      • 301 Moved Permanetly 重定向
      • 302 Found 重定向
      • 304 Not Modified 请求资源未更新,丢弃
      • 注:一些python库(urllib2,requests,...)已经对重定向做了处理,会自动跳转
    • 4xx 客户端错误
      • 400 Bad Request 客户端请求有语法错误,不能被服务器所理解(请求参数或者路径错误)
      • 401 Unauthorized 请求未经授权,这个状态吗需和www-Authenticate报头域一起使用(无权限访问)
      • 403 Forbidden 服务器收到请求,但拒绝提供服务(未登录/IP被封/...)
      • 404 Not Found 请求资源部存在
    • 5xx 服务端错误
      • 500 Internal Server Error 服务器发生了不可预期的错误
      • 503 Server Unavailable 服务器当前不能处理客户端请求,一段时间后可能恢复正常
  • Http Header

    • Request Http Header
        Accept: text/plain
        Accept-Charset: utf-8
        Accept-Encoding: gzip,deflate
        Accept-Language: en-US
        Connection: keep-alive
        Content-Length: 348
        Content-Type: application/x-www-form-urlencoded
        User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:60.0) Gecko/20100101 Firefox/60.0
        Cookie: $version=1; Skin=new;
        Date: ...
        Host: ...
        ....
      
    • Response Http Header
        Status: 200 OK
        Accept: text/plain;charset=utf-8
        Content-Eoncoding: gzip
        Content_Language: en-US
        Content-Length: 348
        Set-Cookie: UserID=xxx,Max-Age=3600;Version=1;...
        Location: ...
        Last-Modified: ...
        ...
      

深度抓取与广度抓取

        A
     /     \
    B       C
    /       \ 
D,E,F,G     X,Y,Z
|
H,I,J,K
  • 深度抓取(垂直)
    • 堆栈(递归,先进后出)
    • A -> B -> D -> H -> I,J,K -> E,F,G -> C -> X,Y,Z
  • 广度抓取(水平)
    • 队列(先进先出)
    • A -> B,C -> D,E,F,G ; X,Y,Z -> H,I,J,K
  • 策略:
    • 重要的网页距离种子站点比较近
    • 一个网页可能有很多路径可以到达(图)
    • 广度优先有利于多爬虫并行抓取
    • 深度与广度结合

不重复抓取策略

  • 记录抓取历史(URL)
    • 保存到数据库(效率低)
    • 使用HashSet(内存限制)
  • 尽量压缩URL
    • MD5/SHA-1编码成一段统一长度的数字/字符串,太长,一般会编码后再取模
    • BitMap方法:建立BitSet,将URL(可在MD5基础上)经过Hash函数映射到一个或多个Bit位来记录
    • BloomFilter: 在BitMap基础上,使用多个Hash函数
    • 注:存在一定碰撞
  • 操作:
    • 评估网站的网页数量
    • 选择合适的Hash算法和空间阈值,降低碰撞几率
    • 选择合适的存储结构和算法
  • 注:
    • 网页数量少的情况下,不需要进行压缩(多数情况)
    • 网页数量大的情况下,可使用BloomFilter压缩URL,重点是计算碰撞概率,以此确定存储空间的阈值
    • 分布式系统,可将散列映射到多台主机

网络爬虫的限制

  • 来源审查: 检查来访HTTP协议头的User‐Agent域,只响应浏览器或友好爬虫的访问
  • 发布公告: Robots协议,网站告知网络爬虫哪些页面可以抓取,哪些不行
    • 在网站根目录下的robots.txt文件 (Robots Exclusion Standard 网络爬虫排除标准)
    • Robots协议是建议但非约束性,网络爬虫可以不遵守,但存在法律风险
    • 基本语法
        # `*`代表所有, `/`代表根目录 
        User‐agent: *
        Disallow: /
      
    • eg: https://www.jd.com/robots.txt
        User‐agent: *
        Disallow: /?*
        Disallow: /pop/*.html
        Disallow: /pinpai/*.html?*
        User‐agent: EtaoSpider
        Disallow: /
        User‐agent: HuihuiSpider
        Disallow: /
        User‐agent: GwdangSpider
        Disallow: /
        User‐agent: WochachaSpider
        Disallow: /
      

网站结构分析

  • 利用sitemap里的信息
  • 对网站目录结构进行分析
  • 网页解析器:
    • 模糊匹配:
      • 正则表达式
    • 结构化解析:
      • htmp.parser
      • BeautifulSoup
      • lxml
      • 。。。

文档解析之Re

正则表达式

Regular Expression(regex)

  • 一种通用的字符串表达框架,简洁表达一组字符串
  • 特点: 简洁,一行胜千言(一行就是特征,即模式)
  • 用途(主要应用在字符串匹配中):
    • 表达文本类型的特征(病毒、入侵等)
    • 匹配字符串的全部或部分
    • 查找或替换一组字符串
    • ...
  • 语法由字符和操作符构成, 常用操作符:

    • 匹配单个字符

      操作符 说明 实例
      . 表示任何单个字符 /
      [] 匹配[]中列举的字符 [abc]表示a、b、c,[a‐z]表示a到z单个字符
      [^ ] 非字符集,对单个字符给出排除范围 [^abc]表示非a或b或c的单个字符
      \d 数字,等价于[0‐9] /
      \D 匹配非数字 /
      \w 单词字符,等价于[A‐Za‐z0‐9] /
      \W 匹配非单词字符 /
      \s 匹配空白,即 空格,tab键 /
      \S 匹配非空白 /
    • 匹配数量

      操作符 说明 实例
      * 前一个字符0次或无限次扩展,即可有可无 abc* 表示 ab、abc、abcc、abccc等
      + 前一个字符1次或无限次扩展,即至少有1次 abc+ 表示 abc、abcc、abccc等
      ? 前一个字符0次或1次扩展,即要么有1次,要么没有 abc? 表示 ab、abc
      {m} 扩展前一个字符m次 ab{2}c表示abbc
      {m,} 扩展前一个字符至少m次 ab{2}c表示abbc,abbbc,abbbbc等
      {m,n} 扩展前一个字符m至n次(含n) ab{1,2}c表示abc、abbc
    • 匹配边界

      操作符 说明 实例
      ^ 匹配字符串开头 ^abc表示abc且在一个字符串的开头
      $ 匹配字符串结尾 abc$表示abc且在一个字符串的结尾
      \b 匹配一个单词的边界,注意:并不是匹配分隔符,而是单词和符号之间的边界
      (单词可是中英文字符,数字;符号可是中英文符号,空格,制表符,换行)
      "a nice day","a niceday": \bnice\b可匹配出"a nice day"的"nice"
      \B 匹配一个非单词的边界 "a nice day","a niceday": \bnice\B可匹配出"a niceday"的"nice"
    • 匹配分组

      操作符 说明 实例
      | 左右表达式任意一个 abc|def 表示 abc、def
      ( ) 分组标记,内部只能使用 (abc)表示abc,(abc|def)表示abc、def
      \num 引用分组num匹配到的字符串 <(\w*)><(\w*)>.*</\2></\1> => <html><h1>hh<h1></html>正确,<html><h1>hh</h1></abc> 错误
      (?P<name>) 分组起别名 <(?P<name1>\w*)><(?P<name2>\w*)>.*</(?P=name2)></(?P=name1)> => <html><h1>hh<h1></html>正确,<html><h1>hh</h1></abc> 错误
      (?P=name) 引用别名为name分组匹配到的字符串 /
  • eg1:

    • 一组字符串(无穷个): 'PY', 'PYY', 'PYYY', 'PYYYY', ......, 'PYYYY......'
    • 正则表达式(无穷字符串组的简洁表达): PY+
  • eg2:
    • 一组字符串: 'PN', 'PYN', 'PYTN', 'PYTHN', 'PYTHON'
    • 正则表达式(简洁表达):P(Y|YT|YTH|YTHO)?N
  • eg3:
    • 表示一组'PY'开头,后续存在不多于10个字符,且不能是'P'或'Y'的字符串(如:'PYABC' 正确;'PYKXYZ' 不正确)
    • 正则表达式(特征字符串组的简洁表达):PY[^PY]{0,10}
  • eg4:

    • 'PN'、'PYN'、'PYYN'、'PYYYN'...
    • PY{:3}N
  • 经典正则表达式实例:

    • ^[A‐Za‐z]+$ 由26个字母组成的字符串
    • ^[A‐Za‐z0‐9]+$ 由26个字母和数字组成的字符串
    • [1‐9]\d{5} 中国境内6位邮政编码
    • [\u4e00‐\u9fa5] 匹配中文字符
    • \d{3}‐\d{8}|\d{4}‐\d{7} 国内电话号码(eg: 010‐68913536)
    • (([1‐9]?\d|1\d{2}|2[0‐4]\d|25[0‐5]).){3}([1‐9]?\d|1\d{2}|2[0‐4]\d|25[0‐5]) IP地址(4段)
      • 0‐99: [1‐9]?\d
      • 100‐199: 1\d{2}
      • 200‐249: 2[0‐4]\d
      • 250‐255: 25[0‐5]
      • 简化表达:\d+.\d+.\d+.\d+\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3}

Re库

  • Python的标准库(用于字符串匹配)
  • 导入import re
  • 正则表达式的表示类型:
    • raw string(原生字符串类型):r'text', eg: r'[1‐9]\d{5}', r'\d{3}‐\d{8}|\d{4}‐\d{7}'
    • string(更繁琐), eg: '[1‐9]\\d{5}','\\d{3}‐\\d{8}|\\d{4}‐\\d{7}'
    • 注:raw string不包含对转义符再次转义的字符串,所以建议当正则表达式包含转义符时,使用raw string
  • 函数式用法: 一次性操作

    • re.search(pattern, string, flags=0): 搜索第一个匹配的,返回match对象
      • pattern : 正则表达式(string/raw string)
      • string : 待匹配字符串
      • flags: 控制标记
        • re.I,re.IGNORECASE : 忽略大小写
        • re.M,re.MULTILINE : ^操作符能将给定字符串的每行当作匹配开始
        • re.S,re.DOTALL : .操作符能匹配所有字符(默认是匹配除换行外的所有字符)
      • eg:
          import re
          match=re.search(r'[1-9]\d{5}','BIT100081 TSU100084')
          if match:
              print(match.group(0))   # 100081
        
    • re.match(pattern, string, flags=0): 从头开始匹配,返回match对象

      • 参数同上
      • eg:

          match=re.match(r'[1-9]\d{5}','BIT100081 TSU100084')
          if match:
              print(match.group(0))   # AttributeError: 'NoneType' object has no attribute 'group'
        
          match=re.match(r'[1-9]\d{5}','100081BIT TSU100084')
          if match:
              print(match.group(0))   # 100081
        
    • re.findall(pattern, string, flags=0): 搜索,返回匹配字串列表
      • 参数同上
      • eg:
          ls=re.findall(r'[1-9]\d{5}','BIT100081 TSU100084') # ['100081','100084']
        
    • re.finditer(pattern, string, flags=0): 搜索,返回匹配结果的迭代类型,每个迭代元素是match对象
      • 参数同上
      • eg:
          for match in re.finditer(r'[1-9]\d{5}','BIT100081 TSU100084'):
              if match:
                  print(match.group(0))
          # 100081
          # 100084
        
    • re.split(pattern, string, maxsplit=0, flags=0): 分割, 返回列表
      • maxsplit: 最大分割数,剩余部分作为最后一个元素输出
      • eg:
          re.split(r'[1-9]\d{5}','BIT100081 TSU100084')   # ['BIT',' TSU',''] 
          re.split(r'[1-9]\d{3}','BIT100081 TSU100084')   # ['BIT', '81 TSU', '84']
          re.split(r'[1-9]\d{5}','BIT100081 TSU100084',maxsplit=1) # ['BIT',' TSU100084']
        
    • re.sub(pattern, repl, string, count=0, flags=0): 替换所有匹配的子串,返回替换后的字符串
      • repl 替换字符串
      • string : 待匹配字符串
      • count 最大替换次数
      • eg:
          re.sub(r'[1-9]\d{5}',':zipcode','BIT100081 TSU100084') # 'BIT:zipcode TSU:zipcode'
        
  • 面向对象用法:编译后的多次操作

    • Step1: regex = re.compile(pattern, flags=0) 将正则表达式的字符串形式编译成正则表达式对象
    • Step2:
      • regex.search(string, flags=0)
      • regex.match(string, flags=0)
      • regex.findall(string, flags=0)
      • regex.finditer(string, flags=0)
      • regex.split(string, maxsplit=0, flags=0)
      • regex.sub(repl, string, count=0, flags=0)
  • Match对象:

    • 一次匹配的结果,包含匹配的很多信息
    • 属性:
      • .string: 待匹配的文本
      • .re: 匹配时使用的pattern对象(正则表达式)
      • .pos: 搜索文本的开始位置
      • .endpos: 搜索文本的结束位置
    • 方法:
      • .group(0): 获得匹配后的字符串
      • .start(): 匹配字符串在原始字符串的开始位置
      • .end(): 匹配字符串在原始字符串的结束位置
      • .span(): 返回(.start(), .end())
    • eg:
        match=re.search(r'[1-9]\d{5}','BIT100081 TSU100084')
        # 属性:
        match.string    # 'BIT100081 TSU100084'
        match.re        # re.compile('[1-9]\\d{5}')
        match.pos       # 0
        match.endpos    # 19
        # 方法:
        match.group(0)  # 100081
        match.start()   # 3
        match.end()     # 9
        match.span()    # (3,9)
      
  • 贪婪匹配(默认):输出匹配最长的子串

      match = re.search(r'PY.*N', 'PYANBNCNDN')
      match.group(0)  # 'PYANBNCNDN'
    
  • 最小匹配: 输出最短匹配子串,操作符后增加?操作符

      '''
      只要长度输出可能不同的,都可以通过在操作符后增加?变成最小匹配
      `*?`: 前一个字符0次或无限次扩展,最小匹配
      `+?`: 前一个字符1次或无限次扩展,最小匹配
      `??`: 前一个字符0次或1次扩展,最小匹配
      `{m,n}?`: 扩展前一个字符m至n次(含n),最小匹配
      '''
      match = re.search(r'PY.*?N', 'PYANBNCNDN')
      match.group(0)  # 'PYAN'
    

文档解析之BeautifulSoup

一个网页解析库,处理高效,目前可支持html, xml,html5文档解析,可配置使用不同解析器

常见解析器:

解析器 使用 说明
html.parser BeautifulSoup(content,'html.parser') Python内置标准库,速度适中,容错能力适中,不依赖扩展
lxml BeautifulSoup(content,'lxml'),BeautifulSoup(content,'xml') 第三方库(pip install lxml),速度快(局部遍历),支持XML解析,容错能力强,依赖C扩展
html5hib BeautifulSoup(content,'html5hib') 第三方库(pip install html5hib),速度慢,以浏览器的方式解析生成HTML5格式的文档,容错能力最好,不依赖外部扩展
  1. 安装 (BeautifulSoup 包含在一个名为 bs4 的文件包中,需要另外安装)

     pip install bs4
    
  2. 创建BeautifulSoup对象,结构化解析Dom树(HTML/XML <=> 文档树 <=> BeautifulSoup对象

     from bs4 import BeautifulSoup
    
     # soup = BeautifulSoup("<html><body><p>data</p></body></html>",'html.parser')
     soup = BeautifulSoup("<html><body><p>data</p></body></html>")
     print(soup.p)
    
     # 格式化输出(为HTML文本及其内容增加添加`\n`),也可用于标签:`<tag>.prettify()`
     print(soup.prettify())
    
  3. 访问节点

    BeautifulSoup基本元素 说明 使用 示例
    Tag 标签 <tag> soup.p
    Name 标签名字,字符串类型 <tag>.name soup.p.name
    Attributes 标签的属性,字典形式组织 <tag>.attrs soup.p.attrs,soup.p['attrname']
    NavigableString 标签内非属性字符串(<>⋯</>中字符串) <tag>.string soup.p.string
    Comment 标签内字符串的注释部分, 特殊类型的 NavigableString 对象 / /
    • 是否有设置属性
      • has_attr("attrname")
    • 获取属性
      • .attrs["attrname"]
      • ["attrname"]
    • 获取内容
      • .text
      • .get_text()
    • .string vs .text

      • .string on a Tag type object returns a NavigableString type object.
      • .text gets all the child strings and return concatenated using the given separator.
      • sample:

        Html string text
        <td>some text</td> some text some text
        <td></td> None /
        <td><p>more text</p></td> more text more text
        <td>even <p>more text</p></td> None (因为文本数>=2.string不知道获取哪一个) even more text (.text返回的是,两段文本的拼接)
        <td><!--This is comment--></td> This is comment /
        <td>even <!--This is comment--></td> None even
  4. Navigating the Tree

    • Going Down: 下行遍历(子节点和子孙节点)
      • .contents 返回儿子节点列表list
      • .children 返回儿子节点迭代类型list_iterator
      • .descendants 返回子孙节点迭代类型generator (包含所有子孙节点)
    • Going up: 上行遍历(父节点和祖先节点)
      • .parent 返回节点的父亲节点
      • .parents 返回所有先辈节点的迭代类型generator (包括soup本身)
    • Going sideways: 平行遍历(兄弟节点)
      • .next_sibling / .previous_sibling 返回HTML文本顺序的下/上一个平行节点
      • .next_siblings / .previous_siblings 返回HTML文本顺序的后/前续所有平行节点的迭代类型generator
    • Going back and forth: 前后遍历(不分层次)
      • .next_element / .next_elements
      • .previous_element / .previous_elements
  5. Searching the Tree

    • Searching Down: 下行搜索(子孙)
      • find/find_all
    • Searching up: 上行搜索(父祖)
      • find_parent / find_parents
    • Searching sideway: 平行搜索(兄弟):
      • find_next_sibling / find_previous_sibling
      • find_next_siblings / find_previous_siblings
    • Searching back and forth: 前后搜索(不分层次的前/后节点):
      • find_next / find_all_next
      • find_previous / find_all_previous
    • 注:
      • 方法参数(name=None,attrs={},recursive=True,text=None,limit=None,**kwargs),可以使用正则表达式
        • name: 标签名称
        • attrs: 标签属性
        • recursive: 是否对子孙全部检索,默认True
        • text: 内容字符串
        • limit: 限制条数
      • <tag>(..) 等价于 <tag>.find_all(..)
      • soup(..) 等价于 soup.find_all(..)
      • 每个元素是一个 bs4.element.Tag 对象
  6. 可使用CSS Selectors选择节点: .select('...')

    • 基础选择:
      • #id
      • tagName
      • .styleClass
    • 属性过滤:
      • [attribute]
      • [attribute=value]
      • [attribute!=value]
      • [attribute^=value]
      • [attribute$=value]
      • [attribute*=value]
    • 层级选择:
      • ancestor descendent
      • parent > child
      • prev + next (next sibling tag)
      • prev ~ siblings (next all sibling tags)
    • 元素过滤:
      • :not(selector)
      • :nth-of-type(index)
      • :nth-child(index)
      • :first-child
      • :last-child
      • :only-child
    • 内容过滤:
      • :contains(text)
      • :empty
      • :has(selector)
    • 表单属性过滤:
      • :enabled
      • :checked
      • :disabled
    • 混合:
      • selector1, selector2, selectorN:获取多个选择符的合集
      • [selector1][selector2][selectorN]:匹配同时符合多个属性选择符的对象
  7. 注:

    • BeautifulSoup用编码自动检测子库来识别当前文档编码并转换成Unicode编码,输出使用utf-8编码
    • 获取属性值.attrs,.attrs['xxx']
    • 获取内容.text,.get_text(),.string,.strings

Demo: 访问节点

from bs4 import BeautifulSoup

content='''
<b>Chat with sb</b>
<a> This is title  <!-- Guess --> </a>
<i><!--This is comment--></i>
<div id="div1">
    <div id="div2">
        <p id="test" class="highlight">
            Hello <a>Tom</a>
            Nice to meet you <!-- This is a comment -->
        </p>
    </div>
</div>
'''
soup=BeautifulSoup(content,'html.parser')
  1. Tag name,attrs

     print("soup.p:",soup.p)
     # <p class="highlight" id="test1">
     #                 Hello <a>Tom</a>
     #                 Nice to meet you <!-- This is a comment -->
     # </p>
    
     print("soup.p.name:",soup.p.name)
     # p
    
     print("soup.p.attrs:",soup.p.attrs)
     # {'id': 'test1', 'class': ['highlight']}
    
     print("soup.p.attr['class']:",soup.p.attrs["class"])
     # ['highlight']
    
     print("soup.p.attrs['id']:",soup.p.attrs["id"])
     #  test1
    
     print("soup.p['class']:",soup.p["class"])
     #['highlight']
    
  2. Tag text/string

     print("soup.p.text:",soup.p.text)
     #
     #                Hello Tom
     #                Nice to meet you
     #
    
     print("soup.p.get_text():",soup.p.get_text())
     #
     #                Hello Tom
     #                Nice to meet you
     #
    
     print("type(soup.p.get_text()):",type(soup.p.get_text()))   # <class 'str'>
     print("-----------------------------------")
    
     print('--- Demo: Tag <p> string ---')
     print("soup.p.string:",soup.p.string)               # None
     print("type(soup.p.string)",type(soup.p.string))    # <class 'NoneType'>
     print("soup.p.strings:",soup.p.strings)             # <generator object Tag._all_strings at 0x00000000028FDD68>
     for i,s in enumerate(soup.p.strings):
         print(i,":",s)
     print("-----------------------------------")    
     # 0 :
     #                 Hello
     # 1 : Tom
     # 2 :
     #                 Nice to meet you
     # 3 :
    
     print('--- Demo: Tag <a> text/string ---')
     print("soup.a.text:",soup.a.text)                       # Chat with sb
     print("soup.a.string:",soup.a.string)                   # Chat with sb
     print("type(soup.a.string):",type(soup.a.string))       # <class 'bs4.element.NavigableString'>
     print("-----------------------------------")
    
     print('--- Demo: Tag <b> text/string ---')
     print("soup.b.text:",soup.b.text)                       # This is title
     print("soup.b.string:",soup.b.string)                   # None
     print("type(soup.b.string):",type(soup.b.string))       # <class 'NoneType'>
     print("-----------------------------------")
    
     print('--- Demo: Tag <i> text/string ---')
     print("soup.i.text:",soup.i.text)                       #
     print("soup.i.string:",soup.i.string)                   # This is comment
     print("type(soup.i.string):",type(soup.i.string))       # <class 'bs4.element.Comment'>
    

Demo: Navigating the Tree

from bs4 import BeautifulSoup

content='''
<b>Chat with sb</b>
<a> This is title  <!-- Guess --> </a>
<i><!--This is comment--></i>
<div id="div1">
    <div id="div2">
        <p id="test" class="highlight">
            Hello <a>Tom</a>
            Nice to meet you <!-- This is a comment -->
        </p>
    </div>
</div>
'''

soup=BeautifulSoup(content,'html.parser')
def print_result(result):
    if type(result)==element.Tag or (type(result)== list and len(result)==0):
        print(result)
        return
    for i,r in enumerate(result):
        print(i,":",r)
    print('-------------------------')

def print_result_name(result):
    if type(result)==element.Tag or type(result)==element.NavigableString or (type(result)== list and len(result)==0):
        print(result)
        return
    for i,r in enumerate(result):
        print(i,":",r.name)
    print('-------------------------')
  1. Going down:

    • .contents

        print(soup.p.contents)
        # <class 'list'>
      
        print_result(soup.p.contents)
        # 0 :
        #                 Hello
        # 1 : <a>Tom</a>
        # 2 :
        #                 Nice to meet you
        # 3 :  This is a comment
        # 4 :
      
    • .children

        print(soup.p.children)          
        # <list_iterator object at 0x0000000001E742E8>
      
        print_result(soup.p.children)
        # 0 :
        #                 Hello
        # 1 : <a>Tom</a>
        # 2 :
        #                 Nice to meet you
        # 3 :  This is a comment
        # 4 :
      
    • .descendants

        print('--- Demo: Tag <p> descendants ---')
        print(soup.p.descendants)
        # <generator object Tag.descendants at 0x00000000028ADD68>
      
        print_result(soup.p.descendants)
        # 0 :
        #                 Hello
        # 1 : <a>Tom</a>
        # 2 : Tom
        # 3 :
        #                 Nice to meet you
        # 4 :  This is a comment
        # 5 :
      
  2. Going up:

    • .parent

        print(type(soup.p.parent))      
        # <class 'bs4.element.Tag'>
      
        print_result(soup.p.parent)
        # <div id="div2">
        # <p class="highlight" id="test1">
        #                 Hello <a>Tom</a>
        #                 Nice to meet you <!-- This is a comment -->
        # </p>
        # <p class="story" id="test2">Story1</p>
        # <p class="story" id="test3">Story2</p>
        # </div>
      
    • .parents

        print(soup.p.parents)           
        # <generator object PageElement.parents at 0x00000000028FDD68>
      
        print_result_name(soup.p.parents)
        # 0 : div
        # 1 : div
        # 2 : [document]
      
  3. Going sideway:

    • next_sibling
        print_result(soup.p.next_sibling)
        # 0 :
      
    • next_siblings

        print(soup.p.next_siblings)     
        # <generator object PageElement.next_siblings at 0x00000000028FDD68>
      
        print_result(soup.p.next_siblings)
        # 0 :
        # 
        # 1 : <p class="story" id="test2">Story1</p>
        # 2 :
        # 
        # 3 : <p class="story" id="test3">Story2</p>
        # 4 :
      
    • vs. find_next_silbings
        print('--- Demo: `find_next_siblings()` ---')
        result=soup.p.find_next_siblings()
        print_result(result)
        # 0 : <p class="story" id="test2">Story1</p>
        # 1 : <p class="story" id="test3">Story2</p>
      
  4. Going forth and back:

    • next_element
        print(soup.p.next_element)
        #
        # Hello
        print(type(soup.p.next_element))
        # <class 'bs4.element.NavigableString'>
      
    • next_elements

        print(soup.p.next_elements)     
        # <generator object PageElement.next_elements at 0x00000000028FDD68>
      
        print_result(soup.p.next_elements)
        # 0 :
        #                 Hello
        # 1 : <a>Tom</a>
        # 2 : Tom
        # 3 :
        #                 Nice to meet you
        # 4 :  This is a comment
        # 5 :
        # 
        # 6 :
        # 
        # 7 : <p class="story" id="test2">Story1</p>
        # 8 : Story1
        # 9 :
        # 
        # 10 : <p class="story" id="test3">Story2</p>
        # 11 : Story2
        # 12 :
        # 
        # 13 :
        # 
        # 14 :
      
    • vs. find_all_next()
        result=soup.p.find_all_next()
        print_result(result)
        # 0 : <a>Tom</a>
        # 1 : <p class="story" id="test2">Story1</p>
        # 2 : <p class="story" id="test3">Story2</p>
      

Demo: Searching the Tree

from bs4 import BeautifulSoup
from bs4 import element
import re

content='''
<html><head><title>The Dormouse's story</title></head> <body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.
</p>
<p class="story">...</p>
</body>
</html>
'''

soup=BeautifulSoup(content,'html.parser')
print(soup.prettify())

def print_result(result):
    if type(result)==element.Tag or (type(result)== list and len(result)==0):
        print(result)
        return
    for i,r in enumerate(result):
        print(i,":",r)
    print('-------------------------')

def print_result_name(result):
    if type(result)==element.Tag or (type(result)== list and len(result)==0):
        print(result)
        return
    for i,r in enumerate(result):
        print(i,":",r.name)
    print('-------------------------')
  1. Searching down

    • by name

        print('--- Demo: `find_all("a")` ---')
        result=soup.find_all('a')
        print_result(result)
        # 0 : <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
        # 1 : <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
        # 2 : <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
      
        print('--- Demo: `find_all(["a","title"])` ---')
        result=soup.find_all(['a','title'])
        print_result(result)
        # 0 : <title>The Dormouse's story</title>
        # 1 : <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
        # 2 : <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
        # 3 : <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
      
        print('--- Demo: `find_all(True)` ---')
        result=soup.find_all(True)
        print_result(result)
        # 0 : <html><head><title>The Dormouse's story</title></head> <body>
        # <p class="title"><b>The Dormouse's story</b></p>
        # <p class="story">
        # Once upon a time there were three little sisters; and their names were
        # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
        # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well.
        # </p>
        # <p class="story">...</p>
        # </body>
        # </html>
        # 1 : <head><title>The Dormouse's story</title></head>
        # 2 : <title>The Dormouse's story</title>
        # 3 : <body>
        # <p class="title"><b>The Dormouse's story</b></p>
        # <p class="story">
        # Once upon a time there were three little sisters; and their names were
        # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
        # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well.
        # </p>
        # <p class="story">...</p>
        # </body>
        # 4 : <p class="title"><b>The Dormouse's story</b></p>
        # 5 : <b>The Dormouse's story</b>
        # 6 : <p class="story">
        # Once upon a time there were three little sisters; and their names were
        # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
        # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well.
        # </p>
        # 7 : <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
        # 8 : <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
        # 9 : <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
        # 10 : <p class="story">...</p>
      
        print('--- Demo: `find_all(re.compile("b")` ---')
        result=soup.find_all(re.compile('b'))
        print_result(result)
        # 0 : <body>
        # <p class="title"><b>The Dormouse's story</b></p>
        # <p class="story">
        # Once upon a time there were three little sisters; and their names were
        # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
        # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well.
        # </p>
        # <p class="story">...</p>
        # </body>
        # 1 : <b>The Dormouse's story</b>
      
    • by attrs

        print('--- Demo: find_all("p","story") ---')
        result=soup.find_all('p','story')
        print_result(result)
        # 0 : <p class="story">
        # Once upon a time there were three little sisters; and their names were
        # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
        # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well.
        # </p>
        # 1 : <p class="story">...</p>
      
        print('--- Demo: find_all(id="link1") ---')
        result=soup.find_all(id='link1')
        print_result(result)
        # 0 : <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
      
        print('--- Demo: find_all(class_="sister") ---')
        result=soup.find_all(class_='sister')
        print_result(result)
        # 0 : <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
        # 1 : <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
        # 2 : <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
      
        print('--- Demo: find_all(re.compile("link")) ---')
        result=soup.find_all(id=re.compile('link'))
        print_result(result)
        # 0 : <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
        # 1 : <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
        # 2 : <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
      
        print('--- Demo: find_all(attrs={"class":"story"}) ---')
        result=soup.find_all(attrs={'class':'story'})
        print_result(result)
        # 0 : <p class="story">
        # Once upon a time there were three little sisters; and their names were
        # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
        # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well.
        # </p>
        # 1 : <p class="story">...</p>
      
    • by recursive

        print('--- Demo: find_all("a") ---')
        result=soup.find_all('a')
        print_result(result)
        # 0 : <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
        # 1 : <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
        # 2 : <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
      
        print('--- Demo: find_all("a",recursive=False) ---')
        result=soup.find_all('a',recursive=False)
        print_result(result)
        # []
      
    • by string/text

        print('--- Demo: find_all(string="three") ---')
        result=soup.find_all(string='three')
        print_result(result)
        # []
      
        print('--- Demo: find_all(string=re.compile("e")) ---')
        result=soup.find_all(string=re.compile('e'))
        print_result(result)
        # 0 : The Dormouse's story
        # 1 : The Dormouse's story
        # 2 :
        # Once upon a time there were three little sisters; and their names were
        #
        # 3 : Elsie
        # 4 : Lacie
        # 5 : Tillie
        # 6 : ; and they lived at the bottom of a well.
      
    • by limit : find()也就是当limit=1时的find_all()
        print('--- Demo: find_all("a",limit-2) ---')
        result=soup.find_all('a',limit=2)
        print_result(result)
        # 0 : <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
        # 1 : <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
      
    • by self def function

        print('--- Demo: using `self def function` ---')
        def my_filter(tag):
            return tag.has_attr('id') and re.match('link',tag.get("id"))
      
        result=soup.find_all(my_filter)
        print_result(result)
        # 0 : <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
        # 1 : <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
        # 2 : <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
      
  2. Searching up: find_parents

     print('--- Demo: link2.`find_parents()` ---')
     result=soup.find(id="link2").find_parents()
     print_result_name(result)
     # 0 : p
     # 1 : body
     # 2 : html
     # 3 : [document]
    
     print('--- Demo: link2.`find_parents("p")` ---')
     result=soup.find(id="link2").find_parents('p')
     print_result(result)
     # 0 : <p class="story">
     # Once upon a time there were three little sisters; and their names were
     # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
     # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well.
     # </p>
    
  3. Searching sideway: find_next_siblings

     print('--- Demo: `find_next_siblings()` ---')
     result=soup.find(id="link1").find_next_siblings()
     print_result(result)
     # 0 : <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
     # 1 : <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
    
  4. Searching forth and back: find_all_next

     print('--- Demo: `find_all_next()` ---')
     result=soup.find(id="link1").find_all_next()
     print_result(result)
     # 0 : <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
     # 1 : <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
     # 2 : <p class="story">...</p>
    

Demo: CSS Selectors

<html><head><title>The Dormouse's story</title></head> <body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.
</p>
<p class="story">...</p>
<input type="text" disabled value="input something"></input>
</body>
</html>
  1. 基础选择

    • #id

        print('--- Demo: `select("#link1")` ---')
        result=soup.select("#link1")
        print_result(result)
        # 0 : <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
      
        print('--- Demo: `select("a#link1")` ---')
        result=soup.select("a#link2")
        print_result(result)
        # 0 : <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
      
    • tagName
        print('--- Demo: `select("title")` ---')
        result=soup.select("title")
        print_result(result)
        # 0 : <title>The Dormouse's story</title>
      
    • .styleClass
        print('--- Demo: `select(".sister")` ---')
        result=soup.select(".sister")
        print_result(result)
        # 0 : <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
        # 1 : <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
        # 2 : <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
      
  2. 属性过滤

    • [attribute]

        print('--- Demo: `select("a[href]")` ---')
        result=soup.select('a[href]')
        print_result(result)
        # 0 : <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
        # 1 : <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
        # 2 : <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
      
    • [attribute=value]

        print('--- Demo: `select("[class=sister]")` ---')
        result=soup.select("[class=sister]")
        print_result(result)
        # 0 : <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
        # 1 : <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
        # 2 : <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
      
    • [attribute^=value]
        print('--- Demo: `select("a[href^="http://example.com/"]")` ---')
        result=soup.select('a[href^="http://example.com/"]')
        print_result(result)
        # 0 : <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
        # 1 : <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
        # 2 : <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
      
    • [attribute$=value]
        print('--- Demo: `select("a[href$="tillie"])` ---')
        result=soup.select('a[href$="tillie"]')
        print_result(result)
        # 0 : <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
      
      -[attribute*=value]
        print('--- Demo: `select("a[href*=".com/el"]")` ---')
        result=soup.select('a[href*=".com/el"]')
        print_result(result)
        # 0 : <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
      
    • [selector1][selector2][selectorN]
        print("--- Demo: `[class='sister'][id=link2]` --- ")
        print_result(soup.select("[class=sister][id=link2]"))
        # 0 : <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
      
  3. 层级选择

    • ancestor descendent
        print('--- Demo: `select("body a")` ---')
        result=soup.select("body a")
        print_result(result)
        # 0 : <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
        # 1 : <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
        # 2 : <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
      
    • parent > child

        print('--- Demo: `select("body > a") ---')
        result=soup.select("body > a")
        print_result(result)
        # []
      
        print('--- Demo: `select("p > a") ---')
        result=soup.select("p > a")
        print_result(result)
        # 0 : <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
        # 1 : <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
        # 2 : <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
      
        print('--- Demo: `select("p > a:nth-of-type(2)")` ---')
        result=soup.select("p > a:nth-of-type(2)")
        print_result(result)
        # 0 : <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
      
        print('--- Demo: `select("p > #link1")` ---')
        result=soup.select("p > #link1")
        print_result(result)
        # 0 : <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
      
    • prev + next :next sibling tag
        print('--- Demo: `select("#link1 ~ .sister")` ---')
        result=soup.select("#link1 ~ .sister")
        print_result(result)
        # 0 : <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
        # 1 : <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
      
    • prev ~ siblings :next all sibling tags
        print('--- Demo: `select("#link1 + .sister")` ---')
        result=soup.select("#link1 + .sister")
        print_result(result)
        # 0 : <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
      
  4. 元素过滤

    • :not(selector)
        print("--- Demo: `:not(.story)` --- ")
        print_result(soup.select("p:not(.story)"))
        # 0 : <p class="title"><b>The Dormouse's story</b></p>
      
    • :nth-of-type(index)
        print('--- Demo: `select("p:nth-of-type(3)")` ---')
        result=soup.select("p:nth-of-type(3)")
        print_result(result)
        # 0 : <p class="story">...</p>
      
    • :nth-child(index)
        print("--- Demo: `p > :nth-child(1)` --- ")
        print_result(soup.select("p > :nth-child(1)"))
        # 0 : <b>The Dormouse's story</b>
        # 1 : <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
      
    • :first-child
        print("--- Demo: `p > :first-child` --- ")
        print_result(soup.select("p > :first-child"))
        # 0 : <b>The Dormouse's story</b>
        # 1 : <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
      
    • :last-child
        print("--- Demo: `p > :last-child` --- ")
        print_result(soup.select("p > :last-child"))
        # 0 : <b>The Dormouse's story</b>
        # 1 : <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
      
    • :only-child
        print("--- Demo: `p > :only-child` --- ")
        print_result(soup.select("p > :only-child"))
        # 0 : <b>The Dormouse's story</b>
      
  5. 内容过滤

    • :contains(text)
        print("--- Demo: `p:contains(story)` --- ")
        print_result(soup.select("p:contains(story)"))
        # 0 : <p class="title"><b>The Dormouse's story</b></p>
      
    • :empty
        print("--- Demo: `p:empty` --- ")
        print_result(soup.select("p:empty"))
        # []
      
    • :has(selector)
        print("--- Demo: `p:has(b)` --- ")
        print_result(soup.select("p:has(b)"))
        # 0 : <p class="title"><b>The Dormouse's story</b></p>
      
  6. 表单属性过滤

    • :enabled,:disabled,:checked
        print("--- Demo: `:disabled`` --- ")
        print_result(soup.select(":disabled"))
        # 0 : <input disabled="" type="text" value="input something"/>
      
  7. 其他:

    • selector1, selector2, selectorN
        print('--- Demo: `select("#link1,#link2")` ---')
        result=soup.select("#link1,#link2")
        print_result(result)
        # 0 : <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
        # 1 : <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
      
    • select_one()
        print('--- Demo: `select_one(".sister")` ---')
        result=soup.select_one(".sister")
        print_result(result)
        # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
      
  8. get attribute value:

     print('--- Demo: `get attribute value` ---')
     result=soup.select(".sister")
    
     print_result(result)
     # 0 : <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
     # 1 : <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
     # 2 : <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
    
     print(result[0].get_text())
     #Elsie
    
     print(result[0].attrs)
     #{'href': 'http://example.com/elsie', 'class': ['sister'], 'id': 'link1'}
    
     print(result[0].attrs['id'])
     #link1
    

文档解析之XPath

  • 使用路径表达式来选取XML/HTML文档中的节点或节点集
  • 安装: pip install lxml
  • 导入:from lxml import etree
  • 注意:lxml和正则一样用C实现,是一款高性能的PythonHTML/XML解析器,可以利用XPath语法来快速的定位特定元素以及节点信息
  • Tools: Chrome的插件XPath Helper,快速得到页面元素的匹配规则

路径表达式

  • // : 选取所有的当前节点,不考虑他们的位置

    • //p (.//p)
    • /p//a
    • //p/a
  • / : 从根节点选取

    • /p (./p)
    • /p/a
  • . 当前节点,.. : 当前节点的父节点

    • ./p
    • ../p
    • //p/b/../a
    • root.xpath('//p/b').xpath('./a')
    • root.xpath('//p/b').xpath('../text()')
    • root.xpath('//p/b/..//a')[0].text
  • @ : 选取属性

    • //@class
    • //p/@class
    • //p//@class
    • //p[@class]
    • //p[@class='s1']
    • //p[@class='s1']/@class
  • /text(),string(.) 选择内容

    • "//b/text()"
    • //b//text()
    • string(.)
    • string(./description)
  • []: Predicates

    • //p[1],//p[last()],//p[last()-1]
    • //p[position()<=2]
    • //p[@class],//p[@class='s1']
    • //p[b],//p[b/@class],//p[b[@class='s1']]
  • * : 通配符,匹配任何

    • //p/*
    • //p//*
    • //p/*/a
    • //p[@*]
    • //*[@class='s1']
  • | : 选取多个路径

    • /p | //b
    • //p/a | //p/b[@class]
  • and,or,not:

    • //a[@class='sister' and @id='link2'],//a[@class='sister'][@id='link2']
    • //a[@id='link1' or @class='outAstyle']
    • //a[not(@class='sister')]
    • //a[not(@class='sister') and @class or @id='link1']
  • xxx():

    • starts-with(): //a[starts-with(@href,'http://example.com/')]
    • contains(): //a[contains(text(),'ie') and contains(@id,'link')]
    • text(): //b/text(),//b//text()
    • string(.): data.xpath('//div[@class="name"]')[0].xpath('string(.)')
  • ::

    • go self: self::,eg: //self::b
    • go up: ancestor:: , ancestor-or-self::,parent::,eg: //a/ancestor::p
    • go down: descendant::,child::,eg: //p/descendant::a[not(@class)]
    • go forward: following::,following-sibling::, eg: p[last()-1]/following::*
    • go back: preceding::,preceding-sibling:: , eg: p[2]/preceding::*
    • get attributes: attribute::,eg: //a/attribute::*,//a/attribute::class
  • lxml.etree._Element:

    • tag
    • attrib
    • text
    • .xpath('string(.)')
    • .get('attribute')

Demo: 解析HTML

#!/usr/bin/env python
# -*- coding:utf-8 -*-
from lxml import etree

content='''
<div>
    <p class="title"><b class='bstyle'>The Dormouse's story</b></p>
    <p class="story">
        Once upon a time there were three little sisters; and their names were
        <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
        <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
        and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
        ; and they lived at the bottom of a well.
        <p> hello ...<b><a> World </a></b> </p>
    </p>
    <p class="story">...<a class="outAstyle">Miss</a> </p>
</div>
'''

# html = etree.parse('./test.html',etree.HTMLParser())
html = etree.HTML(content)
print(html)
# <Element html at 0x1019312c8>

# result = etree.tostring(html)     # 会补全缺胳膊少腿的标签
# print(result.decode("utf-8"))
print(etree.tounicode(html))        # 会补全缺胳膊少腿的标签
# <html><body><div>
#   <p class="title"><b class="bstyle">The Dormouse's story</b></p>
#   <p class="story">
#       Once upon a time there were three little sisters; and their names were
#       <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
#       <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
#       and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
#       ; and they lived at the bottom of a well.
#       </p><p> hello ...<b><a> World </a></b> </p>
#   <p class="story">...<a class="outAstyle">Miss</a> </p>
# </div>
# </body></html>

result=html.xpath("//p/b")
for i,r in enumerate(result):
    print(i,type(r),":",r.tag,r.attrib,r.get('class'),r.text,r.xpath('string(.)'))
# 0 <class 'lxml.etree._Element'> : b {'class': 'bstyle'} bstyle The Dormouse's story The Dormouse's story
# 1 <class 'lxml.etree._Element'> : b {} None None  World

###########################
# More test:
test_path_any(html)
test_path_attr(html)
test_path_predicates(html)


def test_path_any(root):
    print("--- `//` ----")
    do_xpath(root,'p')
    # []
    do_xpath(root,'//p')
    # [<Element p at 0x109f34148>, <Element p at 0x109f34188>, <Element p at 0x109f34248>, <Element p at 0x109f34288>]
    do_xpath(root,'//p/a/text()')
    # ['Elsie', 'Lacie', 'Tillie', 'Miss']
    do_xpath(root,'//p//a/text()')
    # ['Elsie', 'Lacie', 'Tillie', ' World ', 'Miss']
    do_xpath(root,'.//a/text()')
    # ['Elsie', 'Lacie', 'Tillie', ' World ', 'Miss']

    print('--- `xpath` ---')
    print(root.xpath("//p/b//a"))
    # [<Element a at 0x10b555f08>]
    print(root.xpath("//p/b")[1].xpath("//a"))
    # [<Element a at 0x10b555f08>, <Element a at 0x10b5770c8>, <Element a at 0x10b577108>, <Element a at 0x10b577048>, <Element a at 0x10b577088>]
    print(root.xpath("//p/b")[1].xpath("./a"))
    # [<Element a at 0x10c719f48>]
    print(root.xpath("//p/b")[1].xpath("../text()"))
    # [' hello ...', ' ']
    print(root.xpath('//p/b/..//a')[0].text)
    # World
    print('------------------------')

def test_path_attr(root):
    print("--- `@` ----")
    do_xpath(root,'/@class')
    # []
    do_xpath(root,'//@class')
    # ['title', 'bstyle', 'story', 'sister', 'sister', 'sister', 'story', 'outAstyle']

    do_xpath(root,'//p[@class]')
    # [<Element p at 0x10e4c3888>, <Element p at 0x10e4c36c8>, <Element p at 0x10e4c3708>]
    do_xpath(root,"//p[@class='story']")
    # [<Element p at 0x110ba8708>, <Element p at 0x110ba8548>]

    do_xpath(root,"//p/@class")
    # ['title', 'story', 'story']
    do_xpath(root,"//p[@class='story']/@class")
    # ['story', 'story']
    do_xpath(root,"//p[@class='story']//@class")
    # ['story', 'sister', 'sister', 'sister', 'story', 'outAstyle']
    print('------------------------')

def test_path_predicates(root):
    print("--- `[]` ----")
    do_xpath_detail(root,'//p[1]')
    # 0 : <p class="title"><b class="bstyle">The Dormouse's story</b></p>
    do_xpath_detail(root,'//p[last()]')
    # 0 : <p class="story">...<a class="outAstyle">Miss</a> </p>
    do_xpath_detail(root,'//p[last()-1]')
    # 0 : <p> hello ...<b><a> World </a></b> </p>

    do_xpath_detail(root,'//a[1]')
    # 0 : <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
    # 1 : <a> World </a>
    # 2 : <a class="outAstyle">Miss</a>
    do_xpath_detail(root,'//p/a[1]')
    # 0 : <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
    # 1 : <a class="outAstyle">Miss</a>
    do_xpath_detail(root,'//a[position()<=2]')
    # 0 : <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
    # 1 : <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>and
    # 2 : <a> World </a>
    # 3 : <a class="outAstyle">Miss</a>

    do_xpath_detail(root,'//a[@class]')
    # 0 : <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
    # 1 : <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>and
    # 2 : <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.
    # 3 : <a class="outAstyle">Miss</a>
    do_xpath_detail(root,'//a[@class="outAstyle"]')
    # 0 : <a class="outAstyle">Miss</a>

    do_xpath_detail(root,'//p[b]')
    # 0 : <p class="title"><b class="bstyle">The Dormouse's story</b></p>
    # 1 : <p> hello ...<b><a> World </a></b> </p>
    do_xpath_detail(root,"//p[b/@class]")
    # 0 : <p class="title"><b class="bstyle">The Dormouse's story</b></p>
    do_xpath_detail(root,"//p[b[@class='bstyle']]")
    # 0 : <p class="title"><b class="bstyle">The Dormouse's story</b></p>
    print('------------------------')

def do_xpath(root,path):
    result=root.xpath(path)
    print("%s : \n%s" % (path,result))
    return result

def do_xpath_detail(root,path):
    result=root.xpath(path)
    print(path,":")
    if type(result)==list and len(result)>0:
        for i,r in enumerate(result):
            if type(r)==etree._Element:
                print(i,":",etree.tounicode(r))
            else:
                print(i,":",r)
    else:
        print(result)
    return result

Demo: 解析XML

from lxml import etree

content='''
<collection shelf="New Arrivals">
    <movie title="Enemy Behind">
       <type>War, Thriller</type>
       <format>DVD</format>
       <year>2003</year>
       <rating>PG</rating>
       <stars>10</stars>
       <description>Talk about a US-Japan war</description>
    </movie>
    <movie title="Transformers">
       <type>Anime, Science Fiction</type>
       <format>DVD</format>
       <year>1989</year>
       <rating>R</rating>
       <stars>8</stars>
       <description>A schientific fiction</description>
    </movie>
</collection>
'''
root=etree.XML(content)
print(root)
print(etree.tounicode(root))

result=root.xpath('//movie')
for i,r in enumerate(result):
    print(i,r,":",r.tag,r.attrib,r.get('title'))
    print("text:",r.text)
    print("string:",r.xpath('string(./description)'))
    print('rating:',r.xpath('./rating/text()'))

文档解析之JSonPath

  • 是一种信息抽取类库, 是从JSON文档中抽取指定信息的工具,提供多种语言实现版本,包括:Javascript,Python,PHP,Java
  • JsonPath对于JSON来说,相当于XPATH 对于XML, Refer JSONPath - XPath for JSON
  • python中有2个类库可使用
    • pip install jsonpath,import jsonpath
    • pip install jsonpath-rw, from jsonpath import jsonpath,parse , Refer Github

Jsonpath 操作符

  • $: 根节点
  • @: 当前节点
  • *: 通配符,匹配所有
  • ..: 递归搜索
  • . : 子节点
  • []: 取子节点,迭代器标示(可在里面做简单的迭代操作,如数组下标,根据内容选值等)
    • [start:end],[start:end:step]
    • [,] 支持迭代器中做多选
  • (): 支持表达式计算
    • ?(): 过滤操作,表达式结果必须是boolean类型

Json转换

  • import json
  • function:
    • loads,load: jsonString -> pythonObj
    • dumps,dump: pythonObj -> jsonString
  • 转换:
    Json Python
    object dict
    array list
    string unicode
    number(int) int,long
    number(real) float
    true True
    false False
    null None

示例:

import json

content='''
{"subjects":[
    {"rate":"6.5","cover_x":1000,"title":"硬核","url":"https://movie.douban.com/subject/27109879/","playable":false,"cover":"https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2532653002.webp","id":"27109879","cover_y":1414,"is_new":false}
    ,{"rate":"7.1","cover_x":2000,"title":"奎迪:英雄再起","url":"https://movie.douban.com/subject/26707088/","playable":false,"cover":"https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2544510053.webp","id":"26707088","cover_y":2800,"is_new":false}
    ,{"rate":"6.1","cover_x":800,"title":"芳龄十六","url":"https://movie.douban.com/subject/30334122/","playable":false,"cover":"https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2549923514.webp","id":"30334122","cover_y":1185,"is_new":false}
    ,{"rate":"7.7","cover_x":1500,"title":"污垢","url":"https://movie.douban.com/subject/1945750/","playable":false,"cover":"https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2548709468.webp","id":"1945750","cover_y":2222,"is_new":false}
    ,{"rate":"6.8","cover_x":1179,"title":"欢乐满人间2","url":"https://movie.douban.com/subject/26611891/","playable":false,"cover":"https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2515404175.webp","id":"26611891","cover_y":1746,"is_new":false}
]}
'''

# 1. loads: string -> python obj
print('---- loads: --------------')
result=json.loads(content)
print(type(result))                             # <class 'dict'>
print(result)

# 2. dumps: python obj -> string
print('---- dumps: --------------')
subjects=result.get('subjects')
result=json.dumps(subjects,ensure_ascii=False)  # 禁用ascii编码,按utf-8编码    
print(type(result))                             # <class 'str'>
print(result)

# 3. dump: python obj -> string -> file
print('---- dump: --------------')
json.dump(subjects,open('test.json','w'),ensure_ascii=False)
with open('test.json','r') as f:
    print(f.read())

# 4. load: file -> string -> python obj
print('---- load: --------------')
result=json.load(open('test.json','r'))
print(type(result))                             # <class 'list'>
print(result)

print('-------------------------')

Demo:使用Jsonpath解析JSON

import json
import jsonpath

content='''
{"subjects":[
    {"rate":"6.5","cover_x":1000,"title":"硬核","url":"https://movie.douban.com/subject/27109879/","playable":false,"cover":"https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2532653002.webp","id":"27109879","cover_y":1414,"is_new":false}
    ,{"rate":"7.1","cover_x":2000,"title":"奎迪:英雄再起","url":"https://movie.douban.com/subject/26707088/","playable":false,"cover":"https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2544510053.webp","id":"26707088","cover_y":2800,"is_new":false}
    ,{"rate":"6.1","cover_x":800,"title":"芳龄十六","url":"https://movie.douban.com/subject/30334122/","playable":false,"cover":"https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2549923514.webp","id":"30334122","cover_y":1185,"is_new":false}
    ,{"rate":"7.7","cover_x":1500,"title":"污垢","url":"https://movie.douban.com/subject/1945750/","playable":false,"cover":"https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2548709468.webp","id":"1945750","cover_y":2222,"is_new":false}
    ,{"rate":"6.8","cover_x":1179,"title":"欢乐满人间2","url":"https://movie.douban.com/subject/26611891/","playable":false,"cover":"https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2515404175.webp","id":"26611891","cover_y":1746,"is_new":false}
]}
'''

# 0. 加载
obj=json.loads(content)

# 1. `[?()]`
results=jsonpath.jsonpath(obj,'$.subjects[?(float(@.rate)>=7)]')
print(type(results))
# <class 'list'>    
print(results)
#[{'rate': '7.1', 'cover_x': 2000, 'title': '奎迪:英雄再起', 'url': 'https://movie.douban.com/subject/26707088/', 'playable': False, 'cover': 'https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2544510053.webp', 'id': '26707088', 'cover_y': 2800, 'is_new': False}
# , {'rate': '7.7', 'cover_x': 1500, 'title': '污垢', 'url': 'https://movie.douban.com/subject/1945750/', 'playable': False, 'cover': 'https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2548709468.webp', 'id': '1945750', 'cover_y': 2222, 'is_new': False}
# ]

# 2. `.xxx`
results=jsonpath.jsonpath(obj,'$.subjects[?(float(@.rate)>=7)].title')
print(results)
# ['奎迪:英雄再起', '污垢']

# 3. `[index1,index2]`
results=jsonpath.jsonpath(obj,'$.subjects[0,2,3].cover_x')
print(results)
# [1000, 800, 1500]

# 4. `[start:end]`
results=jsonpath.jsonpath(obj,'$.subjects[0:3].cover_x')
print(results)
# [1000, 2000, 800]

# 5. `[start:end:step]`
results=jsonpath.jsonpath(obj,'$.subjects[0:3:2].cover_x')
print(results)
# [1000, 800]

# 6. `?( && )`,`?(,)`
# cover_x   cover_y
# 1000      1414
# 2000      2800
# 800       1185
# 1500      2222
# 1179      1746
results=jsonpath.jsonpath(obj,'$.subjects[?(@.cover_x>=1000 && @.cover_y<1500)]')
print(len(results))
# 1
results=jsonpath.jsonpath(obj,'$.subjects[?(@.cover_x>=1000,@.cover_y<1500)]')
print(len(results))
# 5
print('-------------------------')

Reference