Python 爬虫之Requests & AioHttp

Requests Introduction

  • 第三方的HTTP客户端库,官网| Doc
  • 支持HTTP连接保持和连接池,支持使用cookie保持会话,文件上传,自动确定响应内容的编码,国际化的URL,POST数据自动编码等

  • vs. urllib:

    • urllib,urllib2,urllib3是python原生类库
    • urlliburllib2 是两个相互独立的模块(在python3中urllib2被改为urllib.request)
    • requests库使用了urllib3,支持连接保持(eg:多次请求重复使用一个socket),更方便  urllib vs. requests
  • 安装

    pip install requests
    
  • 使用

    import requests
    
    response=requests.get('http://www.baidu.com')
    print(type(response))
    print(response.status_code,response.reason)
    print(response.encoding,response.apparent_encoding)
    print(response.request.headers)
    print(response.headers)
    print(response.content)
    
  • 注:

    • Requests默认的传输适配器使用阻塞IOResponse.content属性会阻塞,直到整个响应下载完成(数据流功能允许每次接受少量的一部分响应,但依然是阻塞式的)
    • 非阻塞可以考虑其他异步框架,例如grequestsrequests-futures

Requests 基础对象和方法

Request 对象

  • requests.request(method,url,**kwargs) 构造一个请求,支撑以下各方法的基础方法(method:对应get/put/post等7种)
  • requests.get(url,params=None,**kwargs)
  • requests.head(url,**kwargs)
  • requests.post(url,data=None,json=None,**kwargs)
  • requests.put/patch(url,data=None,**kwargs)
  • requests.delete(url,**kwargs)
  • 方法参数:
    • url
    • params: 作为参数增加到url中 (字典或字节流格式)
    • data: 作为Request的内容 (字典、字节序列或文件)
    • json: 作为Request的内容 (JSON格式的数据)
    • headers: HTTP定制头 (字典)
    • cookies : Request中的cookie (字典或CookieJar)
    • auth : 支持HTTP认证功能 (元组)
    • files : 传输文件 (字典类型)
    • timeout : 设定超时时间(秒为单位),默认为None,即一直等待
    • proxies : 设定访问代理服务器,可以增加登录认证(字典类型)
    • allow_redirects : 重定向开关 (True/False,默认为True)
    • stream : 获取内容立即下载开关 (True/False)
      • False(默认): 表示立即开始下载文件并存放到内存当中(若文件过大就会导致内存不足的情况)
      • True: 推迟下载响应体直到访问 Response.content 属性(请求连接不会被关闭直到读取所有数据或者调用Response.close,使用with 语句发送请求,这样可以保证请求一定会被关闭)
    • verify : 认证SSL证书开关 (True/False,默认为True)
    • cert : 本地SSL证书路径

Response 对象

  • 类: <class 'requests.models.Response'>
  • 状态
    • response.status_code
    • response.reason
  • body
    • response.raw (原始响应内容 urllib3.response.HTTPResponse, raw.read(),need set stream=True when request)
    • response.content (二进制形式)
    • response.text (字符串形式,根据encoding显示网页内容)
    • response.json() (JSON格式,字典类型)
  • header
    • response.headers
    • request.headers
  • 编码
    • response.encoding (从HTTP header中猜测的响应内容编码方式)
    • resposne.apparent_encoding (encoding的备选,从网页内容中分析出的响应内容编码方式)
  • response.raise_for_status()
    • 在方法内部判断是否等于200,不是则抛出requests.HTTPError异常
    • 注:不需要增加额外的if语句,该语句便于利用try except进行异常处理

Exception 对象

  • requests.HTTPError : HTTP错误异常
  • requests.URLRequired : URL缺失异常
  • requests.Timeout : 请求URL超时,产生超时异常
  • requests.ConnectTimeout : 连接远程服务器超时异常
  • requests.ConnectionError : 网络连接错误异常,如DNS查询失败、拒绝连接等
  • requests.TooManyRedirects 超过最大重定向次数,产生重定向异常

Requests 基础示例

Visit http://httpbin.org/

requests.get

>>> r=requests.get('http://httpbin.org/get')
>>> type(r)
<class 'requests.models.Response'>
>>> r.status_code,r.reason
(200,'OK')
>>> r.encoding,r.apparent_encoding
(None, 'ascii')
>>> r.headers
{'Access-Control-Allow-Credentials': 'true', 'Access-Control-Allow-Origin': '*', 'Content-Encoding': 'gzip', 'Content-Type': 'application/json', 'Date': 'Thu, 21 Mar 2019 16:40:42 GMT', 'Server': 'nginx', 'Content-Length': '184', 'Connection': 'keep-alive'}
>>> r.request.headers
{'User-Agent': 'python-requests/2.21.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
>>> r.json()
{
  "args": {},
  "headers": {
    "Accept": "*/*",
    "Accept-Encoding": "gzip, deflate",
    "Host": "httpbin.org",
    "User-Agent": "python-requests/2.21.0"
  },
  ...
}

requests.head

>>> r=requests.head('http://httpbin.org/get')
>>> r.text
''
>>> r.headers
{'Access-Control-Allow-Credentials': 'true', 'Access-Control-Allow-Origin': '*', 'Content-Encoding': 'gzip', 'Content-Type': 'application/json', 'Date': 'Tue, 19 Mar 2019 13:16:24 GMT', 'Server': 'nginx', 'Connection': 'keep-alive'}

requests.post+data/json

# `data={...}` 
# 字典,自动编码为form(表单)
# content-type: application/x-www-form-urlencoded
# request body: key1=value1&key2=value2
>>> record={'key1':'value1','key2':'value2'}
>>> r=requests.post('http://httpbin.org/post',data=record)
>>> r.request.headers['content-type']
application/x-www-form-urlencoded
>>> r.json()
{'args': {}, 'data': '', 'files': {}, 'form': {'key1': 'value1', 'key2': 'value2'}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Content-Length': '23', 'Content-Type': 'application/x-www-form-urlencoded', 'Host': 'httpbin.org', 'User-Agent': 'python-requests/2.21.0'}, 'json': None, ...}

# `data='...'`
# 字符串,自动编码为data
# request body: 'ABC123'
>>> record="ABC123"
>>> r=requests.post('http://httpbin.org/post',data=record)
>>> r.request.headers.get('content-type',None)
None
>>> r.json()
{'args': {}, 'data': 'ABC123', 'files': {}, 'form': {}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Content-Length': '6', 'Host': 'httpbin.org', 'User-Agent': 'python-requests/2.21.0'}, 'json': None, ...}

# `json={...}`
# 字典
# content-type: application/json
# request body: {'key1': 'value1', 'key2': 'value2'}
>>> record={'key1':'value1','key2':'value2'}
>>> r = requests.request('POST', 'http://httpbin.org/post', json=record)
>>> r.request.headers['Content-Type']
application/json
>>> r.json()
{'args': {}, 'data': '{"key1": "value1", "key2": "value2"}', 'files': {}, 'form': {}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Content-Length': '36', 'Content-Type': 'application/json', 'Host': 'httpbin.org', 'User-Agent': 'python-requests/2.21.0'}, 'json': {'key1': 'value1', 'key2': 'value2'}, ...}

kwargs: params

>>> kv = {'key1': 'value1', 'key2': 'value2'}
>>> r = requests.request('GET', 'http://httpbin.org/get', params=kv) 
>>> r.url
http://httpbin.org/get?key1=value1&key2=value2
>>> r.json()
{'args': {'key1': 'value1', 'key2': 'value2'}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Host': 'httpbin.org', 'User-Agent': 'python-requests/2.21.0'}, 'origin': '117.83.222.100, 117.83.222.100', 'url': 'https://httpbin.org/get?key1=value1&key2=value2'}

kwargs: auth

import requests
Endpoint="http://httpbin.org"

# 1. basic auth
r=requests.request('GET',Endpoint+'/basic-auth/Tom/Tom111')
print(r.status_code,r.reason)
# 401 UNAUTHORIZED

r=requests.request('GET',Endpoint+'/basic-auth/Tom/Tom111',auth=('Tom','Tom123'))
print(r.status_code,r.reason)
# 401 UNAUTHORIZED

r=requests.request('GET',Endpoint+'/basic-auth/Tom/Tom123',auth=('Tom','Tom123'))
print(r.status_code,r.reason)
print(r.request.headers)
print(r.text)
# 200 OK
# {'User-Agent': 'python-requests/2.21.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive', 'Authorization': 'Basic VG9tOlRvbTEyMw=='}
# {
#   "authenticated": true,
#   "user": "Tom"
# }
print(base64.b64decode('VG9tOlRvbTEyMw=='))
print('--------------------------')

# 2. oauth
r=requests.request('GET',Endpoint+'/bearer')
print(r.status_code,r.reason)           # 401 UNAUTHORIZED
print(r.headers)                        # Note: 'WWW-Authenticate': 'Bearer'

r=requests.request('GET',Endpoint+'/bearer',headers={'Authorization':'Bearer 1234567'})
print(r.status_code,r.reason)           # 200 OK
print(r.headers)
print('--------------------------')

# 3. advance: 自定义身份验证(继承requests.auth.AuthBase)
from requests.auth import AuthBase
class MyAuth(AuthBase):
    def __init__(self,authType,token):
        self.authType=authType
        self.token=token
    def __call__(self,req):
        req.headers['Authorization']=' '.join([self.authType,self.token])
        return req
r=requests.request('GET',Endpoint+'/bearer',auth=MyAuth('Bearer','123456'))
print(r.status_code,r.reason)                   # 200 OK
print("Request Headers:",r.request.headers)
print("Response Headers:",r.headers)
print("Response Text:",r.text)

kwargs: cookies

>>> r=requests.request('GET','http://httpbin.org/cookies/set?freedom=test123')
>>> r.cookies
>>> r.request.headers
{'User-Agent': 'python-requests/2.21.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive', 'Cookie': 'freedom=test123'}

>>> cookies = dict(cookies_are='working')       # {'cookies_are':'working'}
>>> r = requests.get('http://httpbin.org/cookies', cookies=cookies)
>>> r.text
'{"cookies": {"cookies_are": "working"}}'

>>> jar = requests.cookies.RequestsCookieJar()
>>> jar.set('tasty_cookie', 'yum', domain='httpbin.org', path='/cookies')
>>> jar.set('gross_cookie', 'blech', domain='httpbin.org', path='/get')
>>> r = requests.get('http://httpbin.org/cookies', cookies=jar)
>>> r.text
'{"cookies": {"tasty_cookie": "yum"}}'

kwargs: timeout

def timeout_request(url,timeout):
    try:
        resp=requests.get(url,timeout=timeout)
        resp.raise_for_status()
    except requests.Timeout or requests.HTTPError as e:
        print(e)
    except Exception as e:
        print("unknow exception:",e)
    else:
        print(resp.text)
        print(resp.status_code)

timeout_request('http://httpbin.org/get',0.1)
# HTTPConnectionPool(host='httpbin.org', port=80): Max retries exceeded with url: /get (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x1025d9400>, 'Connection to httpbin.org timed out. (connect timeout=0.1)'))

kwargs: proxies

>>> pxs = { 'http': 'http://user:pass@10.10.10.1:1234' 'https': 'https://10.10.10.1:4321' }
>>> r = requests.request('GET', 'http://www.baidu.com', proxies=pxs)

kwargs: files

f={'image': open('黑洞1.jpg', 'rb')}
r = requests.post(Endpoint+'/post', files=f)
print(r.status_code,r.reason)
print(r.headers)
print(r.text[100:200])

print('--------------------------')
# POST Multiple Multipart-Encoded Files
multiple_files = [
    ('images', ('黑洞1.jpg', open('黑洞1.jpg', 'rb'), 'image/jpg')),
    ('images', ('极光1.jpg', open('极光1.jpg', 'rb'), 'image/jpg'))
]
r = requests.post(Endpoint+'/post', files=multiple_files)
print(r.status_code,r.reason)
print(r.headers)
print(r.text[100:200])
print('--------------------------')

kwargs: stream

with requests.get(Endpoint+"/stream/3",stream=True) as r:
    print(r.status_code,r.reason)
    contentLength=int(r.headers.get('content-length',0))
    print("content-length:",contentLength)
    # 此时仅有响应头被下载下来了,连接保持打开状态,因此允许我们根据条件获取内容
    if contentLength<100:
        print(r.content)
    else:
        print('read line by line')
        lines=r.iter_lines() # iter_content 一块一块的下载遍历内容
        for line in lines:  
            if line:
                print(line)             
    print('Done')
print('--------------------------')

Exception

import requests

def do_request(url):
  try:
    r=requests.get(url,timeout=0.1)
    r.raise_for_status()
    r.encoding=r.apparent_encoding
  except requests.Timeout or requests.HTTPError as e:
    print(e)
  except Exception as e:
    print("Request Error:",e)
  else:
    print(r.text)
    print(r.status_code)
    return r

if __name__=='__main__':
  do_request("http://www.baidu.com")

Requests 进阶使用

Event hooks

def get_key_info(response,*args,**kwargs):
    print("callback:content-type",response.headers['Content-Type'])
r=requests.get(Endpoint+'/get',hooks=dict(response=get_key_info))
print(r.status_code,r.reason)

# callback:content-type application/json
# 200 OK

Session

  • 跨请求保持某些参数

      # 在同一个 Session 实例发出的所有请求之间保持 cookie, 期间使用 urllib3 的 connection pooling 功能
      s = requests.Session()
      r=s.get(Endpoint+'/cookies/set/mycookie/123456')
      print("set cookies",r.status_code,r.reason)     # set cookies 200 OK
      r = s.get(Endpoint+"/cookies")
      print("get cookies",r.status_code,r.reason)     # get cookies 200 OK
      print(r.text)
      # {
      #   "cookies": {
      #     "mycookie": "123456"
      #   }
      # }
    
  • 为请求方法提供缺省数据

      # 通过为会话对象的属性提供数据来实现(注:方法层的参数覆盖会会话的参数)
      s = requests.Session()
      s.auth = ('user', 'pass')
      s.headers.update({'x-test': 'true'})
      # both 'x-test' and 'x-test2' are sent
      r=s.get(Endpoint+'/headers', headers={'x-test2': 'true'})
      print(r.request.headers)
      # {'User-Agent': 'python-requests/2.21.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive', 'x-test': 'true', 'x-test2': 'true', 'Authorization': 'Basic dXNlcjpwYXNz'}
    
  • 用作前后文管理器

      with requests.Session() as s:       # 这样能确保 with 区块退出后会话能被关闭,即使发生了异常也一样
          s.get('http://httpbin.org/cookies/set/mycookie/Test123')
          r = s.get(Endpoint+"/cookies")
          print("set cookies",r.status_code,r.reason)
          print(r.text)
          # {
          #   "cookies": {
          #     "mycookie": "Test123"
          #   }
          # }
      print("out with:")
      r = s.get(Endpoint+"/cookies")
      print("get cookies",r.status_code,r.reason)
      print(r.text)
      # {
      #   "cookies": {
      #     "mycookie": "Test123"
      #   }
      # }
    

Prepared Request

# 可在发送请求前,对body/header等做一些额外处理

s=Session()
req=Request('GET',Endpoint+'/get',headers={'User-Agent':'fake1.0.0'})
prepared=req.prepare()  # 要获取一个带有状态的 PreparedRequest需使用`s.prepare_request(req)`

# could do something with prepared.body/prepared.headers here
# ...

resp=s.send(prepared,timeout=3)
print(resp.status_code,resp.reason)
print("request headers:",resp.request.headers)
# {'User-Agent': 'fake1.0.0'}

print("response headers:",resp.headers)
# {'Access-Control-Allow-Credentials': 'true', 'Access-Control-Allow-Origin': '*', 'Content-Type': 'application/json', 'Date': 'Thu, 21 Mar 2019 15:47:30 GMT', 'Server': 'nginx', 'Content-Length': '216', 'Connection': 'keep-alive'}

print(resp.text)
# {
#   "args": {},
#   "headers": {
#     "Accept-Encoding": "identity",
#     "Host": "httpbin.org",
#     "User-Agent": "fake1.0.0"
#   },
#   "origin": "117.83.222.100, 117.83.222.100",
#   "url": "https://httpbin.org/get"
# }

Chunk-Encoded Requests

# 分块传输,使用生成器或任意没有具体长度的迭代器
def gen():
    yield b'hi '
    yield b'there! '
    yield b'How are you?'
    yield b'This is for test 123567890.....!'
    yield b'Test ABCDEFG HIGKLMN OPQ RST UVWXYZ.....!'

r=requests.post(Endpoint+'/post', data=gen())    # stream=True
print(r.status_code,r.reason,r.headers['content-length'])
for chunk in r.iter_content(chunk_size=100):        # chunk_size=None
    if chunk:
        print(chunk)
print('done')

Reqeusts 应用示例

Download pic from http://www.nationalgeographic.com.cn

一次性下载(小文件,stream=False)

import requests
import os

def download_small_file(url):
    try:
        r=requests.get(url)
        r.raise_for_status()
        print(r.status_code,r.reason)
        contentType=r.headers["Content-Type"]
        contentLength=int(r.headers.get("Content-Length",0))
        print(contentType,contentLength)
    except Exception as e:
        print(e)
    else:
        filename=r.url.split('/')[-1]
        print('filename:',filename)
        target=os.path.join('.',filename)
        if os.path.exists(target) and os.path.getsize(target):
            print('Exist -- Skip download!')
        else:
            with open(target,'wb') as fd:
                fd.write(r.content)
    print('done!')

if __name__ == '__main__':
    import time
    print('start')
    start = time.time()

    url="http://image.nationalgeographic.com.cn/2017/0211/20170211061910157.jpg"
    download_small_file(url)

    end=time.time()
    print('Runs %0.2f seconds.' % (end - start))
    print('end')

流式分块下载(大文件,stream=True

import requests
import os

def download_large_file(url):
    try:
        r=requests.get(url,stream=True)
        r.raise_for_status()
        print(r.status_code,r.reason)
        contentType=r.headers["Content-Type"]
        contentLength=int(r.headers.get("Content-Length",0))
        print(contentType,contentLength)        
    except Exception as e:
        print(e)
    else:
        filename=r.url.split('/')[-1]
        print('filename:',filename)
        target=os.path.join('.',filename)
        if os.path.exists(target) and os.path.getsize(target):
            print('Exist -- Skip download!')
        else:
            with open(target,'wb') as fd:
                for chunk in r.iter_content(chunk_size=10240):
                    if chunk:
                        fd.write(chunk)
                        print('download:',len(chunk))
    finally:
        r.close()
        print('close')

if __name__ == '__main__':
    import time
    print('start')
    start = time.time()

    url="http://image.nationalgeographic.com.cn/2017/0211/20170211061910157.jpg"
    download_large_file(url)

    end=time.time()
    print('Runs %0.2f seconds.' % (end - start))
    print('end')

显示进度条

import requests
import os

def download_with_progress(url):
    try:
        with requests.get(url,stream=True) as r:
            r.raise_for_status()
            print(r.status_code,r.reason)

            contentType=r.headers["Content-Type"]
            contentLength=int(r.headers.get("Content-Length",0))
            print(contentType,contentLength)

            filename=r.url.split('/')[-1]
            print('filename:',filename)

            target=os.path.join('.',filename)
            if os.path.exists(target) and os.path.getsize(target):
                print('Exist -- Skip download!')
            else:
                chunk_size=1024
                progress =ProgressBar(filename, total=content_length,chunk_size=1024,unit="KB")
                with open(target,'wb') as fd:
                    for chunk in r.iter_content(chunk_size=chunk_size):
                        if chunk:
                            fd.write(chunk)
                            #print('download:',len(chunk))
                            progress.refresh(count=len(chunk))

    except Exception as e:
        print(e)
    print('done')

# ProgressBar
class ProgressBar(object):
def __init__(self,title,total,chunk_size=1024,unit='KB'):
    self.title=title
    self.total=total
    self.chunk_size=chunk_size
    self.unit=unit
    self.progress=0.0

def __info(self):
    return "【%s】%s %.2f%s / %.2f%s" % (self.title,self.status,self.progress/self.chunk_size,self.unit,self.total/self.chunk_size,self.unit)

def refresh(self,progress):
    self.progress += progress
    self.status="......"
    end_str='\r'
    if self.total>0 and self.progress>=self.total:
        end_str='\n'
        self.status='completed'
    print(self.__info(),end=end_str)

if __name__ == '__main__':
    import time
    print('start')
    start = time.time()

    url="http://image.nationalgeographic.com.cn/2017/0211/20170211061910157.jpg"
    download_with_progress(url)

    end=time.time()
    print('Runs %0.2f seconds.' % (end - start))
    print('end')

多任务下载

  • 多进程下载:并行(同时)

      import multiprocessing
      from multiprocessing import Pool
    
      # do multiple downloads - multiprocessing
      def do_multiple_download_multiprocessing(url_list,targetDir):
          cpu_cnt=multiprocessing.cpu_count()
          print("系统进程数: %s, Parent Pid: %s" % (cpu_cnt,os.getpid()))
    
          p = Pool(cpu_cnt)
          results=[]
          for i,url in enumerate(url_list):
              result=p.apply_async(do_download,args=(i,url,targetDir,False,),callback=print_return)
              results.append(result)
          print('Waiting for all subprocesses done...')
          p.close()
          p.join()
          for result in results:
              print(os.getpid(),result.get())
          print('All subprocesses done.')
    
      # callback
      def print_return(result):
          print(os.getpid(),result)
    
  • 多线程下载:并发(交替)

      import threading
      def do_multiple_downloads_threads(url_list,targetDir):
          thread_list=[]
          for i,url in enumerate(url_list):
              thread=threading.Thread(target=do_download,args=(i,url,targetDir,True,))
              thread.start()
              thread_list.append(thread)
          print('Waiting for all threads done...')
          for thread in thread_list:
              thread.join()
          print('All threads done.')
    
  • verify

      import requests
      from bs4 import BeautifulSoup
      import os,time
      import re
    
      # do download using `requests`
      def do_download(i,url,targetDir,isPrint=False):
          headers={
              'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:60.0) Gecko/20100101 Firefox/60.0'
          }
          try:
              response=requests.get(url,headers=headers,stream=True,verify=False)
              response.raise_for_status()
          except Exception as e:
              print("Occur Exception:",e)
          else:
              content_length = int(response.headers.get('Content-Length',0))
              filename=str(i)+"."+url.split('/')[-1]
              print(response.status_code,response.reason,content_length,filename)
              progressBar=ProgressBar(filename, total=content_length,chunk_size=1024,unit="KB")
              with open(os.path.join(targetDir,filename),'wb') as fd:
                  for chunk in response.iter_content(chunk_size=1024):
                      if chunk:
                          fd.write(chunk)
                          progressBar.refresh(len(chunk))
              if isPrint:
                  print(os.getpid(),threading.current_thread().name,filename,"Done!")
              return '%s %s %s Done' % (os.getpid(),threading.current_thread().name,filename)
    
      # prepare download urls
      def url_list_crawler():
          url="http://m.ngchina.com.cn/travel/photo_galleries/5793.html"
          response=requests.get(url)
          print(response.status_code,response.reason,response.encoding,response.apparent_encoding)
          response.encoding=response.apparent_encoding
          soup=BeautifulSoup(response.text,'html.parser')
          #results=soup.select('div#slideBox ul a img')
          #results=soup.find_all('img')
          results=soup.select("div.sub_center img[src^='http']")
          url_list=[ r["src"] for r in results]
          print("url_list:",len(url_list))
          print(url_list)
          return url_list
    
      # main
      if __name__=='__main__':
          print('start')
    
          targetDir="/Users/cj/space/python/download"
          url="http://image.ngchina.com.cn/2019/0325/20190325110244384.jpg"
          url_list=url_list_crawler()
    
          start=time.time()
    
          # 0 download one file using `requests`
          do_download("A",url,targetDir)
          end = time.time()
          print('Total cost %0.2f seconds.' %  (end-start))
          start=end
    
          # 1 using multiple processings
          do_multiple_download_multiprocessing(url_list,targetDir)
          end = time.time()
          print('Total cost %0.2f seconds.' %  (end-start))
          start=end
    
          # 2 using multiple threads
          do_multiple_downloads_threads(url_list,targetDir)
          end = time.time()
          print('Total cost %0.2f seconds.' %  (end-start))
          start=end
    
          print('end')
    

aiohttp

官网

Asynchronous HTTP Client/Server for asyncio and Python.

  • 支持客户端和HTTP服务器
  • 提供异步web服务的库 ( requests是同步阻塞的)
  • 无需使用Callback Hell即可支持Server/Client WebSockets
  • install: pip install aiohttp

Client Sample

Refer to Client Quickstart

import aiohttp
import asyncio

async def fetch(session, url):
    async with session.get(url) as response:
        return await response.text()

async def main():
    async with aiohttp.ClientSession() as session:
        html = await fetch(session, 'http://httpbin.org/headers')
        print(html)

loop = asyncio.get_event_loop()
loop.run_until_complete(main())

Server Sample

Refer to Web Server Quickstart

from aiohttp import web

async def handle(request):
    name = request.match_info.get('name', "Anonymous")
    text = "Hello, " + name
    return web.Response(text=text)

app = web.Application()
app.add_routes([web.get('/', handle),
                web.get('/{name}', handle)])

web.run_app(app)

应用:协程并发下载文件

单线程 & 异步 & 非阻塞

  • do download using aiohttp

      async def do_aiohttp_download(session,i,url,targetDir):
          async with session.get(url) as response:
              content_length = int(response.headers.get('Content-Length',0))
              filename=str(i)+"."+url.split('/')[-1]
              print(response.status,response.reason,content_length,filename)
              progressBar=ProgressBar(filename, total=content_length,chunk_size=1024,unit="KB")
              with open(os.path.join(targetDir,filename),'wb') as fd:
                  while True:
                      chunk=await response.content.read(1024)
                      if not chunk:
                          break;
                      fd.write(chunk)
                      progressBar.refresh(len(chunk))
              await response.release()
          # print(filename,"Done!")
          return filename
    
      # callback
      def print_async_return(task):
          print(task.result(),"Done")
    
      def print_async_return2(i,task):
          print(i,":",task.result(),"Done")
    
  • case1: do one download

      async def do_async_download(i,url,targetDir):
          async with aiohttp.ClientSession() as session:
              return await do_aiohttp_download(session,i,url,targetDir)
    
  • case2: do multiple download

      # do multiple downloads - asyncio
      async def do_multiple_downloads_async(url_list,targetDir):
           async with aiohttp.ClientSession() as session:
              # tasks=[do_aiohttp_download(session,url,targetDir) for url in url_list]
              # await asyncio.gather(*tasks)          
              tasks=[]
              for i,url in enumerate(url_list):
                  task=asyncio.create_task(do_aiohttp_download(session,i,url,targetDir))
                  # task.add_done_callback(print_async_return)
                  task.add_done_callback(functools.partial(print_async_return2,i))
                  tasks.append(task)
                  await asyncio.gather(*tasks)
    
  • verify

      import os,time
      import asyncio
      import aiohttp
      import functools
      import re
    
      # prepare download urls
      def url_list_crawler():
          url="http://m.ngchina.com.cn/travel/photo_galleries/5793.html"
          response=requests.get(url)
          print(response.status_code,response.reason,response.encoding,response.apparent_encoding)
          response.encoding=response.apparent_encoding
          soup=BeautifulSoup(response.text,'html.parser')
          #results=soup.select('div#slideBox ul a img')
          #results=soup.find_all('img')
          results=soup.select("div.sub_center img[src^='http']")
          url_list=[ r["src"] for r in results]
          print("url_list:",len(url_list))
          print(url_list)
          return url_list
    
      # main          
      if __name__=='__main__':
          print('start')
    
          targetDir="/Users/cj/space/python/download"
          url="http://image.ngchina.com.cn/2019/0325/20190325110244384.jpg"
          url_list=url_list_crawler()
    
          start=time.time()
          # 1. download one file using `aiohttp`
          loop=asyncio.get_event_loop()
          loop.run_until_complete(do_async_download("A",url,targetDir))
          loop.close()
          end = time.time()
          print('Total cost %0.2f seconds.' %  (end-start))
          start=end
    
          # 2. download many files using `aiohttp`
          loop=asyncio.get_event_loop()
          loop.run_until_complete(do_multiple_downloads_async(url_list,targetDir))
          loop.close()
          end = time.time()
          print('Total cost %0.2f seconds.' %  (end-start))
          start=end
    
          print('end')
    

Reference