python爬虫

奇舞精选 · 公众号 · · 2024-07-04 18:00

正文

本文作者系360奇舞团前端开发工程师

概念

网络爬虫是 伪装成客户端 与服务端进行数据交互的程序，重点模拟人的行为

爬虫的步骤

发起网络请求

爬虫请求模块

requests：requests是Python中使用最广泛的HTTP请求库，可以发送HTTP/HTTPS请求，同时支持Cookie、文件上传等功能
urllib:是Python的标准库之一，提供了用于打开和读取URL的接口

import requests 
res = requests.get('demo.com/?1')
print(res.content)

User-Agent 的设置

User-Agent可以让服务器了解客户端所使用的操作系统、CPU类型、浏览器及版本、浏览器渲染引擎、浏览器语言、浏览器插件等
设置的作用：模拟不同的客户端环境，不仅可以规避某些网站对爬虫的封禁和限制
自定义代理池

ua_list = [
    'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11',
    'User-Agent:Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11',
    'Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',
    'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)',
    'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',
    'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0',
    ' Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1',
    'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1',
    ' Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',
]

第三方的模块来随机获取浏览器 UA 信息例如fake-useragent

from fake_useragent import UserAgent

ua=UserAgent()

print(ua.chrome)
print(ua.chrome)
print(ua.random)

获取响应内容

  headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0) Gecko/20100101 Firefox/6.0'}
  context = ssl.create_default_context(cafile=certifi.where())  
  req = request.Request(url=url,headers=headers)
  res = request.urlopen(req,context=context)
  html = res.read().decode('utf-8')

解析内容

BeautifulSoup
用于从 HTML 或 XML 文件中提取数据,不是解析器
lxml
用于处理 XML 和 HTML 数据,与 BeautifulSoup 结合结合使用
python re正则解析

from bs4 import BeautifulSoup  
  
html_doc = """  
网站标题  
  
title">网站标题
  
story">这是一个段落。
  
story">...
  
"""  
  
soup = BeautifulSoup(html_doc, 'lxml')  

  
# 查找并获取标题  
title_tag = soup.title  
print(title_tag.string)  # 网站标题  
  
# 查找所有的段落  
p_tags = soup.find_all('b')  
for tag in p_tags:  
    print(tag.string)  
  
#  查找具有特定类名的段落  
story_paragraphs = soup.find_all('p', class_='story')  
for para in story_paragraphs:  
    print(para.string)

存储数据

CSV文件

写入
with open(filename, 'w', newline='') as csvfile:  
   
    writer = csv.writer(csvfile)  
    writer.writerow(['Name', 'Age'])  
    for row in data:  
        writer.writerow(row)

with open(filename, 'r') as csvfile:  
    reader = csv.reader(csvfile)  
    for row in reader:  
       print(row)

爬取到的数据可以选择将数据存储在关系型数据库（如MySQL、PostgreSQL）中，也可以存储在非关系型数据库（如MongoDB、Redis）中，也可以存在本地文件里。

实战中的注意事项

遵守法律法规：在编写爬虫时，务必遵守目标网站的robots.txt文件规定，尊重网站的版权和隐私政策。
合理设置请求频率：避免过高频率的请求给目标网站服务器带来压力。
异常处理：对可能出现的网络异常、解析异常等进行妥善处理。
数据清洗：对爬取到的数据进行清洗和整理，去除无用信息，保证数据的准确性和可用性