导读
本文旨在帮助小伙伴们学习电商数据抓取的基础知识、工具、Python脚本及应对挑战的技巧,实现高效数据提取。
在本教程中,你将探索:
注:
在开始操作前,小伙伴可以提前注册Bright Data数据抓取网站免费体验相关功能:
识别二维码免费注册体验
电商网页抓取是从亚马逊、沃尔玛、eBay等在线零售平台提取数据的过程。虽然可以手动复制数据,但通常使用自动化工具或脚本完成。
从电商网站提取的数据可帮助:
-
分析产品价格波动
-
跟踪评论评分
-
识别市场趋势
-
研究竞争对手
这些洞察支持明智决策和战略规划。
注意:电商数据抓取工具通常称为电商抓取器。
以下是常见的电商抓取工具类型:
-
自定义脚本:使用Python或JavaScript等编程语言编写的定制脚本。
-
无代码抓取工具:无需编程即可提取数据的工具,适合非技术人员。
-
网页抓取API:通过接口以编程方式提供结构化数据,支持实时或大规模抓取。
-
浏览器扩展:直接在电商网页上收集数据的浏览器插件。
电商抓取器通常可以提取以下数据:
产品详情:
名称、描述、规格、图片
价格信息:
当前价格、折扣、历史价格趋势
客户评论:
评分、评论内容、反馈
分类与标签:
产品分类与标签
卖家信息:
名称、评分、联系方式
物流详情:
运费、配送时间、政策
库存状态:
库存量、缺货通知
营销数据:
产品列表、定价策略、促销活动
现在,学习如何用Python构建电商抓取器!
手动构建电商抓取器前,需熟悉目标网站结构。使用开发者工具(DevTools)检查目标页面:
Requests:发送HTTP请求,获取网页原始HTML。
Beautiful Soup:解析HTML/XML文档,简化数据提取。
pip install requests beautifulsoup4
对于动态加载或依赖JavaScript渲染的网站(如亚马逊),需使用
Selenium
:
连接目标网站:
使用Requests或Selenium获取并解析HTML。
选择目标元素:
通过CSS选择器或XPath定位元素(如产品图片、价格)。
此方法的优势包括完全控制流程和高度定制化,但需要技术知识维护,且每个网站需独立编写脚本。
识别二维码添加小助手不仅可以获得免费注册体验,更有惊喜礼品赠送:
Bright Data电商API免费注册体验
以下为亚马逊、沃尔玛、eBay的Python抓取脚本示例:
亚马逊抓取示例
亚马逊采用反抓取措施,需用Selenium绕过:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
import json
driver = webdriver.Chrome(service=Service())
driver.get("https://amazon.com/")
search_input_element = driver.find_element(By.ID, "twotabsearchtextbox")
search_input_element.send_keys("laptop")
search_button_element = driver.find_element(By.ID, "nav-search-submit-button")
search_button_element.click()
products = []
product_elements = driver.find_elements(By.CSS_SELECTOR, "\[role=\"listitem\"\][data-asin]")
for product_element in product_elements:
url_element = product_element.find_element(By.CSS_SELECTOR, ".a-link-normal")
url = url_element.get_attribute("href")
name_element = product_element.find_element(By.CSS_SELECTOR, "h2")
name = name_element.text
image_element = product_element.find_element(By.CSS_SELECTOR, "img[data-image-load]"
)
image = image_element.get_attribute("src")
product = {
"url": url,
"name": name,
"image": image
}
products.append(product)
with open("products.json", "w", encoding="utf-8") as json_file:
json.dump(products, json_file, indent=4)
[
{
"url": "https://www.amazon.com/A315-24P-R7VH-Display-Quad-Core-Processor-Graphics/dp/B0BS4BP8FB/ref=sr_1_3?crid=1W7R6D59KV9L1&dib=eyJ2IjoiMSJ9.iBCtzwnCm6CE8Bx8hKmQ8ez6PkzMg3asWNhAxvflBg3pKVi5IxQUSDpcaksihO-jEO1nyLGkdoGk_2hNyQ7EWOa6epS_hZHxqV7msqdtcEZv4irFZRnYHcP5YnEwKu17BjsYS_IPI1tFVDS65v_roSCu_IiBNfotAEHSx4zOwQ4u1CRKfvnLjIX4VlECydRjsKaAQ-mErT89tyBUCfEGjzKPPZxwHi3Y0MoieuPceL8.jIuIrqzxNYISYPLHifRJq289Vy9Z6hqT8vmMcUQw9HY&dib_tag=se&keywords=laptop&qid=1735572968&sprefix=l%2Caps%2C271&sr=8-3",
"name": "Acer Aspire 3 A315-24P-R7VH Slim Laptop | 15.6\" Full HD IPS Display | AMD Ryzen 3 7320U Quad-Core Processor | AMD Radeon Graphics | 8GB LPDDR5 | 128GB NVMe SSD | Wi-Fi 6 | Windows 11 Home in S Mode",
"image": "https://m.media-amazon.com/images/I/61gKkYQn6lL._AC_UY218_.jpg"
},
// omitted for brevity...
{
"url": "https://www.amazon.com/Lenovo-Newest-Flagship-Chromebook-HubxcelAccesory/dp/B0CBJ46QZX/ref=sr_1_8?crid=1W7R6D59KV9L1&dib=eyJ2IjoiMSJ9.iBCtzwnCm6CE8Bx8hKmQ8ez6PkzMg3asWNhAxvflBg3pKVi5IxQUSDpcaksihO-jEO1nyLGkdoGk_2hNyQ7EWOa6epS_hZHxqV7msqdtcEZv4irFZRnYHcP5YnEwKu17BjsYS_IPI1tFVDS65v_roSCu_IiBNfotAEHSx4zOwQ4u1CRKfvnLjIX4VlECydRjsKaAQ-mErT89tyBUCfEGjzKPPZxwHi3Y0MoieuPceL8.jIuIrqzxNYISYPLHifRJq289Vy9Z6hqT8vmMcUQw9HY&dib_tag=se&keywords=laptop&qid=1735572968&sprefix=l%2Caps%2C271&sr=8-8",
"name": "Lenovo Newest Flagship Chromebook, 14'' FHD Touchscreen Slim Thin Light Laptop Computer, 8-Core MediaTek Kompanio 520 Processor, 4GB RAM, 64GB eMMC, WiFi 6,Chrome OS+HubxcelAccesory, Abyss Blue",
"image": "https://m.media-amazon.com/images/I/61KlKRdsQ7L._AC_UY218_.jpg"
}
]
若遇CAPTCHA,可尝试使用
SeleniumBase
替代。
沃尔玛抓取示例
目标页面:
沃尔玛“键盘”搜索页
URL:
https://www.walmart.com/search?q=keyboard
沃尔玛同样需用Selenium:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
import json
driver = webdriver.Chrome(service=Service())
driver.get("https://www.walmart.com/search?q=keyboard")
products = []
product_elements = driver.find_elements(By.CSS_SELECTOR, ".carousel-4[data-testid=\"carousel-container\"] li")
for product_element in product_elements:
url_element = product_element.find_element(By.CSS_SELECTOR, "a")
url = url_element.get_attribute("href")
name_element = product_element.find_element(By.CSS_SELECTOR, "h3")
name = name_element.get_attribute("innerText")
image_element = product_element.find_element(By.CSS_SELECTOR, "img[data-testid=\"productTileImage\"]")
image = image_element.get_attribute("src")
product = {
"url": url,
"name": name,
"image": image
}
products.append(product)
with open("products.json", "w", encoding="utf-8") as json_file:
json.dump(products, json_file, indent=4)
[
{
"url": "https://www.walmart.com/sp/track?bt=1&eventST=click&plmt=sp-search-middle~desktop~Results%20for%20%22Electronics%22&pos=1&tax=3944_1089430_132959_1008621_7197407&rdf=1&rd=https%3A%2F%2Fwww.walmart.com%2Fip%2FLogitech-920-004536-Mk270-Keyboard-Mouse-USB-Wireless-Combo-Black%2F28540111%3FclassType%3DREGULAR%26adsRedirect%3Dtrue&adUid=094fb4ae-62f3-4954-ae99-b2938550d72c&mloc=sp-search-middle&pltfm=desktop&pgId=keyboard&pt=search&spQs=sAX_0l4wzWXzBji34bVpmheXU7_ETXGbDXcA9LhcshG_YbqBx24VWzt7yesHivpt1lpckuNhxQqbLidA-d8L4agqx_YPQVlj2EfM_TnEyfsSWiTEkvBaqgkaMzy6bgIZ4eC8t9-qqz7qtb7uXMz3cH92UCf5EEgQlfKwnxJ-SAF1EW1ouCjC10Ur3hELs3143xQPjxNUSUoN8FIF12fxJmTlSlTe4makoj1s2NoubYTqnlJLs3pohowJCRFT76Vl&storeId=3081&couponState=na&bkt=ace1_default%7Cace2_default%7Cace3_default%7Ccoldstart_off%7Csearch_default&classType=REGULAR",
"name": "Logitech Wireless Combo MK270",
"image": "https://i5.walmartimages.com/seo/Logitech-920-004536-Mk270-Keyboard-Mouse-USB-Wireless-Combo-Black_99591453-341e-4c5b-937e-b2ab9b321519.3860011d84a23ccd0732e46474590b15.jpeg?odnHeight=784&odnWidth=580&odnBg=FFFFFF"
},
{
"url": "https://www.walmart.com/sp/track?bt=1&eventST=click&plmt=sp-search-middle~desktop~Results%20for%20%22Electronics%22&pos=2&tax=3944_1089430_132959_1008621_7197407&rdf=1&rd=https%3A%2F%2Fwww.walmart.com%2Fip%2FSteelSeries-Apex-3-TKL-RGB-Gaming-Keyboard-Tenkeyless-Water-Dust-Resistant-PC-and-USB-A%2F996783321%3FclassType%3DVARIANT%26adsRedirect%3Dtrue&adUid=094fb4ae-62f3-4954-ae99-b2938550d72c&mloc=sp-search-middle&pltfm=desktop&pgId=keyboard&pt=search&spQs=Dp3ons-xIcmPw9Ze7UUZuW3PD9Dto_vYCLjglme5vSy5Ze1p4NXg3uzApRy4mgfB-dGDchsq6FDoaZeMy6Dmeagqx_YPQVlj2EfM_TnEyfv_0r9GA9WwEd1cWbcx63Diahe72Zw6lw8suSf-OFKKH6UaiJl_8Qtpar-x0VhgrMsbqG7gDKh5DkQZql3HeMLncWSwburhSEjvpT1dXlDoWKxUrZwxZhOMry-uCqhuSb7Y6B-xZGrNPjYyel0nw11Z&storeId=3081&couponState=na&bkt=ace1_default%7Cace2_default%7Cace3_default%7Ccoldstart_off%7Csearch_default&classType=VARIANT",
"name": "SteelSeries Apex 3 TKL RGB Gaming Keyboard - Tenkeyless - Water & Dust Resistant - PC and USB-A",
"image": "https://i5.walmartimages.com/seo/SteelSeries-Apex-3-TKL-RGB-Gaming-Keyboard-Tenkeyless-Water-Dust-Resistant-PC-and-USB-A_876430c2-eed8-404a-aa55-1c66193daf8e.8c617e57ba48bc49d003f917f85cb535.jpeg?odnHeight=784&odnWidth=580&odnBg=FFFFFF"
},
// omittd for brevity...
{
"url": "https://www.walmart.com/ip/DEP-06-Portable-Digital-Piano-with-X-Stand/7598762909?classType=REGULAR",
"name": "Donner Portable Digital Piano 88-key Synth Action Keyboard with X Stand, Pedal, Auto-accompaniment for Beginner, 128 Tones, 83 Rhythms, Support USB/MIDI/Melodics, Wireless Connection",
"image": "https://i5.walmartimages.com/seo/DEP-06-Portable-Digital-Piano-with-X-Stand_1175fc1e-c191-4c71-9e9a-7e4a13274487.6673e0430c23d122744cfb63ccc8c155.jpeg?odnHeight=784&odnWidth=580&odnBg=FFFFFF"
}
]
eBay抓取示例
-
-
URL:
https://www.ebay.com/sch/i.html?_nkw=mouse
eBay无需JavaScript渲染,可用Requests + Beautiful Soup:
import requests
from bs4 import BeautifulSoup
import json
url = "https://www.ebay.com/sch/i.html?_from=R40&_trksid=m570.l1313&_nkw=mouse&_sacat=0"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36"
}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, "html.parser")
products = []
product_elements = soup.select("li.s-item")
for product_element in product_elements:
url_element = product_element.select("a[data-interactions]")[0]
url = url_element["href"]
name_element = product_element.select("[role=\"heading\"]")[0]
name = name_element.text
image_element = product_element.select("img")[0]
image = image_element["src"]
product = {
"url": url,
"name": name,
"image": image
}
products.append(product)
with open("products.json", "w", encoding="utf-8") as json_file:
json.dump(products, json_file, indent=4)
[
{
"url": "https://www.ebay.com/itm/193168148815?_skw=mouse&itmmeta=01JGC679WKT327K11R9YCGMQAN&hash=item2cf9b8094f:g:8F4AAOSw3B1drMr-&itmprp=enc%3AAQAJAAAAwHoV3kP08IDx%2BKZ9MfhVJKlr8NKoodwElhyHbl4CwcBMRqdGJme95%2F3tIll4uI7QYBk4%2BUBpwVvwiXdAl2%2BcILZ9axc%2BdHSZStWWMxWVyq4JdZ6r52PrRP2aS1jUoFoJ11vL4KyH2S8R5ha71xBtDFcGA2%2BtzhTzcR7J25kxuxbyd%2Frd4YnKbTPKwhn2Q0TP8qL30BJKcj4FnJYP0zhgO4WOGgOCHQhM21%2BanVk%2Fl0eg1H8mqCU91mkgKAt8KghFmw%3D%3D%7Ctkp%3ABlBMULSenYaDZQ",
"name": "2.4GHz Wireless Optical Mouse Mice & USB Receiver For PC Laptop Computer DPI USA",
"image"