网站
https://www.
futurepedia.io/ai-innov
ations
的数据是通过post请求动态加载的:
查看几页的请求载荷:
{"companies":[],"startDate":"2023-12-01T00:00:00.000Z","endDate":"2024-06-09T12:25:08.525Z","limit":25,"page":9,"categories":[],"itemTypes":[],"query":null}
{"companies":[],"startDate":"2023-12-01T00:00:00.000Z","endDate":"2024-06-09T12:25:08.525Z","limit":25,"page":7,"categories":[],"itemTypes":[],"query":null}
{"companies":[],"startDate":"2023-12-01T00:00:00.000Z","endDate":"2024-06-09T12:25:08.525Z","limit":25,"page":5,"categories":[],"itemTypes":[],"query":null}
这三个请求载荷的主要区别在于它们的"page"参数。这个参数通常用于分页,表示请求的是第几页的数据。具体来说:
-
第一个请求载荷请求的是第9页的数据。
-
第二个请求载荷请求的是第7页的数据。
-
第三个请求载荷请求的是第5页的数据。
其他参数,如"companies"、"startDate"、"endDate"、"limit"、"categories"和"itemTypes",在这三个请求中都是相同的。"startDate"和"endDate"定义了请求数据的时间范围,"limit"定义了每页显示的数据条数,而"categories"和"itemTypes"可能用于过滤数据,但在这里它们都是空的,表示没有应用任何过滤条件。"query"参数也是空的,表示没有使用任何搜索查询。
查看返回的json数据:
{
"products": [
{
"id": "2dd3fed5-fb31-473d-8c13-b731c9617657",
"name": "Copilot for Data Factory",
"company": {
"name": "Microsoft",
"slug": "microsoft"
},
"category": "Automation",
"itemType": "Feature",
"shortDescription": "Automates data ingestion and transformation processes",
"longDescription": "Provides AI-driven guidance for data integration, ensuring up-to-date and accurate warehouse data.",
"releaseDate": "2024-05-23",
"sources": [
"
https://
blog.fabric.microsoft.com
/en-us/blog/announcing-the-public-preview-of-copilot-for-data-warehouse-in-microsoft-fabric?ft=All
"
]
},
ChatGPT输入提示词:
你是一个Python编程专家,完成一个Python脚本编写的任务,具体步骤如下:
在F盘新建一个Excel文件:AIInnovations20240609.xlsx
爬取网页:
请求网址:
https://www.
futurepedia.io/api/prod
uct-releases
请求方法:
POST
状态代码:
200 OK
远程地址:
172.67.176.202:443
引荐来源网址政策:
strict-origin-when-cross-origin
请求载荷:
{"companies":[],"startDate":"2023-12-01T00:00:00.000Z","endDate":"2024-06-09T12:25:08.525Z","limit":25,"page":{pagenumber},"categories":[],"itemTypes":[],"query":null}
{pagenumber}的值是从1开始,以1递增,以160结束;
获取网页的json数据;
提取这个json数据中"products"键的值,这个值也是一个json数据;
提取这个json数据中所有键的名称,写入Excel文件的表头,所有键对应的值,写入Excel文件的数据列;
保存Excel文件;
注意:每一步都输出信息到屏幕;
每爬取1页数据后暂停5-9秒;
需要对 JSON 数据进行预处理,将嵌套的字典和列表转换成适合写入 Excel 的格式,比如将嵌套的字典转换为字符串。
要设置请求标头:
Accept:
*/*
Accept-Encoding:
gzip, deflate, br, zstd
Accept-Language:
zh-CN,zh;q=0.9,en;q=0.8
Content-Length:
156
Content-Type:
text/plain;charset=UTF-8
Origin:
https://www.
futurepedia.io
Priority:
u=1, i
Referer:
https://www.
futurepedia.io/ai-innov
ations
Sec-Ch-Ua:
"Google Chrome";v="125", "Chromium";v="125", "Not.A/Brand";v="24"
Sec-Ch-Ua-Mobile:
?0
Sec-Ch-Ua-Platform:
"Windows"
Sec-Fetch-Dest:
empty
Sec-Fetch-Mode:
cors
Sec-Fetch-Site:
same-origin
User-Agent:
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36
ChatGPT生成的源代码:
import requests
import openpyxl
import time
import random
import json
# 文件路径和名称
file_path = r"F:\AIInnovations20240609.xlsx"
# 创建一个新的 Excel 工作簿
wb = openpyxl.Workbook()
ws = wb.active
ws.title = "Product Releases"
# 初始页码
page_number = 1
# 请求的 URL 和方法
url = "
https://www.
futurepedia.io/api/prod
uct-releases
"
# 请求头
request_headers = {
'Accept': '*/*',
'Accept-Encoding': 'gzip, deflate, br, zstd',
'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',
'Content-Length': '156',
'Content-Type': 'text/plain;charset=UTF-8',
'Origin': '
https://www.
futurepedia.io
',
'Priority': 'u=1, i',
'Referer': '
https://www.
futurepedia.io/ai-innov
ations
',
'Sec-Ch-Ua': '"Google Chrome";v="125", "Chromium";v="125", "Not.A/Brand";v="24"',
'Sec-Ch-Ua-Mobile': '?0',
'Sec-Ch-Ua-Platform': '"Windows"',
'Sec-Fetch-Dest': 'empty',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Site': 'same-origin',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36'
}
# 表头写入标志
headers_written = False
# 数据处理函数:将嵌套的字典或列表转换为字符串
def clean_data(value):
if isinstance(value, (dict, list)):
return json.dumps(value, ensure_ascii=False) # 确保非 ASCII 字符被正确编码
return value
# 爬取数据
while page_number <= 190:
print(f"正在爬取第 {page_number} 页的数据...")
# 请求载荷
payload = {