专栏名称: Dance with GenAI

关于生成式人工智能AIGC的一切

AI炒股-从东方财富网批量获取上市公司的全部新闻资讯

Dance with GenAI · 公众号 · · 2024-05-25 05:08

正文

工作任务和目标：用户输入一个上市公司名称，然后程序自动从东方财富网批量获取上市公司的全部新闻资讯

时间： 2024-05-23 03:57:43 -

新闻内容简介：

2024-05-23 03:57:43 - 5月22日， 科大讯 飞宣布，讯飞星火API能力正式免费开放。

新闻链接地址： http://finance.eastmoney.com/a /202405233084538683.html " target="_blank"> http:// finance.eastmoney.com/a /202405233084538683.html

下一页： >

//*[@id="app"]/div[3]/div[1]/div[4]/div/a[5]

第一步：在deepseek中输入提示词：

你是一个Python爬虫专家，完成以下网页爬取的Python脚本任务：

1、用户输入一个关键词，接受这个关键词，保存为变量{stock}；

2、在F:\aivideo文件夹里面新建一个Excel文件：{stock}.xlsx

3、设置chromedriver的路径为："D:\Program Files\chromedriver125\chromedriver.exe"

用selenium打开网页： https:// so.eastmoney.com/news/s? keyword= {stock}；

解析源代码，并打印输出；

定位css选择器=#app > div.main.container > div.c_l > div.news_list > div:nth-child(number1) > div.news_item_t > a的a标签(number的值是从1到10)，提取其内容作为新闻标题，保存到{stock}.xlsx的第1列；提取a标签的href值作为新闻URL，保存到{stock}.xlsx的第2列；

5、定位css选择器=#app > div.main.container > div.c_l > div.news_list > div:nth-child(number2) > div.news_item_c > span.news_item_time的span标签(number的值是从1到10)，提取其内容，作为新闻日期，保存到{stock}.xlsx的第3列；

6、定位#app > div.main.container > div.c_l > div.news_list > div:nth-child(number3) > div.news_item_c > span:nth-child(2)的div标签(number的值是从1到10)，提取其内容，作为新闻摘要，保存到{stock}.xlsx的第4列

7、定位title="下一页"的a 标签，模拟用户点击打开这个

8、随机等待1-10秒；

循环执行第4到第8步，直到点击下一页5次。

注意：

每一步都要输出相关信息到屏幕；

设置请求头，来应对反爬虫机制；

DataFrame.append 方法在 pandas 1.4.0 版本中已经被弃用，并且在后续版本中被移除。为了解决这个问题，我们可以使用 concat 函数来代替 append

第二步，查看审阅deepseek生成的Python代码：
import os

import time

import random

import pandas as pd

from selenium import webdriver

from selenium.webdriver.chrome.service import Service

from selenium.webdriver.common.by import By

from selenium.webdriver.chrome.options import Options

from selenium.webdriver.common.action_chains import ActionChains

# 设置chromedriver的路径

chromedriver_path = "D:\\Program Files\\chromedriver125\\chromedriver.exe"

# 设置请求头

headers = {

'authority': 'so.eastmoney.com',

'method': 'GET',

'path': '/news/s?keyword={stock}',

'scheme': 'https',

'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',

'accept-encoding': 'gzip, deflate, br, zstd',

'accept-language': 'zh-CN,zh;q=0.9,en;q=0.8',

'cache-control': 'max-age=0',

'cookie': '__root_domain_v=.eastmoney.com; _qddaz=QD.484716194472545; _ntes_origin_from=sogou; _ntes_nuid=af525edefbd3d363cd5876dbe902d85b; s_n_f_l_n3=01cb466d244b7db51716517702302; _antanalysis_s_id=1716517702995; UserProvince=%u5168%u56FD; ne_analysis_trace_id=1716518441938; vinfo_n_f_l_n3=01cb466d244b7db5.1.0.1716517702302.0.1716518672850',

'if-modified-since': 'Fri, 24 May 2024 02:40:05 GMT',

'priority': 'u=0, i',

'referer': ' https://www. sogou.com/link? url=hedJjaC291M4oaTBlXc5yCiioa5eDVgw ',

'sec-ch-ua': '"Google Chrome";v="125", "Chromium";v="125", "Not.A/Brand";v="24"',

'sec-ch-ua-mobile': '?0',

'sec-ch-ua-platform': '"Windows"',

'sec-fetch-dest': 'document',

'sec-fetch-mode': 'navigate',

'sec-fetch-site': 'cross-site',

'sec-fetch-user': '?1',

'upgrade-insecure-requests': '1',

'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36'

}

# 设置Chrome选项

chrome_options = Options()

chrome_options.add_argument(f'user-agent={headers["user-agent"]}')

# 设置Chrome服务

service = Service(chromedriver_path)

# 启动Chrome浏览器

driver = webdriver.Chrome(service=service, options=chrome_options)

# 用户输入关键词

stock = input("请输入关键词：")

# 打开网页

url = f' https:// so.eastmoney.com/news/s? keyword= {stock}'

driver.get(url)

# 等待页面加载完成

time.sleep(5)

# 获取网页源代码

html = driver.page_source

print("网页源代码：")

print(html)

# 创建Excel文件

excel_path = os.path.join('F:\\aivideo', f'{stock}.xlsx')

df = pd.DataFrame(columns=['新闻标题', '新闻URL', '新闻日期', '新闻摘要'])

df.to_excel(excel_path, index=False)

# 循环点击下一页5次

for _ in range(5):

AI炒股-从东方财富网批量获取上市公司的全部新闻资讯

正文

请到「今天看啥」查看全文