本文作者:石沛尧,中南财经政法大学金融学院
本文编辑:兰博文
技术总编:金 点
Stata and Python 数据分析
爬虫俱乐部Stata基础课程、Stata进阶课程和Python课程可在小鹅通平台查看,欢迎大家多多支持订阅!如需了解详情,可以通过课程链接(https://appbqiqpzi66527.h5.xiaoeknow.com/homepage/10)或课程二维码进行访问哦~
本文首先从TuShare中下载财务指标,进而整理出盈利能力数据、营运能力数据、成长能力数据、偿债能力数据、现金流量数据;其次,对A股上市公司基本面类数据分析,分析得到的表格所含变量内容如TuShare基本面类数据([TuShare -财经数据接口包](http://tushare.org/fundamental. html#id3))所示,最后,聚类可视化财务指标。首先,引入几个重要的包 :import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.rcParams['font.sans-serif'] = [u'SimHei']
plt.rcParams['axes.unicode_minus'] = False
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"#让一个cell可以同时输出多个语句的output
从TuShare中分别下载财务指标、现金流量表、利润表、资产负债表、主营业务构成表。代码如下:
import tushare as ts
ts.set_token('ca0cd5409f62a54869ffa4ad1f81f99e0a3d649f4e87cfc034dc0001')
pro = ts.pro_api()
stocks = pro.stock_basic(exchange='', list_status='L', fields='ts_code,symbol,name,area,industry,list_date')
stocks
ts_codes = stocks.ts_code
ts_codes
from tqdm import tqdm
import time
ts_codes = stocks.ts_code
financial_indicator_data = pd.DataFrame()
interval = 0.3
for code in tqdm(ts_codes, desc="下载进度"):
df = pro.fina_indicator(ts_code=code, period='20231231', fields='ts_code,ocf_to_profit,ocf_to_debt,ocf_to_shortdebt,roe,netprofit_margin,grossprofit_margin,eps,revenue_ps,ar_turn,arturn_days,inv_turn,invturn_days,ca_turn,current_ratio,quick_ratio,cash_ratio,ebit_to_interest,eqt_yoy ')
if not df.empty:
financial_indicator_data = pd.concat([financial_indicator_data, df], ignore_index=True)
time.sleep(interval)
financial_indicator_data.drop_duplicates(subset=['ts_code'], keep='first', inplace=True)
financial_indicator_data
financial_indicator_data.to_excel('financial_indicator_data.xlsx')
ts_codes = stocks.ts_code
financial_indicator_data2 = pd.DataFrame()
interval = 0.3
for code in tqdm(ts_codes, desc="下载进度"):
df = pro.fina_indicator(ts_code=code, period='20231231', fields='ts_code,or_yoy,op_yoy,equity_yoy,assets_yoy,basic_eps_yoy,eqt_yoy')
if not df.empty:
financial_indicator_data2 = pd.concat([financial_indicator_data2, df], ignore_index=True)
time.sleep(interval)
financial_indicator_data2.drop_duplicates(subset=['ts_code'], keep='first', inplace=True)
financial_indicator_data2
financial_indicator_data2.to_excel('financial_indicator_data2.xlsx')
ts_codes = stocks.ts_code
cash_flow_statement = pd.DataFrame()
interval = 0.3
for code in tqdm(ts_codes, desc="下载进度"):
df = pro.cashflow(ts_code=code, period='20231231', fields='ts_code,n_cashflow_act')
if not df.empty:
cash_flow_statement = pd.concat([cash_flow_statement, df], ignore_index=True)
time.sleep(interval)
cash_flow_statement.drop_duplicates(subset=['ts_code'], keep='first', inplace=True)
cash_flow_statement
cash_flow_statement.to_excel('cash_flow_statement.xlsx')
ts_codes = stocks.ts_code
income_statement = pd.DataFrame()
interval = 0.3
for code in tqdm(ts_codes, desc="下载进度"):
df = pro.income(ts_code=code, period='20231231', fields='ts_code,n_income,revenue,t_compr_income')
if not df.empty:
income_statement = pd.concat([income_statement, df], ignore_index=True)
time.sleep(interval)
income_statement.drop_duplicates(subset=['ts_code'], keep='first', inplace=True)
income_statement
income_statement.to_excel('income_statement.xlsx')
ts_codes = stocks.ts_code
balance_sheet = pd.DataFrame()
interval = 0.3
for code in tqdm(ts_codes, desc="下载进度"):
df = pro.balancesheet(ts_code=code, period='20231231', fields='ts_code,total_assets')
if not df.empty:
balance_sheet = pd.concat([balance_sheet, df], ignore_index=True)
time.sleep(interval)
balance_sheet.drop_duplicates(subset=['ts_code'], keep='first', inplace=True)
balance_sheet
balance_sheet.to_excel('balance_sheet.xlsx')
ts_codes = stocks.ts_code
main_business_composition = pd.DataFrame()
interval = 1
for code in tqdm(ts_codes, desc="下载进度"):
df = pro.fina_mainbz(ts_code=code, period='20231231', fields='ts_code,bz_sales')
if not df.empty:
main_business_composition = pd.concat([main_business_composition, df], ignore_index=True)
time.sleep(interval)
main_business_composition.drop_duplicates(subset=['ts_code'], keep='first', inplace=True)
main_business_composition
main_business_composition.to_excel('main_business_composition.xlsx')
从上述下载的excel中读取并整理数据,整理得到:盈利能力数据、营运能力数据、成长能力数据、偿债能力数据、现金流量数据。代码实现如下:
financial_indicator_data = pd.read_excel('financial_indicator_data.xlsx')
financial_indicator_data2 = pd.read_excel('financial_indicator_data2.xlsx')
cash_flow_statement = pd.read_excel('cash_flow_statement.xlsx')
income_statement = pd.read_excel('income_statement.xlsx')
balance_sheet = pd.read_excel('balance_sheet.xlsx')
main_business_composition = pd.read_excel('main_business_composition.xlsx')
financial_indicator_data
financial_indicator_data2
merged_data = pd.merge(financial_indicator_data, financial_indicator_data2, on="ts_code", how="inner")
merged_data = pd.merge(merged_data, cash_flow_statement, on="ts_code", how="inner")
merged_data = pd.merge(merged_data, income_statement, on="ts_code", how="inner")
merged_data = pd.merge(merged_data, balance_sheet, on="ts_code", how="inner")
merged_data = pd.merge(merged_data, main_business_composition, on="ts_code", how="inner")
merged_data
merged_data["cf_sales"] = merged_data["n_cashflow_act"] / merged_data["bz_sales"]
merged_data["rateofreturn"] = merged_data["n_cashflow_act"] / merged_data["total_assets"]
merged_data["sheqratio"] = merged_data["t_compr_income"] / merged_data["total_assets"]
merged_data["currentasset_days"] = 360/ merged_data["ca_turn"]
merged_data
基本面数据分析如下:首先,营运能力指标有:arturnover,应收账款周转率(次)、arturndays,应收账款周转天数(天)、inventory_turnover,存货周转率(次)、inventory_days,存货周转天数(天)、currentasset_turnover,流动资产周转率(次)、currentasset_days,流动资产周转天数(天)。代码实现如下:operation = merged_data.loc[:, ["ts_code", "ar_turn", "arturn_days", "inv_turn", "invturn_days", "ca_turn", "currentasset_days"]]
operation = pd.merge(operation, stocks[['ts_code', 'name']], on="ts_code", how="left")
operation.set_index("name", inplace=True)
operation.drop(columns=["ts_code"], inplace=True)
operation
profit = merged_data.loc[:, ["ts_code", "roe", "netprofit_margin", "grossprofit_margin", "n_income", "eps", "revenue",'revenue_ps']]
profit = pd.merge(profit, stocks[['ts_code', 'name']], on="ts_code", how="left")
profit.set_index("name", inplace=True)
profit.drop(columns=["ts_code"], inplace=True)
profit
其次,成长能力指标有:mbrg,主营业务收入增长率(%)、nprg,净利润增长率(%)、nav,净资产增长率、targ,总资产增长率、epsg,每股收益增长率、seg,股东权益增长率流动资产周转天数(天)。代码实现如下:growth = merged_data.loc[:, ["ts_code", "or_yoy", "op_yoy", "equity_yoy", "assets_yoy", "basic_eps_yoy", "eqt_yoy_x"]]
growth = pd.merge(growth, stocks[['ts_code', 'name']], on="ts_code", how="left")
growth.set_index("name", inplace=True)
growth.drop(columns=["ts_code"], inplace=True)
growth
再次,偿债能力指标有:currentratio,流动比率、quickratio,速动比率、cashratio,现金比率、icratio,利息支付倍数、sheqratio,股东权益比率、adratio,股东权益增长率。代码实现如下:debtpaying = merged_data.loc[:, ["ts_code", "current_ratio", "quick_ratio", "cash_ratio", "ebit_to_interest", "sheqratio", "eqt_yoy_x"]]
debtpaying = pd.merge(debtpaying, stocks[['ts_code', 'name']], on="ts_code", how="left")
debtpaying.set_index("name", inplace=True)
debtpaying.drop(columns=["ts_code"], inplace=True)
debtpaying
最后,现金流量指标有:cf_sales,经营现金净流量对销售收入比率、rateofreturn,资产的经营现金流量回报率、cf_nm,经营现金净流量与净利润的比率、cf_liabilities,经营现金净流量对负债比率、cashflowratio,现金流量比率。代码实现如下:cashflow = merged_data.loc[:, ["ts_code", "cf_sales", "rateofreturn", "ocf_to_profit", "ocf_to_debt", "ocf_to_shortdebt"]]
cashflow = pd.merge(cashflow, stocks[['ts_code', 'name']], on="ts_code", how="left")
cashflow.set_index("name", inplace=True)
cashflow.drop(columns=["ts_code"], inplace=True)
cashflow
将上述五张表中不同财务指标数据进行数据预处理、降维、聚类和可视化,最终画出三维散点图。代码实现如下:from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
var = [profit,growth,cashflow,debtpaying,operation]
names = ['盈利能力','成长能力','现金流量','偿债能力','营运能力']
SS = StandardScaler()
profit = profit.dropna(thresh=4000,axis=1)
profit = profit.fillna(profit.mean())
data = SS.fit_transform(profit)
pca = PCA(n_components=1)
pca.fit(data)
data_new = pca.transform(data)
data_new
pca_data = []
for i in range(5):
df = var[i]
df = df.dropna(thresh=4000, axis=1)
df = df.replace([np.inf, -np.inf], np.nan)
df = df.dropna(axis=0)
df = df.fillna(df.mean())
SS = StandardScaler()
data = SS.fit_transform(df)
pca = PCA(n_components=1)
pca.fit(data)
data_new = pca.transform(data)
pca_data.append(pd.DataFrame(data_new, index=df.index, columns=[names[i]]))
from functools import reduce
df_pca = reduce(lambda left,right:pd.merge(left,right,left_index=True,right_index=True,how = 'inner'),pca_data)
df_pca.drop_duplicates(subset=None,keep='first',inplace=True,ignore_index=False)
df_pca
SS= StandardScaler()
df_pca_ts = SS.fit_transform(df_pca)
df = pd.DataFrame(df_pca_ts,columns=names,index = df_pca.index)
df
from sklearn.cluster import KMeans
km = KMeans(n_clusters=4)
r_km = km.fit(df)
df['cluster_km'] = r_km.labels_
df
fig = plt.figure(figsize=(12,6))
axes = fig.add_subplot(111, projection='3d')
x = np.array(df.盈利能力)
y = np.array(df.偿债能力)
z = np.array(df.成长能力)
t = np.array(df.cluster_km)
axes.scatter3D(x, y, z, c=t)
axes.set_xlabel('盈利能力')
axes.set_ylabel('偿债能力')
axes.set_zlabel('营运能力')
plt.show()
通过QuantileTransformer将数据的分布进行规范化,提高模型的稳定性和性能,画出三维散点图。代码如下:df.cluster_km.value_counts()
from sklearn.preprocessing import QuantileTransformer as QT
qt = QT(output_distribution='normal')
df_ts = pd.DataFrame(qt.fit_transform(df.iloc[:,:-1]),index=df_pca.index,columns=df_pca.columns)
df_ts
km = KMeans(n_clusters=4)
r_km = km.fit(df_ts)
df_ts['cluster_km'] = r_km.labels_
fig = plt.figure()
axes = fig.add_subplot(111,projection='3d')
x = np.array(df_ts.盈利能力)
y = np.array(df_ts.偿债能力)
z = np.array(df_ts.成长能力)
t = np.array(df_ts.cluster_km)
axes.scatter3D(x,y,z,c=t)
axes.set_xlabel('盈利能力')
axes.set_ylabel('偿债能力')
axes.set_zlabel('营运能力')
plt.show()
以上就是基于TuShare中的财务指标数据进行A股上市公司基本面类数据分析的内容介绍。掌握TuShare财经数据接口包可以让你更加高效地进行财务数据分析,进而基于基本面数据得到业务推荐报告。如果你对Python数据分析还有其他问题或者想要了解更多内容,欢迎留言讨论!
重磅福利!为了更好地服务各位同学的研究,爬虫俱乐部将在小鹅通平台上持续提供金融研究所需要的各类指标,包括上市公司十大股东、股价崩盘、投资效率、融资约束、企业避税、分析师跟踪、净资产收益率、资产回报率、国际四大审计、托宾Q值、第一大股东持股比例、账面市值比、沪深A股上市公司研究常用控制变量等一系列深加工数据,基于各交易所信息披露的数据利用Stata在实现数据实时更新的同时还将不断上线更多的数据指标。我们以最前沿的数据处理技术、最好的服务质量、最大的诚意望能助力大家的研究工作!相关数据链接,请大家访问:(https://appbqiqpzi66527.h5.xiaoeknow.com/homepage/10)或扫描二维码:
对我们的推文累计打赏超过1000元,我们即可给您开具发票,发票类别为“咨询费”。用心做事,不负您的支持!
往期推文推荐
微信公众号“Stata and Python数据分析”分享实用的Stata、Python等软件的数据处理知识,欢迎转载、打赏。我们是由李春涛教授领导下的研究生及本科生组成的大数据处理和分析团队。
武汉字符串数据科技有限公司一直为广大用户提供数据采集和分析的服务工作,如果您有这方面的需求,请发邮件到[email protected],或者直接联系我们的数据中台总工程司海涛先生,电话:18203668525,wechat: super4ht。海涛先生曾长期在香港大学从事研究工作,现为知名985大学的博士生,爬虫俱乐部网络爬虫技术和正则表达式的课程负责人。
此外,欢迎大家踊跃投稿,介绍一些关于Stata和Python的数据处理和分析技巧。
投稿邮箱:[email protected]投稿要求:1)必须原创,禁止抄袭;2)必须准确,详细,有例子,有截图;注意事项:1)所有投稿都会经过本公众号运营团队成员的审核,审核通过才可录用,一经录用,会在该推文里为作者署名,并有赏金分成。2)邮件请注明投稿,邮件名称为“投稿+推文名称”。3)应广大读者要求,现开通有偿问答服务,如果大家遇到有关数据处理、分析等问题,可以在公众号中提出,只需支付少量赏金,我们会在后期的推文里给予解答。