专栏名称: 大数据挖掘DT数据分析

实战数据资源提供。数据实力派社区，手把手带你玩各种数据分析，涵盖数据分析工具使用，数据挖掘算法原理与案例，机器学习，R语言，Python编程，爬虫。如需发布广告请联系： hai299014

目录

相关文章推荐

天池大数据科研平台 · DeepSeek开源放大招：FlashMLA ... · 昨天

湖南发改 · 省发改委召开省信用信息数据库信息归集共享工作 ... · 2 天前

数据派THU · 李飞飞巴黎演讲：如果 AI ... · 4 天前

数据派THU · EvalPlanner：基于“计划-执行”双 ... · 3 天前

数据派THU · 【ICLR2025】VEVO：基于自监督解耦 ... · 4 天前

51好读 › 专栏 › 大数据挖掘DT数据分析

[python]评论文本挖掘：找出兴趣相投的用户并作产品推荐

大数据挖掘DT数据分析 · 公众号 · 大数据 · 2017-04-10 23:03

正文

数据挖掘入门与实战公众号： datadw

目录

过程
电影评价多重分类
用户信息录入
计算皮尔逊相关系数找出兴趣相投的用户插入自己的数据
向某用户推荐电影加权平均所有人的评价
结果与分析

过程：

用爬虫抓取豆瓣电影用户信息
用多重分类法，定义电影评价等级
计算自己与用户的皮尔逊相关度
以人为主体分析相似度：找出志同道合的人，可以发现潜在喜欢的商品
以商品为主体分析相似度：找出相似的商品，可以发现潜在的客户（如亚马逊的‘买了商品A的用户还买了商品B’)

电影评价多重分类：

很差
较差
还行
推荐
力荐

#-*- coding: utf-8 -*-
import json
import sys
reload(sys)
sys.setdefaultencoding( "utf-8" )
user_info = {}
#爬取到的数据
user_dict = {
'ns2250225' :[ 4 , 3 , 4 , 5 , 4 ],
'justin' :[ 3 , 4 , 3 , 4 , 2 ],
'totox' :[ 2 , 3 , 5 , 1 , 4 ],
'fabrice' :[ 4 , 1 , 3 , 4 , 5 ],
'doreen' :[ 3 , 4 , 2 , 5 , 3 ]
}
#录入用户数据
def user_data(user_dict):
for name in user_dict:
user_info[name] = {u '消失的爱人' : user_dict[name][ 0 ]}
user_info[name][u '霍比特人3' ] = user_dict[name][ 1 ]
user_info[name][u '神去村' ] = user_dict[name][ 2 ]
user_info[name][u '泰坦尼克号' ] = user_dict[name][ 3 ]
user_info[name][u '这个杀手不太冷' ] = user_dict[name][ 4 ]
user_data(user_dict)
#存放用户数据
try :
with open( 'user_data.txt' , 'w' ) as data:
for key in user_info:
data.write(key)
for key2 in user_info[key]:
data.write( '\t' )
data.write(key2)
data.write( '\t' )
data.write( '\t' )
data.write(str(user_info[key][key2]))
data.write( '\n' )
data.write( '\n' )
except IOError as err:
print ( 'File error: ' + str(err))

计算皮尔逊相关系数，找出兴趣相投的用户：(插入自己的数据)

from math import sqrt
#计算皮尔逊相关度(1为完全正相关，-1为完成负相关)
def sim_pearson(prefs, p1, p2):
# Get the list of mutually rated items
si = {}
for item in prefs[p1]:
if item in prefs[p2]:
si[item] = 1
# if they are no ratings in common, return 0
if len(si) == 0 :
return 0
# Sum calculations
n = len(si)
# Sums of all the preferences
sum1 = sum([prefs[p1][it] for it in si])
sum2 = sum([prefs[p2][it] for it in si])
# Sums of the squares
sum1Sq = sum([pow(prefs[p1][it], 2 ) for it in si])
sum2Sq = sum([pow(prefs[p2][it], 2 ) for it in si])
# Sum of the products
pSum = sum([prefs[p1][it] * prefs[p2][it] for it in si])
# Calculate r (Pearson score)
num = pSum - (sum1 * sum2 / n)
den = sqrt((sum1Sq - pow(sum1, 2 ) / n) * (sum2Sq - pow(sum2, 2 ) / n))
if den == 0 :
return 0
r = num / den
return r
#插入自己的数据
user_info[ 'me' ] = {u '消失的爱人' : 5 ,
u '神去村' : 3 ,
u '炸裂鼓手' : 5 }
#找出皮尔逊相关系数>0的用户，说明该用户与自己的电影品味比较相近
for user in user_info:
res = sim_pearson(user_info, 'me' , user)
if res > 0 :
print ( 'the user like %s is : %s' % ( 'me' , user))
print ( 'result :%f\n' % res)

向某用户推荐电影（加权平均所有人的评价）

#向某个用户推荐电影(加权平均所有人的评价值)
def getRecommendations(prefs,person,similarity=sim_pearson):
totals={}
simSums={}
for other in prefs:
# don't compare me to myself
if other==person: continue
sim=similarity(prefs,person,other)
# ignore scores of zero or lower
if sim<= 0 : continue
for item in prefs[other]:
# only score movies I haven't seen yet
if item not in prefs[person] or prefs[person][item]== 0 :
# Similarity * Score
totals.setdefault(item, 0 )
totals[item]+=prefs[other][item]*sim
# Sum of similarities
simSums.setdefault(item, 0 )
simSums[item]+=sim
# Create the normalized list
rankings=[(total/simSums[item],item) for item,total in totals.items()]
# Return the sorted list
rankings.sort()
rankings.reverse()
return rankings
#向我推荐电影
res = getRecommendations(user_info, "me" )
print ( 'Recommand watching the movie:' )
print json.dumps(res, encoding= 'UTF-8' , ensure_ascii= False )

结果与分析：

与我电影口味相近的用户有：doreen, fabrice
推荐我看的电影有：泰坦尼克号，这个杀手不太冷
以人为主体分析，找出有相似爱好的人，并向这些人推荐商品，可以发现潜在喜欢的商品
而若以商品为主体分析，找出相似的商品，找出喜欢这个产品的人，可以发现商品潜在的客户

数据挖掘入门与实战

搜索添加微信公众号：datadw

请到「今天看啥」查看全文

推荐文章

天池大数据科研平台 · DeepSeek开源放大招：FlashMLA让算力狂飙！曝光低成本秘笈

昨天

湖南发改 · 省发改委召开省信用信息数据库信息归集共享工作动员培训会

2 天前

数据派THU · 李飞飞巴黎演讲：如果 AI 资源被少数公司垄断，整个生态系统都会完蛋

4 天前

数据派THU · EvalPlanner：基于“计划-执行”双阶段的大语言模型评估框架

3 天前

数据派THU · 【ICLR2025】VEVO：基于自监督解耦的可控零样本语音模仿

4 天前

最搞笑笑话王 · ?终于冬天了，发个笑话给大家，笑死个人！

8 年前

全球健身指南 · 如果你不知道看什么电影，就关注这个号

7 年前

名师联庭院景观智库 · 创意庭院，感动你我

7 年前

肿瘤免疫细胞治疗资讯 · 【1439】身体是否健康，有无癌症风险，看脸就知道

7 年前

品橙旅游 · 定制旅游时代：是被巨浪吞没还是把海洋煮沸

7 年前

Sov5搜索 · 小百科 · 今天看啥 · 移动版

51好读 - 好文章就要读起来!