问题导读
1.如何统计网站总的点击量?
2.如何实现统计不能访问网页的个数?
3.文章中如何定义和使用Scala函数的?
上一篇
about云日志分析实战之清洗日志3:如何在spark shell中导入自定义包
http://www.aboutyun.com/forum.php?mod=viewthread&tid=22881
上一篇,我们已经添加了清洗日志的核心代码,那么剩下的我们就可以统计相关信息,比如最简单的找到不能访问的网页。
导入之后,我们创建AccessLogParser实例
[Bash shell]
纯文本查看
复制代码
?
1
|
val p = new AccessLogParser
|
这个很重要,在后面我们会用到
首先我们需要加载一部分日志样例。
[Bash shell]
纯文本查看
复制代码
?
01
02
03
04
05
06
07
08
09
10
|
192.168.169.50 - - [17
/Feb/2012
:10:09:13 +0800]
"GET /favicon.ico HTTP/1.1"
404 288
"-"
"360se"
192.168.169.50 - - [17
/Feb/2012
:10:36:26 +0800]
"GET / HTTP/1.1"
403 5043
"-"
"Mozilla/5.0 (Windows NT 5.1; rv:6.0) Gecko/20100101 Firefox/6.0"
192.168.169.50 - - [17
/Feb/2012
:10:36:26 +0800]
"GET /icons/powered_by_rh.png HTTP/1.1"
200 1213
"http://192.168.55.230/"
"Mozilla/5.0 (Windows NT 5.1; rv:6.0) Gecko/20100101 Firefox/6.0"
192.168.169.50 - - [17
/Feb/2012
:10:09:10 +0800]
"GET /icons/powered_by_rh.png HTTP/1.1"
200 1213
"http://192.168.55.230/"
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; InfoPath.2; 360SE)"
192.168.55.230 - - [24
/Feb/2012
:09:48:58 +0800]
"GET /favicon.ico HTTP/1.1"
404 288
"-"
"Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.24) Gecko/20111109 CentOS/3.6-3.el5.centos Firefox/3.6.24"
192.168.169.50 - - [24
/Feb/2012
:09:45:03 +0800]
"GET /server-status HTTP/1.1"
404 290
"-"
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; InfoPath.2; 360SE)"
192.168.55.230 - - [24
/Feb/2012
:09:49:02 +0800]
"GET / HTTP/1.1"
403 5043
"-"
"Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.24) Gecko/20111109 CentOS/3.6-3.el5.centos Firefox/3.6.24"
192.168.55.230 - - [24
/Feb/2012
:09:49:02 +0800]
"GET /icons/apache_pb.gif HTTP/1.1"
200 2326
"http://192.168.55.230/"
"Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.24) Gecko/20111109 CentOS/3.6-3.el5.centos Firefox/3.6.24"
192.168.55.230 - - [24
/Feb/2012
:09:49:02 +0800]
"GET /icons/powered_by_rh.png HTTP/1.1"
200 1213
"http://192.168.55.230/"
"Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.24) Gecko/20111109 CentOS/3.6-3.el5.centos Firefox/3.6.24"
192.168.55.230 - - [24
/Feb/2012
:09:49:20 +0800]
"GET /server-status HTTP/1.1"
404 290
"-"
"Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.24) Gecko/20111109 CentOS/3.6-3.el5.centos Firefox/3.6.24"
|
将其保存为aboutyun.log
将其上传到hadoop
[Bash shell]
纯文本查看
复制代码
?
1
|
hadoop fs -put aboutyun.log /
|
上传成功验证
统计网站总的点击量
接着我们加载文件。
[Bash shell]
纯文本查看
复制代码
?
1
|
var log=sc.textFile(
"/aboutyun.log"
)
|
这里sc是系统已经初始化的,我们可以直接使用,可以理解为sparkContext的实例
加载之后,我们统计行数,也可以理解为统计网站总的点击量。这时候我们就看到总点击量为10
统计网站不能访问网页的数量
首先我们定义一个函数,获取一条记录的httpStatusCode,也就是返回码
[Scala]
纯文本查看
复制代码
?