专栏名称: 云技术实践
关注云计算,云技术,云运维,云存储,存储,分布式,OpenStack,SDN,Ceph,虚拟化,运维,分享在云计算/虚拟化/运维项目实施中的资讯、经验、技术,坚持干货。
目录
相关文章推荐
架构师之路  ·  你的提示词根本只是在浪费算力,让deepse ... ·  2 天前  
架构师之路  ·  90%的用户不知道!触发DeepSeek深度 ... ·  3 天前  
架构师之路  ·  你的提示词根本只是在浪费算力,能让deeps ... ·  4 天前  
51好读  ›  专栏  ›  云技术实践

kafka监控实战(jmxtrans+InfluxDb+Grafana)

云技术实践  · 公众号  · 架构  · 2017-08-23 20:01

正文

前言


从上周一直在调研找一款好用的kafka监控,我测试使用过的KafkaOffsetMonitor、Burrow、kafka-monitor、Kafka-Manager,他们各有优缺点,具体情况我这里就不展开描述了,大家可以到它们的git上去查看, 并且它们基本上都是监控topic的写入和读取等等,没有提供对于整体集群的监控信息,比如集群的分片、延时、内存使用情况等等,无意中发现了jmxtrans,jmxtrans它是一个通过jmx采集java应用的数据采集器,他的输出可以是Graphite、StatsD、Ganglia、InfluxDb等等,刚好我们现有的监控是通过InfluxDb做数据存储的,通过Grafana做展示,下面就给大家介绍一下jmxtrans+InfluxDb+Grafana监控kafka的整体解决方案,并且不需要任何额外的开发工作,完全使用原生的。


二、环境介绍


1、角色

a、10.10.10.10   InfluxDb

b、10.10.10.100  Grafana

c、10.10.30.69   jmxtrans

d、kafka集群

10.10.20.14  node1

10.10.20.15 node2

10.10.20.16 node3

10.10.20.17 node4


2、软件版本

influxdb-1.2.4-1.x86_64

grafana-4.1.1-1484211277.x86_64

jmxtrans-266.rpm

kafka_2.10-0.9.0.0.jar.asc


3、架构图


三、配置规划


1、jmxtrans我们可以分别在每台kafka节点上部署,也可以部署到一台机器上,我这里是选择了后者,因为我的集群小,这样配置文件可以集中管理,如果集群比较大,可以考虑分散部署。


2、关于jmxtrans的配置文件,分全局指标(每个kafka节点)和topic指标,全局指标每个节点一个配置文件,命名规则:base_10.10.20.14.json,topic指标是每个topic一个配置文件,命名规则:falcon_monitor_us_17.json


四、监控指标


1、全局指标

每秒输入的流量

"obj" : "kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec"

"attr" : [ "Count" ]

"resultAlias":"BytesInPerSec"

"tags"     : {"application" : "BytesInPerSec"}


每秒输入的流量

"obj" : "kafka.server:type=BrokerTopicMetrics,name=BytesOutPerSec"

"attr" : [ "Count" ]

"resultAlias":"BytesOutPerSec"

"tags"     : {"application" : "BytesOutPerSec"}


每秒输入的流量

"obj" : "kafka.server:type=BrokerTopicMetrics,name=BytesRejectedPerSec"

"attr" : [ "Count" ]

"resultAlias":"BytesRejectedPerSec"

"tags"     : {"application" : "BytesRejectedPerSec"}


每秒的消息写入总量

"obj" : "kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec"

"attr" : [ "Count" ]

"resultAlias":"MessagesInPerSec"

"tags"     : {"application" : "MessagesInPerSec"}


每秒FetchFollower的请求次数

"obj" : "kafka.network:type=RequestMetrics,name=RequestsPerSec,request=FetchFollower"

"attr" : [ "Count" ]

"resultAlias":"RequestsPerSec"

"tags"     : {"request" : "FetchFollower"}


每秒FetchConsumer的请求次数

"obj" : "kafka.network:type=RequestMetrics,name=RequestsPerSec,request=FetchConsumer"

"attr" : [ "Count" ]

"resultAlias":"RequestsPerSec"

"tags"     : {"request" : "FetchConsumer"}


每秒Produce的请求次数

"obj" : "kafka.network:type=RequestMetrics,name=RequestsPerSec,request=Produce"

"attr" : [ "Count" ]

"resultAlias":"RequestsPerSec"

"tags"     : {"request" : "Produce"}


内存使用的使用情况

"obj" : "java.lang:type=Memory"

"attr" : [ "HeapMemoryUsage", "NonHeapMemoryUsage" ]

"resultAlias":"MemoryUsage"

"tags"     : {"application" : "MemoryUsage"}


GC的耗时和次数

"obj" : "java.lang:type=GarbageCollector,name=*"

"attr" : [ "CollectionCount","CollectionTime" ]

"resultAlias":"GC"

"tags"     : {"application" : "GC"}


线程的使用情况

"obj" : "java.lang:type=Threading"

"attr" : [ "PeakThreadCount","ThreadCount" ]

"resultAlias":"Thread"

"tags"     : {"application" : "Thread"}


副本落后主分片的最大消息数量

"obj" : "kafka.server:type=ReplicaFetcherManager,name=MaxLag,clientId=Replica"

"attr" : [ "Value" ]

"resultAlias":"ReplicaFetcherManager"

"tags"     : {"application" : "MaxLag"}


该broker上的partition的数量

"obj" : "kafka.server:type=ReplicaManager,name=PartitionCount"

"attr" : [ "Value" ]

"resultAlias":"ReplicaManager"

"tags"     : {"application" : "PartitionCount"}


正在做复制的partition的数量

"obj" : "kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions"

"attr" : [ "Value" ]

"resultAlias":"ReplicaManager"

"tags"     : {"application" : "UnderReplicatedPartitions"}


Leader的replica的数量

"obj" : "kafka.server:type=ReplicaManager,name=LeaderCount"

"attr" : [ "Value" ]

"resultAlias":"ReplicaManager"

"tags"     : {"application" : "LeaderCount"}


一个请求FetchConsumer耗费的所有时间

"obj" : "kafka.network:type=RequestMetrics,name=TotalTimeMs,request=FetchConsumer"

"attr" : [ "Count","Max" ]

"resultAlias":"TotalTimeMs"

"tags"     : {"application" : "FetchConsumer"}


一个请求FetchFollower耗费的所有时间

"obj" : "kafka.network:type=RequestMetrics,name=TotalTimeMs,request=FetchFollower"

"attr" : [ "Count","Max" ]

"resultAlias":"TotalTimeMs"

"tags"     : {"application" : "FetchFollower"}


一个请求Produce耗费的所有时间

"obj" : "kafka.network:type=RequestMetrics,name=TotalTimeMs,request=Produce"

"attr" : [ "Count","Max" ]

"resultAlias":"TotalTimeMs"

"tags"     : {"application" : "Produce"}


2、topic的监控指标

falcon_monitor_us每秒的写入流量

"kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec,topic=falcon_monitor_us"

"attr" : [ "Count" ]

"resultAlias":"falcon_monitor_us"

"tags"     : {"application" : "BytesInPerSec"}


falcon_monitor_us每秒的输出流量

"kafka.server:type=BrokerTopicMetrics,name=BytesOutPerSec,topic=falcon_monitor_us"

"attr" : [ "Count" ]

"resultAlias":"falcon_monitor_us"

"tags"     : {"application" : "BytesOutPerSec"}


falcon_monitor_us每秒写入消息的数量

"obj" : "kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec,topic=falcon_monitor_us"

"attr" : [ "Count" ]

"resultAlias":"falcon_monitor_us"

"tags"     : {"application" : "MessagesInPerSec"}


falcon_monitor_us在每个分区最后的Offset

"obj" : "kafka.log:type=Log,name=LogEndOffset,topic=falcon_monitor_us,partition=*"

"attr" : [ "Value" ]

"resultAlias":"falcon_monitor_us"

"tags"     : {"application" : "LogEndOffset"}


PS:

1、参数说明

"obj"对应jmx的ObjectName,就是我们要监控的指标

"attr"对应ObjectName的属性,可以理解为我们要监控的指标的值

"resultAlias"对应metric 的名称,在InfluxDb里面就是MEASUREMENTS名

"tags" 对应InfluxDb的tag功能,对与存储在同一个MEASUREMENTS里面的不同监控指标可以做区分,我们在用Grafana绘图的时候会用到,建议对每个监控指标都打上tags


2、对于全局监控,每一个监控指标对应一个MEASUREMENTS,所有的kafka节点同一个监控指标数据写同一个MEASUREMENTS ,对于topc监控的监控指标,同一个topic所有kafka节点写到同一个MEASUREMENTS,并且以topic名称命名


五、安装


1、kafka

这里不详细介绍kafka集群的安装,主要说一下kafka的启动方式,因为我们需要通过jmx采集kafka的监控数据,所以在kafka的启动时候需要启动jmx端口,启动方式如下:

cd /data/kafka/bin/

JMX_PORT=9999 nohup ./kafka-server-start.sh ../config/server.properties  >/dev/null 2>&1 &


2、InfluxDb

yum -y install  influxdb   ##安装

/etc/init.d/influxdb start ##启动服务

[root@ip-10-10-10-10 jmxtrans]# influx

Connected to http://localhost:8086 version 1.3.2

InfluxDB shell version: 1.3.2

> CREATE USER "root" WITH PASSWORD '123456' WITH ALL PRIVILEGES    ##添加一个账号

>


3、Grafana

yum -y install  grafana          ##安装

/etc/init.d/grafana-server start ##启动服务


4、jmxtrans

wget http://central.maven.org/maven2/org/jmxtrans/jmxtrans/266/jmxtrans-266.rpm

rpm -ivh jmxtrans-266.rpm  ##安装

/etc/init.d/jmxtrans start ##启动


六、配置

这里主要介绍jmxtrans采集数据的配置文件撰写和Grafana绘图的配置注意事项,kafka和InfluxDb的配置这里不做描述。


1、jmxtrans

a、jmxtrans默认读取/var/lib/jmxtrans下的配置文件去采集数据的,所以我们把采集kafka监控数据的配置文件都在这个目录下,下面是我的配置文件命名规范:

[root@ip-10-10-30-69 jmxtrans]# ll

total 96

-rw-r--r-- 1 root root 1657 Aug 18 17:03 article-feedback-10min-json_14.json

-rw-r--r-- 1 root root 1657 Aug 18 17:03 article-feedback-10min-json_15.json

-rw-r--r-- 1 root root 1657 Aug 18 17:04 article-feedback-10min-json_16.json

-rw-r--r-- 1 root root 1657 Aug 18 17:04 article-feedback-10min-json_17.json

-rw-r--r-- 1 root root 8430 Aug 22 08:24 base_10.10.20.14.json

-rw-r--r-- 1 root root 8431 Aug 22 08:24 base_10.10.20.15.json

-rw-r--r-- 1 root root 8431 Aug 22 08:25 base_10.10.20.16.json

-rw-r--r-- 1 root root 8431 Aug 22 08:25 base_10.10.20.17.json

-rw-r--r-- 1 root root 2027 Aug 21 16:19 falcon_monitor_us_14.json

-rw-r--r-- 1 root root 2027 Aug 21 16:20 falcon_monitor_us_15.json

-rw-r--r-- 1 root root 2484 Aug 21 20:58 falcon_monitor_us_16.json

-rw-r--r-- 1 root root 2027 Aug 21 16:20 falcon_monitor_us_17.json

-rw-r--r-- 1 root root 2147 Aug 21 17:43 highgmp-articles-through-primary_14.json

-rw-r--r-- 1 root root 2147 Aug 21 17:46 highgmp-articles-through-primary_15.json

-rw-r--r-- 1 root root 2147 Aug 21 17:46 highgmp-articles-through-primary_16.json

-rw-r--r-- 1 root root 2147 Aug 21 17:47 highgmp-articles-through-primary_17.json

[root@ip-10-10-30-69 jmxtrans]# pwd

/var/lib/jmxtrans


b、全局监控的配置文件,以10.10.20.14为例:

[root@ip-10-10-30-69 jmxtrans]# cat base_10.10.20.14.json

{

"servers" : [ {

"port" : "9999",

"host" : "10.10.20.14",

"queries" : [ {

"obj" : "kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec",

"attr" : [ "Count","OneMinuteRate" ],

"resultAlias":"BytesInPerSec",

"outputWriters" : [ {

"@class" : "com.googlecode.jmxtrans.model.output.InfluxDbWriterFactory",

"url" : "http://10.10.10.10:8086/",

"username" : "root",

"password" : "root",

"database" : "jmxDB",

"tags"     : {"application" : "BytesInPerSec"}

} ]

},

{

"obj" : "kafka.server:type=BrokerTopicMetrics,name=BytesOutPerSec",

"attr" : [ "Count","OneMinuteRate" ],

"resultAlias":"BytesOutPerSec",

"outputWriters" : [ {

"@class" : "com.googlecode.jmxtrans.model.output.InfluxDbWriterFactory",

"url" : "http://10.10.10.10:8086/",

"username" : "root",

"password" : "root",

"database" : "jmxDB",

"tags"     : {"application" : "BytesOutPerSec"}

} ]

},

{

"obj" : "kafka.server:type=BrokerTopicMetrics,name=BytesRejectedPerSec",

"attr" : [ "Count","OneMinuteRate" ],

"resultAlias":"BytesRejectedPerSec",

"outputWriters" : [ {

"@class" : "com.googlecode.jmxtrans.model.output.InfluxDbWriterFactory",

"url" : "http://10.10.10.10:8086/",

"username" : "root",

"password" : "root",

"database" : "jmxDB",

"tags"     : {"application" : "BytesRejectedPerSec"}

} ]

},

{

"obj" : "kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec",

"attr" : [ "Count","OneMinuteRate" ],

"resultAlias":"MessagesInPerSec",

"outputWriters" : [ {

"@class" : "com.googlecode.jmxtrans.model.output.InfluxDbWriterFactory",

"url" : "http://10.10.10.10:8086/",

"username" : "root",

"password" : "root",

"database" : "jmxDB",

"tags"     : {"application" : "MessagesInPerSec"}

} ]

},

{

"obj" : "kafka.network:type=RequestMetrics,name=RequestsPerSec,request=FetchConsumer",

"attr" : [ "Count" ],

"resultAlias":"RequestsPerSec",

"outputWriters" : [ {

"@class" : "com.googlecode.jmxtrans.model.output.InfluxDbWriterFactory",

"url" : "http://10.10.10.10:8086/",

"username" : "root",

"password" : "root",

"database" : "jmxDB",

"tags"     : {"request" : "FetchConsumer"}

} ]

},

{

"obj" : "kafka.network:type=RequestMetrics,name=RequestsPerSec,request=FetchFollower",

"attr" : [ "Count" ],

"resultAlias":"RequestsPerSec",

"outputWriters" : [ {

"@class" : "com.googlecode.jmxtrans.model.output.InfluxDbWriterFactory",

"url" : "http://10.10.10.10:8086/",

"username" : "root",

"password" : "root",

"database" : "jmxDB",

"tags"     : {"request" : "FetchFollower"}

} ]

},

{

"obj" : "kafka.network:type=RequestMetrics,name=RequestsPerSec,request=Produce",

"attr" : [ "Count" ],

"resultAlias":"RequestsPerSec",

"outputWriters" : [ {

"@class" : "com.googlecode.jmxtrans.model.output.InfluxDbWriterFactory",

"url" : "http://10.10.10.10:8086/",

"username" : "root",

"password" : "root",

"database" : "jmxDB",

"tags"     : {"request" : "Produce"}

} ]

},

{

"obj" : "java.lang:type=Memory",

"attr" : [ "HeapMemoryUsage", "NonHeapMemoryUsage" ],

"resultAlias":"MemoryUsage",

"outputWriters" : [ {

"@class" : "com.googlecode.jmxtrans.model.output.InfluxDbWriterFactory",

"url" : "http://10.10.10.10:8086/",

"username" : "root",

"password" : "root",

"database" : "jmxDB",

"tags"     : {"application" : "MemoryUsage"}

} ]

},

{

"obj" : "java.lang:type=GarbageCollector,name=*",

"attr" : [ "CollectionCount","CollectionTime" ],

"resultAlias":"GC",

"outputWriters" : [ {

"@class" : "com.googlecode.jmxtrans.model.output.InfluxDbWriterFactory",

"url" : "http://10.10.10.10:8086/",

"username" : "root",

"password" : "root",

"database" : "jmxDB",

"tags"     : {"application" : "GC"}

} ]

},

{

"obj" : "java.lang:type=Threading",

"attr" : [ "PeakThreadCount","ThreadCount" ],

"resultAlias":"Thread",

"outputWriters" : [ {

"@class" : "com.googlecode.jmxtrans.model.output.InfluxDbWriterFactory",

"url" : "http://10.10.10.10:8086/",

"username" : "root",

"password" : "root",

"database" : "jmxDB",

"tags"     : {"application" : "Thread"}

} ]

},

{

"obj" : "kafka.server:type=ReplicaFetcherManager,name=MaxLag,clientId=Replica",

"attr" : [ "Value" ],

"resultAlias":"ReplicaFetcherManager",

"outputWriters" : [ {

"@class" : "com.googlecode.jmxtrans.model.output.InfluxDbWriterFactory",

"url" : "http://10.10.10.10:8086/",

"username" : "root",

"password" : "root",

"database" : "jmxDB",

"tags"     : {"application" : "MaxLag"}

} ]

},

{

"obj" : "kafka.server:type=ReplicaManager,name=PartitionCount",

"attr" : [ "Value" ],

"resultAlias":"ReplicaManager",

"outputWriters" : [ {

"@class" : "com.googlecode.jmxtrans.model.output.InfluxDbWriterFactory",

"url" : "http://10.10.10.10:8086/",

"username" : "root",

"password" : "root",

"database" : "jmxDB",

"tags"     : {"application" : "PartitionCount"}

} ]

},

{

"obj" : "kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions",

"attr" : [ "Value" ],

"resultAlias":"ReplicaManager",

"outputWriters" : [ {

"@class" : "com.googlecode.jmxtrans.model.output.InfluxDbWriterFactory",

"url" : "http://10.10.10.10:8086/",

"username" : "root",

"password" : "root",

"database" : "jmxDB",

"tags"     : {"application" : "UnderReplicatedPartitions"}

} ]

},

{

"obj" : "kafka.server:type=ReplicaManager,name=LeaderCount",

"attr" : [ "Value" ],

"resultAlias":"ReplicaManager",

"outputWriters" : [ {

"@class" : "com.googlecode.jmxtrans.model.output.InfluxDbWriterFactory",

"url" : "http://10.10.10.10:8086/",

"username" : "root",

"password" : "root",

"database" : "jmxDB",

"tags"     : {"application" : "LeaderCount"}

} ]

},

{

"obj" : "kafka.network:type=RequestMetrics,name=TotalTimeMs,request=FetchConsumer",

"attr" : [ "Count","Max" ],

"resultAlias":"TotalTimeMs",

"outputWriters" : [ {

"@class" : "com.googlecode.jmxtrans.model.output.InfluxDbWriterFactory",

"url" : "http://10.10.10.10:8086/",

"username" : "root",

"password" : "root",

"database" : "jmxDB",

"tags"     : {"application" : "FetchConsumer"}

} ]

},

{

"obj" : "kafka.network:type=RequestMetrics,name=TotalTimeMs,request=FetchFollower",

"attr" : [ "Count","Max" ],

"resultAlias":"TotalTimeMs",

"outputWriters" : [ {

"@class" : "com.googlecode.jmxtrans.model.output.InfluxDbWriterFactory",

"url" : "http://10.10.10.10:8086/",

"username" : "root",

"password" : "root",

"database" : "jmxDB",

"tags"     : {"application" : "FetchConsumer"}

} ]

},

{

"obj" : "kafka.network:type=RequestMetrics,name=TotalTimeMs,request=Produce",

"attr" : [ "Count","Max" ],

"resultAlias":"TotalTimeMs",

"outputWriters" : [ {

"@class" : "com.googlecode.jmxtrans.model.output.InfluxDbWriterFactory",

"url" : "http://10.10.10.10:8086/",

"username" : "root",

"password" : "root",

"database" : "jmxDB",

"tags"     : {"application" : "Produce"}

} ]

},

{

"obj" : "kafka.server:type=ReplicaManager,name=IsrShrinksPerSec",

"attr" : [ "Count" ],

"resultAlias":"ReplicaManager",

"outputWriters" : [ {

"@class" : "com.googlecode.jmxtrans.model.output.InfluxDbWriterFactory",

"url" : "http://10.10.10.10:8086/",

"username" : "root",

"password" : "root",

"database" : "jmxDB",

"tags"     : {"application" : "IsrShrinksPerSec"}

} ]

}

]

} ]

}


c、topic监控的配置文件,以falcon_monitor_us的10.10.20.14节点为例:


[root@ip-10-10-30-69 jmxtrans]# cat  falcon_monitor_us_14.json

{

"servers" : [ {

"port" : "9999",

"host" : "10.10.20.14",

"queries" : [ {

"obj" : "kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec,topic=falcon_monitor_us",

"attr" : [ "Count" ],

"resultAlias":"falcon_monitor_us",

"outputWriters" : [ {

"@class" : "com.googlecode.jmxtrans.model.output.InfluxDbWriterFactory",

"url" : "http://10.10.10.10:8086/",

"username" : "root",

"password" : "root",

"database" : "jmxDB",

"tags"     : {"application" : "BytesInPerSec"}

} ]

},

{

"obj" : "kafka.server:type=BrokerTopicMetrics,name=BytesOutPerSec,topic=falcon_monitor_us",

"attr" : [ "Count" ],

"resultAlias":"falcon_monitor_us",

"outputWriters" : [ {

"@class" : "com.googlecode.jmxtrans.model.output.InfluxDbWriterFactory",

"url" : "http://10.10.10.10:8086/",

"username" : "root",

"password" : "root",

"database" : "jmxDB",

"tags"     : {"application" : "BytesOutPerSec"}

} ]

},

{

"obj" : "kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec,topic=falcon_monitor_us",

"attr" : [ "Count" ],

"resultAlias":"falcon_monitor_us",

"outputWriters" : [ {

"@class" : "com.googlecode.jmxtrans.model.output.InfluxDbWriterFactory",

"url" : "http://10.10.10.10:8086/",

"username" : "root",

"password" : "root",

"database" : "jmxDB",

"tags"     : {"application" : "MessagesInPerSec"}

} ]

},

{

"obj" : "kafka.log:type=Log,name=LogEndOffset,topic=falcon_monitor_us,partition=*",

"attr" : [ "Value" ],

"resultAlias":"falcon_monitor_us",

"outputWriters" : [ {

"@class" : "com.googlecode.jmxtrans.model.output.InfluxDbWriterFactory",

"url" : "http://10.10.10.10:8086/",

"username" : "root",

"password" : "root",

"database" : "jmxDB",

"tags"     : {"application" : "LogEndOffset"}

} ]

}

]

} ]

}


2、Grafana配置

a、添加数据源

Url、Database、User、Password需要和jmxtrans采集数据配置文件里面的写一致,然后点击Save&Test,提示成功就正常了

b、创建一个dashboard,然后在这里配置每一个监控指标的图







请到「今天看啥」查看全文