欢迎Follow我的GitHub,关注我的简书
在含有GPU的服务器中,运行TensorFlow程序。
一般不需要显式指定使用CPU还是GPU,TensorFlow能自动检测。如果检测到GPU,TensorFlow会尽可能地利用找到的第一个GPU来执行操作。如果机器上有超过一个可用的GPU,除第一个外,其它GPU默认是不参与计算的。为了让TensorFlow使用这些 GPU,你必须将op明确指派给它们执行。with tf.device('/gpu:%d' % i)
用来指派特定的CPU或GPU。
本文源码的GitHub地址,位于multi_gpu_train
文件夹。
查看当前服务器,满足TensorFlow的GPU数。
def get_available_gpus():
"""
查看GPU的命令:nvidia-smi
查看被占用的情况:ps aux | grep PID
:return: GPU个数
"""
local_device_protos = device_lib.list_local_devices()
print "all: %s" % [x.name for x in local_device_protos]
print "gpu: %s" % [x.name for x in local_device_protos if x.device_type == 'GPU']
默认的TensorFlow库,无法显示,需要下载TensorFlow的GPU版本。
pip install --upgrade tensorflow-gpu==1.2 -i http://mirrors.aliyun.com/pypi/simple --trusted-host mirrors.aliyun.com
查看GPU的库是否导入
echo $LD_LIBRARY_PATH
查看服务器的GPU
nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.20 Driver Version: 375.20 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K40m Off | 0000:04:00.0 Off | 0 |
| N/A 29C P0 68W / 235W | 0MiB / 11471MiB | 85% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
tensorflow-gpu==1.3.0报错,无法找到libcudnn.so.6,回退版本1.2即可。
ImportError: libcudnn.so.6: cannot open shared object file: No such file or directory
如果缺少libcudnn库,则登录Nvidia的官网下载cuDNN库,网址。
最终显示设备
all: [u'/cpu:0', u'/gpu:0']
gpu: [u'/gpu:0']
显示tensorboard
tensorboard --logdir=/tmp/cifar10_train --port=8008
源码
使用tf.gfile模块,处理文件夹,执行核心方法train()。
def main(argv=None): # pylint: disable=unused-argument
cifar10.maybe_download_and_extract() # 下载数据
# 目录处理的标准流程,使用tf.gfile模块
if tf.gfile.Exists(FLAGS.train_dir): # 如果存在已有的训练数据
tf.gfile.DeleteRecursively(FLAGS.train_dir) # 则递归删除
tf.gfile.MakeDirs(FLAGS.train_dir) # 新建目录
train() # 核心方法,训练
if __name__ == '__main__':
tf.app.run()
创建global_step
,训练步数,在训练时,自动增加,名称是global_step
,shape是[]
,表示常数,初始值是0,非训练参数。
def train():
"""Train CIFAR-10 for a number of steps."""
with tf.Graph().as_default(), tf.device('/cpu:0'): # 默认使用默认CPU0
# 参数: trainable是False,不用训练,全局步数就是global_step,默认设置。
global_step = tf.get_variable(
'global_step', [],
initializer=tf.constant_initializer(0), trainable=False)
# 每个批次的训练数,
num_batches_per_epoch = (cifar10.NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN /
FLAGS.batch_size) # batch_size是128,50000 / 128=390.625
decay_steps = int(num_batches_per_epoch * cifar10.NUM_EPOCHS_PER_DECAY) # 每个批次需要衰减的次数
lr = tf.train.exponential_decay(cifar10.INITIAL_LEARNING_RATE,
global_step,
decay_steps,
cifar10.LEARNING_RATE_DECAY_FACTOR,
staircase=True) # 计算学习率,lr=Learning Rate
opt = tf.train.GradientDescentOptimizer(lr) # 参数是学习率
使用预加载队列,获取batch_queue
。
# Get images and labels for CIFAR-10.
images, labels = cifar10.distorted_inputs() # 获取图片资源和标签
batch_queue = tf.contrib.slim.prefetch_queue.prefetch_queue(
[images, labels], capacity=2 * FLAGS.num_gpus) # 使用预加载的队列
多个GPU执行,当GPU不足时,有几个执行几个。将全部梯度放置于tower_grads
,reuse_variables()
重用变量,使用summaries获取scope变量的数据。
tower_grads = []
with tf.variable_scope(tf.get_variable_scope()): # 变量的名称
for i in xrange(FLAGS.num_gpus): # 创建GUP的循环
with tf.device('/gpu:%d' % i): # 指定GPU
with tf.name_scope('%s_%d' % (cifar10.TOWER_NAME, i)) as scope:
# 含有几个GPU执行几个,没有不执行
print('running: %s_%d' % (cifar10.TOWER_NAME, i))
# Dequeues one batch for the GPU
image_batch, label_batch = batch_queue.dequeue()
# Calculate the loss for one tower of the CIFAR model. This function
# constructs the entire CIFAR model but shares the variables across
# all towers.
loss = tower_loss(scope, image_batch, label_batch) # 获得损失函数
# Reuse variables for the next tower.
tf.get_variable_scope().reuse_variables() # 重用变量
# Retain the summaries from the final tower.
summaries = tf.get_collection(tf.GraphKeys.SUMMARIES, scope) # 创建存储信息
# Calculate the gradients for the batch of data on this CIFAR tower.
grads = opt.compute_gradients(loss) # 计算梯度
# Keep track of the gradients across all towers.
tower_grads.append(grads) # 添加梯度,tower_grads是外部变量,会存储全部梯度信息
求平均的梯度,优化器opt使用梯度的平均值,存储输入进入summaries。
# We must calculate the mean of each gradient. Note that this is the
# synchronization point across all towers.
grads = average_gradients(tower_grads) # 求梯度的平均值
# Add a summary to track the learning rate.
summaries.append(tf.summary.scalar('learning_rate', lr))
# Add histograms for gradients.
for grad, var in grads: # 存储数据
if grad is not None:
summaries.append(tf.summary.histogram(var.op.name + '/gradients', grad))
# Apply the gradients to adjust the shared variables.
apply_gradient_op = opt.apply_gradients(grads, global_step=global_step) # 将梯度应用于变量
# Add histograms for trainable variables.
for var in tf.trainable_variables(): # 存储数据
summaries.append(tf.summary.histogram(var.op.name, var))
求变量的均值操作
variable_averages = tf.train.ExponentialMovingAverage(
cifar10.MOVING_AVERAGE_DECAY, global_step) # 求变量的均值
variables_averages_op = variable_averages.apply(tf.trainable_variables()) # 将变量的均值应用于操作
训练操作,summary操作
# Group all updates to into a single train op.
train_op = tf.group(apply_gradient_op, variables_averages_op) # 训练操作
# Create a saver.
saver = tf.train.Saver(tf.global_variables()) # 创建变量存储器
# Build the summary operation from the last tower summaries.
summary_op = tf.summary.merge(summaries) # 合并统计数据
执行初始化操作
# Build an initialization operation to run below.
init = tf.global_variables_initializer()
# Start running operations on the Graph. allow_soft_placement must be set to
# True to build towers on GPU, as some of the ops do not have GPU
# implementations.
sess = tf.Session(config=tf.ConfigProto(
allow_soft_placement=True,
log_device_placement=FLAGS.log_device_placement))
sess.run(init)
# Start the queue runners.
tf.train.start_queue_runners(sess=sess)
summary_writer = tf.summary.FileWriter(FLAGS.train_dir, sess.graph) # 写入summary的地址
每10次显示一次,每100次总结一次,每1000次保存一次;
for step in xrange(FLAGS.max_steps):
start_time = time.time()
_, loss_value = sess.run([train_op, loss])
duration = time.time() - start_time
assert not np.isnan(loss_value), 'Model diverged with loss = NaN'
if step % 10 == 0:
num_examples_per_step = FLAGS.batch_size * FLAGS.num_gpus
examples_per_sec = num_examples_per_step / duration
sec_per_batch = duration / FLAGS.num_gpus
format_str = ('%s: step %d, loss = %.2f (%.1f examples/sec; %.3f '
'sec/batch)')
print(format_str % (datetime.now(), step, loss_value,
examples_per_sec, sec_per_batch))
if step % 100 == 0:
summary_str = sess.run(summary_op)
summary_writer.add_summary(summary_str, step)
# Save the model checkpoint periodically.
if step % 1000 == 0 or (step + 1) == FLAGS.max_steps:
checkpoint_path = os.path.join(FLAGS.train_dir, 'model.ckpt')
saver.save(sess, checkpoint_path, global_step=step)
核心就是通过多CPU训练梯度,并且求平均梯度和平均均值。TensorBoard的效果
OK, that's all!