There was a really cute paper at the GAN workshop this year,
Generating Text via Adversarial Training
by Zhang, Gan, and Carin. In particular, they make a couple of unusual
choices that appear important. (Warning: if you are not familiar with
GANs, this post will not make a lot of sense.)
-
They use a convolutional neural network (CNN) as a
discriminator, rather than an RNN. In retrospect this seems like a good
choice, e.g. Tong Zhang has been
crushing it
in text classification with CNNs. CNNs are a bit easier to train than
RNNs, so the net result is a powerful discriminator with a relatively
easy optimization problem associated with it.
-
They use a smooth approximation to the LSTM output in their
generator, but actually this kind of trick appears everywhere so isn't
so remarkable in isolation.
-
They use a pure moment matching criterion for the saddle point
optimization (estimated over a mini-batch). GANs started with a
pointwise discrimination loss and more recent work has augmented this
loss with moment matching style penalties, but here the saddle point
optimization is pure moment matching. (So technically the discriminator
isn't a discriminator. They actually refer to it as discriminator or
encoder interchangeably in the text, this explains why.)
-
They are very smart about initialization. In particular the
discriminator is pre-trained to distinguish between a true sentence and
the same sentence with two words swapped in position. (During
initialization, the discriminator is trained using a pointwise
classification loss). This is interesting because swapping two words
preserves many of the
n
" role="presentation">
n
-
-gram
statistics of the input, i.e., many of the convolutional filters will
compute the exact same value. (I've had good luck recently using
permuted sentences as negatives for other models, now I'm going to try
swapping two words.)
-
They update the generator
more frequently
than the
discriminator, which is counter to the standard folklore which says you
want the discriminator to move faster than the generator. Perhaps this
is because the CNN optimization problem is much easier than the LSTM
one; the use of a purely moment matching loss might also be relevant.
The old complaint about neural network papers was that you couldn't
replicate them. Nowadays it is often easier to replicate neural
network papers than other papers, because you can just fork their code
on github and run the experiment. However, I still find it difficult to
ascertain the relative importance of the various choices that were
made. For the choices enumerated above: what is the sensitivity of the
final result to these choices? Hard to say, but I've started to assume
the sensitivity is high, because when I have tried to tweak a result
after replicating it, it usually goes to shit. (I haven't tried to
replicate this particular result yet.)
Anyway this paper has some cool ideas and hopefully it can be extended to generating realistic-looking dialog.