The computaion of neg_reward is wrong

This code uses batch-averaged (sample_rouge - baseline rouge), but it don't make sense in math and this item should be sample-wise because what we really want to maximize is this:
<img src="http://chart.googleapis.com/chart?cht=tx&chl=J(\theta)={1\over{N}}\Sigma_{i=1}^N (r(y^i)-b_{y^i})\Sigma_{t=1}^T\log {p_{\theta}(y_t^i|y_{1...t}^i,x^i)}" style="border:none;">