Efficient event understanding in videos and language