Causal Distillation for Language Models