Human-centric video understanding with weak supervision