Natural language evaluation with humans in the loop and statistical estimators