Fine-grained image and video analysis with limited supervision