When do gradient methods work well in non-convex learning problems?