Bubbles up rate limit information so that we can retry after a certain
duration if needed higher up in the stack.
Also caps the number of concurrent evals running at once to also help.
Release Notes:
- N/A
Replace hardcoded 0.10 threshold with configurable parameter and set
0.05 default for most tests, with 0.2 for from_pixels_constructor
eval that produces more mismatched tags.
Release Notes:
- N/A
We run the unit evals once a day in the middle of the night, and trigger
a Slack post if it fails.
Release Notes:
- N/A
---------
Co-authored-by: Oleksiy Syvokon <oleksiy.syvokon@gmail.com>