This code is clearly functional, it’ll compile and execute.
However, the customer actually needs the code to do a saturating add.
With that knowledge, we can clearly see that the code is not correct. It will not saturate, it will wrap around instead.
Without that knowledge, an LLM will happily write some basic unit tests that won’t cover the saturation edge case, and the bug would live on until its hit in prod.
If you’re lucky, and your function doco is good, the LLM might spot the bug, and notify you.
My personal preference for how to generate tests is to ask the agent to write specific tests. E.g: “write a test for add that demonstrates that it saturates”.
IMO this is a bad example as in theory, testers test code against requirements, and if there is no such req stating anything about saturation then how should the testers or in this case the LLM know?
Yeah, I had testers that tested the functionality of a delay… But had set the delay parameter to zero. Well good thing this one case worked, but you didn’t check anything beyond that for correctness at all.
We continuously create tests that ensure a process completes in an set amount of time, and every time, we don’t give them enough leeway, and the test will fail randomly if the CI runner gets overloaded.
Testing functionality isn’t the same as correctness.
Oh excuse me then, what is correctness?
int add(int a, int b) { return a + b; }This code is clearly functional, it’ll compile and execute.
However, the customer actually needs the code to do a saturating add.
With that knowledge, we can clearly see that the code is not correct. It will not saturate, it will wrap around instead.
Without that knowledge, an LLM will happily write some basic unit tests that won’t cover the saturation edge case, and the bug would live on until its hit in prod.
If you’re lucky, and your function doco is good, the LLM might spot the bug, and notify you.
My personal preference for how to generate tests is to ask the agent to write specific tests. E.g: “write a test for add that demonstrates that it saturates”.
IMO this is a bad example as in theory, testers test code against requirements, and if there is no such req stating anything about saturation then how should the testers or in this case the LLM know?
It is over simplified, but there are often implicit requirements that a human would be aware of from the broader context that the LLM may not be.
i.e
addis used to increment a health bar, so wrap around doesn’t make sense.Yeah, I had testers that tested the functionality of a delay… But had set the delay parameter to zero. Well good thing this one case worked, but you didn’t check anything beyond that for correctness at all.
Timing and tests, name a better migraine duo :D.
We continuously create tests that ensure a process completes in an set amount of time, and every time, we don’t give them enough leeway, and the test will fail randomly if the CI runner gets overloaded.