• slevinkelevra@sh.itjust.works
    link
    fedilink
    arrow-up
    14
    ·
    2 days ago

    That is just the thing, developer and tester should never be the same person. Let alone same AI model. IMO testing is never taken seriously enough, just seen as unnecessary step and merged together with dev testing. From my years of experience I know that everything testers find is just explained away rather than properly adressed, and then with all of the obvious stuff in the way you never see the real issues.

    • nymnympseudonym@piefed.social
      link
      fedilink
      English
      arrow-up
      4
      arrow-down
      5
      ·
      2 days ago

      Interested in how much actual experience you have with AI geneated testsuites.

      My code was never tested this well.

      • CameronDev@programming.dev
        link
        fedilink
        arrow-up
        16
        arrow-down
        1
        ·
        2 days ago

        I have experience with AI generated test suites, and while its good for generating coverage, it isn’t so good for actually ensuring correctness, which is the actual point.

        I’ve watched the robot happily introduce bugs to pass broken tests, and also break tests to match code, and everything in between.

        I don’t want lots of tests, I want good tests.

        • mermella@piefed.social
          link
          fedilink
          English
          arrow-up
          2
          arrow-down
          1
          ·
          2 days ago

          You have to prompt for that, I do that regularly along with refactors. ‘Examine all tests to ensure they are testing functionality and not just passing a test.’ It finds them and will work on it. I think the problem continues to be engineering discipline. People are lazy with AI on multiple levels, not just copy pasta slop.

              • CameronDev@programming.dev
                link
                fedilink
                arrow-up
                1
                ·
                1 day ago
                int add(int a, int b) {
                    return a + b;
                }
                

                This code is clearly functional, it’ll compile and execute.

                However, the customer actually needs the code to do a saturating add.

                With that knowledge, we can clearly see that the code is not correct. It will not saturate, it will wrap around instead.


                Without that knowledge, an LLM will happily write some basic unit tests that won’t cover the saturation edge case, and the bug would live on until its hit in prod.

                If you’re lucky, and your function doco is good, the LLM might spot the bug, and notify you.

                My personal preference for how to generate tests is to ask the agent to write specific tests. E.g: “write a test for add that demonstrates that it saturates”.

                • slevinkelevra@sh.itjust.works
                  link
                  fedilink
                  arrow-up
                  2
                  ·
                  18 hours ago

                  IMO this is a bad example as in theory, testers test code against requirements, and if there is no such req stating anything about saturation then how should the testers or in this case the LLM know?

                  • CameronDev@programming.dev
                    link
                    fedilink
                    arrow-up
                    1
                    ·
                    17 hours ago

                    It is over simplified, but there are often implicit requirements that a human would be aware of from the broader context that the LLM may not be.

                    i.e add is used to increment a health bar, so wrap around doesn’t make sense.

            • slevinkelevra@sh.itjust.works
              link
              fedilink
              arrow-up
              2
              arrow-down
              1
              ·
              1 day ago

              Yeah, I had testers that tested the functionality of a delay… But had set the delay parameter to zero. Well good thing this one case worked, but you didn’t check anything beyond that for correctness at all.

              • CameronDev@programming.dev
                link
                fedilink
                arrow-up
                1
                ·
                1 day ago

                Timing and tests, name a better migraine duo :D.

                We continuously create tests that ensure a process completes in an set amount of time, and every time, we don’t give them enough leeway, and the test will fail randomly if the CI runner gets overloaded.