100 Days of Gen AI: Day 27
Today I did a few things:
- Verified incoming requests are actually from Slack
- Wrote out several unit tests
- Evaluated the quality of LLM responses
I used ChatGPT 4o Canvas for the first two tasks. It did perfectly on the first task, running correctly after the first shot (though I subsequently did some minor refactoring). But the unit testing took several rounds of back and forth and it completely failed on writing the last (more complicated) test no matter how many times I tried with ChatGPT. All of the unit tests it wrote failed from various errors on the first run. Sometimes feeding the error back to ChatGPT was enough to solve it, sometimes not.
I also spent some time evaluating the quality of the slackbot responses, and found some of them lacking. There are two steps:
- Retrieval
- Response generation
I found the issues were resulting from poor quality data from the retrieval step. I had set retrieval from the vector database to the Bedrock default of 300 tokens (which is about a paragraph in length). However, where it breaks on these tokens is arbitrary. What I found is that some questions have documented answers in our FAQ that are actually multiple paragraphs in length. Thus the retrieval step can never return the full answer, and even if it’s a short solution, the retrieval step might still break midway through the documented answer. In both cases the generation step will be working with an incomplete answer.
In one clear example, our documented solution had 8 numbered steps. The retrieval step only returned step 1 and 2, and so the generated solution was a confident but totally incomplete answer.
I did some research into different retrieval chunking strategies and found a new strategy that seemed like the best for our use case – hierarchical chunking. With this strategy, when it finds a chunk (child), it will also return the larger chunk of context around it (parent). I set these to the defaults (child: 300 tokens, parent: 1500 tokens). This improved the quality of the answers. However I’m not fully satisfied by this solution as there is still no guarantee that even with the larger 1500 token chunk that it won’t split in the middle and return a partial answer. It also means the generation step is now working with far more text, much of it irrelevant, which increases the likelihood of a misleading answer.