100 Days of Gen AI: Day 14

September 17, 2024

I started on a new project to build an AI chatbot that can answer common questions about how to use Terraform Cloud. This is a project inspired by my work at Benchling since our team commonly fields questions such as:

“How do I create a new workspace”
“I’m getting an ‘Error: Failed to query available provider packages’”
“How do I request a new github repo integration with Terraform Cloud?”

Our team has already published a detailed internal FAQ but a chatbot could be a quicker way to sort through the lengthy info to get an answer.

My plan is to build atop either GPT2 or BERT. After looking into both, it seems like BERT might fit this use case best since the need is not to generate creative text responses, but rather to provide answers to a discrete set of questions that are known beforehand. The google-bert/bert-base-uncased model seems like a good model to work with.

(As an aside, while I was looking at some of the 1700+ fine-tunes of google-bert/bert-base-uncased, I came across this tabularisai/robust-sentiment-analysis model)

The plan would be to fine-tune atop Hugging Face’s BERT model with a prepared dataset like this:

[
  {
    "question": "I get a 500 error when accessing the API.",
    "answer": "A 500 error indicates an internal server error. Check logs for stack traces."
  },
  {
    "question": "How do I reset my password?",
    "answer": "You can reset your password by visiting the [reset page]."
  }
]

And a fine-tuning process like:

from transformers import BertForSequenceClassification, BertTokenizer, Trainer, TrainingArguments
from datasets import load_dataset

# Load the tokenizer and model
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)

# Load your FAQ data (in this case, we'll use the same structure as GPT fine-tuning)
dataset = load_dataset("json", data_files={"train": "faq_data.json"})

# Tokenize the dataset
def tokenize_function(examples):
    return tokenizer(examples["question"], examples["answer"], truncation=True, padding=True, max_length=512)

tokenized_datasets = dataset.map(tokenize_function, batched=True, num_proc=4)

# Set up training arguments
training_args = TrainingArguments(
    output_dir="./bert-faq-finetuned",
    evaluation_strategy="epoch",
    learning_rate=5e-5,
    weight_decay=0.01,
    per_device_train_batch_size=8,
    num_train_epochs=3,
)

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
)

# Fine-tune the model
trainer.train()

# Save the model
trainer.save_model("./bert-faq-finetuned")

However, before embarking on the work of creating a dataset and fine-tuning the model, I wanted to see how effective ChatGPT could be if I just fed it the full FAQ and told it to answer questions as they came up. It was surprisingly good, such that I don’t think training a model is actually necessary. At a dozen or so pages, our FAQ can easily fit within ChatGPT’s context window. This would be a much faster way to build such a chatbot.

I’ll implement this in the future as a Benchling work project. I’m sensitive to the proprietary nature of this FAQ, so want to constrain any work to Benchling equipment. I also won’t be posting any details regarding the actual FAQ contents here, though I will share the general outline of the process I follow.

Newer Older