Patronus AI x Databricks: Coaching Fashions for Hallucination Detection

Hallucinations in giant language fashions (LLMs) happen when fashions produce responses that don’t align with factual actuality or the offered context. This drawback is difficult for LLM practitioners growing RAG purposes the place LLM outputs have entry to user-provided paperwork. For instance, if LLMs getting used for monetary question-answering or medical prognosis produce responses that deviate from supply paperwork, customers are uncovered to misinformation with important destructive penalties.

The LLM-as-a-judge paradigm has grown in reputation for detecting inaccuracies in generative AI software responses, as a consequence of its flexibility and ease of use. Nonetheless, even when utilizing top-performing fashions like GPT-4, LLM-as-a-judge continuously fails to judge responses to advanced reasoning duties precisely. Moreover, there are issues in regards to the high quality, transparency and value of closed-source LMs. Nonetheless, there’s a important hole in efficiency between open supply and closed-source fashions used for analysis duties because of the lack of difficult and domain-specific publicly accessible datasets.

At Patronus AI, we acknowledged the necessity for an automatic LLM analysis platform to instill confidence in enterprises deploying GenAI fashions. That’s why we constructed Lynx, a SOTA hallucination detection mannequin that’s able to utilizing advanced reasoning to determine conflicting outputs. In experiments, we discovered that Lynx outperformed all current LLM-as-a-judge evaluators utilizing closed and open supply fashions. In domain-specific duties, this distinction was much more pronounced, with a 7.5% distinction in medical question-answering.

*Responses of GPT-4o, Claude-3-Sonnet and Lynx on an instance from HaluBench. The human annotation was that the instance contained a hallucination.*

On this weblog, we describe the method of coaching a SOTA hallucination detection LM with LLM Foundry, Composer and Mosaic AI Mannequin Coaching.

Lynx-70B-Instruct is a finetuned Llama-3-70B-Instruct mannequin. (In our experiments, we finetuned a number of extra open supply fashions and present full leads to our paper.) We selected Databricks Mosaic AI instruments, together with the LLM Foundry, Composer, and coaching cluster, as a result of they supplied extra customization choices and help for a variety of language fashions.

We first constructed our coaching and analysis datasets for a hallucination identification activity utilizing a perturbation course of (see our paper for extra particulars). To create a fine-tuning job on the Databricks Mosaic AI coaching infrastructure, we create a config just like the next:

command: |
  pip set up peft
  cd llm-foundry/scripts
  composer practice/practice.py /mnt/config/parameters.yaml
picture: mosaicml/llm-foundry:2.3.0_cu121_flash2-latest
identify: llama-3-70B-Instruct-${experiment_name}

compute:
  gpus: 32  # Variety of GPUs to make use of

parameters:
  tokenizer_name: meta-llama/Meta-Llama-3-70B-Instruct
  max_seq_len: 8000
  global_seed: 17

  # Run Identify
  run_name: ${run_name}

  max_split_size_mb: 512

  # Mannequin
  mannequin:
    identify: hf_causal_lm
    init_device: blended
    pretrained_model_name_or_path: meta-llama/Meta-Llama-3-70B-Instruct
    pretrained: true
    use_auth_token: true
    use_flash_attention_2: true

  # Tokenizer
  tokenizer:
    identify: ${tokenizer_name}
    kwargs:
      model_max_length: ${max_seq_len}

  loggers:
    wandb: {"undertaking": "hallucination-finetuning", "entity":"patronusai"}
  
save_folder:  ${save_path}

We then scheduled coaching jobs utilizing the Databricks Mosaic AI CLI:

mcli run -f train_config.yaml

For supervised finetuning on 70B fashions, we educated on 32 NVIDIA H100 GPUs, for an efficient batch dimension of 256. To reinforce efficiency, we used native optimizations in Composer, together with FSDP and flash consideration.

To view leads to real-time, we used the WandB integration with LLM Foundry to log coaching outcomes to the WandB dashboard. The Mosaic AI Coaching console makes it simple to observe run standing, together with completion standing and job historical past from teammates.

Training Run Logs

Mosaic AI’s coaching platform abstracts away the complexities of deploying coaching runs throughout a number of clusters and compute suppliers. A coaching run might be launched on a GPU cluster on one cloud supplier (e.g., AWS) and shifted to a different supplier (e.g. GCP) with no extra effort. Clusters are monitored for community and GPU faults throughout the coaching console, mechanically cordoning defective {hardware} to mitigate downtime.

Our outcomes on HaluBench present that our finetuned mannequin outperforms closed-source LLMs and open supply LLMs when used as choose evaluator LMs throughout totally different duties. Lynx outperformed GPT-4o by virtually 1% in accuracy averaged throughout all duties, and is the best-performing open-source mannequin by a large margin.

HaluBench Results