Language model hallucinations are a big problem. Can we build LMs w/ factuality & correctness guarantees?
Conformal factuality is a simple, practical modification to any LM that uses conformal prediction to give exact high-prob. correctness guarantees
Instruction-following models are now ubiquitous, but API-only access limits research.
Today, we’re releasing info on Alpaca (solely for research use), a small but capable 7B model based on LLaMA that often behaves like OpenAI’s text-davinci-003.
Demo:
We are releasing AlpacaFarm, a simulator enabling everyone to run and study the full RLHF pipeline at a fraction of the time (<24h) and cost (<$200) w/ LLM-simulated annotators. Starting w/ Alpaca, we show RLHF gives big 10+% winrate gains vs davinci003 ()
I've been using a GPT4 paper assistant that reads the daily ArXiv feed and makes personalized recommendations in Slack. It's worked pretty well for me (today's paper demo ). If this sounds helpful, you can set up your own bot here .
I'm excited to share that I'll be joining
@Stanford
CS as an Assistant Professor starting Sept 2020 and spending the next year at Semantic Machines.
I'm incredibly grateful for the support I received from friends and colleagues and excited to continue my work at Stanford!
We know that language models (LMs) reflect opinions - from internet pre-training, to developers and crowdworkers, and even user feedback. But whose opinions actually appear in the outputs? We make LMs answer public opinion polls to find out:
How can we trust LM evals when LMs might be pretraining on the test set? We show you can prove suspected test set contamination on black-box models with false positive rate guarantees. An audit of 5 open LMs shows little evidence of strong contamination.
Congrats to
@YonatanOren_
,
@nicole__meister
,
@niladrichat
,
@faisalladhak
on getting a best paper honorable mention for their work on provably detecting test set contamination for LLMs! If you’re interested in contamination, or fun statistical tests, their talk is Thu 4-4:15pm!
Interested in differential privacy (DP) or private NLP?
Our preprint has something for both interests:
We found that privacy-preserving NLP can be painless (code below), and DP-SGD works surprisingly well on extremely large models.
(contd)
Alpaca is an instruction-tuned version of LLaMA 7B, where our 52k demonstrations are based on the self-instruct method of Wang et al w/ text-davinci-003.
Combining small tuning data and model allows us to train Alpaca quickly (3hrs on 8xA100).
Data:
We release information needed to replicate Alpaca, and we await Meta’s guidance on releasing its weights. We hope releasing Alpaca will let us better understand model failures and facilitate academic research with a strong instruction-following model.
How can we be robust to changes in unmeasured variables such as confounders?
@megha_byte
shows that we can leverage human commonsense causality to annotate data with potential unmeasured variables.
Come by our
#ICML2020
Q&A at Jul 14, 9am and 10pm PDT ()!
New work with
@daniel_d_kang
on improving training losses for more reliable natural language generation ().
Large scale corpora are often noisy and contain undesirable behaviors like hallucinating facts. The ubiquitous log-loss amplifies these problems. 1/2
Interested in evaluating generation? Want rigorous evaluations of model plagiarism and underdiversity? Come see "Unifying Human and Statistical Evaluation for Natural Language Generation" (w/ Percy Liang and
@hughbzhang
) on Tuesday 9:18 at Northstar A. ()
Alpaca has many flaws and open release can have negative effects, but we believe the benefits of open research outweigh the drawbacks. We discuss the decision for release on our blog and mitigate demo misuse through content filters and watermarks.
Blog:
Come see my students' ICML talks on SSL/LMs!
Studying what happens when models train on their own outputs (Oral B4,)
Quantifying opinions reflected by LMs (Oral B1,)
A new risk decomposition for SSL (Oral B5,)
NLG data can be noisy, and training on such data makes LMs replicate these issues. Can we trace and remove these examples?
Come to the contrastive error attribution poster at 11. I'll be there for
@faisalladhak
+
@esindurmusnlp
who couldnt make it (arxiv )
This was a really fun project - the observation that the conditional mutual information learned by BERT can be directly be used to do unsupervised dependency parsing was very neat and surprising.
Sharing my NAACL 2021 paper (w/
@tatsu_hashimoto
):
Why does random masking work so well in language model pre-training?
We show that MLM can capture the statistical dependencies between tokens and these dependencies closely mirror syntactic dependencies.
Posting model-generated content on the internet can end up degrading the quality of future datasets collected on the internet.
@rtaori13
did some neat work trying to study when this will / wont be a big problem.
🎉 The last few weeks have seen the release of
#StableDiffusion
,
#OPT
, and other large models.
⚠️ But should we be concerned about an irreversible influx of AI content on the internet?
⚙️ Will this make it harder to collect clean training data for future AI models?
🧵👇 1/6
Is chatGPT now in a steady state rather than explosive growth? The Google Trends (US, 1-year, "ChatGPT" search term) is surprising and interesting. I expected a return to rapid growth once school was back in session.
🚀Thrilled to launch the Workshop on the Next Generation of AI Safety at
#ICML2024
! Dive into the future of AI safety. CFP & more details 👉
#NextGenAISafety
#ICML2024
@srush_nlp
We analyzed some of these data contamination and stability questions in a paper last year . Roughly, if you are indistinguishable (in a total variation sense), and you make stability assumptions on the learner, the dynamics are stable and not too bad.
Please help disseminate! Flexible post-doc position at SAIL working with research groups of your choice. Hopefully a useful opportunity for people waiting out the hiring freeze.
AI postdocs available! The Stanford AI Lab is trying to help in the current
#COVID19
pandemic. Some of that is via research but another need is jobs for great young people. We’re opening positions for 2 years of innovative research with Stanford AI Faculty
We also find that newer human-feedback tuned models are not only more left-leaning than base LMs, but often collapse onto the dominant liberal viewpoint (e.g., 99% approval for Joe Biden) and attempts to steer LMs towards specific groups lead to only modest improvements.
Happy to share our paper on aligning LLaMA 7B with LoRA-based RLHF!
LoRA uses 2 A100s compared to 8 for full model tuning, and yields higher win rate on AlpacaFarm with only 10h training ✅
More details below:
We create a new dataset of opinion polls, OpinionQA, and compare LM responses to those of 60 demographic groups in the US. With this, we can quantitatively and comprehensively characterize who current LMs are aligned to.
The core idea is a reduction from LM factuality to conformal prediction. We do this by associating each LM output with an 'entailment set' where the set containing any true fact implies correctness. Conformal prediction can then provide the necessary containment guarantees.
OpinionQA is now a part of HELM and you can get OpinionQA here (). We hope our work helps improve the broader discourse on opinions and values that LMs do or should reflect.
w/
@ShibaniSan
,
@esindurmusnlp
,
@faisalladhak
, cinoo lee, and
@percyliang
Neat way to incorporate unlabeled data into distributionally robust optimization (along with ). The duals work out surprisingly nicely (though convergence rates are probably still nonparametric).
Our foray into “robust learning” (and messy duality proofs!). Charlie noticed issues with DRL using transport (fig1) and developed a model/algorithm to address using additional unlabeled data. Algorithm is built on a hard-fought dual (thm2) worked out with Ed and
@sebastianclaici
Replicating experiments in CS can be hard, but it turns out its nothing compared to what can happen in chemistry. Experiments that consistently work can *one day suddenly stop working, and never work again*
Our approach lets users target (almost) any probability level for correctness and factuality, and the LM attains nearly exactly this target correctness level (marginally over the calibration and test sets).
Why did I do that? 🤔 Today in lab meeting we discussed "Rationalization is rational" (). Thanks
@fierycushman
for a thought-provoking and beautifully written paper!
We find the RLHF simulator to be very accurate.
The simulated annotators are close to humans in agreement rate (65 vs 66%) at 1/45th the cost, and rankings of methods trained in simulation agree with rankings of methods trained on real human feedback.
You should also check out the many other cool contamination-related papers at this ICLR like Shi et al (), Golchin et al (), and Roberts et al (). Each has a different take on this problem!
Finally, w/ the AlpacaFarm release, we are releasing easy-to-run code for the simulator and RLHF methods, as well as all 40k human and simulated preference data (data: , code: )
After talking to many students about their grad school experience I compiled this blog post on "How to pick your grad school". I discuss all the important factors and details from contrasting but complementary perspectives. I hope it will be helpful!
This one is a fun one to read, and the intro is refreshingly upfront about the limitations of this approach (log-n factors and high probability bounds).
A fantastic new paper by Thomas Steinke and Lydia Zakynthinou (
@shortstein
and
@zakynthinou
). They use Conditional Mutual Information as a perspective to understand generalization, capturing VC dimension, compression schemes, differential privacy, & more.
The resulting system is practical. Here are some random, non-cherry-picked outputs on FactScore, NQ, and MATH for a system providing 80% (FactScore) and 90% (NQ, MATH) guarantees.
We find that on topics from abortion to automation, there is a substantial misalignment between opinions in LMs and of these groups - as divisive as the Democrat-Republican divide on climate change and some groups (65+, mormon, widowed) are poorly reflected in all current LMs.
On tasks like FactScore, NaturalQuestions, and MATH, we can take low to moderate base factuality levels (<30% for FactScore, <80 for others) and boost them to 80-90% factuality while keeping most of the LM outputs.
Please see our paper for other details, like how it’s necessary to emulate inter- and intra-annotator variability to build a simulator that captures important phenomena like overoptimization ()
This is since log-loss prevents models from assigning zero probability to noisy test sequences.
Adaptively truncating the loss solves this by optimizing for model distinguishability. Empirically, we find improvements in factuality for Gigaword summarization 2/2.
Additionally, to enable fast and reproducible evals, we define a new automatic eval for instruction following and combine many existing eval datasets. This aggregated eval set agrees very well with the simple but real human instructions from the Alpaca live demo.
Our reference RLHF implementations give substantial improvements. The best method (PPO) provides major gains over Alpaca on win rate vs davinci003 (41->55%), and we find it gives much more detailed explanations for answers.
@BlancheMinerva
@AiEleuther
Funny story is that we started with the pile, but didn't find test sets (at least large, exchangeable ones) so we trained our own positive control with contaminants. Pile and pythia is great for this kind of work, but we needed a lower quality, contaminated dataset for our expts
Our insight is that models memorize the order of examples seen in training. Since most datasets are exchangeable (order doesn't matter), a preference for a canonical ordering may indicate prior exposure: the model could only know the ordering if it saw the data during training.
Running this test on open models, we find little evidence that these LMs strongly memorize benchmark test sets. The exceptions are Mistral on ARC, and both LLaMA and Mistral on MMLU but these could be due to multiple hypothesis testing. See the paper for detailed discussions.
Our test can reliably detect benchmarks that are included more than once in a controlled experiment where we pretrained models with known contamination, with 100% success rate at duplication rates above 10, and 50% success at rates above 2.
It's a few lines of code () to make huggingface transformers (BERT, GPT2, etc) differentially private.
The library also has nice memory tricks by
@lxuechen
to scale DP-SGD to large language models.
(contd)
@soldni
@SemanticScholar
The author API is a real gem. Very easy to match on desired authors or filter for basic author stats. I wish I could use the
@SemanticScholar
API for the ArXiv feed update part too, instead of hitting the ArXiv RSS endpoint.
@haileysch__
Our cost estimates of naively running this on text-davinci-003 across all the benchmarks we wanted was a bit terrifying. We do have ideas on dealing with this, and hopefully will have more positive things to say soon.
On the DP side: we surprisingly observe no 'curse of dimensionality' with large pretrained models, and DP-SGD improves as models get larger!
Often the baseline of DP-SGD over all parameters beats parameter-efficient methods.
(contd)
This was work with
@lxuechen
,
@florian_tramer
,
@percyliang
.
You should also check out (thanks to
@thegautamkamath
for an earlier shoutout to our work)
They show that you can also get high-performance private models using low-rank tuning methods.
On the private NLP side, we show provable privacy is easy:
Fine-tuning language models with DP-SGD nearly match nonprivate performance for a wide range of tasks, spanning classification, table2text generation, and dialog generation.
Want to try it? (contd)
@gabemukobi
Human (Chris) generated. The correctness annotation is from being technically true (correct but irrelevant). Our guarantees are always w.r.t the annotator, so technically all guarantees are "Chris would judge 90% of this as correct"
@BlancheMinerva
@haileysch__
The naive (non shared) version was something like 10k per dataset.. but I think we just have a fundamentally better design for this now, so I may reach out if we manage to find something a bit better and viable.
@lreyzin
@ben_golub
This seems like a variant of the Reichenbach common cause principle, which is pretty interesting stuff. It has some pretty extensive discussion and purported counterexamples here -
@srush_nlp
I think it's nuanced. You might have learning algorithms that don't have the right stability properties, and a second risk is the human data distribution changes from exposure to LLMs, and/or humans stop producing content, removing the stabilizing effect of human data.
@srush_nlp
@sebgehr
@fernandaedi
@mrdrozdov
Is the film a lot nicer than writing normally on the tablet? I can never get used to doing math on the tablet because of how the pen glides on the screen.
@ml_angelopoulos
@Eric_Wallace_
I don't think Chris is twitter active but I'll send him this thread. Section 3.3 in conformal risk control was quite nice and a good inspiration for us.