🍷 FineWeb technical report is out and so is 📚 FineWeb-Edu, a 1.3 trillion tokens dataset that outperforms all other open web datasets, with remarkable improvements on educational benchmarks such as MMLU, ARC, and OpenBookQA.
Technical report:
Dataset:
Debugging large scale pretraining is hard and expensive. We noticed that our CodeParrot training performed significantly worse than the equivalent Megatron training pipeline. Time for some investigation! 🕵️♀️ (1/n)
Last week we released StarCoder, a state-of-the-art code LLM with 15B parameters. With the release of StarCoder we also conducted a comprehensive evaluation of code LLMs.
Some interesting findings and how we got there ✨:
Today is my first day at
@huggingface
as a Machine Learning Research intern. I'm thrilled to be joining such an amazing team and contribute to democratizing the ML community 🤗
In the past two weeks, we've seen 4 new code models drop: StableCode, OctoCoder, OctoGeeX, and DeciCoder 🚀
Everyone's talking about "HumanEval" – so how does code evaluation work & what makes reproducibility challenging?
A thread 🧵:
This is my second week as an intern at
@huggingface
. I am working on code models and as a first project I did some exploration of a large multilingual code dataset 💻. A thread:
Since the introduction of Codex a lot has happened in the ML for code space with many large code models open-sourced. We built an interactive blog to compare all of these models and explain how they are trained and evaluated ✨:
🌌 Cosmopedia: The largest open synthetic dataset of textbooks, blogposts and stories generated by Mixtral with a total of 25B tokens and 30M files 🚀
A little backstory to this "cosmic" journey: Two weeks ago I started experimenting with some cool web
You probably heard of Megatron-LM for large model training, if you haven't tried it yet, we made this blog to guide you step by step.
You can train language models fast and convert them to 🤗 Transformers.
🧵 Here's how we're tackling text de-identification to remove personal information from code datasets at
@BigCodeProject
:
- An annotated benchmark 📑
- A pipeline for PII detection and anonymization 🚀
- A demo to visualize anonymized samples 🔍
(1/n)
Last week we released StarChatβ, an instruction tuned model of 💫StarCoder+.
In this thread, I will share some cool examples we generated with StarChat and show how it can be used to help you with coding questions.
A thread 🧵
After a successful launch of 💫StarCoder - our new 15B LLM for code. I'm Happy to be presenting this afternoon our previous paper "SantaCoder: don't reach for the stars!" aka smol StarCoder at ICLR
@DL4Code
workshop this afternoon.
#ICLR2023
Here's a Colab I shared internally at HF for running synthetic data generation at scale with llm-swarm
It’s very easy to use (once you have a slurm cluster 🙈) just provide your million prompts dataset, a model and let it run 🐝!
Would people be
@nvidia
@DBahdanau
Debugging, although complex, is effective to fix small bugs even if they don't solve the initial problem ✨. It can be tricky for training scripts, as some changes require more time for their impact to be observed. We are now ready and training more models! stay tuned 🚀 (14/14)
🚀 Fine-tune SantaCoder on code generation datasets with this repo:
A Google Colab by
@mrm8488
is also available.
✨ Bonus: we fine-tuned SantaCoder on Jupyter Notebooks to make it explain code
Here's the full list of the resources and links:
- Cosmopedia dataset:
- Cosmo-1B model:
- GitHub code:
⚡ Other libraries we used:
- llm-swarm for large scale synthetic data generation:
🌌 Cosmopedia: The largest open synthetic dataset of textbooks, blogposts and stories generated by Mixtral with a total of 25B tokens and 30M files 🚀
A little backstory to this "cosmic" journey: Two weeks ago I started experimenting with some cool web
The much-debated Phind and WizardCoder Models have joined the BigCode Leaderboard, evaluated across 10+ languages! 🌍
It took some time to get the correct evaluation results due to some evaluation subtleties. Dive into our exploration🕵️♂️🧵:
With many great resources for code models scattered around, it is hard to keep track. We’ve added several code-related datasets, models and metrics to the 🤗 hub for downstream tasks. Want to learn how to train models to estimate algorithmic complexity or explain code? A thread:
Inspired by the Open LLM LeaderBoard, and with several strong code models released, we created a Multilingual Code Leaderboard:
📊 10+ programming languages
⚡Throughput measurement
🔬 Fully reproducible
✉️ Open for submission of results
🔥 Do you want an open and versatile code assistant? Today, we are delighted to introduce CodeQwen1.5-7B and CodeQwen1.5-7B-Chat, are specialized codeLLMs built upon the Qwen1.5 language model!
🔋 CodeQwen1.5 has been pretrained with 3T tokens of code-related data and exhibits
We are releasing all the tools we developed under open-access, and we hope they will advance the code generation space. Releasing a model is not just releasing a checkpoint ✨
You can find all relevant links at:
Paper:
@nvidia
@DBahdanau
🥁 And indeed when training with all these fixes a bit longer we noticed that the discrepancy was gone and we now match Megatron's performance 🥳! (13/n)
We've just published a detailed blog post on the creation of Cosmopedia dataset. We hope this will provide insights about generating synthetic data at scale for pre-training.
Here are some key takeaways:
🎯 Prompt curation is crucial: we want to cover
@nvidia
@DBahdanau
Next we tried the data loader from Megatron. Initially we had a shuffling problem (credits to
@DBahdanau
), the files were shuffled but not the sequences, so those from long files can fill up a single batch. This improved the training considerably but wasn’t enough. (7/n)
When testing the model, we were struck by its conversational and reasoning abilities when we added a series of dialogues to the context: It can act as a tech assistant. Evaluation on HELM reasoning tasks:
@nvidia
@DBahdanau
At this stage, we thought we might’ve missed something in the points above. We found there was a difference between the frameworks in the scaling of the attention weights in mixed precision. We fixed the difference and finally got a slight improvement! (9/n)
Based on popular demand, we trained smaller versions of 💫 StarCoder.
Check this leaderboard for more details on their performance compared to other base code models:
🌌 News from the StarCoder cosmos!
We trained smaller versions of StarCoder: 1B, 3B and 7B models.
1T tokens, 80+ programming languages with 8k context window, MQA & FIM.
Last week, I gave a keynote about
@BigCodeProject
and
@huggingface
ecosystem to 1500 attendants at the
@KubeCon_
&
@CloudNativeFdn
summit and GOSIM conference in Shanghai. It was a great chance to meet the Open Source community and discuss AI!
Slides:
Another cool feature for Gradio. We used these API endpoints in our Code Generation blog to call 3 other spaces in parallel threads without needing to load many large models in one space.
Code Generation Blog:
CodeGen space:
Really stoked to share Gradio's new "Use via API" page
1⃣ Build a
@Gradio
app (or find one on Spaces)
2⃣ Click on the "Use via API" link in the footer
3⃣ See the expected payload and try it out immediately
4⃣ View handy code snippets in Python or JS
Embed ML everywhere!
Introducing: StarCoder2 and The Stack v2 ⭐️
StarCoder2 is trained with a 16k token context and repo-level information for 4T+ tokens. All built on The Stack v2 - the largest code dataset with 900B+ tokens.
All code, data and models are fully open!
Another architectural change which feels like a must for every new language model is Multi-Query-Attention, you can process larger batches and faster!
If you ever evaluated a code model you must know how necessary that is
A very underrated architecture tweak to GPT is multi-query attention (MQA): sharing value/key across attention heads saves a lot of memory in the kv-cache.
Max generation batch size on a Colab GPU with a 1B model:❗️512❗️ vs 32 (vanilla GPT)
Test it here:
@nvidia
@DBahdanau
First, we observed that our loss had more noise than Megatron’s. We used distributed training and we were plotting the training loss of the main worker only, plotting the average over the workers made the loss way smoother! (3/n)
@nvidia
@DBahdanau
Then we thought maybe the optimizer is the issue: the transformers implementation of AdamW is slightly different from Pytorch and will be deprecated. Switching to AdamW from Pytorch exhibits better behavior after the warmup stage but then the performance becomes similar. (11/n)
@nvidia
@DBahdanau
The only thing left to check was the rest of the training script. We found a bug in the weight decay! However it didn’t seem to impact the training on the short run. We also used 🤗 Trainer to replace our training loop in case there was another bug, but there wasn’t. (8/n)
@Grady_Booch
We also do have a lot of incorrect, wrong & inappropriate information on the web, that is used to train today's LLMs. At least with synthetic data you can have some control over what you generate. Hallucination remains an issue for sure, there are methods to attenuate it like
Glad yo see Cosmopedia as the top trending dataset on the HF hub since we released it 3 days ago 🚀
Cosmo-1b model:
This is our attempt at reproducing the dataset used to train Phi models but with an Open Source model. It’s a
@nvidia
@DBahdanau
Naturally we compared the architectures of GPT2 in both frameworks to look for any differences. The only difference we found was in the GELU activation function, they use slightly different implementations, but this didn’t impact the training. (5/n)
A LOT of data curation.🕵️♀️ We manually inspected 50-100 files for all the extensions in the selected programming languages and choose adequate filters.
We also added GitHub Issues, Git Commits and Jupyter Notebooks.
Overview of some of the spaces we built:
@nvidia
@DBahdanau
But we got excited too early, after some discussions with the authors of this change they explained that it doesn't impact the training in the long run. And indeed, going beyond 2000 steps makes the gap go away. (10/n)
Megatron is a framework developed by
@Nvidia
for training large transformer models.
@DBahdanau
observed a training gap for CodeParrot, a GPT2 model for code generation, between Megatron and our script in transformers. (2/n)
@nvidia
@DBahdanau
Another candidate for the difference was the optimizer, we used AdamW from transformers, while in Megatron they used AdamW from Apex. We trained the model with the latter but it didn’t seem to solve the problem 😓 (6/n)
@nvidia
@DBahdanau
Then we suspected that the initialization weights could be different between the two frameworks, but it turned out they followed the same distributions. (4/n)
⚖️ StarCoder is not just a strong model- it sets a new standard for data governance. We trained only on permissive data with an opt-out mechanism and no PII.
We also implemented tools for code attribution like a membership test: and a search index.
Here's a nice tool for visualizing and clustering your dataset into topics 🔍
We used it to understand topic coverage and filter web samples when building Cosmopedia:
Text clustering at home? Yes, with text-clustering, a tiny smol repo:
The pipeline is fully built on open tools:
1⃣embed with sentence-transformers
2⃣project to 2d with UMAP
3⃣run DBSCAN clustering
4⃣label clusters with Mixtral
Runs in 5-10min and tada:
Is your code in 📑 The Stack? Check if your repositories are in the dataset and a large language models for code will learn from them!
You don't want your code to be part of The Stack? Follow the opt-out instruction and we'll remove it!
We trained for multiple epochs without performance degradation. The scaling laws indicate that a 15B model trained on the 300B tokens we had would be very undertrained.
=> We repeat data until the magical 1T tokens ✨
The loss kept going down!
Code generation with language models is cool. Complexity prediction is even cooler!
This Gradio space predicts the complexity of Java programs using a code model ⏱️ :
Can't wait to hear about StarCoder from
@LoubnaBenAllal1
! ICYMI, StarCoder is a code LLM from Hugging Face with 15B params & 8k context, trained on 1T tokens of permissive data in 80+ programming languages.
Starts in 24 hrs: May 16, 9am PST. RSVP below.
🇲🇦 Proud to see this platform we developed with fellow Moroccans help in earthquake relief efforts. To contribute please visit:
We have a map for coordination + forms and a WhatsApp bot for victims, witnesses & NGOs:
Architecture-wise, we used FlashAttention to increase our context window to 8k 🚀
This can help when large context is needed for example to support repository-level information in IDE integrations.
StarCoder outperforms OpenAI's code-cushman-001 and all open code generation models on HumanEval. On other benchmarks like DS-1000 the gap is even larger.
DS-1000 includes more diverse and realistic data science problems spanning 7 libraries.
You can reproduce most of these numbers, with our code evaluation harness:
It Includes MultiPL-E (HumanEval in 18 languages), HumanEval, DS-1000, PaL-GSM8k among others with Multi-GPU setup & docker containers for execution.
Code Llama with
@huggingface
🤗 Yesterday,
@MetaAI
released Code Llama, a family of open-access code LLMs!
Today, we release the integration in the Hugging Face ecosystem🔥
Models:
👉
blog post:
👉
Blog post covers how to use it!
Next Tuesday, I will give a webinar hosted by
@AnalyticsVidhya
on the training of LLMs for code, like StarCoder.
I will also discuss how to leverage these models using open-source libraries such as transformers, datasets and PEFT.
Register here: .
Honored to have given a talk at
@KTHuniversity
about Machine Learning for Code at
@Huggingface
with CodeParrot & BigCode.
🦜 For educational tools about code models:
🌸 For some state-of-the-art code datasets and models:
Introducing 🎃🦇 the AI Halloween Photobooth! 🦇🎃
Turn into a Spooky Skeleton💀✨ or a PS1 style vampire 🎮🧛
From
@linoy_tsaban
and I, powered by LEDITS 🎨: spooky iteration of what we had
@ICCVConference
/
@huggingface
Paris event🕸️🕷
Go play! ▶️
🔍 What's pass
@k
?
We generate k slightly different solutions for each problem with sampling at a fixed temperature. We consider the problem solved if any of the solutions pass the tests.
With code models getting better, we usually report pass
@1
: one chance to solve the problem.
The power of building tools, datasets, and models in the open: the community can build on top of it and everyone profits!
Exhibit A: since the release of 📑The Stack and ⭐️StarCoder research groups from academia and industry have trained models on top BigCode's releases.
Thrilled to announce the success of our recent
#webinar
on Generative Models with Loubna Ben Allal.
If you missed it, watch the recording here:
Take our survey to improve future events, & stay tuned for more!
#MoroccoAI
#AI
#NLP
Introducing 📑 The Stack - a 3TB dataset of permissively licensed code in 30 programming languages.
You want your code excluded from the model training? There is an opt-out form and data governance plan:
Let's take a tour🧵
DS-1000 is a realistic Python benchmark of Data Science use cases based on Stack-Overflow questions. It consists of 1000 problems spanning 7 widely-used libraries, and it was developed by
@HKUniversity
NLP Group.
📣 Introducing ⭐ StarCoder+ & StarChat Beta!
We trained StarCoder on the Falcon model's English web dataset and Instruction-tuned it. Both models rank high in the LLM leaderboard, with strong natural language performance and coding capabilities.
📈 Significant progress was made in the code evaluation space, but there's more to tackle such as:
• Evaluating repo level & multi-file changes
• Testing for class implementations
• Improving test coverage
MultiPL-E is the translation of HumanEval to 18 programming languages by
@northeasterm
Programming Research Lab.
It powers the Multilingual Code Evaluation leaderboard
@d_aumiller
Indeed it can be expensive which is why we trained for a few steps to test the changes (but this can also lead to false conclusions). I think it’s important to split the problem in small pieces and keep track of everything that was tested an the order of the tests
Join us on the BigCode journey 🚀 and contribute to the next language model for code.
Together we will address the challenges of this field in an open and responsible way 🌸.
print("Hello world! 🎉")
Excited to announce the BigCode project led by
@ServiceNowRSRCH
and
@huggingface
! In the spirit of BigScience we aim to develop large language models for code in an open and responsible way.
Join here:
A thread with our goals🧵
In NLP, model generations are often compared to reference solutions using metrics like BLEU 📜↔️🔍
But for code, these metrics don't capture the large and complex space of possible solutions.
Let's go back to HumanEval: It's less than 200 problems, with only function implementations in Python. Is that all you expect a code model to do? 🤔
The answer is no, which is why researchers have developed other benchmarks such as DS-1000, MultiPL-E, APPS...
@Jakewk
It's textbooks generated by an LLM, there's some hallucination for sure, but the performance of the model we trained on the dataset suggests there's a large chunk is accurate. You can inspect some samples here:
We built this demo this using 🤗 Spaces, an easy tool to deploy free apps, more than 14B parameters are hosted in this single space. If you have any questions or feedback you can use our new 🤗 feature of the hub: community tab, or even open a PR!
These examples were generated by deploying a local version of the amazing Hugging Face Chat-UI:
🔍 If you want to dive in the TS source code of how the prompt is built, StarChat can help!
Although HumanEval appears to be correlated with performance on other code completion benchmarks, it may not effectively capture model nuances in various scenarios.
Therefore, it's necessary to evaluate code models across multiple tasks. From StarCoder's Model Card:
That's where functional correctness shines, we test model generations against unit tests, like humans would. And we report a score called pass
@k
.
➡️ HumanEval = 164 Python programs with 7.7 tests per problem in average.
We've applied a variant of the Zephyr recipe to create StarChat2 🌟!
It balances the code and math capabilities of BigCode's StarCoder2 with those of chat models to produce a capable programming assistant 👩💻
🚀 Demo:
🧑🍳 Recipe:
@GrantDeLozier
@Thom_Wolf
It was actually a bug in the weight decay not LR, because of a typo it was also applied to LayerNorms that are normally excluded. And yes we do plot the learning rate curves it’s good practice to follow how it changes
We now host APPS (Hendrycks et al), a popular benchmark for evaluating code generation models with 10000 problems of three difficulty levels. We added the dataset as well as the metric.
Dataset:
Metric:
Each HumanEval prompt is a function signature with a docstring.
Instruction-tuned models are more chatty (GPT3 vs ChatGPT). They can either be evaluated with this prompt, or with an instruction friendly format to better align with their fine-tuning.
The first code model has officially joined the ggml library 🚀 VSCode extensions / On-device code generation.. The possibilities are endless. Let's level up the coding game with the power of ggml!
👉
👀 A glimpse of our latest mystery model's performance. Not just acing the coding tasks, but also mastering natural language! Intrigued yet?
Join us at our StarCoder webinar this Thursday to find out:
@thukeg
Great work! We added HumanEval-X to the Hugging Face hub and we can transfer it to your HF organization . It would be great to have the models there too!
🔍 Reproducibility is a also big challenge due to variance in:
• Generation parameters (n_samples, temperature...)
• Evaluation sets (HumanEval vs MultiPL-E-Python)
• Prompts (prefixes, base vs instruction)
➡️A leaderboard is essential for clarity in this space.
There are also other benchmarks for testing tasks like Program Repair and Code Explanation within HumanEvalPack thanks to
@Muennighoff
& team's work in
@BigCodeProject
.
StarChat Beta might hallucinate and generate problematic output. After all it's still a Beta version, but it shows we’re on a good path for code models that are open-access but also strong ✨