Forem: Refact AI

Open-source Fine-Tuning on Codebase with Refact

Refact AI — Tue, 05 Sep 2023 10:18:54 +0000

Code completion has become increasingly popular, thanks to tools like GitHub Copilot and open-source Large Language Models (LLMs). However, both Copilot and open models often fall short when it comes to working effectively on your specific codebase. This is because these models have never been exposed to your unique code patterns and conventions.
In order to improve the quality of suggestions and tailor them to your codebase there's a technique called fine-tuning. By fine-tuning a pre-trained model on your codebase, you can improve its ability to understand and generate code that aligns with your requirements.
In this blog post, we will delve into the concept of fine-tuning, and its technical details, and show how you can start self-hosting your fine-tuned model in Refact.

Example

In this video, the same simple function is generated by: Copilot, base Refact 3b model, fine-tuned Refact 3b model.
All three can look down the code, find what variables are necessary, and help you with typing, but only the finetuned version knows how to work with DatasetOpts.

How Exactly Fine-tune Works?

Large language models work by predicting the next token. This simple objective allows LLMs to learn syntax, code patterns, and even high-level concepts.
The code you write is probably different from all the other projects on the internet. It might be similar - that's why code LLMs are already useful - but you probably have your own established way to do things.
One simple example is coding style. Predicting the next token in a certain way defines how a model writes code, including variable names, spaces, etc.
Fine-tuning has the same objective as pre-training: predict the next token. By adjusting the parameters in a clever way (it needs only one GPU to train!), the model starts to predict the next token according to your coding style, as well as patterns, your typical API usage, etc.
That's why you'll see more useful suggestions if you are using a fine-tuned model.

What Data Can I Use for Fine-tuning the Model?

In Refact UI, you will need to upload source code, in archive form (.zip, .tar.gz, .bz2) or give it a link to a git repository (private git repositories work too, you need to generate a ssh key though). You can upload an individual file, too. Refact then will slice your source code into pieces that a model can actually train on.
It's a good idea to give the model the current code of your projects. However, it's NOT a good idea to feed 3rd party libraries that you use, as the model may learn to generate code similar to the internals of those libraries.

Test Loss

In order to measure how well the model is adapted to your code, you can take one or two of your files and make it a test set. To be meaningful as a measurement, these files should be using your coding style, your libraries and APIs.

<img src="https://refact.ai/images/blog/refact-finetune/sources-code.png">
<span>Picture: shows <code>vllm</code> github repository as a training set, and a single file <code>benchmark_serving.py</code> as a fixed test set</span>

If test files are also present in the train set, they will be automatically subtracted from it.
If you don't specify any test set, it will pick several random files for you.

Technical Details

It's possible to fine-tune all parameters (called "full fine-tune"), but recently PEFT methods became popular. PEFT stands for Parameter-Efficient Fine-Tuning. There are several methods available, the most popular so far is LoRA (2106.09685) that can train less than 1% of the original weights.
LoRA has one important parameter -- tensor size, called lora_r. It defines how much information LoRA can add to the network. If your codebase is small, the fine-tuning process will see the same data over and over again, many times in a loop. We found that for a smaller codebase small LoRA tensors work best because it won't overfit as much -- the tensors just don't have the capacity to fit the limited training set exactly.
As the codebase gets bigger, tensors should become bigger as well. We also unfreeze token embeddings at a certain codebase size.
To pick all the parameters automatically, we have developed a heuristic that calculates a score based on the source files it sees. This score is then used to determine the appropriate LoRA size, number of finetuning steps, and other parameters. We have tested this heuristic on several beta test clients, small codebases of several files, and large codebases like the Linux kernel (consisting of about 50,000 useful source files).
If the heuristic doesn't work for you for whatever reason, you can set all the parameters yourself.

How to Test If It Worked?

After the fine-tuning process finishes (which should take several hours), you can dynamically turn it on and off and observe the difference it makes for code suggestions. You can do this using this switch:

<img src="https://refact.ai/images/blog/refact-finetune/lora-select.png">

There's a catch: both VS Code and JB plugins cache the responses. To force the model to produce a new suggestion (rather than immediately responding with a cached one), you can change the text a few lines above, for example, a comment.
Alternatively, you can use the Manual Suggestion Trigger (a key combination), which always produces a new suggestion.

Self Hosting

You can use your own GPU to host and fine-tune LLMs with Refact self-hosting server.

FAQ

Q: Maybe models can guess code better if they have more context, especially from other files?
A: For the best results, you need both. Fine-tuning gives you the coding style, and if the model can see relevant snippets of code from other files, it will work better for calling functions and using types defined outside of the current file. We are currently working on that, too. Join our discord server and be the first to know when we release it!
Q: I only want to imitate the coding style of certain experts on my team. Is this possible?
A: Certainly! It is indeed possible to imitate the coding style of specific experts on your team. You can achieve this by selectively uploading the files that represent the desired coding style and excluding any old or low-quality code. By doing so, the model will generate code that aligns with the chosen coding style. This approach can be valuable in transferring expert knowledge within your company, as the coding assistant can consistently suggest good coding practices.

🤖We trained a small 1.6b code model that reaches 32% HumanEval🤖

Refact AI — Tue, 05 Sep 2023 10:15:23 +0000

Today we're introducing Refact LLM: 1.6B code model with infill real-time code completion (including fill-in-the-middle(FIM) capability) and chat.
Refact LLM achieves the state-of-the-art performance among the code LLMs, coming closer to HumanEval as Starcoder, being 10x smaller in size, and it beats other code models such as StableCode, CodeGen and ReplitCode on HumanEval metric.

Summary:

1.6b parameters
20 programming languages
4096 tokens context
code completion and chat capabilities
SoTA on HumanEval benchmark among similar code models
pre-trained on permissive licensed code and available for commercial use

Model	Model Size	HumanEval pass@1
DeciCoder-1b	1b	19.1%
Refact-1.6-fim	1.6b	32.0%
StableCode	3b	20.2%
ReplitCode v1	3b	21.9%
CodeGen2.5-multi	7b	28.4%
CodeLlama	7b	33.5%
StarCoder	15b	33.6%

The base model was trained on our own set of code with permissive licenses only and open text datasets (the text to code ratio was 50:50). In total, we trained our base model on 1.2T tokens of code on our cluster.

The model was then fine-tuned with open code instruction-following datasets filtered for quality and a synthetic dataset based on The Stack dedup v1.1 to improve FIM and boosting the base model performance.

You can read more about the architecture decisions that we made in the blog post.

We aim for the model to be accessible to everyone, we're releasing the model for commercial use under BigScience OpenRAIL-M license and making the weight available on HuggingFace.

While the trend recently was for the model sizes to get bigger, we wanted to lower barriers to entry and make it a versatile tool for developers with varying hardware setups. With the smaller size, running the model is much faster and affordable than ever: the model can be served on most of all modern GPUs requiring just 3Gb RAM and works great for real-time code completion tasks.

Refact LLM can be easily integrated into existing developers workflows with an open-source docker container and VS Code and JetBrains plugins. With Refact's intuitive user interface, developers can utilize the model easily for a variety of coding tasks. Finetune is available in the self-hosting (docker) and Enterprise versions, making suggestions more relevant for your private codebase.

Refact 1.6B LLM is the third model in the family of our code models, with CodeContrast 3b and CodeContrast 0.3b released previously. We aim to continue with our research and future updates to improve the LLM's performance and capabilities. We would love to get community contributions and feedback to enhance the model further. For any questions and ideas, please visit our Discord.

How To Train a Code Model Using Recent AI Innovations

Refact AI — Fri, 11 Aug 2023 13:51:20 +0000

ML is changing fast!

Recently Meta has released LLaMA model that surprised many people - it packed a lot of magic in a small size. The 12b version was comparable with OpenAI's GPT-3 largest 175B model in quality.

MosaicML released the MPT-7B model, which has a context of 60k tokens, thanks to the ALiBi position encoding.

BigCode released the StarCoder model that hits 30.4% on HumanEval pass@1, and they also released a code dataset cleaned of personally identifiable information, called The Stack.

Replit recently released the replit-code-v1-3b model trained on code that follows some LLaMA innovations and it shows great metrics, but it has no fill-in-the-middle capability, no diffs, and it has seen no data other than code.

We plan to make the model publicly available.

LLaMA Innovations

The number one thing about LLaMA is that it was trained for 1T tokens (and larger models for 1.4T tokens). But that alone is not enough, the transformer architecture and hyperparameters must be right to continue training for that long.

Architecture: LLaMA doesn't have the bias terms in self-attention and in MLP - that probably allows weight decay to work better. Self-attention runs independently from MLP, not sequentially - this makes calculations a bit faster because they don't have to wait for each other. LLaMA also uses RMSNorm instead of LayerNorm, but that shouldn't be important.

Hyperparameters: the most interesting is the batch size of 4M tokens. Early in training, many tokens are surprising for the model, and it gets interesting updates. But to run for longer, each batch needs to have diverse data that is not yet predictable, that's why it should be so big.

Figure 1: LLaMA loss and metrics get monotonically better for 1T tokens and beyond.

ALiBi Position Encoding

Transformers traditionally had absolute position encoding, which means each position in the context of 2048 or so tokens has its own trainable vector. It's horrible, of course, because the same tokens moved left or right will produce different activations! But some still use it, notably the StarCoder model.

There are three widely used solutions to this:

Relative Attention, introduced in the Transformer XL paper (with a not-very-clear explanation)
[Rotary Embeddings](https://arxiv.org/abs/2104.09864) (LLaMA uses this one)
[ALiBi](https://arxiv.org/abs/2108.12409v2)

Relative Attention has a big disadvantage: it adds trainable parameters. That means the initialization must be right, gradients must be right. We tried wavelets some time ago instead of trainable parameters and it worked just as well, proving there's no need for trainable parameters here, really.

Both Rotary Embeddings and ALiBi are great, but ALiBi has an additional advantage - extendable context size, compared to what was used in pretrain. Not immediately in our experience, but after a bit of fine-tuning - still a big advantage.

But let's directly compare the latter two on a short 15e9 token run:

    <img src="https://refact.ai/images/blog/recent-innovations/image5.png">
    <span>ALiBi</span>


    <img src="https://refact.ai/images/blog/recent-innovations/image2.png">
    <span>Rotary</span>

So ALiBi even works better for our setup!

Early Dropout

Researchers at Meta proposed using dropout in early training to improve underfitting (not overfitting). The way it works is this: put dropout layers at many places in the transformer, gradually turn down the drop rate from 10..15% to zero at the first 20% of the training run.

According to the paper, it can give a couple of percent on metrics for free, on the test set. Trying this using the same short training run we've got:

The red run with early dropout has a clear advantage on the training loss (never mind the little drop).

Multi-Query Attention

One of the ways to have a large context size with small memory usage is Multi-Query Attention, used at scale in PaLM models. A short explanation is this: in Multi-Head Attention a self-attention layer produces K, V and Q (keys, values and queries) for each head. But in Multi-Query Attention keys and values are produced just once (not for each head), only the queries are different for each attention head. Look at 2204.02311 for a detailed explanation.

This allows for a smaller KV cache while sampling and improved inference speed.

It was recently used in the StarCoder models, where it helps to handle a big context size of 8192.

LiON

Another recent development is LiON, an optimizer that makes a bold claim - that it can replace Adam. Adam ruled the world of deep models since its introduction in 2014, nearly a decade!

Various people are trying LiON on their projects, with varying degrees of success. A good starting point to look around is the lion-pytorch on github from Phil Wang aka lucidrains (thank you man!).

The main problem is the hyperparameters, which are well established for Adam, but it's still a bit of guesswork for LiON. There are three: betas, weight decay, learning rate. We took β1=0.95, β2=0.98 without checking, and tested LR and WD:

LR: gray is Adam with lr=20e-5, others are LiONs from 2e-5 to 5e-5 (the best).

WD: a higher weight decay (green wd=0.8) is slightly worse in the middle (within error bars?) for this short run but it's just as good at the end, compared to wd=0.6 and wd=0.4.

We took lr=5e-5 (four times lower than the Adam learning rate) and wd=0.8 (eight times higher than in Adam).

By the way, the low effect of weight decay on the final result is consistent with the LiON paper: they changed WD from 0.5 to 2.0 to a very little effect on final performance, especially with a higher learning rate.

The Data

We use Red Pajama as a text dataset, Stack Dedup 1.2 for plain code, and our own internal dataset for diffs.

We use fill-in-the-middle training almost exactly as in 2207.14255 (but we limit the "middle" part size to 4k chars)

Hyperparameters

Optimizer	LiON β1=0.95, β2=0.98	Batch size	2M tokens
LR	5e-5	Context size	4096
LR schedule	linear to zero with warmup	Dropout	p=0.1
Weight Decay	0.8	Dropout schedule	to zero at 20% of training

Conclusion

The model is called "202305-refact2b-mqa-lion", it has 1.6b parameters, we will release the weights for everyone to check out!

Introducing Refact: Open-source alternative to Github Copilot

Refact AI — Wed, 19 Apr 2023 19:16:19 +0000

We've just launched Refact.ai, the AI coding assistant that combines code autocompletion, refactoring, and chat inside your favorite IDE.

You can download our plugin for JetBrains or VS Code. It's currently free for everyone while we're in the technical preview, we plan to introduce more pricing tiers soon.

Why Refact?

We believe next-gen developer assistant tool can benefit greatly from the use of different AI models working in harmony, that's why we decided to power Refact with a combination of models.

Our proprietary fast and smart AI completion model is state-of-the-art in size and latency. For each language group, we have fine-tuned a specific model to provide speed and accuracy. They are hosted in our data center, ensuring performance and precision for quick boilerplate and basic code refactoring. Plus, by hosting them ourselves, we can ensure the highest level of security and reliability for our users.

On top of that, we use the powerful GPT-3.5-Turbo and GPT-4 models which make it possible to chat with it using natural language and apply code improvement and explaining functions like “Find/Fix Bugs”, and "Explain Complex Code".

Refact allows you to restrict access to particular files or projects, ensuring that your private code or confidential files are protected. And we don't collect datasets on the server side.
If you have NVIDIA GPU you can self-host our model using our Docker or contact us for an on-prem enterprise version.

Refact offers a wholesome developer experience by making multiple functions available inside one IDE.

Autocomplete

At the core of Refact is our autocomplete feature that works with 20+ programming languages, including Python, Javascript, Java, Go, Rust, C++, Ruby, and more. When you type some code, the model automatically generates the suggestion by looking for context up and down.

AI Toolbox

To improve your existing code quickly and easily, Refact AI Toolbox allows you to simply highlight the area you want to improve and use one of the functions to find and fix bugs, make code more readable, add console logs, or explain complex code.

Integrated AI Chat

Finally, you can use natural language prompts in the AI Chat to refine, explain, and generate new code, as well as provide hints on API usage and documentation links. Your code is part of the context of the conversation automatically and it also gets pasted back directly into the IDE.

Get Started

To get started, simply download Refact.ai on JetBrains or VS Code for free. We're currently in a technical preview, but we're working hard to introduce different pricing tiers and would love to hear your feedback.

Join our Discord community to help shape the future of independent AI coding assistants.