Zephyrnet Logo

Beginner’s Guide to Build Your Own Large Language Models from Scratch

Date:

Introduction

Be it twitter or Linkedin, I encounter numerous posts about Large Language Models(LLMs) each day. Perhaps I wondered why there’s such an incredible amount of research and development dedicated to these intriguing models. From ChatGPT to BARD, Falcon, and countless others, their names swirl around, leaving me eager to uncover their true nature. How are these models created? How to build large language models? How do they possess the ability to answer virtually any question you throw at them? These burning questions have lingered in my mind, fueling my curiosity. This insatiable curiosity has ignited a fire within me, propelling me to dive headfirst into the realm of LLMs.

LLMs

Join me on an exhilarating journey as we will discuss the current state of the art in LLMs. Together, we’ll unravel the secrets behind their development, comprehend their extraordinary capabilities, and shed light on how they have revolutionized the world of language processing.

Learning Objectives

  1. Learn about LLMs and their current state of the art.
  2. Understand different LLMs available and approaches to training these LLMs from scratch
  3. Explore best practices to train and evaluate LLMs

Buckle up and let’s start our journey to mastering LLMs.

Table of contents

A Brief History of Large Language Models

The history of Large Language Models goes back to the 1960s. In 1967, a professor at MIT built the first ever NLP program Eliza to understand natural language. It uses pattern matching and substitution techniques to understand and interact with humans. Later, in 1970, another NLP program was built by the MIT team to understand and interact with humans known as SHRDLU.

In 1988, RNN architecture was introduced to capture the sequence information present in the text data. During the 2000s, there was extensive research in NLP using RNNs. Language models using RNNs were state-of-the-art architectures to date. But RNNs could work well with only shorter sentences but not with long sentences. Hence, LSTM was introduced in 2013. During this period, huge developments emerged in LSTM-based applications. Simultaneously, research began in attention mechanisms as well.

There were 2 major concerns with LSTM. LSTM solved the problem of long sentences to some extent but it could not really excel while working with really long sentences. Training LSTM models cannot be parallelized. Due to this, the training of these models took longer time.

Large Language Models

In 2017, there was a breakthrough in the research of NLP through the paper Attention Is All You Need. This paper revolutionized the entire NLP landscape. The researchers introduced the new architecture known as Transformers to overcome the challenges with LSTMs. Transformers essentially were the first LLM developed containing a huge no. of parameters. Transformers emerged as state-of-the-art models for LLMs. Even today, the development of LLM remains influenced by transformers.

Building large language models

Over the next five years, there was significant research focused on building better LLMs compared to transformers. The size of LLM exponentially increased over time. The experiments proved that increasing the size of LLMs and datasets improved the knowledge of LLMs. Hence, LLMs like BERT, GPT, and their variants like GPT-2, GPT-3, GPT 3.5, and XLNet were introduced with an increase in the size of parameters and training datasets.

In 2022, there was another breakthrough in NLP, ChatGPT. ChatGPT is a dialogue-optimized LLM that is capable of answering anything you want it to. In a couple of months, Google introduced BARD as a competitor to ChatGPT.

In the last 1 year, there have been hundreds of Large Language Models developed. You can get the list of open-source LLMs along with their performance rank here. The state-of-the-art LLM to date is Falcon 40B Instruct.

What are Large Language Models?

Simply put this way- Large Language Models are deep learning models trained on huge datasets to understand human languages. Its core objective is to learn and understand human languages precisely. Large Language Models enable the machines to interpret languages just like the way we, as humans, interpret them.

Large Language Models learn the patterns and relationships between the words in the language. For example, it understands the syntactic and semantic structure of the language like grammar, order of the words, and meaning of the words and phrases. It gains the capability to grasp the whole language itself.

But how exactly is language models different from Large Language Models?

Language models and Large Language models learn and understand the human language but the primary difference is the development of these models.

Language models are generally statistical models developed using HMMs or probabilistic-based models whereas Large Language Models are deep learning models with billions of parameters trained on a very huge dataset.

But why do we need Large Language Models in the first place?

Why Large Language Models?

The answer to this question is simple. LLMs are task-agnostic models. Literally, these models have the capability to solve any task. For example, ChatGPT is a classical example of this. Every time you ask ChatGPT something, it amazes you.

And one more astonishing feature about these LLMs is that you don’t have to actually fine-tune the models like any other pretrained model for your task. All you need do is to prompt the model. It does the job for you. Hence, LLMs provide instant solutions to any problem that you are working on. Moreover, it’s just one model for all your problems and tasks. Hence, these models are known as the Foundation models in NLP.

Different Kinds of LLMs

LLMs can be broadly classified into 2 types depending on their task:

  1. Continuing the text
  2. Dialogue optimized

Continuing the Text

These LLMs are trained to predict the next sequence of words in the input text. Their task at hand is to continue the text.

For example, given the text “How are you”, these LLMs might complete the sentence with “How are you doing? or “How are you? I am fine.

The list of LLMs falling under this category are Transformers, BERT, XLNet, GPT, and its variants like GPT-2, GPT-3, GPT-4, etc.

Now, the problem with these LLMs is that its very good at completing the text rather than answering. Sometimes, we expect the answer rather than completion.

As discussed above, given How are you? as an input, LLM tries to complete the text with doing? or I am fine. The response can be either of them: completion or an answer. This is exactly why the dialogue-optimized LLMs were introduced.

2. Dialogue Optimized

These LLMs respond back with an answer rather than completing it. Given the input “How are you?”, these LLMs might respond back with an answer “I am doing fine.” rather than completing the sentence.

The list of dialogue-optimized LLMs is InstructGPT, ChatGPT, BARD, Falcon-40B-instruct, etc.

Now, we will see the challenges involved in training LLMs from scratch.

What are the Challenges of Training LLM?

Training LLMs from scratch are really challenging because of 2 main factors: Infrastructure and Cost.

Infrastructure

LLMs are trained on a massive text corpus ranging at least in the size of 1000 GBs. The models used to train on these datasets are very large containing billions of parameters. In order to train such large models on the massive text corpus, we need to set up an infrastructure/hardware supporting multiple GPUs. Can you guess the time taken to train GPT-3 – 175 billion parameter model on a single GPU?

It would take 355 years to train GPT-3 on a single NVIDIA Tesla V100 GPU.

This clearly shows that training LLM on a single GPU is not possible at all. It requires distributed and parallel computing with thousands of GPUs.

Just to give you an idea, here is the hardware used for training popular LLMs-

  1. Falcon-40B was trained on 384 A100 40GB GPUs, using a 3D parallelism strategy (TP=8, PP=4, DP=12) combined with ZeRO.
  2. Researchers calculated that OpenAI could have trained GPT-3 in as little as 34 days on 1,024x A100 GPUs
  3. PaLM (540B, Google): 6144 TPU v4 chips used in total.

Cost

It’s very obvious from the above that GPU infrastructure is much needed for training LLMs from scratch. Setting up this size of infrastructure is highly expensive. Companies and research institutions invest millions of dollars to set it up and train LLMs from scratch.

It is estimated that GPT-3 cost around $4.6 million dollars to train from scratch

On average, the 7B parameter model would cost roughly $25000 to train it from scratch.

Now, we will see how to train LLMs from scratch

How Do you Train LLMs from Scratch?

The training process of LLMs is different for the kind of LLM you want to build whether it’s continuing the text or dialogue optimized. The performance of LLMs mainly depends upon 2 factors: Dataset and Model Architecture. These 2 are the key driving factors behind the performance of LLMs.

Let’s discuss the now different steps involved in training the LLMs.

1. Continuing the Text

The training process of the LLMs that continue the text is known as pretraining LLMs. These LLMs are trained in self-supervised learning to predict the next word in the text. We will exactly see the different steps involved in training LLMs from scratch.

a. Dataset Collection

The first step in training LLMs is collecting a massive corpus of text data. The dataset plays the most significant role in the performance of LLMs. Recently, OpenChat is the latest dialog-optimized large language model inspired by LLaMA-13B. It achieves 105.7% of the ChatGPT score on the Vicuna GPT-4 evaluation. Do you know the reason behind its success? It’s high-quality data. It has been finetuned on only ~6K data.

The training data is created by scraping the internet, websites, social media platforms, academic sources, etc. Make sure that training data is as diverse as possible.

Recent work has demonstrated that increased training dataset diversity improves general cross-domain knowledge and downstream generalization capability for large-scale language models

What does it say? Let me explain.

You might have come across the headlines that “ChatGPT failed at JEE” or “ChatGPT fails to clear the UPSC” and so on. What can be the possible reasons? The reason being it lacked the necessary level of intelligence. This is heavily dependent on the dataset used for training. Hence, the demand for diverse dataset continues to rise as high-quality cross-domain dataset has a direct impact on the model generalization across different tasks.

Unlock the potential of LLMs with the high quality data!

Previously, Common Crawl was the go-to dataset for training LLMs. The Common Crawl contains the raw web page data, extracted metadata, and text extractions since 2008. The size of the dataset is in petabytes (1 petabyte=1e6 GB). It’s proven that the Large Language Models trained on this dataset showed effective results but failed to generalize well across other tasks. Hence, a new dataset called Pile was created from 22 diverse high-quality datasets. It’s a combination of existing data sources and new datasets in the range of 825 GB. In recent times, the refined version of the common crawl was released in the name of RefinedWeb Dataset. The datasets used for GPT-3 and GPT-4 have not been open-sourced in order to maintain a competitive advantage over the others.

b. Dataset Preprocessing

The next step is to preprocess and clean the dataset. As the dataset is crawled from multiple web pages and different sources, it is quite often that the dataset might contain various nuances. We must eliminate these nuances and prepare a high-quality dataset for the model training.

The specific preprocessing steps actually depend on the dataset you are working with. Some of the common preprocessing steps include removing HTML Code, fixing spelling mistakes, eliminating toxic/biased data, converting emoji into their text equivalent, and data deduplication. Data deduplication is one of the most significant preprocessing steps while training LLMs. Data deduplication refers to the process of removing duplicate content from the training corpus.

It’s obvious that the training data might contain duplicate or nearly the same sentences since it’s collected from various data sources. We need data deduplication for 2 primary reasons: It helps the model not to memorize the same data again and again. It helps us to evaluate LLMs better because the training and test data contain non-duplicated information. If it contains duplicated information, there is a very chance that the information it has seen in the training set is provided as output during the test set. As a result, the numbers reported may not be true. You can read more about data deduplication techniques here.

c. Dataset Preparation

The next step is to create the input and output pairs for training the model. During the pretraining phase, LLMs are trained to predict the next token in the text. Hence, input and output pairs are created accordingly.

For example, let’s take a simple corpus-

  • Example 1: I am a DHS Chatbot.
  • Example 2: DHS stands for DataHack Summit.
  • Example 3: I can provide you with information about DHS

In the case of example 1, we can create the input-output pairs as per below-

LLM

Similarly, in the case of example 2, the following is a list of input and output pairs-

LLM

Each input and output pair is passed on to the model for training. Now, what next? Let’s define the model architecture.

d. Model Architecture

The next step is to define the model architecture and train the LLM.

As of today, there are a huge no. of LLMs being developed. You can get an overview of different LLMs here. There is a standard process followed by the researchers while building LLMs. Most of the researchers start with an existing Large Language Model architecture like GPT-3  along with the actual hyperparameters of the model. And then tweak the model architecture and hyperparameters to come up with a state-of-the-art model architecture.

For example,

  • Falcon is a state-of-the-art LLM. It ranks first on the open-source LLM leaderboard. Falcon is inspired by GPT-3 architecture with a couple of tweaks.

e. Hyperparameter Search

Hyperparameter tuning is a very expensive process in terms of time and cost as well. Just imagine running this experiment for the billion-parameter model. It’s not feasible right? Hence, the ideal method to go about is to use the hyperparameters of current research work, for example, use the hyperparameters of GPT-3 while working with the corresponding architecture and then find the optimal hyperparameters on the small scale and then interpolate them for the final model.

The experiments can involve any or all of the following: weight initialization, positional embeddings, optimizer, activation, learning rate, weight decay, loss function, sequence length, number of layers, number of attention heads, number of parameters, dense vs. sparse layers, batch size, and drop out.

Let’s discuss the best practices for popular hyperparameters now-

  • Batch size: Ideally choose the large batch size that fits the GPU memory.
  • Learning Rate Scheduler: The better way to go about this is to decrease the learning rate as the training progress. This will overcome the local minima and improves the model stability. Some of the commonly used Learning Rate Schedulers are Step Decay and Exponential Decay.
  • Weight Initialization: The model convergence highly depends on the weights initialized before training. Initializing the proper weights leads to faster convergence. The commonly used weight initialization for transformers is T-Fixup.
  • Regularization: It is proven that LLMs are prone to overfitting. Hence, it’s necessary to use the techniques Using batch normalization, dropout, l1/l2 regularization will help the model overcome overfitting.

2. Dialogue-optimized LLMs

In the dialogue-optimized LLMs, the first step is the same as the pretraining LLMs discussed above. After pretraining, these LLMs are now capable of completing the text. Now, to generate an answer for a specific question, the LLM is finetuned on a supervised dataset containing questions and answers. By the end of this step, your model is now capable of generating an answer to a question.

ChatGPT is a dialogue-optimized LLM. The training method of ChatGPT is similar to the steps discussed above. Just that it includes an additional step known as RLHF apart from pretraining and supervised fine tuning.

LLM

But recently, there has been a paper known as LIMA: Less Is for More Alignment. It reveals that you don’t need RLHF at all in the first place. All you need is pretraining on the huge amount of dataset and supervised fine-tuning on a high quality data as less as 1000 data.

As of today, OpenChat is the latest dialog-optimized large language model inspired by LLaMA-13B. It achieves 105.7% of the ChatGPT score on the Vicuna GPT-4 evaluation. It’s been finetuned on only 6k high-quality data.

How Do you Evaluate LLMs?

The evaluation of LLMs cannot be subjective. It has to be a logical process to evaluate the performance of LLMs.

In the case of classification or regression problems, we have the true labels and predicted labels and then compare both of them to understand how well the model is performing. We look at the confusion matrix for this right? But what about large language models? They just generate the text.

There are 2 ways to evaluate LLMs: Intrinsic and extrinsic methods.

Intrinsic Methods

Traditional Language models were evaluated using intrinsic methods like perplexity, bits per character, etc. These metrics track the performance on the language front i.e. how well the model is able to predict the next word.

Extrinsic Methods

With the advancements in LLMs today, extrinsic methods are preferred to evaluate their performance. The recommended way to evaluate LLMs is to look at how well they are performing at different tasks like problem-solving, reasoning, mathematics, computer science, and competitive exams like MIT, JEE, etc.

EleutherAI released a framework called as Language Model Evaluation Harness to compare and evaluate the performance of LLMs. Hugging face integrated the evaluation framework to evaluate open source LLMs developed by community.

The proposed framework evaluates LLMs across 4 different datasets. Final score is aggregation of score from each datasets.

  • AI2 Reasoning Challenge: A collection of science questions designed for elementary school students.
  • HellaSwag: A test that challenges state-of-the-art models to make common-sense inferences, which are relatively easy for humans (about 95% accuracy).
  • MMLU: A comprehensive test that evaluates the multitask accuracy of a text model. It includes 57 different tasks covering subjects like basic math, U.S. history, computer science, law, and more.
  • TruthfulQA: A test specifically created to assess a model’s tendency to generate accurate answers and avoid reproducing false information commonly found online.

Also Read: 10 Exciting Projects on Large Language Models(LLM)

End Notes

Hope you are now ready to build your own large language models!

Any thoughts? Comment below.

spot_img

Latest Intelligence

spot_img