Tiny Story Generator: Fine-tuning “Small” Language Models with PEFT

Published:

In the realm of natural language processing, language models have emerged as powerful tools for generating human-like text. While larger models (like GPT-3, GPT-4, Llama-2, Llama-3, Gemini, Mistral, etc) have garnered significant attention, it is also demonstrated that even relatively small models can produce high-quality outputs when fine-tuned on specific domains.

The Tiny Story Generator project (github page) aims to leverage this finding by fine-tuning smaller language models, such as GPT-small (117M), GPT-medium (345M), GPT-large (774M), and GPT-extra-large (1.5B), on a dataset of short stories. The goal is to enable these models to generate outputs of similar quality to their larger counterparts, but with significantly reduced computational requirements.

The Dataset

The project utilizes the TinyStories dataset from Huggingface, which consists of a collection of short, self-contained narratives. Each story in the dataset follows a specific format, starting with <|startoftext|> and ending with <|endoftext|>. Here’s an example:

<|startoftext|> Once upon a time, there was a little boy named Tim. Tim was
a happy boy who liked to play with his toys. One day, Tim saw a pretty mirror on
the wall... <|endoftext|>

The dataset is preprocessed into a CSV file, with each row containing a single story in the specified format. This preprocessing step ensures that the data is ready for fine-tuning the language models.

Data Preprocessing

To convert the original TinyStory data from HuggingFace to the required format, you can run the following command:

python3 src/processes/dataprep.py -c config/dataprep.toml

The configuration file config/dataprep.toml contains parameters such as the number of data points, train/test split ratio, random seed, input file path, and output file path.

Fine-tuning the Models

The fine-tuning process involves training the selected language model on the preprocessed dataset. The project leverages the PEFT (Parameter-Efficient Fine-Tuning) technique, which allows for efficient fine-tuning by only updating a small subset of the model’s parameters.

To fine-tune the model, run the following command:

python3 src/modeling/finetune.py -c config/model_finetune.toml

The configuration file config/model_finetune.toml specifies parameters such as the input file, model name, device (CPU or GPU), and whether to use PEFT or full fine-tuning.

By fine-tuning the models on the TinyStories dataset, the project aims to imbue them with the ability to generate high-quality, coherent, and engaging short stories. The fine-tuned models can then be used for various applications, such as creative writing, storytelling, or even educational purposes.

Inference and Generation

Once the models are fine-tuned, they can be used for inference and text generation. The project provides a simple interface for generating new stories based on a given prompt. To generate a new story, run the following command:

python3 src/modeling/inference.py \
    -c config/model_inference.toml \
    -q "Once upon a time, in an ancient house, there lived a girl named Lily. \
        She loved to decorate her room with pretty things. One day, she found a box"
    -n 1 \
    -rnd 876543

The -q flag specifies the prompt, -n sets the number of stories to generate, and -rnd provides a random seed for reproducibility.

The configuration file config/model_inference.toml contains parameters such as the output file path, model name, device, and generation hyperparameters (temperature, top-k, top-p, repetition penalty, and bad words).

The generated stories can be saved to a file or displayed directly, allowing users to explore the creative potential of the fine-tuned language models.

Broader Applications

While the Tiny Story Generator project focuses on the specific use case of generating short stories, the underlying techniques and principles can be applied to a wide range of text generation tasks. The project serves as a practical example of how smaller language models can be fine-tuned to achieve high-quality outputs in specific domains, potentially reducing the computational requirements and making language models more accessible to a broader audience.

Whether you’re a writer seeking inspiration, an educator looking for engaging educational materials, or a researcher exploring the capabilities of language models, the Tiny Story Generator project offers an exciting glimpse into the future of natural language generation.