TinyStories
The Small Language Model from Microsoft called Phi-3 was trained on using a novel dataset called TinyStories
In Short
The Small Language Model (SLM) Phi-3 was trained on synthetic data generated by GPT-3.5 and GPT-4.
Often LLM created training data can be too repetitive and similar without diversity in verbs, nouns and adjectives.
The corpus needed to combine all the qualitative elements found in natural language, such as grammar, vocabulary, facts, and reasoning.
But, designed to be smaller, less diverse, and more restricted in terms of its content.
The principle of creating a framework or data topology for the LLM to create the synthetic training data I find interesting.
The study shows that the training of generative models on TinyStories can typically be done in less than a day on a single GPU. And still exhibit many behaviours similar to the ones observed in LLMs.
Creating TinyStories
Instead of training on just raw web data, the creators of Phi-3 looked for high quality data.
Microsoft researchers decided to create a discrete dataset starting, and basing the training data on 3,000 words — comprising of roughly equal number of nouns, verbs, and adjectives.
They then asked a large language model to create a children’s story using one noun, one verb, and one adjective from the list — a prompt repeated millions of times over several days, generating millions of tiny children’s stories.
Small language models are designed to excel at simpler tasks, making them more accessible and easier to use for organisations with limited resources. They can also be fine-tuned more easily to meet specific needs.
Data Design Elements
The idea behind the TinyStories dataset is to create a corpus that combines all the qualitative elements found in natural language, such as grammar, vocabulary, facts, and reasoning. However, it is designed to be smaller, less diverse, and more restricted in terms of its content.
Diverse Data
To this end, the researchers relied on the latest text generation models by OpenAI (GPT-3.5 and GPT-4) to produce large amounts of synthetic content according to specific instructions.
They particularly instruct the models to generate content using vocabulary that a typical 3-year-old child would understand, and restrict the content to the format of short stories in English.
The main challenge in using large language models for producing training data is generating a dataset that is sufficiently diverse.
Prompting these models to produce stories, even with a high generation temperature, often results in a repetitive dataset that lacks the diversity needed to train a language model with an understanding of language comparable to that of children.
When prompted to create its own stories, the small language model trained on TinyStories generated fluent narratives with perfect grammar.
In Conclusion
The hope with TinyStories is to facilitate the development, analysis, and research of language models, especially for low-resource or specialised domains, and shed light on the emergence of language capabilities in these models.
A general question arising from this work is whether synthesising a refined dataset can be beneficial for training networks for practical uses. For example, it might be possible to train a customer service chatbot by synthesising a large dataset of hypothetical calls.
I’m currently the Chief Evangelist @ Kore AI. I explore & write about all things at the intersection of AI & language; ranging from LLMs, Chatbots, Voicebots, Development Frameworks, Data-Centric latent spaces & more.