- shuangma
The art of data in LLM era
Content
Data collection
Data curation
Data creation
Human in the loop
Harmlessness / Ethical consideration
Evaluation
Open problems and brainstorming
data evaluation
close the data-model loop
other modalities
Data collection
Text: web crowdsource
<image, text>: alt text, caption, description…
Data curation
Usually, the base model is pretrained on a massive scale of corpus that collected from web resources. While raw text data is typically noisy and requires cleaning and preprocessing before training an LLM.
Previous techniques:
removing HTML tags, handling punctuation and capitalization, tokenizing text into sentences or words, normalizing or lemmatizing words, removing stop words, and handling special cases like URLs or numerical values.
standard curation also includes filtering out noisy or irrelevant data, removing duplicates, handling special characters or encoding issues, and addressing copyright or licensing considerations.
Case: setup criterion according to specific targeting scenario.
discarded samples too short or too long;
remove samples that contain ‘harmful words’;
remove un-meaningful content, e.g. code, placeholder, javascript, etc.
remove duplicates by statistics
Advanced techniques:
Previous techniques can used to filter out obvious noisy / dirty data, while when data that do not exist obvious or statistical flaws, we will need more advanced techniques. These includes but not limited to the data that:
not self-contained
do not involve meaningful content
do not contain logic or has logical errors
similar content / style
harmful content
mis-aligned responses
One possible solution could be using LLM as grader (e.g. GPT-series):
case: we want to filter out some data which does not above our specific expectation.
design appropriate prompt, such that GPT4 (or others) can give according grading and hence used as annotations.
annotate a small amount of data
use the labeled data to train a classifier, and then automatically filter out the ‘unexpected’ data.
However, come up with the appropriate prompt will based on our expectation or evaluation criterion, which is very task-specific. And in some cases, the expectations are not clear. It will need extensive study and investigation to formulate standard and effective criterion.
<image, text> pairs data:
contrastive learning pretrained model as filter, e.g. CLIP
Data creation
Why?
Data augmentation
Dataset Balancing
In-context learning / Instruction tuning
Efficient and release the burden of expensive labeling
What and How? Use ChatGPT / GPT-4 to generate our expected synthetic data
Diversity: ideally the generated data should be very diverse, e.g. different concepts, skills, scenarios, different level of difficulty, complexity, style and etc. however, avoid generative models to generate similar content is itself a research problem. For example, variational modes could suffer from mode collapse issue. In LLM, though we could alleviate such an issue by sampling strategy (p-sampling) or tuning the temperate, the output content still tends to be fell in similar modes. So, we need to come up with strategies to induce the model to be more creative and diverse in its output.
Inject randomness into the prompt in a way that give rises to generation of diverse dataset.
e.g. providing constraints on topics and target audience.
e.g. create a set of root words, and randomly sample a subset of the root words for generation
fundamentally, we could investigate novel methodologies to improve the diversity of the generated content, while this is leave as a open research problem. Potential directions could be formulating LLM as variational models, e.g. diffusion process, injecting random variables, etc.
Consistent
<query, response> pairs generation
adding constraints to force the response to be closely related with query
Reasoning
while <query, response> is the most widely used approach to generate / label data. Model that finetuned or distilled from such data could suffer from lack of powerful ability in reasoning and complex process understanding. So, despite the <query, response> corse level data, we can further generate more fine-grained data. In the reasoning case, one solution can be:
ask the model also generate the intermediate reasoning steps. It can combined with techniques of, e.g.:
carefully designed prompts
CoT, ToT, inner-monologue, hierarchical walking memory (west-world), etc.
Human in the loop
Training data with human feedback by RLHF is one of the critical technique in LLM alignment. So design mature human labeling process and criterion is important. Factors we could consider:
how to design the grading criterion, especially for ‘subtle/hard’ samples
for some domain specific data, it will need educational / domain expertise to proceed scoring
for harmful content, it may need red-teaming to provide adversarial samples
to improve reasoning ability, we may need to factorize samples, and given intermedia feedbacks, e.g. math problems. While factorizing and given appropriate feedback are not trivial.
Harmlessness / Ethical consideration
For real-world application, we need to carefully consider the content to be harmlessness. Some things we can consider:
Pre-training on curated data
Fine-tuning with human reviewers
Bias detection and mitigation
Controlling output behavior (e.g. adding filtering on the sampling stage)
Openness, transparency, and accountability
Evaluation
In the new LLM-time, evaluation for LLM itself is a new research problem. Just like [1] stated that, conventional ML evaluation metrics and static benchmarks are not appropriate anymore. For huge models which have already seen almost ‘all data’, new benchmarks and evaluation metrics are NEEDED!
New benchmarks could be:
New challenging / unconventional tasks can be designed.
human-level tasks can be used, e.g. SAT, GRE, etc.
New evaluation could be:
fine-grained and meaningful feedback. e.g. use gpt-4 as a grader to give reflection and/or fine-grained evaluation (it can be according to different aspects according to the scenario, e.g. for story-telling, it could be grammar, creativity, consistency, etc.) Comparing with binary score, e.g. success or fail, good or bad, such evaluation can provide more informative feedback.
data pruning for unbiased performance evaluation. For example, contamination experiment can be performed:
n-gram overlap
embedding (petrained model) and syntax-based similarity (abstract syntax trees (AST)) analysis
Open problems and brainstorming
Data evaluation How to determine a dataset is good or not? Typically we would need to wait the model evaluation finish, and to see if the data could help to improve the model performance or not. However, in current large-model situation, such cycle is very time consuming and computational expensive. So the evaluation metrics for data itself is needed. Though we can base on some intuitions to pre-evaluate our dataset, e.g. diversity, quality, consistency, etc. However, these are very task/benchmark specific. A systematic and generalized way to perform data evaluation could be meaningful. Several aspects could be considered:
definition and formation of ‘good quality’. It needs comprehensive considering the involved application scenarios, and investigation.
comprehensive and deep study of scaling-law and emergent effectiveness on the ‘data quality’ axis
Close the dataGen-modelTrain loop
“online learning”: during the training process, with the optimization of the model itself, the distribution could be shifted. In such way, performance can be hurt due to distribution discrepancy. In [2], an iterative training and data labeling process is proposed to mimic an ‘online learning’ scenario. Levering RL high-level concepts to perform training and data generation is an interesting direction.[3]
“active learning”: in one of my previous work ‘active-contrast’[4], we inspired by active learning to perform efficient data sampling strategy, which aims to only select the most informative and diverse data within a limited budget. While the high-level intuition can be well generalize to our scenario, i.e. joint model training and data generation. The gradient space or the Hessian space can proved effective feedback of the model optimization direction, which can be cycled back to our data-gen module to tell it what kinda data is helpful and what is not… In such a manner, the model training and data-gen module are interacting with each other so as to a collaborative game.
[2] https://arxiv.org/abs/2212.08073 [3]https://arxiv.org/pdf/2303.11366.pdf [4] https://arxiv.org/abs/2009.09805
Other modalities
visual
RGB
3rd-person view, ego-view
other visual perceptions, flow, depth, segmentation, etc.
auditory
speech
audio
action
decision-making for control
planning
user activities