The art of data in LLM era

shuangma
Jul 24, 2023
0 min read

Content

Data collection
Data curation
Data creation
Human in the loop
Harmlessness / Ethical consideration
Evaluation
Open problems and brainstorming
1. data evaluation
2. close the data-model loop
3. other modalities

Data collection

Text: web crowdsource

<image, text>: alt text, caption, description…

Data curation

Usually, the base model is pretrained on a massive scale of corpus that collected from web resources. While raw text data is typically noisy and requires cleaning and preprocessing before training an LLM.

Previous techniques:
- removing HTML tags, handling punctuation and capitalization, tokenizing text into sentences or words, normalizing or lemmatizing words, removing stop words, and handling special cases like URLs or numerical values.
- standard curation also includes filtering out noisy or irrelevant data, removing duplicates, handling special characters or encoding issues, and addressing copyright or licensing considerations.
- Case: setup criterion according to specific targeting scenario.
  - discarded samples too short or too long;
  - remove samples that contain ‘harmful words’;
  - remove un-meaningful content, e.g. code, placeholder, javascript, etc.
  - remove duplicates by statistics

Advanced techniques:
- Previous techniques can used to filter out obvious noisy / dirty data, while when data that do not exist obvious or statistical flaws, we will need more advanced techniques. These includes but not limited to the data that:
  - not self-contained
  - do not involve meaningful content
  - do not contain logic or has logical errors
  - similar content / style
  - harmful content
  - mis-aligned responses

One possible solution could be using LLM as grader (e.g. GPT-series):
- case: we want to filter out some data which does not above our specific expectation.
  1. design appropriate prompt, such that GPT4 (or others) can give according grading and hence used as annotations.
  2. annotate a small amount of data
  3. use the labeled data to train a classifier, and then automatically filter out the ‘unexpected’ data.

However, come up with the appropriate prompt will based on our expectation or evaluation criterion, which is very task-specific. And in some cases, the expectations are not clear. It will need extensive study and investigation to formulate standard and effective criterion.
<image, text> pairs data:
- contrastive learning pretrained model as filter, e.g. CLIP

Data creation

Why?
- Data augmentation
- Dataset Balancing
- In-context learning / Instruction tuning
- Efficient and release the burden of expensive labeling

What and How? Use ChatGPT / GPT-4 to generate our expected synthetic data
- Diversity: ideally the generated data should be very diverse, e.g. different concepts, skills, scenarios, different level of difficulty, complexity, style and etc. however, avoid generative models to generate similar content is itself a research problem. For example, variational modes could suffer from mode collapse issue. In LLM, though we could alleviate such an issue by sampling strategy (p-sampling) or tuning the temperate, the output content still tends to be fell in similar modes. So, we need to come up with strategies to induce the model to be more creative and diverse in its output.
  - Inject randomness into the prompt in a way that give rises to generation of diverse dataset.
    - e.g. providing constraints on topics and target audience.
    - e.g. create a set of root words, and randomly sample a subset of the root words for generation

fundamentally, we could investigate novel methodologies to improve the diversity of the generated content, while this is leave as a open research problem. Potential directions could be formulating LLM as variational models, e.g. diffusion process, injecting random variables, etc.

Consistent
- <query, response> pairs generation
- adding constraints to force the response to be closely related with query

Reasoning
- while <query, response> is the most widely used approach to generate / label data. Model that finetuned or distilled from such data could suffer from lack of powerful ability in reasoning and complex process understanding. So, despite the <query, response> corse level data, we can further generate more fine-grained data. In the reasoning case, one solution can be:
  - ask the model also generate the intermediate reasoning steps. It can combined with techniques of, e.g.:
    - carefully designed prompts
    - CoT, ToT, inner-monologue, hierarchical walking memory (west-world), etc.

Human in the loop

Training data with human feedback by RLHF is one of the critical technique in LLM alignment. So design mature human labeling process and criterion is important. Factors we could consider:

how to design the grading criterion, especially for ‘subtle/hard’ samples
for some domain specific data, it will need educational / domain expertise to proceed scoring
for harmful content, it may need red-teaming to provide adversarial samples
to improve reasoning ability, we may need to factorize samples, and given intermedia feedbacks, e.g. math problems. While factorizing and given appropriate feedback are not trivial.

Harmlessness / Ethical consideration

For real-world application, we need to carefully consider the content to be harmlessness. Some things we can consider:

Pre-training on curated data
Fine-tuning with human reviewers
Bias detection and mitigation
Controlling output behavior (e.g. adding filtering on the sampling stage)
Openness, transparency, and accountability

Evaluation

In the new LLM-time, evaluation for LLM itself is a new research problem. Just like [1] stated that, conventional ML evaluation metrics and static benchmarks are not appropriate anymore. For huge models which have already seen almost ‘all data’, new benchmarks and evaluation metrics are NEEDED!

New benchmarks could be:

New challenging / unconventional tasks can be designed.
human-level tasks can be used, e.g. SAT, GRE, etc.

New evaluation could be:

fine-grained and meaningful feedback. e.g. use gpt-4 as a grader to give reflection and/or fine-grained evaluation (it can be according to different aspects according to the scenario, e.g. for story-telling, it could be grammar, creativity, consistency, etc.) Comparing with binary score, e.g. success or fail, good or bad, such evaluation can provide more informative feedback.
data pruning for unbiased performance evaluation. For example, contamination experiment can be performed:
- n-gram overlap
- embedding (petrained model) and syntax-based similarity (abstract syntax trees (AST)) analysis

[1] https://arxiv.org/pdf/2303.12712.pdf

Open problems and brainstorming

Data evaluation How to determine a dataset is good or not? Typically we would need to wait the model evaluation finish, and to see if the data could help to improve the model performance or not. However, in current large-model situation, such cycle is very time consuming and computational expensive. So the evaluation metrics for data itself is needed. Though we can base on some intuitions to pre-evaluate our dataset, e.g. diversity, quality, consistency, etc. However, these are very task/benchmark specific. A systematic and generalized way to perform data evaluation could be meaningful. Several aspects could be considered:
- definition and formation of ‘good quality’. It needs comprehensive considering the involved application scenarios, and investigation.
- comprehensive and deep study of scaling-law and emergent effectiveness on the ‘data quality’ axis

Close the dataGen-modelTrain loop
- “online learning”: during the training process, with the optimization of the model itself, the distribution could be shifted. In such way, performance can be hurt due to distribution discrepancy. In [2], an iterative training and data labeling process is proposed to mimic an ‘online learning’ scenario. Levering RL high-level concepts to perform training and data generation is an interesting direction.[3]
- “active learning”: in one of my previous work ‘active-contrast’[4], we inspired by active learning to perform efficient data sampling strategy, which aims to only select the most informative and diverse data within a limited budget. While the high-level intuition can be well generalize to our scenario, i.e. joint model training and data generation. The gradient space or the Hessian space can proved effective feedback of the model optimization direction, which can be cycled back to our data-gen module to tell it what kinda data is helpful and what is not… In such a manner, the model training and data-gen module are interacting with each other so as to a collaborative game.

[2] https://arxiv.org/abs/2212.08073 [3]https://arxiv.org/pdf/2303.11366.pdf [4] https://arxiv.org/abs/2009.09805

Other modalities
- visual
  - RGB
    - 3rd-person view, ego-view

other visual perceptions, flow, depth, segmentation, etc.

auditory
- speech
- audio

action
- decision-making for control
- planning
- user activities

Shuang Ma

The art of data in LLM era

Comments