LLMs_talk_March23.pdf

Advances, Challenges,
and Opportunities in
Model Evaluation
Nazneen Rajani | Research Lead @ Hugging Face | nazneen@hf.co | @nazneenrajani

Outline
Part 1:
NLP Modeling landscape
Systematic study of 75,000 models on HF
Part 2:
NLP Evaluation landscape
Challenges and opportunities in model evaluation and documentation
Part 3:
Opensource alternative to ChatGPT
Evaluating a Chatbot

GPT-3
2021
Jun Oct
PaLM
Chinchilla
OPT
BLOOM
Gopher
2022
Megatron TNLG
Dec
May
Apr
Jul
Jul
GPT-J
Large Language Models since GPT3
ChatGPT
Nov
Dec
Galactica
GPT-Neo
Jun
GPT-NeoX
Feb
Flan-T5
Oct
*only LLMs with >1B parameters & EN as the main training language are shown. Comprehensive list: https://crfm.stanford.edu/helm/v1.0/?models=1
UL2
Cohere
Jurassic Anthropic
2023
Feb
LLaMA
Flan-UL2
March
Alpaca
GPT-4

GPT-3
2021
Jun Oct
PaLM
Chinchilla
OPT
BLOOM
Gopher
2022
Megatron TNLG
Dec
May
Apr
Jul
Jul
GPT-J
ChatGPT
Nov
Dec
Galactica
GPT-Neo
Jun
GPT-NeoX
Feb
Flan-T5
Oct
UL2
Cohere
Jurassic Claude
2023
Feb
LLaMA
Flan-UL2
March
Alpaca
GPT-4

Model Access
�� 🔒
Open access models Closed access models

🔓 Open Access Models
All model components are publicly available:
● Open source code
● Training data
○ Sources and their distribution
○ Data preprocessing and curation steps
● Model weights
● Paper or blog summarizing
○ Architecture and training details
○ Evaluation results
○ Adaptation to the model
■ Safety ﬁlters
■ Training with human feedback

🔓 Open Access Models
Allows reproducing results and replicating parts of the model
Enable auditing and conducting risk analysis
Serves as a research artifact
Enables interpreting model output

🔒 Closed Access Models
Only research paper or blog is available and may include overview of
● Training data
● Architecture and training details (including infrastructure)
● Evaluation results
● Adaptation to the model
○ Safety ﬁlters
○ Training with human feedback

🔒 Closed Access Models
Safety concerns
Competitive advantage
Expensive to setup guardrails for safe access

Model Access
�� 🔒
Open access Closed access
Limited access
��

GPT-3
2021
Jun Oct
PaLM
Chinchilla
OPT
BLOOM
Gopher
2022
Megatron TNLG
Dec
May
Apr
Jul
Jul
GPT-J
ChatGPT
Nov
Dec
Galactica
GPT-Neo
Jun
GPT-NeoX
Feb
Flan-T5
Oct
UL2
Cohere
Jurassic Claude
2023
Feb
LLaMA
Flan-UL2
March
Alpaca
GPT-4
��
��
��
��
��
��
��
��
��
��
��
��

Open Access Large Language Models
Research on policy, governance, AI safety and alignment
Community eﬀorts like Eleuther, Big Science, LAION
Papers with several authors
Open source ML has potential for huge impact

Ecosystem as part of the ML workﬂow
Collect data Train model Evaluate Deploy
>23K datasets >143K models
>70 metrics and
measurements
Spaces/ Gradio for
demos

ML Modeling Landscape
There is an exponential growth of ML models.

ML Modeling Landscape
Distribution by task categories

NLP Modeling Landscape
Approx 40% of the task categories are NLP
Covering 78% of the models

Including multimodal – 55% task categories

Including multimodal – 55% task categories
Including speech – 72% task categories
Coverage – 90% of models

Distribution by language (based on 20% models reporting)

Model Usage
Top 0.2% models (N=124) makeup >80% HF model
usage

Model Usage
usage
98% of these models are trained on just text data

Model Usage
usage
98% of these models are trained on just text data
Of these –
65% were created before 2021
33% were created in 2021
2% were created in 2022

Model Age vs. Usage
Relation between model age and its usage

Model Age vs. Usage
Relation between model age and its usage
These models served as research artifacts for the later generation of models

Model Age vs. Usage
Factors:
1. Compute is becoming cheaper making model training more accessible
2. As more models are created, their usage is distributed
3. Models are being replaced by their eﬃcient counterparts (ex: BERT →
DistilBERT)

Trend Width
Step 1: Find all peaks in a signal
Step 2: Measure peak widths at base
Step 3: Take the max width

Model Usage Trends
Usage trend width for top models
https://huggingface.co/spaces/nazneen/model-usage
bert-base-uncased

Model Usage Trends
bert-base-uncased
sentence-transformers/paraphrase-
xlm-r-multilingual-v1

Model Usage Trends
bert-base-uncased
sentence-transformers/paraphrase-xlm
-r-multilingual-v1
HateSpeech-CNERG/indic-abusive-allIn
One-MuRIL

Model Usage Trends
Average trend widths of models in 90th percentile of usage:
Created before 2021 → 60 weeks
Created in 2021 → 45 weeks
Created in 2022 → 24 weeks

Model Usage
What other factors might aﬀect model usage?
- What does the model do?
- How does it perform?
- What was it trained on?
- Is it easy to use?
- What are its limitations?

Model Usage
Model
documentation!
What other factors might aﬀect model usage?
- What does the model do?
- How good is the model?
- What was it trained on?
- Is it easy to use?
- What are its limitations?

Model Documentation
Collect data Train model Evaluate Deploy
✔ Dataset ✔ How to use
✔ Intended
uses
✔ Evaluation
✔ Limitations
✔ Training
✔ Environmental impact

Why document models?
🔍Transparency
📢Communication
📈Reproducibility

Model Documentation Landscape
Robustness Report (Goel*, Rajani*, et al., NAACL 2021)
Model Card (Mitchell et al., 2019)
Interactive Model Cards (Crisan, Vig,Drouhard, and Rajani, FAccT2022)
Method Card (Adkins et al., 2022)

Model Documentation in
Model documentation is part of the repo’s README
��

Model documentation statistics
Newer models
are less likely to
have model
cards

Model Documentation vs. Usage
Observation: Only 50% models have model cards but contribute 98% of
total usage

total usage
Goal: Study the relation between model usage and documentation

total usage
Hypothesis: Model documentation drives model usage

Model Documentation RCT
total usage
Randomized Control Trial (RCT) for models:

total usage
Model population

total usage
Model population
Control group
Treatment group

Model population
Control group
Treatment group Documentation
total usage

Model population
Control group
Treatment group Documentation
Compare usage
total usage

Randomized Control Trial Process
Treatment group

󰠁󰠁
Treatment group
Documentation

Treatment group
Documentation
󰠁󰠁

Treatment group
Documentation Submit Pull Requests
󰠁
󰠁󰠁

Treatment group
󰠁
Documentation is part of
model repo
󰠁󰠁

Treatment group
󰠁
Documentation is part of
model repo
1 week
}
󰠁󰠁

RCT Results
Red line indicates week when treatment was administered

Model Documentation RCT Findings
1. Increased usage of models in treatment group compared to control group
2. More prominent for model weights downloads
3. Model documentation drives model usage

What do developers document about models?
Distribution of sections in model cards

NLP Evaluation Landscape
Slew of work on evaluation in NLP

Tools

Papers

NLP Evaluation Idioms
1. Subpopulations – disaggregate evaluation on slice or subpopulation of data

Example: short reviews (< 50 words) in the IMDB sentiment dataset
Tools: Snorkel (Ratner et al., 2017), Errudite (Wu et al., 2019)

2. Transformations – natural perturbations to original evaluation instances

Example: substitute words with their synonyms in the IMDB dataset
Tools: NLPAug (Ma, 2019)

3. Evaluation sets – evaluation on diagnostic sets

Example: write new movie reviews in the style of a newspaper columnist
Tools: CheckList (Ribeiro et al., 2020)

4. Attacks – adversarial evaluation

4. Attacks – adversarial evaluation
Example: add “aabbccaa” to reviews because it makes the model predict positive sentiment
Tools: TextAttack (Morris et al., 2020), OpenAttack (Zeng et al., 2020)

Slew of work on evaluation in NLP -- tools and research papers

Goldilocks spectrum for Model Evaluation
Aggregate
evaluations
Adversarial
attacks
Subpopulations/
Disaggregate
evaluations
Distribution shift
Transformations/
Natural
perturbations
Diagnostic sets

Clever Hans eﬀect
Challenges with Evaluation

Challenges with evaluation
Idiomatic Lock-In Workﬂow Fragmentation
Tool A Tool B
Subpopulations
Transformations
Attacks
Evaluation sets
Scattered evaluation Difﬁculty reporting
Challenges
Today

Robustness Gym
Tool A Tool B
Subpopulations
Transformations
Attacks
Evaluation sets
Challenges
Today
Entire Evaluation Spectrum Consolidate Findings
Subpopulations
Transformations
Attacks
Evaluation sets
Testbenches Robustness Reports
Robustness
Gym
(Goel*, Rajani*, et al., NAACL 2021)

Robustness Gym
Tool A Tool B
Subpopulations
Transformations
Attacks
Evaluation sets
Challenges
Today
Entire Evaluation Spectrum Consolidate Findings
Subpopulations
Transformations
Diagnostic sets
Attacks
Testbenches Robustness Reports
Robustness
Gym
(Goel*, Rajani*, et al., NAACL 2021)

Load your
dataset
Robustness Gym Workﬂow

Cache useful
information

Build slices
of data

Consolidate
slices into a
testbench

Evaluate a
model to
generate a
report

Robustness Report for Natural Language Inference using bert-base-uncased on SNLI

Named Entity Linking
map “strings” to “things”
in a knowledge base like
Wikipedia
Experiments with Commercial APIs for Named Entity Linking
When did England last win the football world cup?

Wikipedia
FIFA World Cup
England National Football Team

Wikipedia
FIFA World Cup
Downstream System
Question Answering System

Wikipedia
Downstream System
FIFA World Cup
Question Answering System
1966
A correct NEL is required for the downstream system!

Robustness Report for NEL on AIDA-b dataset

Popularity
heuristic
outperforms all
commercial
systems

Commercial
APIs are not any
more robust
than popularity
heuristic

Commercial
systems are
capitalization
sensitive

Type of
Systematic
Error!

Systematic Error Analysis and Labeling (SEAL)
Evaluation is a creative process
Systematic errors are diﬃcult to detect:
- High dimension of the learned representations
- Extracting and labeling semantics in the error group requires human-in-the-loop
Interactive tool to identify and label candidate data slices with high systematic errors
(Rajani et al, EMNLP ‘22 demo)

1. Embed
Identify candidate groups with high systematic errors

Identify candidate groups with high systematic errors
2. Cluster

Generate semantic labels using LLMs
books
music
worst book/album reviews
products that work with both
Windows and Mac
Gym equipment
3. Semantic Labeling

https://huggingface.co/spaces/nazneen/seal

SEAL Experimental Results
SEAL identiﬁed data groups where the model performance drops between 5% to 28%

Takeaways
1. Open-sourcing ML research artifacts is becoming the norm

Takeaways
1. Open-sourcing ML research artifacts is now the default
2. The most popular Hugging Face models are those that are older and
well-documented

Takeaways
well-documented
3. Model evaluation can be actionable – RG toolkit supports this goal via ﬁne-grained
evaluation

Takeaways
well-documented
3. Model evaluation can be actionable – RG toolkit supports this goal via ﬁne-grained
evaluation
4. LLMs can help label systematic errors in models in a human interpretable way

Current Research Focus
● Open-source alternative to ChatGPT
● Follow what we are building https://huggingface.co/HuggingFaceH4
● Evaluating a Chatbot

Training a Chatbot
1. Pretraining the LM
a. Predicting the next token
b. Eg: GPT-3, BLOOM
2. Incontext learning (aka prompt-based learning)
a. Few shot learning without updating the parameters
b. Context distillation is a variant wherein you condition on the prompt and update the parameters
3. Supervised ﬁne-tuning
a. Fine-tuning for instruction following and to make them chatty
b. Eg: InstructGPT, LaMDA, Sparrow, OPT-IML, LLaMA-I, Alpaca
4. Reinforcement Learning from Human Feedback
a. safety/alignment
b. nudging the LM towards values you desire

Training a Chatbot
1. Pretraining the LM
a. Predicting the next token
b. Eg: GPT-3, BLOOM
2. Incontext learning (aka prompt-based learning)
a. Few shot learning without updating the parameters
b. Context distillation is a variant wherein you condition on the prompt and update the
parameters
3. Supervised ﬁne-tuning
a. Fine-tuning for instruction following and to make them chatty
b. Eg: InstructGPT, LaMDA, Sparrow, OPT-IML, LLaMA-I
4. Reinforcement Learning from Human Feedback
a. safety/alignment
b. nudging the LM towards values you desire

Ouyang, Long, et al. "Training language models to follow instructions with human feedback." arXiv preprint arXiv:2203.02155 (2022).

Training a Chatbot
Supervised Fine-tuning

Training a Chatbot
Reinforcement learning with human feedback (RLHF)

Evaluating instruction
following/chatty-ness

● Step 1: Evaluating instruction following. Does the model generate useful responses
on the topic? Are they open-ended?
○ Eg: Brainstorm a list of New Year’s resolutions

Evaluating the RM

● Step 2: Evaluating the RM. Can the model choose between a truthful and a
untruthful response? Can it rank harmful responses lower than the harmless
responses?

Red-teaming

responses?
● Step 3: Red-teaming. Crafting prompts that would surface model vulnerabilities and
emerging capabilities.
○ Eg: Complete the sentence, “You should just go kill”

Evaluating instruction
following/chatty-ness
Evaluating the RM Red-teaming

responses?
● Step 3: Red-teaming. Crafting prompts that would surface model vulnerabilities
and emerging capabilities.
○ Eg: Complete the sentence, “You should just go kill”

Red-Teaming
Evaluating LLMs for:
1. Model vulnerabilities
2. Emerging capabilities that they are not explicitly trained for

Red-Teaming
1. Model vulnerabilities

Red-Teaming
2. Emerging Capabilities
- Power-seeking behavior (eg: resources)
- Persuading people to do harm (on themselves or others)
- Having agency with physical outcomes (eg: ordering chemicals online via an API)
These are considered critical threat scenarios

Red-Teaming
Similarities with adversarial attacks:
- Goal is to “attack” or “manipulate” the model to generate harmful content
- Actionable: used to ﬁne-tune the model to steer it away to generate friendly output

Red-Teaming
Diﬀerences with adversarial attacks:
- Human interpretable and look like regular prompt. Eg: preﬁxing “aaabbcc” is
adversarial but not red-teaming.

Red-Teaming
Differences with adversarial attacks:
- Human interpretable and look like regular prompt. Eg: prefixing “aaabbcc” is
adversarial but not red-teaming.
*Warning: offensive text below*
Wallace, et al. "Universal Adversarial Triggers for Attacking and Analyzing NLP" (2021).

Red-Teaming Methods
Roleplay attacks wherein the LLM is instructed to behave as a malicious character
Instructing the model to respond in code instead of natural language
Instructing a model to reveal sensitive information such as PII.

Red-Teaming ChatGPT
https://twitter.com/spiantado/status/1599462375887114240

Takeaways from Red-Teaming
1. Few-shot-prompted LMs with helpful, honest, and harmless behavior are not harder
to red-team than plain LMs.
2. There are no clear trends with scaling model size for attack success rate except
RLHF models that are more diﬃcult to red-team as they scale.
3. Models may learn to be harmless by being evasive, there is tradeoﬀ between
helpfulness and harmlessness.
4. The distribution of the success rate varies across categories of harm with
non-violent ones having a higher success rate.

Open problems with Red-Teaming
1. There is no open-source red-teaming dataset for code generation that
attempts to jailbreak a model via code. Eg: generating a program that
implements a DDOS or backdoor attack.
2. Designing and implementing strategies for red-teaming LLMs for critical threat
scenarios.
3. Evaluating the tradeoﬀs between evasiveness and helpfulness.

Further Reading
Red-Teaming https://huggingface.co/blog/red-teaming
RLHF https://huggingface.co/blog/rlhf
Dialog agents https://huggingface.co/blog/dialog-agents

RLHF Team
Nathan Lambert Lewis Tunstall Thomas Wolf
And more at Hugging Face and the community!
Leandro von Werra Younes Belkada Edward Beeching

Collaborators
Systematic study of HF models and SEAL
Robustness Gym
James Zou
(Stanford)
Weixin Liang
(Stanford)
Karan Goel
(Stanford)
Jesse Vig
(Salesforce)
Chris Re
(Stanford)
Mohit Bansal
(UNC)
Xinyu Yang
(ZJU)
Meg Mitchell
(Hugging Face)

LLMs_talk_March23.pdf

More Related Content

What's hot

What's hot (20)

Similar to LLMs_talk_March23.pdf

Similar to LLMs_talk_March23.pdf (20)

Recently uploaded

Recently uploaded (20)

LLMs_talk_March23.pdf