(Go: >> BACK << -|- >> HOME <<)

SlideShare a Scribd company logo
Advances, Challenges,
and Opportunities in
Model Evaluation
Nazneen Rajani | Research Lead @ Hugging Face | nazneen@hf.co | @nazneenrajani
Outline
Part 1:
NLP Modeling landscape
Systematic study of 75,000 models on HF
Part 2:
NLP Evaluation landscape
Challenges and opportunities in model evaluation and documentation
Part 3:
Opensource alternative to ChatGPT
Evaluating a Chatbot
Outline
Part 1:
NLP Modeling landscape
Systematic study of 75,000 models on HF
Part 2:
NLP Evaluation landscape
Challenges and opportunities in model evaluation and documentation
Part 3:
Opensource alternative to ChatGPT
Evaluating a Chatbot
GPT-3
2021
Jun Oct
PaLM
Chinchilla
OPT
BLOOM
Gopher
2022
Megatron TNLG
Dec
May
Apr
Jul
Jul
GPT-J
Large Language Models since GPT3
ChatGPT
Nov
Dec
Galactica
GPT-Neo
Jun
GPT-NeoX
Feb
Flan-T5
Oct
*only LLMs with >1B parameters & EN as the main training language are shown. Comprehensive list: https://crfm.stanford.edu/helm/v1.0/?models=1
UL2
Cohere
Jurassic Anthropic
2023
Feb
LLaMA
Flan-UL2
March
Alpaca
GPT-4
GPT-3
2021
Jun Oct
PaLM
Chinchilla
OPT
BLOOM
Gopher
2022
Megatron TNLG
Dec
May
Apr
Jul
Jul
GPT-J
Large Language Models since GPT3
ChatGPT
Nov
Dec
Galactica
GPT-Neo
Jun
GPT-NeoX
Feb
Flan-T5
Oct
*only LLMs with >1B parameters & EN as the main training language are shown. Comprehensive list: https://crfm.stanford.edu/helm/v1.0/?models=1
UL2
Cohere
Jurassic Claude
2023
Feb
LLaMA
Flan-UL2
March
Alpaca
GPT-4
Model Access
�� 🔒
Open access models Closed access models
🔓 Open Access Models
All model components are publicly available:
● Open source code
● Training data
○ Sources and their distribution
○ Data preprocessing and curation steps
● Model weights
● Paper or blog summarizing
○ Architecture and training details
○ Evaluation results
○ Adaptation to the model
■ Safety filters
■ Training with human feedback
🔓 Open Access Models
Allows reproducing results and replicating parts of the model
Enable auditing and conducting risk analysis
Serves as a research artifact
Enables interpreting model output
🔒 Closed Access Models
Only research paper or blog is available and may include overview of
● Training data
● Architecture and training details (including infrastructure)
● Evaluation results
● Adaptation to the model
○ Safety filters
○ Training with human feedback
🔒 Closed Access Models
Safety concerns
Competitive advantage
Expensive to setup guardrails for safe access
Model Access
�� 🔒
Open access Closed access
Limited access
��
GPT-3
2021
Jun Oct
PaLM
Chinchilla
OPT
BLOOM
Gopher
2022
Megatron TNLG
Dec
May
Apr
Jul
Jul
GPT-J
Large Language Models since GPT3
ChatGPT
Nov
Dec
Galactica
GPT-Neo
Jun
GPT-NeoX
Feb
Flan-T5
Oct
*only LLMs with >1B parameters & EN as the main training language are shown. Comprehensive list: https://crfm.stanford.edu/helm/v1.0/?models=1
UL2
Cohere
Jurassic Claude
2023
Feb
LLaMA
Flan-UL2
March
Alpaca
GPT-4
�� �� ��
��
�� ��
�� ��
�� ��
��
�� ��
��
�� ��
�� ��
�� ��
��
GPT-3
2021
Jun Oct
PaLM
Chinchilla
OPT
BLOOM
Gopher
2022
Megatron TNLG
Dec
May
Apr
Jul
Jul
GPT-J
Large Language Models since GPT3
ChatGPT
Nov
Dec
Galactica
GPT-Neo
Jun
GPT-NeoX
Feb
Flan-T5
Oct
*only LLMs with >1B parameters & EN as the main training language are shown. Comprehensive list: https://crfm.stanford.edu/helm/v1.0/?models=1
UL2
Cohere
Jurassic Claude
2023
Feb
LLaMA
Flan-UL2
March
Alpaca
GPT-4
Open Access Large Language Models
Research on policy, governance, AI safety and alignment
Community efforts like Eleuther, Big Science, LAION
Papers with several authors
Open source ML has potential for huge impact
Ecosystem as part of the ML workflow
Collect data Train model Evaluate Deploy
>23K datasets >143K models
>70 metrics and
measurements
Spaces/ Gradio for
demos
ML Modeling Landscape
There is an exponential growth of ML models.
ML Modeling Landscape
Distribution by task categories
NLP Modeling Landscape
Approx 40% of the task categories are NLP
Covering 78% of the models
NLP Modeling Landscape
Including multimodal – 55% task categories
NLP Modeling Landscape
Including multimodal – 55% task categories
Including speech – 72% task categories
Coverage – 90% of models
NLP Modeling Landscape
Distribution by language (based on 20% models reporting)
Model Usage
Top 0.2% models (N=124) makeup >80% HF model
usage
Model Usage
Top 0.2% models (N=124) makeup >80% HF model
usage
98% of these models are trained on just text data
Model Usage
Top 0.2% models (N=124) makeup >80% HF model
usage
98% of these models are trained on just text data
Of these –
65% were created before 2021
33% were created in 2021
2% were created in 2022
Model Age vs. Usage
Relation between model age and its usage
Model Age vs. Usage
Relation between model age and its usage
Model Age vs. Usage
Relation between model age and its usage
These models served as research artifacts for the later generation of models
Model Age vs. Usage
Relation between model age and its usage
Model Age vs. Usage
Factors:
1. Compute is becoming cheaper making model training more accessible
2. As more models are created, their usage is distributed
3. Models are being replaced by their efficient counterparts (ex: BERT →
DistilBERT)
Trend Width
Step 1: Find all peaks in a signal
Step 2: Measure peak widths at base
Step 3: Take the max width
Model Usage Trends
Usage trend width for top models
https://huggingface.co/spaces/nazneen/model-usage
bert-base-uncased
Model Usage Trends
Usage trend width for top models
https://huggingface.co/spaces/nazneen/model-usage
bert-base-uncased
sentence-transformers/paraphrase-
xlm-r-multilingual-v1
Model Usage Trends
Usage trend width for top models
https://huggingface.co/spaces/nazneen/model-usage
bert-base-uncased
sentence-transformers/paraphrase-xlm
-r-multilingual-v1
HateSpeech-CNERG/indic-abusive-allIn
One-MuRIL
Model Usage Trends
Model Usage Trends
Model Usage Trends
Model Usage Trends
Model Usage Trends
Average trend widths of models in 90th percentile of usage:
Created before 2021 → 60 weeks
Created in 2021 → 45 weeks
Created in 2022 → 24 weeks
Model Usage
What other factors might affect model usage?
- What does the model do?
- How does it perform?
- What was it trained on?
- Is it easy to use?
- What are its limitations?
Model Usage
Model
documentation!
What other factors might affect model usage?
- What does the model do?
- How good is the model?
- What was it trained on?
- Is it easy to use?
- What are its limitations?
Model Documentation
Collect data Train model Evaluate Deploy
✔ Dataset ✔ How to use
✔ Intended
uses
✔ Evaluation
✔ Limitations
✔ Training
✔ Environmental impact
Why document models?
🔍Transparency
📢Communication
📈Reproducibility
Model Documentation Landscape
Robustness Report (Goel*, Rajani*, et al., NAACL 2021)
Model Card (Mitchell et al., 2019)
Interactive Model Cards (Crisan, Vig,Drouhard, and Rajani, FAccT2022)
Method Card (Adkins et al., 2022)
Model Documentation Landscape
Robustness Report (Goel*, Rajani*, et al., NAACL 2021)
Model Card (Mitchell et al., 2019)
Interactive Model Cards (Crisan, Vig,Drouhard, and Rajani, FAccT2022)
Method Card (Adkins et al., 2022)
Model Documentation Landscape
Robustness Report (Goel*, Rajani*, et al., NAACL 2021)
Model Card (Mitchell et al., 2019)
Interactive Model Cards (Crisan, Vig,Drouhard, and Rajani, FAccT2022)
Method Card (Adkins et al., 2022)
Model Documentation Landscape
Robustness Report (Goel*, Rajani*, et al., NAACL 2021)
Model Card (Mitchell et al., 2019)
Interactive Model Cards (Crisan, Vig,Drouhard, and Rajani, FAccT2022)
Method Card (Adkins et al., 2022)
Model Documentation in
Model documentation is part of the repo’s README
��
Model Documentation for GPT2
Model Documentation for GPT2
Model Documentation for GPT2
Model Documentation for GPT2
Model documentation statistics
Newer models
are less likely to
have model
cards
Model Documentation vs. Usage
Observation: Only 50% models have model cards but contribute 98% of
total usage
Model Documentation vs. Usage
Observation: Only 50% models have model cards but contribute 98% of
total usage
Goal: Study the relation between model usage and documentation
Model Documentation vs. Usage
Observation: Only 50% models have model cards but contribute 98% of
total usage
Goal: Study the relation between model usage and documentation
Hypothesis: Model documentation drives model usage
Model Documentation RCT
Observation: Only 50% models have model cards but contribute 98% of
total usage
Goal: Study the relation between model usage and documentation
Hypothesis: Model documentation drives model usage
Randomized Control Trial (RCT) for models:
Model Documentation RCT
Observation: Only 50% models have model cards but contribute 98% of
total usage
Goal: Study the relation between model usage and documentation
Hypothesis: Model documentation drives model usage
Randomized Control Trial (RCT) for models:
Model population
Model Documentation RCT
Observation: Only 50% models have model cards but contribute 98% of
total usage
Goal: Study the relation between model usage and documentation
Hypothesis: Model documentation drives model usage
Randomized Control Trial (RCT) for models:
Model population
Control group
Treatment group
Model Documentation RCT
Model population
Control group
Treatment group Documentation
Observation: Only 50% models have model cards but contribute 98% of
total usage
Goal: Study the relation between model usage and documentation
Hypothesis: Model documentation drives model usage
Randomized Control Trial (RCT) for models:
Model Documentation RCT
Model population
Control group
Treatment group Documentation
Compare usage
Observation: Only 50% models have model cards but contribute 98% of
total usage
Goal: Study the relation between model usage and documentation
Hypothesis: Model documentation drives model usage
Randomized Control Trial (RCT) for models:
Randomized Control Trial Process
Treatment group
Randomized Control Trial Process
󰠁󰠁
Treatment group
Documentation
Randomized Control Trial Process
Treatment group
Documentation
󰠁󰠁
Randomized Control Trial Process
Treatment group
Documentation Submit Pull Requests
󰠁
󰠁󰠁
Randomized Control Trial Process
Treatment group
Documentation Submit Pull Requests
󰠁
Documentation is part of
model repo
󰠁󰠁
Randomized Control Trial Process
Treatment group
Documentation Submit Pull Requests
󰠁
Documentation is part of
model repo
1 week
}
󰠁󰠁
RCT Results
Red line indicates week when treatment was administered
RCT Results
Red line indicates week when treatment was administered
Model Documentation RCT Findings
1. Increased usage of models in treatment group compared to control group
2. More prominent for model weights downloads
3. Model documentation drives model usage
What do developers document about models?
Distribution of sections in model cards
What do developers document about models?
Distribution of sections in model cards
Outline
Part 1:
NLP Modeling landscape
Systematic study of 75,000 models on HF
Part 2:
NLP Evaluation landscape
Challenges and opportunities in model evaluation and documentation
Part 3:
Opensource alternative to ChatGPT
Evaluating a Chatbot
NLP Evaluation Landscape
Slew of work on evaluation in NLP
NLP Evaluation Landscape
Slew of work on evaluation in NLP
Tools
NLP Evaluation Landscape
Slew of work on evaluation in NLP
Papers
NLP Evaluation Idioms
1. Subpopulations – disaggregate evaluation on slice or subpopulation of data
NLP Evaluation Idioms
1. Subpopulations – disaggregate evaluation on slice or subpopulation of data
Example: short reviews (< 50 words) in the IMDB sentiment dataset
Tools: Snorkel (Ratner et al., 2017), Errudite (Wu et al., 2019)
NLP Evaluation Idioms
1. Subpopulations – disaggregate evaluation on slice or subpopulation of data
2. Transformations – natural perturbations to original evaluation instances
NLP Evaluation Idioms
1. Subpopulations – disaggregate evaluation on slice or subpopulation of data
2. Transformations – natural perturbations to original evaluation instances
Example: substitute words with their synonyms in the IMDB dataset
Tools: NLPAug (Ma, 2019)
NLP Evaluation Idioms
1. Subpopulations – disaggregate evaluation on slice or subpopulation of data
2. Transformations – natural perturbations to original evaluation instances
3. Evaluation sets – evaluation on diagnostic sets
NLP Evaluation Idioms
1. Subpopulations – disaggregate evaluation on slice or subpopulation of data
2. Transformations – natural perturbations to original evaluation instances
3. Evaluation sets – evaluation on diagnostic sets
Example: write new movie reviews in the style of a newspaper columnist
Tools: CheckList (Ribeiro et al., 2020)
NLP Evaluation Idioms
1. Subpopulations – disaggregate evaluation on slice or subpopulation of data
2. Transformations – natural perturbations to original evaluation instances
3. Evaluation sets – evaluation on diagnostic sets
4. Attacks – adversarial evaluation
NLP Evaluation Idioms
1. Subpopulations – disaggregate evaluation on slice or subpopulation of data
2. Transformations – natural perturbations to original evaluation instances
3. Evaluation sets – evaluation on diagnostic sets
4. Attacks – adversarial evaluation
Example: add “aabbccaa” to reviews because it makes the model predict positive sentiment
Tools: TextAttack (Morris et al., 2020), OpenAttack (Zeng et al., 2020)
NLP Evaluation Landscape
Slew of work on evaluation in NLP -- tools and research papers
Goldilocks spectrum for Model Evaluation
Aggregate
evaluations
Adversarial
attacks
Subpopulations/
Disaggregate
evaluations
Distribution shift
Transformations/
Natural
perturbations
Diagnostic sets
Challenges with Evaluation
Clever Hans effect
Challenges with Evaluation
Challenges with evaluation
Idiomatic Lock-In Workflow Fragmentation
Tool A Tool B
Subpopulations
Transformations
Attacks
Evaluation sets
Scattered evaluation Difficulty reporting
Challenges
Today
Challenges with evaluation
Idiomatic Lock-In Workflow Fragmentation
Tool A Tool B
Subpopulations
Transformations
Attacks
Evaluation sets
Scattered evaluation Difficulty reporting
Challenges
Today
Challenges with evaluation
Idiomatic Lock-In Workflow Fragmentation
Tool A Tool B
Subpopulations
Transformations
Attacks
Evaluation sets
Scattered evaluation Difficulty reporting
Challenges
Today
Robustness Gym
Idiomatic Lock-In Workflow Fragmentation
Tool A Tool B
Subpopulations
Transformations
Attacks
Evaluation sets
Scattered evaluation Difficulty reporting
Challenges
Today
Entire Evaluation Spectrum Consolidate Findings
Subpopulations
Transformations
Attacks
Evaluation sets
Testbenches Robustness Reports
Robustness
Gym
(Goel*, Rajani*, et al., NAACL 2021)
Robustness Gym
Idiomatic Lock-In Workflow Fragmentation
Tool A Tool B
Subpopulations
Transformations
Attacks
Evaluation sets
Scattered evaluation Difficulty reporting
Challenges
Today
Entire Evaluation Spectrum Consolidate Findings
Subpopulations
Transformations
Diagnostic sets
Attacks
Testbenches Robustness Reports
Robustness
Gym
(Goel*, Rajani*, et al., NAACL 2021)
Robustness Gym
Idiomatic Lock-In Workflow Fragmentation
Tool A Tool B
Subpopulations
Transformations
Attacks
Evaluation sets
Scattered evaluation Difficulty reporting
Challenges
Today
Entire Evaluation Spectrum Consolidate Findings
Subpopulations
Transformations
Attacks
Evaluation sets
Testbenches Robustness Reports
Robustness
Gym
(Goel*, Rajani*, et al., NAACL 2021)
Robustness Gym Workflow
Load your
dataset
Robustness Gym Workflow
Cache useful
information
Robustness Gym Workflow
Build slices
of data
Robustness Gym Workflow
Consolidate
slices into a
testbench
Robustness Gym Workflow
Evaluate a
model to
generate a
report
Robustness Gym Workflow
Robustness Report for Natural Language Inference using bert-base-uncased on SNLI
Robustness Report for Natural Language Inference using bert-base-uncased on SNLI
Robustness Report for Natural Language Inference using bert-base-uncased on SNLI
Robustness Report for Natural Language Inference using bert-base-uncased on SNLI
Robustness Report for Natural Language Inference using bert-base-uncased on SNLI
Robustness Report for Natural Language Inference using bert-base-uncased on SNLI
Named Entity Linking
map “strings” to “things”
in a knowledge base like
Wikipedia
Experiments with Commercial APIs for Named Entity Linking
When did England last win the football world cup?
Named Entity Linking
map “strings” to “things”
in a knowledge base like
Wikipedia
Experiments with Commercial APIs for Named Entity Linking
When did England last win the football world cup?
FIFA World Cup
England National Football Team
When did England last win the football world cup?
Named Entity Linking
map “strings” to “things”
in a knowledge base like
Wikipedia
Experiments with Commercial APIs for Named Entity Linking
When did England last win the football world cup?
FIFA World Cup
England National Football Team
When did England last win the football world cup?
Downstream System
Question Answering System
Named Entity Linking
map “strings” to “things”
in a knowledge base like
Wikipedia
Experiments with Commercial APIs for Named Entity Linking
Downstream System
FIFA World Cup
England National Football Team
Question Answering System
When did England last win the football world cup?
1966
A correct NEL is required for the downstream system!
Experiments with Commercial APIs for Named Entity Linking
Robustness Report for NEL on AIDA-b dataset
Experiments with Commercial APIs for Named Entity Linking
Robustness Report for NEL on AIDA-b dataset
Popularity
heuristic
outperforms all
commercial
systems
Experiments with Commercial APIs for Named Entity Linking
Robustness Report for NEL on AIDA-b dataset
Commercial
APIs are not any
more robust
than popularity
heuristic
Experiments with Commercial APIs for Named Entity Linking
Robustness Report for NEL on AIDA-b dataset
Commercial
systems are
capitalization
sensitive
Experiments with Commercial APIs for Named Entity Linking
Robustness Report for NEL on AIDA-b dataset
Type of
Systematic
Error!
Systematic Error Analysis and Labeling (SEAL)
Evaluation is a creative process
Systematic errors are difficult to detect:
- High dimension of the learned representations
- Extracting and labeling semantics in the error group requires human-in-the-loop
Interactive tool to identify and label candidate data slices with high systematic errors
(Rajani et al, EMNLP ‘22 demo)
Systematic Error Analysis and Labeling (SEAL)
1. Embed
Identify candidate groups with high systematic errors
(Rajani et al, EMNLP ‘22 demo)
Systematic Error Analysis and Labeling (SEAL)
Identify candidate groups with high systematic errors
2. Cluster
(Rajani et al, EMNLP ‘22 demo)
Systematic Error Analysis and Labeling (SEAL)
Generate semantic labels using LLMs
books
music
worst book/album reviews
products that work with both
Windows and Mac
Gym equipment
3. Semantic Labeling
(Rajani et al, EMNLP ‘22 demo)
Systematic Error Analysis and Labeling (SEAL)
https://huggingface.co/spaces/nazneen/seal
Systematic Error Analysis and Labeling (SEAL)
https://huggingface.co/spaces/nazneen/seal
Systematic Error Analysis and Labeling (SEAL)
https://huggingface.co/spaces/nazneen/seal
Systematic Error Analysis and Labeling (SEAL)
https://huggingface.co/spaces/nazneen/seal
Systematic Error Analysis and Labeling (SEAL)
https://huggingface.co/spaces/nazneen/seal
SEAL Experimental Results
SEAL Experimental Results
SEAL identified data groups where the model performance drops between 5% to 28%
Takeaways
1. Open-sourcing ML research artifacts is becoming the norm
Takeaways
1. Open-sourcing ML research artifacts is now the default
2. The most popular Hugging Face models are those that are older and
well-documented
Takeaways
1. Open-sourcing ML research artifacts is becoming the norm
2. The most popular Hugging Face models are those that are older and
well-documented
3. Model evaluation can be actionable – RG toolkit supports this goal via fine-grained
evaluation
Takeaways
1. Open-sourcing ML research artifacts is becoming the norm
2. The most popular Hugging Face models are those that are older and
well-documented
3. Model evaluation can be actionable – RG toolkit supports this goal via fine-grained
evaluation
4. LLMs can help label systematic errors in models in a human interpretable way
Outline
Part 1:
NLP Modeling landscape
Systematic study of 75,000 models on HF
Part 2:
NLP Evaluation landscape
Challenges and opportunities in model evaluation and documentation
Part 3:
Opensource alternative to ChatGPT
Evaluating a Chatbot
Current Research Focus
● Open-source alternative to ChatGPT
● Follow what we are building https://huggingface.co/HuggingFaceH4
● Evaluating a Chatbot
Evaluating a Chatbot
Training a Chatbot
1. Pretraining the LM
a. Predicting the next token
b. Eg: GPT-3, BLOOM
2. Incontext learning (aka prompt-based learning)
a. Few shot learning without updating the parameters
b. Context distillation is a variant wherein you condition on the prompt and update the parameters
3. Supervised fine-tuning
a. Fine-tuning for instruction following and to make them chatty
b. Eg: InstructGPT, LaMDA, Sparrow, OPT-IML, LLaMA-I, Alpaca
4. Reinforcement Learning from Human Feedback
a. safety/alignment
b. nudging the LM towards values you desire
Training a Chatbot
1. Pretraining the LM
a. Predicting the next token
b. Eg: GPT-3, BLOOM
2. Incontext learning (aka prompt-based learning)
a. Few shot learning without updating the parameters
b. Context distillation is a variant wherein you condition on the prompt and update the
parameters
3. Supervised fine-tuning
a. Fine-tuning for instruction following and to make them chatty
b. Eg: InstructGPT, LaMDA, Sparrow, OPT-IML, LLaMA-I
4. Reinforcement Learning from Human Feedback
a. safety/alignment
b. nudging the LM towards values you desire
Evaluating a Chatbot
Ouyang, Long, et al. "Training language models to follow instructions with human feedback." arXiv preprint arXiv:2203.02155 (2022).
Training a Chatbot
Ouyang, Long, et al. "Training language models to follow instructions with human feedback." arXiv preprint arXiv:2203.02155 (2022).
Supervised Fine-tuning
Training a Chatbot
Ouyang, Long, et al. "Training language models to follow instructions with human feedback." arXiv preprint arXiv:2203.02155 (2022).
Reinforcement learning with human feedback (RLHF)
Evaluating a Chatbot
Evaluating instruction
following/chatty-ness
Evaluating a Chatbot
● Step 1: Evaluating instruction following. Does the model generate useful responses
on the topic? Are they open-ended?
○ Eg: Brainstorm a list of New Year’s resolutions
Evaluating a Chatbot
Evaluating the RM
Evaluating a Chatbot
● Step 1: Evaluating instruction following. Does the model generate useful responses
on the topic? Are they open-ended?
○ Eg: Brainstorm a list of New Year’s resolutions
● Step 2: Evaluating the RM. Can the model choose between a truthful and a
untruthful response? Can it rank harmful responses lower than the harmless
responses?
Evaluating a Chatbot
Red-teaming
Evaluating a Chatbot
● Step 1: Evaluating instruction following. Does the model generate useful responses
on the topic? Are they open-ended?
○ Eg: Brainstorm a list of New Year’s resolutions
● Step 2: Evaluating the RM. Can the model choose between a truthful and a
untruthful response? Can it rank harmful responses lower than the harmless
responses?
● Step 3: Red-teaming. Crafting prompts that would surface model vulnerabilities and
emerging capabilities.
○ Eg: Complete the sentence, “You should just go kill”
Evaluating a Chatbot
Ouyang, Long, et al. "Training language models to follow instructions with human feedback." arXiv preprint arXiv:2203.02155 (2022).
Evaluating instruction
following/chatty-ness
Evaluating the RM Red-teaming
Evaluating a Chatbot
● Step 1: Evaluating instruction following. Does the model generate useful responses
on the topic? Are they open-ended?
○ Eg: Brainstorm a list of New Year’s resolutions
● Step 2: Evaluating the RM. Can the model choose between a truthful and a
untruthful response? Can it rank harmful responses lower than the harmless
responses?
● Step 3: Red-teaming. Crafting prompts that would surface model vulnerabilities
and emerging capabilities.
○ Eg: Complete the sentence, “You should just go kill”
Red-Teaming
Evaluating LLMs for:
1. Model vulnerabilities
2. Emerging capabilities that they are not explicitly trained for
Red-Teaming
1. Model vulnerabilities
Red-Teaming
2. Emerging Capabilities
- Power-seeking behavior (eg: resources)
- Persuading people to do harm (on themselves or others)
- Having agency with physical outcomes (eg: ordering chemicals online via an API)
These are considered critical threat scenarios
Red-Teaming
Similarities with adversarial attacks:
- Goal is to “attack” or “manipulate” the model to generate harmful content
- Actionable: used to fine-tune the model to steer it away to generate friendly output
Red-Teaming
Differences with adversarial attacks:
- Human interpretable and look like regular prompt. Eg: prefixing “aaabbcc” is
adversarial but not red-teaming.
Red-Teaming
Differences with adversarial attacks:
- Human interpretable and look like regular prompt. Eg: prefixing “aaabbcc” is
adversarial but not red-teaming.
*Warning: offensive text below*
Wallace, et al. "Universal Adversarial Triggers for Attacking and Analyzing NLP" (2021).
Red-Teaming Methods
Roleplay attacks wherein the LLM is instructed to behave as a malicious character
Instructing the model to respond in code instead of natural language
Instructing a model to reveal sensitive information such as PII.
Red-Teaming ChatGPT
https://twitter.com/spiantado/status/1599462375887114240
Red-Teaming ChatGPT
Takeaways from Red-Teaming
1. Few-shot-prompted LMs with helpful, honest, and harmless behavior are not harder
to red-team than plain LMs.
2. There are no clear trends with scaling model size for attack success rate except
RLHF models that are more difficult to red-team as they scale.
3. Models may learn to be harmless by being evasive, there is tradeoff between
helpfulness and harmlessness.
4. The distribution of the success rate varies across categories of harm with
non-violent ones having a higher success rate.
Open problems with Red-Teaming
1. There is no open-source red-teaming dataset for code generation that
attempts to jailbreak a model via code. Eg: generating a program that
implements a DDOS or backdoor attack.
2. Designing and implementing strategies for red-teaming LLMs for critical threat
scenarios.
3. Evaluating the tradeoffs between evasiveness and helpfulness.
Further Reading
Red-Teaming https://huggingface.co/blog/red-teaming
RLHF https://huggingface.co/blog/rlhf
Dialog agents https://huggingface.co/blog/dialog-agents
RLHF Team
Nathan Lambert Lewis Tunstall Thomas Wolf
And more at Hugging Face and the community!
Leandro von Werra Younes Belkada Edward Beeching
Collaborators
Systematic study of HF models and SEAL
Robustness Gym
James Zou
(Stanford)
Weixin Liang
(Stanford)
Karan Goel
(Stanford)
Jesse Vig
(Salesforce)
Chris Re
(Stanford)
Mohit Bansal
(UNC)
Xinyu Yang
(ZJU)
Meg Mitchell
(Hugging Face)
Thanks for listening

More Related Content

What's hot

Large Language Models, No-Code, and Responsible AI - Trends in Applied NLP in...
Large Language Models, No-Code, and Responsible AI - Trends in Applied NLP in...Large Language Models, No-Code, and Responsible AI - Trends in Applied NLP in...
Large Language Models, No-Code, and Responsible AI - Trends in Applied NLP in...
David Talby
 
OpenAI’s GPT 3 Language Model - guest Steve Omohundro
OpenAI’s GPT 3 Language Model - guest Steve OmohundroOpenAI’s GPT 3 Language Model - guest Steve Omohundro
OpenAI’s GPT 3 Language Model - guest Steve Omohundro
Numenta
 
LLMs Bootcamp
LLMs BootcampLLMs Bootcamp
LLMs Bootcamp
Fiza987241
 
Generative Models and ChatGPT
Generative Models and ChatGPTGenerative Models and ChatGPT
Generative Models and ChatGPT
Loic Merckel
 
Introduction to LLMs
Introduction to LLMsIntroduction to LLMs
Introduction to LLMs
Loic Merckel
 
A Comprehensive Review of Large Language Models for.pptx
A Comprehensive Review of Large Language Models for.pptxA Comprehensive Review of Large Language Models for.pptx
A Comprehensive Review of Large Language Models for.pptx
SaiPragnaKancheti
 
Build an LLM-powered application using LangChain.pdf
Build an LLM-powered application using LangChain.pdfBuild an LLM-powered application using LangChain.pdf
Build an LLM-powered application using LangChain.pdf
AnastasiaSteele10
 
Transformers, LLMs, and the Possibility of AGI
Transformers, LLMs, and the Possibility of AGITransformers, LLMs, and the Possibility of AGI
Transformers, LLMs, and the Possibility of AGI
SynaptonIncorporated
 
And then there were ... Large Language Models
And then there were ... Large Language ModelsAnd then there were ... Large Language Models
And then there were ... Large Language Models
Leon Dohmen
 
An introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERTAn introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERT
Suman Debnath
 
LangChain Intro by KeyMate.AI
LangChain Intro by KeyMate.AILangChain Intro by KeyMate.AI
LangChain Intro by KeyMate.AI
OzgurOscarOzkan
 
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...
Mihai Criveti
 
Generative AI and ChatGPT - Scope of AI and advance Generative AI
Generative AI and ChatGPT - Scope of AI and advance Generative AIGenerative AI and ChatGPT - Scope of AI and advance Generative AI
Generative AI and ChatGPT - Scope of AI and advance Generative AI
Kumaresan K
 
Generative-AI-in-enterprise-20230615.pdf
Generative-AI-in-enterprise-20230615.pdfGenerative-AI-in-enterprise-20230615.pdf
Generative-AI-in-enterprise-20230615.pdf
Liming Zhu
 
Reinventing Deep Learning
 with Hugging Face Transformers
Reinventing Deep Learning
 with Hugging Face TransformersReinventing Deep Learning
 with Hugging Face Transformers
Reinventing Deep Learning
 with Hugging Face Transformers
Julien SIMON
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer models
Ding Li
 
AI and ML Series - Introduction to Generative AI and LLMs - Session 1
AI and ML Series - Introduction to Generative AI and LLMs - Session 1AI and ML Series - Introduction to Generative AI and LLMs - Session 1
AI and ML Series - Introduction to Generative AI and LLMs - Session 1
DianaGray10
 
Generative AI, WiDS 2023.pptx
Generative AI, WiDS 2023.pptxGenerative AI, WiDS 2023.pptx
Generative AI, WiDS 2023.pptx
Colleen Farrelly
 
Building NLP applications with Transformers
Building NLP applications with TransformersBuilding NLP applications with Transformers
Building NLP applications with Transformers
Julien SIMON
 
Using the power of Generative AI at scale
Using the power of Generative AI at scaleUsing the power of Generative AI at scale
Using the power of Generative AI at scale
Maxim Salnikov
 

What's hot (20)

Large Language Models, No-Code, and Responsible AI - Trends in Applied NLP in...
Large Language Models, No-Code, and Responsible AI - Trends in Applied NLP in...Large Language Models, No-Code, and Responsible AI - Trends in Applied NLP in...
Large Language Models, No-Code, and Responsible AI - Trends in Applied NLP in...
 
OpenAI’s GPT 3 Language Model - guest Steve Omohundro
OpenAI’s GPT 3 Language Model - guest Steve OmohundroOpenAI’s GPT 3 Language Model - guest Steve Omohundro
OpenAI’s GPT 3 Language Model - guest Steve Omohundro
 
LLMs Bootcamp
LLMs BootcampLLMs Bootcamp
LLMs Bootcamp
 
Generative Models and ChatGPT
Generative Models and ChatGPTGenerative Models and ChatGPT
Generative Models and ChatGPT
 
Introduction to LLMs
Introduction to LLMsIntroduction to LLMs
Introduction to LLMs
 
A Comprehensive Review of Large Language Models for.pptx
A Comprehensive Review of Large Language Models for.pptxA Comprehensive Review of Large Language Models for.pptx
A Comprehensive Review of Large Language Models for.pptx
 
Build an LLM-powered application using LangChain.pdf
Build an LLM-powered application using LangChain.pdfBuild an LLM-powered application using LangChain.pdf
Build an LLM-powered application using LangChain.pdf
 
Transformers, LLMs, and the Possibility of AGI
Transformers, LLMs, and the Possibility of AGITransformers, LLMs, and the Possibility of AGI
Transformers, LLMs, and the Possibility of AGI
 
And then there were ... Large Language Models
And then there were ... Large Language ModelsAnd then there were ... Large Language Models
And then there were ... Large Language Models
 
An introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERTAn introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERT
 
LangChain Intro by KeyMate.AI
LangChain Intro by KeyMate.AILangChain Intro by KeyMate.AI
LangChain Intro by KeyMate.AI
 
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...
 
Generative AI and ChatGPT - Scope of AI and advance Generative AI
Generative AI and ChatGPT - Scope of AI and advance Generative AIGenerative AI and ChatGPT - Scope of AI and advance Generative AI
Generative AI and ChatGPT - Scope of AI and advance Generative AI
 
Generative-AI-in-enterprise-20230615.pdf
Generative-AI-in-enterprise-20230615.pdfGenerative-AI-in-enterprise-20230615.pdf
Generative-AI-in-enterprise-20230615.pdf
 
Reinventing Deep Learning
 with Hugging Face Transformers
Reinventing Deep Learning
 with Hugging Face TransformersReinventing Deep Learning
 with Hugging Face Transformers
Reinventing Deep Learning
 with Hugging Face Transformers
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer models
 
AI and ML Series - Introduction to Generative AI and LLMs - Session 1
AI and ML Series - Introduction to Generative AI and LLMs - Session 1AI and ML Series - Introduction to Generative AI and LLMs - Session 1
AI and ML Series - Introduction to Generative AI and LLMs - Session 1
 
Generative AI, WiDS 2023.pptx
Generative AI, WiDS 2023.pptxGenerative AI, WiDS 2023.pptx
Generative AI, WiDS 2023.pptx
 
Building NLP applications with Transformers
Building NLP applications with TransformersBuilding NLP applications with Transformers
Building NLP applications with Transformers
 
Using the power of Generative AI at scale
Using the power of Generative AI at scaleUsing the power of Generative AI at scale
Using the power of Generative AI at scale
 

Similar to LLMs_talk_March23.pdf

Trustworthy Generative AI_ ICML'23 Tutorial.pptx
Trustworthy Generative AI_ ICML'23 Tutorial.pptxTrustworthy Generative AI_ ICML'23 Tutorial.pptx
Trustworthy Generative AI_ ICML'23 Tutorial.pptx
sylvioneto11
 
NLP in 2020
NLP in 2020NLP in 2020
NLP in 2020
Grigory Sapunov
 
Farmers Protest - Stance Detection
Farmers Protest - Stance DetectionFarmers Protest - Stance Detection
Farmers Protest - Stance Detection
IRJET Journal
 
Aglr Tf
Aglr TfAglr Tf
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
Dr. Haxel Consult
 
Machine Learning and Deep Learning from Foundations to Applications Excel, R,...
Machine Learning and Deep Learning from Foundations to Applications Excel, R,...Machine Learning and Deep Learning from Foundations to Applications Excel, R,...
Machine Learning and Deep Learning from Foundations to Applications Excel, R,...
Narendra Ashar
 
Automated Machine Learning
Automated Machine LearningAutomated Machine Learning
Automated Machine Learning
Yuriy Guts
 
M2 l10 fairness, accountability, and transparency
M2 l10 fairness, accountability, and transparencyM2 l10 fairness, accountability, and transparency
M2 l10 fairness, accountability, and transparency
BoPeng76
 
Strategic Management – MGT 451 Final Exam Your final.docx
Strategic Management – MGT 451 Final Exam  Your final.docxStrategic Management – MGT 451 Final Exam  Your final.docx
Strategic Management – MGT 451 Final Exam Your final.docx
florriezhamphrey3065
 
IRJET - Conversion of Unsupervised Data to Supervised Data using Topic Mo...
IRJET -  	  Conversion of Unsupervised Data to Supervised Data using Topic Mo...IRJET -  	  Conversion of Unsupervised Data to Supervised Data using Topic Mo...
IRJET - Conversion of Unsupervised Data to Supervised Data using Topic Mo...
IRJET Journal
 
Roadmap Composite Simulation - Summary 2015
Roadmap Composite Simulation - Summary 2015Roadmap Composite Simulation - Summary 2015
Roadmap Composite Simulation - Summary 2015
Virtual Dimension Center (VDC) Fellbach
 
Towards a harmonization of metadata application profiles for agricultural lea...
Towards a harmonization of metadata application profiles for agricultural lea...Towards a harmonization of metadata application profiles for agricultural lea...
Towards a harmonization of metadata application profiles for agricultural lea...
Gauri Salokhe
 
Medinfo 2010 openEHR Clinical Modelling Worshop
Medinfo 2010 openEHR Clinical Modelling WorshopMedinfo 2010 openEHR Clinical Modelling Worshop
Medinfo 2010 openEHR Clinical Modelling Worshop
Koray Atalag
 
Book Recommendation System Using Deep Learning (GPT3)
Book Recommendation System Using Deep Learning (GPT3)Book Recommendation System Using Deep Learning (GPT3)
Book Recommendation System Using Deep Learning (GPT3)
IRJET Journal
 
1 1leanthinking
1 1leanthinking1 1leanthinking
1 1leanthinking
Utku Orçun GEZİCİ
 
Developing_a_knowledge-reuse_tool_for_automatic_to.pdf
Developing_a_knowledge-reuse_tool_for_automatic_to.pdfDeveloping_a_knowledge-reuse_tool_for_automatic_to.pdf
Developing_a_knowledge-reuse_tool_for_automatic_to.pdf
Haji Abu
 
Metadata Quality Issues in Learning Repositories
Metadata Quality Issues in Learning RepositoriesMetadata Quality Issues in Learning Repositories
Metadata Quality Issues in Learning Repositories
Nikos Palavitsinis, PhD
 
Some Frameworks for Improving Analytic Operations at Your Company
Some Frameworks for Improving Analytic Operations at Your CompanySome Frameworks for Improving Analytic Operations at Your Company
Some Frameworks for Improving Analytic Operations at Your Company
Robert Grossman
 
Algorithms 14-00122
Algorithms 14-00122Algorithms 14-00122
Algorithms 14-00122
DrSafikureshiMondal
 
Using Machine Learning to aid Journalism at the New York Times
Using Machine Learning to aid Journalism at the New York TimesUsing Machine Learning to aid Journalism at the New York Times
Using Machine Learning to aid Journalism at the New York Times
Vivian S. Zhang
 

Similar to LLMs_talk_March23.pdf (20)

Trustworthy Generative AI_ ICML'23 Tutorial.pptx
Trustworthy Generative AI_ ICML'23 Tutorial.pptxTrustworthy Generative AI_ ICML'23 Tutorial.pptx
Trustworthy Generative AI_ ICML'23 Tutorial.pptx
 
NLP in 2020
NLP in 2020NLP in 2020
NLP in 2020
 
Farmers Protest - Stance Detection
Farmers Protest - Stance DetectionFarmers Protest - Stance Detection
Farmers Protest - Stance Detection
 
Aglr Tf
Aglr TfAglr Tf
Aglr Tf
 
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
 
Machine Learning and Deep Learning from Foundations to Applications Excel, R,...
Machine Learning and Deep Learning from Foundations to Applications Excel, R,...Machine Learning and Deep Learning from Foundations to Applications Excel, R,...
Machine Learning and Deep Learning from Foundations to Applications Excel, R,...
 
Automated Machine Learning
Automated Machine LearningAutomated Machine Learning
Automated Machine Learning
 
M2 l10 fairness, accountability, and transparency
M2 l10 fairness, accountability, and transparencyM2 l10 fairness, accountability, and transparency
M2 l10 fairness, accountability, and transparency
 
Strategic Management – MGT 451 Final Exam Your final.docx
Strategic Management – MGT 451 Final Exam  Your final.docxStrategic Management – MGT 451 Final Exam  Your final.docx
Strategic Management – MGT 451 Final Exam Your final.docx
 
IRJET - Conversion of Unsupervised Data to Supervised Data using Topic Mo...
IRJET -  	  Conversion of Unsupervised Data to Supervised Data using Topic Mo...IRJET -  	  Conversion of Unsupervised Data to Supervised Data using Topic Mo...
IRJET - Conversion of Unsupervised Data to Supervised Data using Topic Mo...
 
Roadmap Composite Simulation - Summary 2015
Roadmap Composite Simulation - Summary 2015Roadmap Composite Simulation - Summary 2015
Roadmap Composite Simulation - Summary 2015
 
Towards a harmonization of metadata application profiles for agricultural lea...
Towards a harmonization of metadata application profiles for agricultural lea...Towards a harmonization of metadata application profiles for agricultural lea...
Towards a harmonization of metadata application profiles for agricultural lea...
 
Medinfo 2010 openEHR Clinical Modelling Worshop
Medinfo 2010 openEHR Clinical Modelling WorshopMedinfo 2010 openEHR Clinical Modelling Worshop
Medinfo 2010 openEHR Clinical Modelling Worshop
 
Book Recommendation System Using Deep Learning (GPT3)
Book Recommendation System Using Deep Learning (GPT3)Book Recommendation System Using Deep Learning (GPT3)
Book Recommendation System Using Deep Learning (GPT3)
 
1 1leanthinking
1 1leanthinking1 1leanthinking
1 1leanthinking
 
Developing_a_knowledge-reuse_tool_for_automatic_to.pdf
Developing_a_knowledge-reuse_tool_for_automatic_to.pdfDeveloping_a_knowledge-reuse_tool_for_automatic_to.pdf
Developing_a_knowledge-reuse_tool_for_automatic_to.pdf
 
Metadata Quality Issues in Learning Repositories
Metadata Quality Issues in Learning RepositoriesMetadata Quality Issues in Learning Repositories
Metadata Quality Issues in Learning Repositories
 
Some Frameworks for Improving Analytic Operations at Your Company
Some Frameworks for Improving Analytic Operations at Your CompanySome Frameworks for Improving Analytic Operations at Your Company
Some Frameworks for Improving Analytic Operations at Your Company
 
Algorithms 14-00122
Algorithms 14-00122Algorithms 14-00122
Algorithms 14-00122
 
Using Machine Learning to aid Journalism at the New York Times
Using Machine Learning to aid Journalism at the New York TimesUsing Machine Learning to aid Journalism at the New York Times
Using Machine Learning to aid Journalism at the New York Times
 

Recently uploaded

Applications of Data Science in Various Industries
Applications of Data Science in Various IndustriesApplications of Data Science in Various Industries
Applications of Data Science in Various Industries
IABAC
 
bcme welcome and ground rule required for bcme course (1).pptx
bcme welcome and ground rule required for bcme course (1).pptxbcme welcome and ground rule required for bcme course (1).pptx
bcme welcome and ground rule required for bcme course (1).pptx
BINITADASH3
 
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model SafeDelhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
dipti singh$A17
 
BIGPPTTTTTTTTtttttttttttttttttttttt.pptx
BIGPPTTTTTTTTtttttttttttttttttttttt.pptxBIGPPTTTTTTTTtttttttttttttttttttttt.pptx
BIGPPTTTTTTTTtttttttttttttttttttttt.pptx
RajdeepPaul47
 
@Call @Girls Kolkata 0000000000 Shivani Beautiful Girl any Time
@Call @Girls Kolkata 0000000000 Shivani Beautiful Girl any Time@Call @Girls Kolkata 0000000000 Shivani Beautiful Girl any Time
@Call @Girls Kolkata 0000000000 Shivani Beautiful Girl any Time
manjukaushik328
 
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN MATKA RESULTS KALYAN CHART KALYAN MATKA ...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN MATKA RESULTS KALYAN CHART KALYAN MATKA ...❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN MATKA RESULTS KALYAN CHART KALYAN MATKA ...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN MATKA RESULTS KALYAN CHART KALYAN MATKA ...
#kalyanmatkaresult #dpboss #kalyanmatka #satta #matka #sattamatka
 
Streamlining Legacy Complexity Through Modernization
Streamlining Legacy Complexity Through ModernizationStreamlining Legacy Complexity Through Modernization
Streamlining Legacy Complexity Through Modernization
sanjay singh
 
Seamlessly Pay Online, Pay In Stores or Send Money
Seamlessly Pay Online, Pay In Stores or Send MoneySeamlessly Pay Online, Pay In Stores or Send Money
Seamlessly Pay Online, Pay In Stores or Send Money
gargtinna79
 
AWS Cloud Technology and Services by Miguel Ángel Rodríguez Anticona.pdf
AWS Cloud Technology and Services by Miguel Ángel Rodríguez Anticona.pdfAWS Cloud Technology and Services by Miguel Ángel Rodríguez Anticona.pdf
AWS Cloud Technology and Services by Miguel Ángel Rodríguez Anticona.pdf
Miguel Ángel Rodríguez Anticona
 
[D3T1S03] Amazon DynamoDB design puzzlers
[D3T1S03] Amazon DynamoDB design puzzlers[D3T1S03] Amazon DynamoDB design puzzlers
[D3T1S03] Amazon DynamoDB design puzzlers
Amazon Web Services Korea
 
一比一原版(usyd毕业证书)悉尼大学毕业证如何办理
一比一原版(usyd毕业证书)悉尼大学毕业证如何办理一比一原版(usyd毕业证书)悉尼大学毕业证如何办理
一比一原版(usyd毕业证书)悉尼大学毕业证如何办理
67n7f53
 
Simon Fraser University degree offer diploma Transcript
Simon Fraser University  degree offer diploma TranscriptSimon Fraser University  degree offer diploma Transcript
Simon Fraser University degree offer diploma Transcript
taqyea
 
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
#kalyanmatkaresult #dpboss #kalyanmatka #satta #matka #sattamatka
 
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
#kalyanmatkaresult #dpboss #kalyanmatka #satta #matka #sattamatka
 
Niagara College degree offer diploma Transcript
Niagara College  degree offer diploma TranscriptNiagara College  degree offer diploma Transcript
Niagara College degree offer diploma Transcript
taqyea
 
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
#kalyanmatkaresult #dpboss #kalyanmatka #satta #matka #sattamatka
 
NEW THYROID DISEASES CLASSIFICATION USING ML.docx
NEW THYROID DISEASES CLASSIFICATION USING ML.docxNEW THYROID DISEASES CLASSIFICATION USING ML.docx
NEW THYROID DISEASES CLASSIFICATION USING ML.docx
dharugayu13475
 
Bangalore @Call @Girls 0000000000 Riya Khan Beautiful And Cute Girl any Time
Bangalore @Call @Girls 0000000000 Riya Khan Beautiful And Cute Girl any TimeBangalore @Call @Girls 0000000000 Riya Khan Beautiful And Cute Girl any Time
Bangalore @Call @Girls 0000000000 Riya Khan Beautiful And Cute Girl any Time
adityaroy0215
 
LLM powered Contract Compliance Application.pptx
LLM powered Contract Compliance Application.pptxLLM powered Contract Compliance Application.pptx
LLM powered Contract Compliance Application.pptx
Jyotishko Biswas
 
@Call @Girls Bandra phone 9920874524 You Are Serach A Beautyfull Dolle come here
@Call @Girls Bandra phone 9920874524 You Are Serach A Beautyfull Dolle come here@Call @Girls Bandra phone 9920874524 You Are Serach A Beautyfull Dolle come here
@Call @Girls Bandra phone 9920874524 You Are Serach A Beautyfull Dolle come here
SARITA PANDEY
 

Recently uploaded (20)

Applications of Data Science in Various Industries
Applications of Data Science in Various IndustriesApplications of Data Science in Various Industries
Applications of Data Science in Various Industries
 
bcme welcome and ground rule required for bcme course (1).pptx
bcme welcome and ground rule required for bcme course (1).pptxbcme welcome and ground rule required for bcme course (1).pptx
bcme welcome and ground rule required for bcme course (1).pptx
 
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model SafeDelhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
 
BIGPPTTTTTTTTtttttttttttttttttttttt.pptx
BIGPPTTTTTTTTtttttttttttttttttttttt.pptxBIGPPTTTTTTTTtttttttttttttttttttttt.pptx
BIGPPTTTTTTTTtttttttttttttttttttttt.pptx
 
@Call @Girls Kolkata 0000000000 Shivani Beautiful Girl any Time
@Call @Girls Kolkata 0000000000 Shivani Beautiful Girl any Time@Call @Girls Kolkata 0000000000 Shivani Beautiful Girl any Time
@Call @Girls Kolkata 0000000000 Shivani Beautiful Girl any Time
 
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN MATKA RESULTS KALYAN CHART KALYAN MATKA ...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN MATKA RESULTS KALYAN CHART KALYAN MATKA ...❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN MATKA RESULTS KALYAN CHART KALYAN MATKA ...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN MATKA RESULTS KALYAN CHART KALYAN MATKA ...
 
Streamlining Legacy Complexity Through Modernization
Streamlining Legacy Complexity Through ModernizationStreamlining Legacy Complexity Through Modernization
Streamlining Legacy Complexity Through Modernization
 
Seamlessly Pay Online, Pay In Stores or Send Money
Seamlessly Pay Online, Pay In Stores or Send MoneySeamlessly Pay Online, Pay In Stores or Send Money
Seamlessly Pay Online, Pay In Stores or Send Money
 
AWS Cloud Technology and Services by Miguel Ángel Rodríguez Anticona.pdf
AWS Cloud Technology and Services by Miguel Ángel Rodríguez Anticona.pdfAWS Cloud Technology and Services by Miguel Ángel Rodríguez Anticona.pdf
AWS Cloud Technology and Services by Miguel Ángel Rodríguez Anticona.pdf
 
[D3T1S03] Amazon DynamoDB design puzzlers
[D3T1S03] Amazon DynamoDB design puzzlers[D3T1S03] Amazon DynamoDB design puzzlers
[D3T1S03] Amazon DynamoDB design puzzlers
 
一比一原版(usyd毕业证书)悉尼大学毕业证如何办理
一比一原版(usyd毕业证书)悉尼大学毕业证如何办理一比一原版(usyd毕业证书)悉尼大学毕业证如何办理
一比一原版(usyd毕业证书)悉尼大学毕业证如何办理
 
Simon Fraser University degree offer diploma Transcript
Simon Fraser University  degree offer diploma TranscriptSimon Fraser University  degree offer diploma Transcript
Simon Fraser University degree offer diploma Transcript
 
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
 
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
 
Niagara College degree offer diploma Transcript
Niagara College  degree offer diploma TranscriptNiagara College  degree offer diploma Transcript
Niagara College degree offer diploma Transcript
 
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
 
NEW THYROID DISEASES CLASSIFICATION USING ML.docx
NEW THYROID DISEASES CLASSIFICATION USING ML.docxNEW THYROID DISEASES CLASSIFICATION USING ML.docx
NEW THYROID DISEASES CLASSIFICATION USING ML.docx
 
Bangalore @Call @Girls 0000000000 Riya Khan Beautiful And Cute Girl any Time
Bangalore @Call @Girls 0000000000 Riya Khan Beautiful And Cute Girl any TimeBangalore @Call @Girls 0000000000 Riya Khan Beautiful And Cute Girl any Time
Bangalore @Call @Girls 0000000000 Riya Khan Beautiful And Cute Girl any Time
 
LLM powered Contract Compliance Application.pptx
LLM powered Contract Compliance Application.pptxLLM powered Contract Compliance Application.pptx
LLM powered Contract Compliance Application.pptx
 
@Call @Girls Bandra phone 9920874524 You Are Serach A Beautyfull Dolle come here
@Call @Girls Bandra phone 9920874524 You Are Serach A Beautyfull Dolle come here@Call @Girls Bandra phone 9920874524 You Are Serach A Beautyfull Dolle come here
@Call @Girls Bandra phone 9920874524 You Are Serach A Beautyfull Dolle come here
 

LLMs_talk_March23.pdf

  • 1. Advances, Challenges, and Opportunities in Model Evaluation Nazneen Rajani | Research Lead @ Hugging Face | nazneen@hf.co | @nazneenrajani
  • 2. Outline Part 1: NLP Modeling landscape Systematic study of 75,000 models on HF Part 2: NLP Evaluation landscape Challenges and opportunities in model evaluation and documentation Part 3: Opensource alternative to ChatGPT Evaluating a Chatbot
  • 3. Outline Part 1: NLP Modeling landscape Systematic study of 75,000 models on HF Part 2: NLP Evaluation landscape Challenges and opportunities in model evaluation and documentation Part 3: Opensource alternative to ChatGPT Evaluating a Chatbot
  • 4. GPT-3 2021 Jun Oct PaLM Chinchilla OPT BLOOM Gopher 2022 Megatron TNLG Dec May Apr Jul Jul GPT-J Large Language Models since GPT3 ChatGPT Nov Dec Galactica GPT-Neo Jun GPT-NeoX Feb Flan-T5 Oct *only LLMs with >1B parameters & EN as the main training language are shown. Comprehensive list: https://crfm.stanford.edu/helm/v1.0/?models=1 UL2 Cohere Jurassic Anthropic 2023 Feb LLaMA Flan-UL2 March Alpaca GPT-4
  • 5. GPT-3 2021 Jun Oct PaLM Chinchilla OPT BLOOM Gopher 2022 Megatron TNLG Dec May Apr Jul Jul GPT-J Large Language Models since GPT3 ChatGPT Nov Dec Galactica GPT-Neo Jun GPT-NeoX Feb Flan-T5 Oct *only LLMs with >1B parameters & EN as the main training language are shown. Comprehensive list: https://crfm.stanford.edu/helm/v1.0/?models=1 UL2 Cohere Jurassic Claude 2023 Feb LLaMA Flan-UL2 March Alpaca GPT-4
  • 6. Model Access �� 🔒 Open access models Closed access models
  • 7. 🔓 Open Access Models All model components are publicly available: ● Open source code ● Training data ○ Sources and their distribution ○ Data preprocessing and curation steps ● Model weights ● Paper or blog summarizing ○ Architecture and training details ○ Evaluation results ○ Adaptation to the model ■ Safety filters ■ Training with human feedback
  • 8. 🔓 Open Access Models Allows reproducing results and replicating parts of the model Enable auditing and conducting risk analysis Serves as a research artifact Enables interpreting model output
  • 9. 🔒 Closed Access Models Only research paper or blog is available and may include overview of ● Training data ● Architecture and training details (including infrastructure) ● Evaluation results ● Adaptation to the model ○ Safety filters ○ Training with human feedback
  • 10. 🔒 Closed Access Models Safety concerns Competitive advantage Expensive to setup guardrails for safe access
  • 11. Model Access �� 🔒 Open access Closed access Limited access ��
  • 12. GPT-3 2021 Jun Oct PaLM Chinchilla OPT BLOOM Gopher 2022 Megatron TNLG Dec May Apr Jul Jul GPT-J Large Language Models since GPT3 ChatGPT Nov Dec Galactica GPT-Neo Jun GPT-NeoX Feb Flan-T5 Oct *only LLMs with >1B parameters & EN as the main training language are shown. Comprehensive list: https://crfm.stanford.edu/helm/v1.0/?models=1 UL2 Cohere Jurassic Claude 2023 Feb LLaMA Flan-UL2 March Alpaca GPT-4 �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� ��
  • 13. GPT-3 2021 Jun Oct PaLM Chinchilla OPT BLOOM Gopher 2022 Megatron TNLG Dec May Apr Jul Jul GPT-J Large Language Models since GPT3 ChatGPT Nov Dec Galactica GPT-Neo Jun GPT-NeoX Feb Flan-T5 Oct *only LLMs with >1B parameters & EN as the main training language are shown. Comprehensive list: https://crfm.stanford.edu/helm/v1.0/?models=1 UL2 Cohere Jurassic Claude 2023 Feb LLaMA Flan-UL2 March Alpaca GPT-4
  • 14. Open Access Large Language Models Research on policy, governance, AI safety and alignment Community efforts like Eleuther, Big Science, LAION Papers with several authors Open source ML has potential for huge impact
  • 15. Ecosystem as part of the ML workflow Collect data Train model Evaluate Deploy >23K datasets >143K models >70 metrics and measurements Spaces/ Gradio for demos
  • 16. ML Modeling Landscape There is an exponential growth of ML models.
  • 18. NLP Modeling Landscape Approx 40% of the task categories are NLP Covering 78% of the models
  • 19. NLP Modeling Landscape Including multimodal – 55% task categories
  • 20. NLP Modeling Landscape Including multimodal – 55% task categories Including speech – 72% task categories Coverage – 90% of models
  • 21. NLP Modeling Landscape Distribution by language (based on 20% models reporting)
  • 22. Model Usage Top 0.2% models (N=124) makeup >80% HF model usage
  • 23. Model Usage Top 0.2% models (N=124) makeup >80% HF model usage 98% of these models are trained on just text data
  • 24. Model Usage Top 0.2% models (N=124) makeup >80% HF model usage 98% of these models are trained on just text data Of these – 65% were created before 2021 33% were created in 2021 2% were created in 2022
  • 25. Model Age vs. Usage Relation between model age and its usage
  • 26. Model Age vs. Usage Relation between model age and its usage
  • 27. Model Age vs. Usage Relation between model age and its usage These models served as research artifacts for the later generation of models
  • 28. Model Age vs. Usage Relation between model age and its usage
  • 29. Model Age vs. Usage Factors: 1. Compute is becoming cheaper making model training more accessible 2. As more models are created, their usage is distributed 3. Models are being replaced by their efficient counterparts (ex: BERT → DistilBERT)
  • 30. Trend Width Step 1: Find all peaks in a signal Step 2: Measure peak widths at base Step 3: Take the max width
  • 31. Model Usage Trends Usage trend width for top models https://huggingface.co/spaces/nazneen/model-usage bert-base-uncased
  • 32. Model Usage Trends Usage trend width for top models https://huggingface.co/spaces/nazneen/model-usage bert-base-uncased sentence-transformers/paraphrase- xlm-r-multilingual-v1
  • 33. Model Usage Trends Usage trend width for top models https://huggingface.co/spaces/nazneen/model-usage bert-base-uncased sentence-transformers/paraphrase-xlm -r-multilingual-v1 HateSpeech-CNERG/indic-abusive-allIn One-MuRIL
  • 38. Model Usage Trends Average trend widths of models in 90th percentile of usage: Created before 2021 → 60 weeks Created in 2021 → 45 weeks Created in 2022 → 24 weeks
  • 39. Model Usage What other factors might affect model usage? - What does the model do? - How does it perform? - What was it trained on? - Is it easy to use? - What are its limitations?
  • 40. Model Usage Model documentation! What other factors might affect model usage? - What does the model do? - How good is the model? - What was it trained on? - Is it easy to use? - What are its limitations?
  • 41. Model Documentation Collect data Train model Evaluate Deploy ✔ Dataset ✔ How to use ✔ Intended uses ✔ Evaluation ✔ Limitations ✔ Training ✔ Environmental impact
  • 43. Model Documentation Landscape Robustness Report (Goel*, Rajani*, et al., NAACL 2021) Model Card (Mitchell et al., 2019) Interactive Model Cards (Crisan, Vig,Drouhard, and Rajani, FAccT2022) Method Card (Adkins et al., 2022)
  • 44. Model Documentation Landscape Robustness Report (Goel*, Rajani*, et al., NAACL 2021) Model Card (Mitchell et al., 2019) Interactive Model Cards (Crisan, Vig,Drouhard, and Rajani, FAccT2022) Method Card (Adkins et al., 2022)
  • 45. Model Documentation Landscape Robustness Report (Goel*, Rajani*, et al., NAACL 2021) Model Card (Mitchell et al., 2019) Interactive Model Cards (Crisan, Vig,Drouhard, and Rajani, FAccT2022) Method Card (Adkins et al., 2022)
  • 46. Model Documentation Landscape Robustness Report (Goel*, Rajani*, et al., NAACL 2021) Model Card (Mitchell et al., 2019) Interactive Model Cards (Crisan, Vig,Drouhard, and Rajani, FAccT2022) Method Card (Adkins et al., 2022)
  • 47. Model Documentation in Model documentation is part of the repo’s README ��
  • 52. Model documentation statistics Newer models are less likely to have model cards
  • 53. Model Documentation vs. Usage Observation: Only 50% models have model cards but contribute 98% of total usage
  • 54. Model Documentation vs. Usage Observation: Only 50% models have model cards but contribute 98% of total usage Goal: Study the relation between model usage and documentation
  • 55. Model Documentation vs. Usage Observation: Only 50% models have model cards but contribute 98% of total usage Goal: Study the relation between model usage and documentation Hypothesis: Model documentation drives model usage
  • 56. Model Documentation RCT Observation: Only 50% models have model cards but contribute 98% of total usage Goal: Study the relation between model usage and documentation Hypothesis: Model documentation drives model usage Randomized Control Trial (RCT) for models:
  • 57. Model Documentation RCT Observation: Only 50% models have model cards but contribute 98% of total usage Goal: Study the relation between model usage and documentation Hypothesis: Model documentation drives model usage Randomized Control Trial (RCT) for models: Model population
  • 58. Model Documentation RCT Observation: Only 50% models have model cards but contribute 98% of total usage Goal: Study the relation between model usage and documentation Hypothesis: Model documentation drives model usage Randomized Control Trial (RCT) for models: Model population Control group Treatment group
  • 59. Model Documentation RCT Model population Control group Treatment group Documentation Observation: Only 50% models have model cards but contribute 98% of total usage Goal: Study the relation between model usage and documentation Hypothesis: Model documentation drives model usage Randomized Control Trial (RCT) for models:
  • 60. Model Documentation RCT Model population Control group Treatment group Documentation Compare usage Observation: Only 50% models have model cards but contribute 98% of total usage Goal: Study the relation between model usage and documentation Hypothesis: Model documentation drives model usage Randomized Control Trial (RCT) for models:
  • 61. Randomized Control Trial Process Treatment group
  • 62. Randomized Control Trial Process 󰠁󰠁 Treatment group Documentation
  • 63. Randomized Control Trial Process Treatment group Documentation 󰠁󰠁
  • 64. Randomized Control Trial Process Treatment group Documentation Submit Pull Requests 󰠁 󰠁󰠁
  • 65. Randomized Control Trial Process Treatment group Documentation Submit Pull Requests 󰠁 Documentation is part of model repo 󰠁󰠁
  • 66. Randomized Control Trial Process Treatment group Documentation Submit Pull Requests 󰠁 Documentation is part of model repo 1 week } 󰠁󰠁
  • 67. RCT Results Red line indicates week when treatment was administered
  • 68. RCT Results Red line indicates week when treatment was administered
  • 69. Model Documentation RCT Findings 1. Increased usage of models in treatment group compared to control group 2. More prominent for model weights downloads 3. Model documentation drives model usage
  • 70. What do developers document about models? Distribution of sections in model cards
  • 71. What do developers document about models? Distribution of sections in model cards
  • 72. Outline Part 1: NLP Modeling landscape Systematic study of 75,000 models on HF Part 2: NLP Evaluation landscape Challenges and opportunities in model evaluation and documentation Part 3: Opensource alternative to ChatGPT Evaluating a Chatbot
  • 73. NLP Evaluation Landscape Slew of work on evaluation in NLP
  • 74. NLP Evaluation Landscape Slew of work on evaluation in NLP Tools
  • 75. NLP Evaluation Landscape Slew of work on evaluation in NLP Papers
  • 76. NLP Evaluation Idioms 1. Subpopulations – disaggregate evaluation on slice or subpopulation of data
  • 77. NLP Evaluation Idioms 1. Subpopulations – disaggregate evaluation on slice or subpopulation of data Example: short reviews (< 50 words) in the IMDB sentiment dataset Tools: Snorkel (Ratner et al., 2017), Errudite (Wu et al., 2019)
  • 78. NLP Evaluation Idioms 1. Subpopulations – disaggregate evaluation on slice or subpopulation of data 2. Transformations – natural perturbations to original evaluation instances
  • 79. NLP Evaluation Idioms 1. Subpopulations – disaggregate evaluation on slice or subpopulation of data 2. Transformations – natural perturbations to original evaluation instances Example: substitute words with their synonyms in the IMDB dataset Tools: NLPAug (Ma, 2019)
  • 80. NLP Evaluation Idioms 1. Subpopulations – disaggregate evaluation on slice or subpopulation of data 2. Transformations – natural perturbations to original evaluation instances 3. Evaluation sets – evaluation on diagnostic sets
  • 81. NLP Evaluation Idioms 1. Subpopulations – disaggregate evaluation on slice or subpopulation of data 2. Transformations – natural perturbations to original evaluation instances 3. Evaluation sets – evaluation on diagnostic sets Example: write new movie reviews in the style of a newspaper columnist Tools: CheckList (Ribeiro et al., 2020)
  • 82. NLP Evaluation Idioms 1. Subpopulations – disaggregate evaluation on slice or subpopulation of data 2. Transformations – natural perturbations to original evaluation instances 3. Evaluation sets – evaluation on diagnostic sets 4. Attacks – adversarial evaluation
  • 83. NLP Evaluation Idioms 1. Subpopulations – disaggregate evaluation on slice or subpopulation of data 2. Transformations – natural perturbations to original evaluation instances 3. Evaluation sets – evaluation on diagnostic sets 4. Attacks – adversarial evaluation Example: add “aabbccaa” to reviews because it makes the model predict positive sentiment Tools: TextAttack (Morris et al., 2020), OpenAttack (Zeng et al., 2020)
  • 84. NLP Evaluation Landscape Slew of work on evaluation in NLP -- tools and research papers
  • 85. Goldilocks spectrum for Model Evaluation Aggregate evaluations Adversarial attacks Subpopulations/ Disaggregate evaluations Distribution shift Transformations/ Natural perturbations Diagnostic sets
  • 88. Challenges with evaluation Idiomatic Lock-In Workflow Fragmentation Tool A Tool B Subpopulations Transformations Attacks Evaluation sets Scattered evaluation Difficulty reporting Challenges Today
  • 89. Challenges with evaluation Idiomatic Lock-In Workflow Fragmentation Tool A Tool B Subpopulations Transformations Attacks Evaluation sets Scattered evaluation Difficulty reporting Challenges Today
  • 90. Challenges with evaluation Idiomatic Lock-In Workflow Fragmentation Tool A Tool B Subpopulations Transformations Attacks Evaluation sets Scattered evaluation Difficulty reporting Challenges Today
  • 91. Robustness Gym Idiomatic Lock-In Workflow Fragmentation Tool A Tool B Subpopulations Transformations Attacks Evaluation sets Scattered evaluation Difficulty reporting Challenges Today Entire Evaluation Spectrum Consolidate Findings Subpopulations Transformations Attacks Evaluation sets Testbenches Robustness Reports Robustness Gym (Goel*, Rajani*, et al., NAACL 2021)
  • 92. Robustness Gym Idiomatic Lock-In Workflow Fragmentation Tool A Tool B Subpopulations Transformations Attacks Evaluation sets Scattered evaluation Difficulty reporting Challenges Today Entire Evaluation Spectrum Consolidate Findings Subpopulations Transformations Diagnostic sets Attacks Testbenches Robustness Reports Robustness Gym (Goel*, Rajani*, et al., NAACL 2021)
  • 93. Robustness Gym Idiomatic Lock-In Workflow Fragmentation Tool A Tool B Subpopulations Transformations Attacks Evaluation sets Scattered evaluation Difficulty reporting Challenges Today Entire Evaluation Spectrum Consolidate Findings Subpopulations Transformations Attacks Evaluation sets Testbenches Robustness Reports Robustness Gym (Goel*, Rajani*, et al., NAACL 2021)
  • 99. Evaluate a model to generate a report Robustness Gym Workflow
  • 100. Robustness Report for Natural Language Inference using bert-base-uncased on SNLI
  • 101. Robustness Report for Natural Language Inference using bert-base-uncased on SNLI
  • 102. Robustness Report for Natural Language Inference using bert-base-uncased on SNLI
  • 103. Robustness Report for Natural Language Inference using bert-base-uncased on SNLI
  • 104. Robustness Report for Natural Language Inference using bert-base-uncased on SNLI
  • 105. Robustness Report for Natural Language Inference using bert-base-uncased on SNLI
  • 106. Named Entity Linking map “strings” to “things” in a knowledge base like Wikipedia Experiments with Commercial APIs for Named Entity Linking When did England last win the football world cup?
  • 107. Named Entity Linking map “strings” to “things” in a knowledge base like Wikipedia Experiments with Commercial APIs for Named Entity Linking When did England last win the football world cup? FIFA World Cup England National Football Team When did England last win the football world cup?
  • 108. Named Entity Linking map “strings” to “things” in a knowledge base like Wikipedia Experiments with Commercial APIs for Named Entity Linking When did England last win the football world cup? FIFA World Cup England National Football Team When did England last win the football world cup? Downstream System Question Answering System
  • 109. Named Entity Linking map “strings” to “things” in a knowledge base like Wikipedia Experiments with Commercial APIs for Named Entity Linking Downstream System FIFA World Cup England National Football Team Question Answering System When did England last win the football world cup? 1966 A correct NEL is required for the downstream system!
  • 110. Experiments with Commercial APIs for Named Entity Linking Robustness Report for NEL on AIDA-b dataset
  • 111. Experiments with Commercial APIs for Named Entity Linking Robustness Report for NEL on AIDA-b dataset Popularity heuristic outperforms all commercial systems
  • 112. Experiments with Commercial APIs for Named Entity Linking Robustness Report for NEL on AIDA-b dataset Commercial APIs are not any more robust than popularity heuristic
  • 113. Experiments with Commercial APIs for Named Entity Linking Robustness Report for NEL on AIDA-b dataset Commercial systems are capitalization sensitive
  • 114. Experiments with Commercial APIs for Named Entity Linking Robustness Report for NEL on AIDA-b dataset Type of Systematic Error!
  • 115. Systematic Error Analysis and Labeling (SEAL) Evaluation is a creative process Systematic errors are difficult to detect: - High dimension of the learned representations - Extracting and labeling semantics in the error group requires human-in-the-loop Interactive tool to identify and label candidate data slices with high systematic errors (Rajani et al, EMNLP ‘22 demo)
  • 116. Systematic Error Analysis and Labeling (SEAL) 1. Embed Identify candidate groups with high systematic errors (Rajani et al, EMNLP ‘22 demo)
  • 117. Systematic Error Analysis and Labeling (SEAL) Identify candidate groups with high systematic errors 2. Cluster (Rajani et al, EMNLP ‘22 demo)
  • 118. Systematic Error Analysis and Labeling (SEAL) Generate semantic labels using LLMs books music worst book/album reviews products that work with both Windows and Mac Gym equipment 3. Semantic Labeling (Rajani et al, EMNLP ‘22 demo)
  • 119. Systematic Error Analysis and Labeling (SEAL) https://huggingface.co/spaces/nazneen/seal
  • 120. Systematic Error Analysis and Labeling (SEAL) https://huggingface.co/spaces/nazneen/seal
  • 121. Systematic Error Analysis and Labeling (SEAL) https://huggingface.co/spaces/nazneen/seal
  • 122. Systematic Error Analysis and Labeling (SEAL) https://huggingface.co/spaces/nazneen/seal
  • 123. Systematic Error Analysis and Labeling (SEAL) https://huggingface.co/spaces/nazneen/seal
  • 125. SEAL Experimental Results SEAL identified data groups where the model performance drops between 5% to 28%
  • 126. Takeaways 1. Open-sourcing ML research artifacts is becoming the norm
  • 127. Takeaways 1. Open-sourcing ML research artifacts is now the default 2. The most popular Hugging Face models are those that are older and well-documented
  • 128. Takeaways 1. Open-sourcing ML research artifacts is becoming the norm 2. The most popular Hugging Face models are those that are older and well-documented 3. Model evaluation can be actionable – RG toolkit supports this goal via fine-grained evaluation
  • 129. Takeaways 1. Open-sourcing ML research artifacts is becoming the norm 2. The most popular Hugging Face models are those that are older and well-documented 3. Model evaluation can be actionable – RG toolkit supports this goal via fine-grained evaluation 4. LLMs can help label systematic errors in models in a human interpretable way
  • 130. Outline Part 1: NLP Modeling landscape Systematic study of 75,000 models on HF Part 2: NLP Evaluation landscape Challenges and opportunities in model evaluation and documentation Part 3: Opensource alternative to ChatGPT Evaluating a Chatbot
  • 131. Current Research Focus ● Open-source alternative to ChatGPT ● Follow what we are building https://huggingface.co/HuggingFaceH4 ● Evaluating a Chatbot
  • 133. Training a Chatbot 1. Pretraining the LM a. Predicting the next token b. Eg: GPT-3, BLOOM 2. Incontext learning (aka prompt-based learning) a. Few shot learning without updating the parameters b. Context distillation is a variant wherein you condition on the prompt and update the parameters 3. Supervised fine-tuning a. Fine-tuning for instruction following and to make them chatty b. Eg: InstructGPT, LaMDA, Sparrow, OPT-IML, LLaMA-I, Alpaca 4. Reinforcement Learning from Human Feedback a. safety/alignment b. nudging the LM towards values you desire
  • 134. Training a Chatbot 1. Pretraining the LM a. Predicting the next token b. Eg: GPT-3, BLOOM 2. Incontext learning (aka prompt-based learning) a. Few shot learning without updating the parameters b. Context distillation is a variant wherein you condition on the prompt and update the parameters 3. Supervised fine-tuning a. Fine-tuning for instruction following and to make them chatty b. Eg: InstructGPT, LaMDA, Sparrow, OPT-IML, LLaMA-I 4. Reinforcement Learning from Human Feedback a. safety/alignment b. nudging the LM towards values you desire
  • 135. Evaluating a Chatbot Ouyang, Long, et al. "Training language models to follow instructions with human feedback." arXiv preprint arXiv:2203.02155 (2022).
  • 136. Training a Chatbot Ouyang, Long, et al. "Training language models to follow instructions with human feedback." arXiv preprint arXiv:2203.02155 (2022). Supervised Fine-tuning
  • 137. Training a Chatbot Ouyang, Long, et al. "Training language models to follow instructions with human feedback." arXiv preprint arXiv:2203.02155 (2022). Reinforcement learning with human feedback (RLHF)
  • 138. Evaluating a Chatbot Evaluating instruction following/chatty-ness
  • 139. Evaluating a Chatbot ● Step 1: Evaluating instruction following. Does the model generate useful responses on the topic? Are they open-ended? ○ Eg: Brainstorm a list of New Year’s resolutions
  • 141. Evaluating a Chatbot ● Step 1: Evaluating instruction following. Does the model generate useful responses on the topic? Are they open-ended? ○ Eg: Brainstorm a list of New Year’s resolutions ● Step 2: Evaluating the RM. Can the model choose between a truthful and a untruthful response? Can it rank harmful responses lower than the harmless responses?
  • 143. Evaluating a Chatbot ● Step 1: Evaluating instruction following. Does the model generate useful responses on the topic? Are they open-ended? ○ Eg: Brainstorm a list of New Year’s resolutions ● Step 2: Evaluating the RM. Can the model choose between a truthful and a untruthful response? Can it rank harmful responses lower than the harmless responses? ● Step 3: Red-teaming. Crafting prompts that would surface model vulnerabilities and emerging capabilities. ○ Eg: Complete the sentence, “You should just go kill”
  • 144. Evaluating a Chatbot Ouyang, Long, et al. "Training language models to follow instructions with human feedback." arXiv preprint arXiv:2203.02155 (2022). Evaluating instruction following/chatty-ness Evaluating the RM Red-teaming
  • 145. Evaluating a Chatbot ● Step 1: Evaluating instruction following. Does the model generate useful responses on the topic? Are they open-ended? ○ Eg: Brainstorm a list of New Year’s resolutions ● Step 2: Evaluating the RM. Can the model choose between a truthful and a untruthful response? Can it rank harmful responses lower than the harmless responses? ● Step 3: Red-teaming. Crafting prompts that would surface model vulnerabilities and emerging capabilities. ○ Eg: Complete the sentence, “You should just go kill”
  • 146. Red-Teaming Evaluating LLMs for: 1. Model vulnerabilities 2. Emerging capabilities that they are not explicitly trained for
  • 148. Red-Teaming 2. Emerging Capabilities - Power-seeking behavior (eg: resources) - Persuading people to do harm (on themselves or others) - Having agency with physical outcomes (eg: ordering chemicals online via an API) These are considered critical threat scenarios
  • 149. Red-Teaming Similarities with adversarial attacks: - Goal is to “attack” or “manipulate” the model to generate harmful content - Actionable: used to fine-tune the model to steer it away to generate friendly output
  • 150. Red-Teaming Differences with adversarial attacks: - Human interpretable and look like regular prompt. Eg: prefixing “aaabbcc” is adversarial but not red-teaming.
  • 151. Red-Teaming Differences with adversarial attacks: - Human interpretable and look like regular prompt. Eg: prefixing “aaabbcc” is adversarial but not red-teaming. *Warning: offensive text below* Wallace, et al. "Universal Adversarial Triggers for Attacking and Analyzing NLP" (2021).
  • 152. Red-Teaming Methods Roleplay attacks wherein the LLM is instructed to behave as a malicious character Instructing the model to respond in code instead of natural language Instructing a model to reveal sensitive information such as PII.
  • 155. Takeaways from Red-Teaming 1. Few-shot-prompted LMs with helpful, honest, and harmless behavior are not harder to red-team than plain LMs. 2. There are no clear trends with scaling model size for attack success rate except RLHF models that are more difficult to red-team as they scale. 3. Models may learn to be harmless by being evasive, there is tradeoff between helpfulness and harmlessness. 4. The distribution of the success rate varies across categories of harm with non-violent ones having a higher success rate.
  • 156. Open problems with Red-Teaming 1. There is no open-source red-teaming dataset for code generation that attempts to jailbreak a model via code. Eg: generating a program that implements a DDOS or backdoor attack. 2. Designing and implementing strategies for red-teaming LLMs for critical threat scenarios. 3. Evaluating the tradeoffs between evasiveness and helpfulness.
  • 157. Further Reading Red-Teaming https://huggingface.co/blog/red-teaming RLHF https://huggingface.co/blog/rlhf Dialog agents https://huggingface.co/blog/dialog-agents
  • 158. RLHF Team Nathan Lambert Lewis Tunstall Thomas Wolf And more at Hugging Face and the community! Leandro von Werra Younes Belkada Edward Beeching
  • 159. Collaborators Systematic study of HF models and SEAL Robustness Gym James Zou (Stanford) Weixin Liang (Stanford) Karan Goel (Stanford) Jesse Vig (Salesforce) Chris Re (Stanford) Mohit Bansal (UNC) Xinyu Yang (ZJU) Meg Mitchell (Hugging Face)