The document discusses advances and challenges in model evaluation and summarizes a presentation on this topic. It provides an overview of the growing landscape of natural language processing (NLP) models, including their usage trends over time. There is a lack of documentation for most models, with only 50% having model cards despite contributing 98% of usage. The presentation proposes a randomized controlled trial to study whether improving model documentation could increase usage by adding documentation to a treatment group of models and comparing their usage to an undocumented control group. The goal is to provide more transparency and drive better model communication and reproducibility.
Large Language Models, No-Code, and Responsible AI - Trends in Applied NLP in...David Talby
An April 2023 presentation to the AMIA working group on natural language processing. The talk focuses on three current trends in NLP and how they apply in healthcare: Large language models, No-code, and Responsible AI.
OpenAI’s GPT 3 Language Model - guest Steve OmohundroNumenta
In this research meeting, guest Stephen Omohundro gave a fascinating talk on GPT-3, the new massive OpenAI Natural Language Processing model. He reviewed the network architecture, training process, and results in the context of past work. There was extensive discussion on the implications for NLP and for Machine Intelligence / AGI.
Link to GPT-3 paper: https://arxiv.org/abs/2005.14165
Link to YouTube recording of Steve's talk: https://youtu.be/0ZVOmBp29E0
This document provides information about a bootcamp to build applications using Large Language Models (LLMs). The bootcamp consists of 11 modules covering topics such as introduction to generative AI, text analytics techniques, neural network models for natural language processing, transformer models, embedding retrieval, semantic search, prompt engineering, fine-tuning LLMs, orchestration frameworks, the LangChain application platform, and a final project to build a custom LLM application. The bootcamp will be held in various locations and dates between September 2023 and January 2024.
A brief introduction to generative models in general is given, followed by a succinct discussion about text generation models and the "Transformer" architecture. Finally, the focus is set on a non-technical discussion about ChatGPT with a selection of recent news articles.
A non-technical overview of Large Language Models, exploring their potential, limitations, and customization for specific challenges. While this deck is tailored for an audience from the financial industry in mind, its content remains broadly applicable.
(This updated version builds on our previous deck: slideshare.net/LoicMerckel/intro-to-llms.)
A Comprehensive Review of Large Language Models for.pptxSaiPragnaKancheti
The document presents a review of large language models (LLMs) for code generation. It discusses different types of LLMs including left-to-right, masked, and encoder-decoder models. Existing models for code generation like Codex, GPT-Neo, GPT-J, and CodeParrot are compared. A new model called PolyCoder with 2.7 billion parameters trained on 12 programming languages is introduced. Evaluation results show PolyCoder performs less well than comparably sized models but outperforms others on C language tasks. In general, performance improves with larger models and longer training, but training solely on code can be sufficient or advantageous for some languages.
Build an LLM-powered application using LangChain.pdfAnastasiaSteele10
LangChain is an advanced framework that allows developers to create language model-powered applications. It provides a set of tools, components, and interfaces that make building LLM-based applications easier. With LangChain, managing interactions with language models, chaining together various components, and integrating resources like APIs and databases is a breeze. The platform includes a set of APIs that can be integrated into applications, allowing developers to add language processing capabilities without having to start from scratch.
The document provides an overview of transformers, large language models (LLMs), and artificial general intelligence (AGI). It discusses the architecture and applications of transformers in natural language processing. It describes how LLMs have evolved from earlier statistical models and now perform state-of-the-art results on NLP tasks through pre-training and fine-tuning. The document outlines the capabilities of GPT-3, the largest LLM to date, as well as its limitations and ethical concerns. It introduces AGI and the potential for such systems to revolutionize AI, while also noting the technical, ethical and societal challenges to developing AGI.
And then there were ... Large Language ModelsLeon Dohmen
It is not often even in the ICT world that one witnesses a revolution. The rise of the Personal Computer, the rise of mobile telephony and, of course, the rise of the Internet are some of those revolutions. So what is ChatGPT really? Is ChatGPT also such a revolution? And like any revolution, does ChatGPT have its winners and losers? And who are they? How do we ensure that ChatGPT contributes to a positive impulse for "Smart Humanity?".
During a key note om April 3 and 13 2023 Piek Vossen explained the impact of Large Language Models like ChatGPT.
Prof. PhD. Piek Th.J.M. Vossen, is Full professor of Computational Lexicology at the Faculty of Humanities, Department of Language, Literature and Communication (LCC) at VU Amsterdam:
What is ChatGPT? What technology and thought processes underlie it? What are its consequences? What choices are being made? In the presentation, Piek will elaborate on the basic principles behind Large Language Models and how they are used as a basis for Deep Learning in which they are fine-tuned for specific tasks. He will also discuss a specific variant GPT that underlies ChatGPT. It covers what ChatGPT can and cannot do, what it is good for and what the risks are.
An introduction to the Transformers architecture and BERTSuman Debnath
The transformer is one of the most popular state-of-the-art deep (SOTA) learning architectures that is mostly used for natural language processing (NLP) tasks. Ever since the advent of the transformer, it has replaced RNN and LSTM for various tasks. The transformer also created a major breakthrough in the field of NLP and also paved the way for new revolutionary architectures such as BERT.
LangChain Intro, Keymate.AI Search Plugin for ChatGPT, How to use langchain library? How to implement similar functionality in programming language of your choice? Example LangChain applications.
The presentation revolves around the concept of "langChain", This innovative framework is designed to "chain" together different components to create more advanced use cases around Large Language Models (LLMs). The idea is to leverage the power of LLMs to tackle complex problems and generate solutions that are more than the sum of their parts.
One of the key features of the presentation is the application of the "Keymate.AI Search" plugin in conjunction with the Reasoning and Acting Chain of Thought (ReAct) framework. The presenter encourages the audience to utilize these tools to generate reasoning traces and actions. The ReAct framework, learned from an initial search, is then applied to these traces and actions, demonstrating the potential of LLMs to learn and apply complex frameworks.
The presentation also delves into the impact of climate change on biodiversity. The presenter prompts the audience to look up the latest research on this topic and summarize the key findings. This exercise not only highlights the importance of climate change but also demonstrates the capabilities of LLMs in researching and summarizing complex topics.
The presentation concludes with several key takeaways. The presenter emphasizes that specialized custom solutions work best and suggests a bottom-up approach to expert systems. However, they caution that over-abstraction can lead to leakages, causing time and money limits to hit early and tasks to fail or require many iterations. The presenter also notes that while prompt engineering is important, it's not necessary to over-optimize if the LLM is clever. The presentation ends on a hopeful note, expressing a need for more clever LLMs and acknowledging that good applications are rare but achievable.
Overall, the presentation provides a comprehensive overview of the LanGCHAIN framework, its applications, and the potential of LLMs in solving complex problems. It serves as a call to action for the audience to explore these tools and frameworks.
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...Mihai Criveti
Mihai is the Principal Architect for Platform Engineering and Technology Solutions at IBM, responsible for Cloud Native and AI Solutions. He is a Red Hat Certified Architect, CKA/CKS, a leader in the IBM Open Innovation community, and advocate for open source development. Mihai is driving the development of Retrieval Augmentation Generation platforms, and solutions for Generative AI at IBM that leverage WatsonX, Vector databases, LangChain, HuggingFace and open source AI models.
Mihai will share lessons learned building Retrieval Augmented Generation, or “Chat with Documents” platforms and APIs that scale, and deploy on Kubernetes. His talk will cover use cases for Generative AI, limitations of Large Language Models, use of RAG, Vector Databases and Fine Tuning to overcome model limitations and build solutions that connect to your data and provide content grounding, limit hallucinations and form the basis of explainable AI. In terms of technology, he will cover LLAMA2, HuggingFace TGIS, SentenceTransformers embedding models using Python, LangChain, and Weaviate and ChromaDB vector databases. He’ll also share tips on writing code using LLM, including building an agent for Ansible and containers.
Scaling factors for Large Language Model Architectures:
• Vector Database: consider sharding and High Availability
• Fine Tuning: collecting data to be used for fine tuning
• Governance and Model Benchmarking: how are you testing your model performance
over time, with different prompts, one-shot, and various parameters
• Chain of Reasoning and Agents
• Caching embeddings and responses
• Personalization and Conversational Memory Database
• Streaming Responses and optimizing performance. A fine tuned 13B model may
perform better than a poor 70B one!
• Calling 3rd party functions or APIs for reasoning or other type of data (ex: LLMs are
terrible at reasoning and prediction, consider calling other models)
• Fallback techniques: fallback to a different model, or default answers
• API scaling techniques, rate limiting, etc.
• Async, streaming and parallelization, multiprocessing, GPU acceleration (including
embeddings), generating your API using OpenAPI, etc.
This document discusses generative AI and its potential transformations and use cases. It outlines how generative AI could enable more low-cost experimentation, blur division boundaries, and allow "talking to data" for innovation and operational excellence. The document also references responsible AI frameworks and a pattern catalogue for developing foundation model-based systems. Potential use cases discussed include automated reporting, digital twins, data integration, operation planning, communication, and innovation applications like surrogate models and cross-discipline synthesis.
Reinventing Deep Learning with Hugging Face TransformersJulien SIMON
The document discusses how transformers have become a general-purpose architecture for machine learning, with various transformer models like BERT and GPT-3 seeing widespread adoption. It introduces Hugging Face as a company working to make transformers more accessible through tools and libraries. Hugging Face has seen rapid growth, with its hub hosting over 73,000 models and 10,000 datasets that are downloaded over 1 million times daily. The document outlines Hugging Face's vision of facilitating the entire machine learning process from data to production through tools that support tasks like transfer learning, hardware acceleration, and collaborative model development.
Natural language processing and transformer modelsDing Li
The document discusses several approaches for text classification using machine learning algorithms:
1. Count the frequency of individual words in tweets and sum for each tweet to create feature vectors for classification models like regression. However, this loses some word context information.
2. Use Bayes' rule and calculate word probabilities conditioned on class to perform naive Bayes classification. Laplacian smoothing is used to handle zero probabilities.
3. Incorporate word n-grams and context by calculating word probabilities within n-gram contexts rather than independently. This captures more linguistic information than the first two approaches.
AI and ML Series - Introduction to Generative AI and LLMs - Session 1DianaGray10
Session 1
👉This first session will cover an introduction to Generative AI & harnessing the power of large language models. The following topics will be discussed:
Introduction to Generative AI & harnessing the power of large language models.
What’s generative AI & what’s LLM.
How are we using it in our document understanding & communication mining models?
How to develop a trustworthy and unbiased AI model using LLM & GenAI.
Personal Intelligent Assistant
Speakers:
📌George Roth - AI Evangelist at UiPath
📌Sharon Palawandram - Senior Machine Learning Consultant @ Ashling Partners & UiPath MVP
📌Russel Alfeche - Technology Leader RPA @qBotica & UiPath MVP
This talk overviews my background as a female data scientist, introduces many types of generative AI, discusses potential use cases, highlights the need for representation in generative AI, and showcases a few tools that currently exist.
Building NLP applications with TransformersJulien SIMON
The document discusses how transformer models and transfer learning (Deep Learning 2.0) have improved natural language processing by allowing researchers to easily apply pre-trained models to new tasks with limited data. It presents examples of how HuggingFace has used transformer models for tasks like translation and part-of-speech tagging. The document also discusses tools from HuggingFace that make it easier to train models on hardware accelerators and deploy them to production.
In this session, you'll get all the answers about how ChatGPT and other GPT-X models can be applied to your current or future project. First, we'll put in order all the terms – OpenAI, GPT-3, ChatGPT, Codex, Dall-E, etc., and explain why Microsoft and Azure are often mentioned in this context. Then, we'll go through the main capabilities of the Azure OpenAI and respective usecases that might inspire you to either optimize your product or build a completely new one.
This document summarizes developments in natural language processing (NLP) in 2020. It discusses large language models like GPT-3, the increasing sizes of transformer-based models, issues with large models, multilingual models, more efficient transformer architectures, benchmarks for evaluating NLP systems, conversational agents, and APIs and cloud services for NLP.
This document summarizes a research paper that aims to analyze the stance (for, against, neutral) of public opinions expressed on Twitter regarding the farmers' protests in India. The researchers gathered Twitter data on the topic and used a deep learning model called ULMFiT to classify the stances. ULMFiT first pre-trains on general domain text, then fine-tunes on the Twitter data to achieve a classification F1 score of 0.67 for the three stances. The goal is to understand public opinion and how it may have influenced the government's decision to repeal certain farm laws.
The document discusses efforts to harmonize metadata application profiles for agricultural learning repositories. It describes the Agricultural Learning Repositories Task Force initiative which aims to connect stakeholders and promote sharing of learning resources. The Task Force has undertaken various activities including building a community, creating an inventory of repositories, publishing best practice recommendations, and demonstrating federated searching across repositories. An evaluation of existing application profiles resulted in guidelines to help standardize metadata and ensure interoperability.
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...Dr. Haxel Consult
Word embeddings, deep learning, transformer models and other pre-trained neural language models (sometimes recently referred to as "foundational models") have fundamentally changed the way state-of-the-art systems for natural language processing and information access are built today. The "Data-to-Value" process methodology (Leidner 2013; Leidner 2022a,b) has been devised to embody best practices for the construction of natural language engineering solutions; it can assist practitioners and has also been used to transfer industrial insights into the university classroom. This talk recaps how the methodology supports engineers in building systems more consistently and then outlines the changes in the methodology to adapt it to the deep learning age. The cost and energy implications will also be discussed.
Machine Learning and Deep Learning from Foundations to Applications Excel, R,...Narendra Ashar
Preparing stakeholders across the organization in Advanced Machine learning, Deep Learning, Algorithms, Machine Learning for Image Processing, Machine Learning for Text Processing, Deep Learning Applications.
Courses can be tailored for
Freshers in a corporate
Senior Executives
Marketing, Business Development and other staff. who want to get a simpler view on these newer and apparently complex topics.
A tremendous backlog of predictive modeling problems in the industry and short supply of trained data scientists have spiked interest in automation over the last few years. A new academic field, AutoML, has emerged. However, there is a significant gap between the topics that are academically interesting and automation capabilities that are necessary to solve real-world industrial problems end-to-end. An even greater challenge is enabling a non-expert to build a robust and trustworthy AI solution for their company. In this talk, we’ll discuss what an industry-grade AutoML system consists of and the scientific and engineering challenges of building it.
M2 l10 fairness, accountability, and transparencyBoPeng76
The document provides introductions for three lecturers:
- Adam Obeng is a research scientist studying experimental meta-analysis methods with a PhD in Sociology from Columbia University.
- Toby Farmer is a product manager at Facebook AI working on machine translation with a law degree and background in politics and tech entrepreneurship.
- The third section discusses fairness, accountability, transparency, and ethics (FAT*) in AI and provides an overview of why these issues are important and examples of problems that can arise.
Strategic Management – MGT 451: Final Exam
Your final exam’s deliverable is a written report addressing the question: How does Innovation contribute to create
Competitive Advantage? Students can rephrase this question and use it as your exam title.
To support your report, you need to include at least ten (10) relevant sources. Five (5) of them should come from the
Reference list distributed in class. To access key material, visit Marymount library (physically and/or online)
In your written report balance the opinions of scholars (quotes, citations, etc.), researchers (statistics, findings, etc.) with
your own analytical reflections (opinions, views, etc.). Also mentioned examples of companies that support your statements.
Blogs ARE NOT allowed to be cited unless they are written by a scholar or prominent business figure.
A. Essay Content and Structure:
The length of your exam should fluctuate between 9 to 12 pages. 5 pages correspond to content addressing the topic of
Innovation; the remaining pages should be used for cover, references, and annexes; see below.
# of
Pages
Section
1 Cover: Include your name, course name, school, university, professor name, and date.
1 Table of Contents: Consider the headings and page numbers included in your paper
5 – 5 ½ Body of the Report. Points a) to f) below must be addressed in your exam. In parenthesis, I include some illustrative
questions to guide your analysis; feel free to use those or other questions / ideas to produce your report.
a) Definition & Importance: What is Innovation? Why does Innovation matter? What is the relationship between
Innovation and Competitive Advantage? In this section, cite at least 3 relevant definitions of innovation (use the
provided Reference list, other articles, and/or textbooks) and based on those ideas provide your own definition of
Innovation.
b) Components: What are the key elements of Innovation? What are the distinctive characteristics of Innovation? Are
there different types of Innovation?
c) Key Issues: What challenges around Innovation does a firm typically face? What problems may arise when a
company decides to embrace an Innovation mind-set? In which ways does the lack of Innovation affect a firm’s
Competitive Advantage?
d) Process: What are the key steps (process) to maximize the results of Innovation and achieve sound business results?
What aspects cannot be forgotten? Are there best practices to further Innovation?
e) Culture: What values and/or behaviors do effectively create a culture of Innovation? How does organizational
culture support or limit Innovation? How can an Innovation culture be developed?
f) Lessons Learned: What have you learned about Innovation? How have your views on Innovation changed? How
can you develop your Innovation mind-set? What is the most surprising aspect you have found in your research?
1 - 2 References: Include.
IRJET - Conversion of Unsupervised Data to Supervised Data using Topic Mo...IRJET Journal
This document proposes a methodology to automatically assign topics to unlabeled datasets using topic modeling techniques. It applies latent Dirichlet allocation (LDA) and non-negative matrix factorization (NMF) with term frequency-inverse document frequency (TF-IDF) weighting to product reviews to generate topics. Word similarities are used to cluster words for each topic. Sentiment analysis and word clouds are also used to gain insights. The methodology successfully converts unlabeled to labeled data and provides automatic topic labeling to facilitate further research and opportunity discovery.
The document discusses a survey of experts on composite simulation. It finds that automotive and aerospace are dominant industries for application, with component failure/crash and material description as top areas. Material models are seen as the most significant technology. High R&D demands exist for material models, failure prediction, service life, and manufacturing processes. Researchers see above average needs for research in these areas.
Towards a harmonization of metadata application profiles for agricultural lea...Gauri Salokhe
Metadata interoperability allows the exchange and preservation of crucial learning and teaching information, as well as its future reuse among a large number of different systems and repositories. This paper introduces work around metadata interoperability that has taken place in the context of the Agricultural Learning Repositories Task Force (AgLR-TF), an international community of the stakeholders that are involved in agricultural learning repositories. It particularly focuses on a review and assessment of metadata application profiles that are currently implemented in agricultural learning repositories. The results of this study can be found useful by who are designing, implementing and operating agricultural learning repositories, facilitating thus metadata interoperability in this application field.
This document discusses using formal modeling techniques like openEHR to improve the maintainability of clinical software. It summarizes research modeling the Minimal Standard Terminology for Digestive Endoscopy (MST) using openEHR archetypes. Implementing change requests from a previous endoscopy application in both the original application and a new one based on openEHR models found the openEHR-based application was significantly easier to maintain. Formal modeling addresses issues with non-standard clinical language and supports semantic interoperability and multilingual requirements.
Book Recommendation System Using Deep Learning (GPT3)IRJET Journal
This document describes a book recommendation system that uses deep learning (GPT-3) to provide personalized book recommendations to users. The system takes in a book that a user enjoys and returns 3 similar book recommendations along with additional metadata about each book like descriptions, page counts, and preview links. It was created using Streamlit for the frontend interface and the OpenAI API to query GPT-3 for recommendations. When given a book, GPT-3 analyzes the content to find semantically similar books, then the system enriches the recommendations using the Google Books API. The results successfully provided related book suggestions with high accuracy ratings during testing. Some limitations are the cost of using GPT-3 and reliance on Google Books
This document provides an overview of lean thinking concepts through a presentation designed to teach others. It begins with learning objectives and an introduction to contrasting mass production and lean mindsets. Key concepts of lean explained include eliminating waste to create value, adopting a customer pull vs. producer push mindset, and the lengthy historical process through which Toyota developed its lean production system. Examples are provided of exercises used to illustrate lean concepts like the seven wastes and five S's. The presentation concludes by discussing potential disconnects that can arise in lean implementation efforts.
The final presentation file for my PhD Defense that took place on February 21st, 2014 in Alcala de Henares, Spain. For any questions or clarification please contact me at palavitsinis@gmail.com
Some Frameworks for Improving Analytic Operations at Your CompanyRobert Grossman
I review three frameworks for analytic operations that are designed to improve the value obtained when deploying analytic models into products, services and internal operations.
This document summarizes literature on using bio-inspired algorithms to optimize fuzzy clustering. It describes the general architecture of how bio-inspired optimization algorithms can be applied to optimize parameters of fuzzy clustering algorithms and improve clustering quality. The document reviews several popular bio-inspired optimization algorithms and analyzes literature on optimization fuzzy clustering, identifying China, India, and the United States as the top publishing countries. Network analysis is applied to literature on the topic to identify clusters in the research.
Using Machine Learning to aid Journalism at the New York TimesVivian S. Zhang
This talk was presented to NYC Open Data Meetup Group on Nov 11, 2014.
Speaker:
Daeil Kim is currently a data scientist at the Times and is finishing up his Ph.D at Brown University on work related to developing scalable inference algorithms for Bayesian Nonparametric models. His work at the Times spans a variety of problems related to the company's business interests, audience development, as well as developing tools to aid journalism.
Topic:
This talk will focus mostly on how machine learning can help problems that prop up in journalism. We'll begin first by talking about using popular supervised learning algorithms such as regularized Logistic Regression to help assist a journalist's work in uncovering insights into a story regarding the recall of Takata airbags in cars. Afterwards, we'll think about using topic modeling to deal with large document dumps generated from FOIA (Freedom of Information Act) requests and Refinery, a simple web based tool to ease the implementation of such tasks. Finally, if there is time, we will go over how topic models have been extended to assist in the problem of designing an efficient recommendation engine for text-based content.
Applications of Data Science in Various IndustriesIABAC
The wide-ranging applications of data science across industries.
From healthcare to finance, data science drives innovation and efficiency by transforming raw data into actionable insights.
Learn how data science enhances decision-making, boosts productivity, and fosters new advancements in technology and business. Explore real-world examples of data science applications today.
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN MATKA RESULTS KALYAN CHART KALYAN MATKA MATKA RESULT KALYAN MATKA TIPS SATTA MATKA MATKA COM MATKA PANA JODI TODAY
LLM powered contract compliance application which uses Advanced RAG method Self-RAG and Knowledge Graph together for the first time.
It provides highest accuracy for contract compliance recorded so far for Oil and Gas Industry.
2. Outline
Part 1:
NLP Modeling landscape
Systematic study of 75,000 models on HF
Part 2:
NLP Evaluation landscape
Challenges and opportunities in model evaluation and documentation
Part 3:
Opensource alternative to ChatGPT
Evaluating a Chatbot
3. Outline
Part 1:
NLP Modeling landscape
Systematic study of 75,000 models on HF
Part 2:
NLP Evaluation landscape
Challenges and opportunities in model evaluation and documentation
Part 3:
Opensource alternative to ChatGPT
Evaluating a Chatbot
7. 🔓 Open Access Models
All model components are publicly available:
● Open source code
● Training data
○ Sources and their distribution
○ Data preprocessing and curation steps
● Model weights
● Paper or blog summarizing
○ Architecture and training details
○ Evaluation results
○ Adaptation to the model
■ Safety filters
■ Training with human feedback
8. 🔓 Open Access Models
Allows reproducing results and replicating parts of the model
Enable auditing and conducting risk analysis
Serves as a research artifact
Enables interpreting model output
9. 🔒 Closed Access Models
Only research paper or blog is available and may include overview of
● Training data
● Architecture and training details (including infrastructure)
● Evaluation results
● Adaptation to the model
○ Safety filters
○ Training with human feedback
10. 🔒 Closed Access Models
Safety concerns
Competitive advantage
Expensive to setup guardrails for safe access
12. GPT-3
2021
Jun Oct
PaLM
Chinchilla
OPT
BLOOM
Gopher
2022
Megatron TNLG
Dec
May
Apr
Jul
Jul
GPT-J
Large Language Models since GPT3
ChatGPT
Nov
Dec
Galactica
GPT-Neo
Jun
GPT-NeoX
Feb
Flan-T5
Oct
*only LLMs with >1B parameters & EN as the main training language are shown. Comprehensive list: https://crfm.stanford.edu/helm/v1.0/?models=1
UL2
Cohere
Jurassic Claude
2023
Feb
LLaMA
Flan-UL2
March
Alpaca
GPT-4
�� �� ��
��
�� ��
�� ��
�� ��
��
�� ��
��
�� ��
�� ��
�� ��
��
14. Open Access Large Language Models
Research on policy, governance, AI safety and alignment
Community efforts like Eleuther, Big Science, LAION
Papers with several authors
Open source ML has potential for huge impact
15. Ecosystem as part of the ML workflow
Collect data Train model Evaluate Deploy
>23K datasets >143K models
>70 metrics and
measurements
Spaces/ Gradio for
demos
23. Model Usage
Top 0.2% models (N=124) makeup >80% HF model
usage
98% of these models are trained on just text data
24. Model Usage
Top 0.2% models (N=124) makeup >80% HF model
usage
98% of these models are trained on just text data
Of these –
65% were created before 2021
33% were created in 2021
2% were created in 2022
25. Model Age vs. Usage
Relation between model age and its usage
26. Model Age vs. Usage
Relation between model age and its usage
27. Model Age vs. Usage
Relation between model age and its usage
These models served as research artifacts for the later generation of models
28. Model Age vs. Usage
Relation between model age and its usage
29. Model Age vs. Usage
Factors:
1. Compute is becoming cheaper making model training more accessible
2. As more models are created, their usage is distributed
3. Models are being replaced by their efficient counterparts (ex: BERT →
DistilBERT)
30. Trend Width
Step 1: Find all peaks in a signal
Step 2: Measure peak widths at base
Step 3: Take the max width
31. Model Usage Trends
Usage trend width for top models
https://huggingface.co/spaces/nazneen/model-usage
bert-base-uncased
32. Model Usage Trends
Usage trend width for top models
https://huggingface.co/spaces/nazneen/model-usage
bert-base-uncased
sentence-transformers/paraphrase-
xlm-r-multilingual-v1
33. Model Usage Trends
Usage trend width for top models
https://huggingface.co/spaces/nazneen/model-usage
bert-base-uncased
sentence-transformers/paraphrase-xlm
-r-multilingual-v1
HateSpeech-CNERG/indic-abusive-allIn
One-MuRIL
38. Model Usage Trends
Average trend widths of models in 90th percentile of usage:
Created before 2021 → 60 weeks
Created in 2021 → 45 weeks
Created in 2022 → 24 weeks
39. Model Usage
What other factors might affect model usage?
- What does the model do?
- How does it perform?
- What was it trained on?
- Is it easy to use?
- What are its limitations?
40. Model Usage
Model
documentation!
What other factors might affect model usage?
- What does the model do?
- How good is the model?
- What was it trained on?
- Is it easy to use?
- What are its limitations?
41. Model Documentation
Collect data Train model Evaluate Deploy
✔ Dataset ✔ How to use
✔ Intended
uses
✔ Evaluation
✔ Limitations
✔ Training
✔ Environmental impact
43. Model Documentation Landscape
Robustness Report (Goel*, Rajani*, et al., NAACL 2021)
Model Card (Mitchell et al., 2019)
Interactive Model Cards (Crisan, Vig,Drouhard, and Rajani, FAccT2022)
Method Card (Adkins et al., 2022)
44. Model Documentation Landscape
Robustness Report (Goel*, Rajani*, et al., NAACL 2021)
Model Card (Mitchell et al., 2019)
Interactive Model Cards (Crisan, Vig,Drouhard, and Rajani, FAccT2022)
Method Card (Adkins et al., 2022)
45. Model Documentation Landscape
Robustness Report (Goel*, Rajani*, et al., NAACL 2021)
Model Card (Mitchell et al., 2019)
Interactive Model Cards (Crisan, Vig,Drouhard, and Rajani, FAccT2022)
Method Card (Adkins et al., 2022)
46. Model Documentation Landscape
Robustness Report (Goel*, Rajani*, et al., NAACL 2021)
Model Card (Mitchell et al., 2019)
Interactive Model Cards (Crisan, Vig,Drouhard, and Rajani, FAccT2022)
Method Card (Adkins et al., 2022)
53. Model Documentation vs. Usage
Observation: Only 50% models have model cards but contribute 98% of
total usage
54. Model Documentation vs. Usage
Observation: Only 50% models have model cards but contribute 98% of
total usage
Goal: Study the relation between model usage and documentation
55. Model Documentation vs. Usage
Observation: Only 50% models have model cards but contribute 98% of
total usage
Goal: Study the relation between model usage and documentation
Hypothesis: Model documentation drives model usage
56. Model Documentation RCT
Observation: Only 50% models have model cards but contribute 98% of
total usage
Goal: Study the relation between model usage and documentation
Hypothesis: Model documentation drives model usage
Randomized Control Trial (RCT) for models:
57. Model Documentation RCT
Observation: Only 50% models have model cards but contribute 98% of
total usage
Goal: Study the relation between model usage and documentation
Hypothesis: Model documentation drives model usage
Randomized Control Trial (RCT) for models:
Model population
58. Model Documentation RCT
Observation: Only 50% models have model cards but contribute 98% of
total usage
Goal: Study the relation between model usage and documentation
Hypothesis: Model documentation drives model usage
Randomized Control Trial (RCT) for models:
Model population
Control group
Treatment group
59. Model Documentation RCT
Model population
Control group
Treatment group Documentation
Observation: Only 50% models have model cards but contribute 98% of
total usage
Goal: Study the relation between model usage and documentation
Hypothesis: Model documentation drives model usage
Randomized Control Trial (RCT) for models:
60. Model Documentation RCT
Model population
Control group
Treatment group Documentation
Compare usage
Observation: Only 50% models have model cards but contribute 98% of
total usage
Goal: Study the relation between model usage and documentation
Hypothesis: Model documentation drives model usage
Randomized Control Trial (RCT) for models:
69. Model Documentation RCT Findings
1. Increased usage of models in treatment group compared to control group
2. More prominent for model weights downloads
3. Model documentation drives model usage
70. What do developers document about models?
Distribution of sections in model cards
71. What do developers document about models?
Distribution of sections in model cards
72. Outline
Part 1:
NLP Modeling landscape
Systematic study of 75,000 models on HF
Part 2:
NLP Evaluation landscape
Challenges and opportunities in model evaluation and documentation
Part 3:
Opensource alternative to ChatGPT
Evaluating a Chatbot
76. NLP Evaluation Idioms
1. Subpopulations – disaggregate evaluation on slice or subpopulation of data
77. NLP Evaluation Idioms
1. Subpopulations – disaggregate evaluation on slice or subpopulation of data
Example: short reviews (< 50 words) in the IMDB sentiment dataset
Tools: Snorkel (Ratner et al., 2017), Errudite (Wu et al., 2019)
78. NLP Evaluation Idioms
1. Subpopulations – disaggregate evaluation on slice or subpopulation of data
2. Transformations – natural perturbations to original evaluation instances
79. NLP Evaluation Idioms
1. Subpopulations – disaggregate evaluation on slice or subpopulation of data
2. Transformations – natural perturbations to original evaluation instances
Example: substitute words with their synonyms in the IMDB dataset
Tools: NLPAug (Ma, 2019)
80. NLP Evaluation Idioms
1. Subpopulations – disaggregate evaluation on slice or subpopulation of data
2. Transformations – natural perturbations to original evaluation instances
3. Evaluation sets – evaluation on diagnostic sets
81. NLP Evaluation Idioms
1. Subpopulations – disaggregate evaluation on slice or subpopulation of data
2. Transformations – natural perturbations to original evaluation instances
3. Evaluation sets – evaluation on diagnostic sets
Example: write new movie reviews in the style of a newspaper columnist
Tools: CheckList (Ribeiro et al., 2020)
82. NLP Evaluation Idioms
1. Subpopulations – disaggregate evaluation on slice or subpopulation of data
2. Transformations – natural perturbations to original evaluation instances
3. Evaluation sets – evaluation on diagnostic sets
4. Attacks – adversarial evaluation
83. NLP Evaluation Idioms
1. Subpopulations – disaggregate evaluation on slice or subpopulation of data
2. Transformations – natural perturbations to original evaluation instances
3. Evaluation sets – evaluation on diagnostic sets
4. Attacks – adversarial evaluation
Example: add “aabbccaa” to reviews because it makes the model predict positive sentiment
Tools: TextAttack (Morris et al., 2020), OpenAttack (Zeng et al., 2020)
106. Named Entity Linking
map “strings” to “things”
in a knowledge base like
Wikipedia
Experiments with Commercial APIs for Named Entity Linking
When did England last win the football world cup?
107. Named Entity Linking
map “strings” to “things”
in a knowledge base like
Wikipedia
Experiments with Commercial APIs for Named Entity Linking
When did England last win the football world cup?
FIFA World Cup
England National Football Team
When did England last win the football world cup?
108. Named Entity Linking
map “strings” to “things”
in a knowledge base like
Wikipedia
Experiments with Commercial APIs for Named Entity Linking
When did England last win the football world cup?
FIFA World Cup
England National Football Team
When did England last win the football world cup?
Downstream System
Question Answering System
109. Named Entity Linking
map “strings” to “things”
in a knowledge base like
Wikipedia
Experiments with Commercial APIs for Named Entity Linking
Downstream System
FIFA World Cup
England National Football Team
Question Answering System
When did England last win the football world cup?
1966
A correct NEL is required for the downstream system!
111. Experiments with Commercial APIs for Named Entity Linking
Robustness Report for NEL on AIDA-b dataset
Popularity
heuristic
outperforms all
commercial
systems
112. Experiments with Commercial APIs for Named Entity Linking
Robustness Report for NEL on AIDA-b dataset
Commercial
APIs are not any
more robust
than popularity
heuristic
113. Experiments with Commercial APIs for Named Entity Linking
Robustness Report for NEL on AIDA-b dataset
Commercial
systems are
capitalization
sensitive
114. Experiments with Commercial APIs for Named Entity Linking
Robustness Report for NEL on AIDA-b dataset
Type of
Systematic
Error!
115. Systematic Error Analysis and Labeling (SEAL)
Evaluation is a creative process
Systematic errors are difficult to detect:
- High dimension of the learned representations
- Extracting and labeling semantics in the error group requires human-in-the-loop
Interactive tool to identify and label candidate data slices with high systematic errors
(Rajani et al, EMNLP ‘22 demo)
116. Systematic Error Analysis and Labeling (SEAL)
1. Embed
Identify candidate groups with high systematic errors
(Rajani et al, EMNLP ‘22 demo)
117. Systematic Error Analysis and Labeling (SEAL)
Identify candidate groups with high systematic errors
2. Cluster
(Rajani et al, EMNLP ‘22 demo)
118. Systematic Error Analysis and Labeling (SEAL)
Generate semantic labels using LLMs
books
music
worst book/album reviews
products that work with both
Windows and Mac
Gym equipment
3. Semantic Labeling
(Rajani et al, EMNLP ‘22 demo)
127. Takeaways
1. Open-sourcing ML research artifacts is now the default
2. The most popular Hugging Face models are those that are older and
well-documented
128. Takeaways
1. Open-sourcing ML research artifacts is becoming the norm
2. The most popular Hugging Face models are those that are older and
well-documented
3. Model evaluation can be actionable – RG toolkit supports this goal via fine-grained
evaluation
129. Takeaways
1. Open-sourcing ML research artifacts is becoming the norm
2. The most popular Hugging Face models are those that are older and
well-documented
3. Model evaluation can be actionable – RG toolkit supports this goal via fine-grained
evaluation
4. LLMs can help label systematic errors in models in a human interpretable way
130. Outline
Part 1:
NLP Modeling landscape
Systematic study of 75,000 models on HF
Part 2:
NLP Evaluation landscape
Challenges and opportunities in model evaluation and documentation
Part 3:
Opensource alternative to ChatGPT
Evaluating a Chatbot
131. Current Research Focus
● Open-source alternative to ChatGPT
● Follow what we are building https://huggingface.co/HuggingFaceH4
● Evaluating a Chatbot
133. Training a Chatbot
1. Pretraining the LM
a. Predicting the next token
b. Eg: GPT-3, BLOOM
2. Incontext learning (aka prompt-based learning)
a. Few shot learning without updating the parameters
b. Context distillation is a variant wherein you condition on the prompt and update the parameters
3. Supervised fine-tuning
a. Fine-tuning for instruction following and to make them chatty
b. Eg: InstructGPT, LaMDA, Sparrow, OPT-IML, LLaMA-I, Alpaca
4. Reinforcement Learning from Human Feedback
a. safety/alignment
b. nudging the LM towards values you desire
134. Training a Chatbot
1. Pretraining the LM
a. Predicting the next token
b. Eg: GPT-3, BLOOM
2. Incontext learning (aka prompt-based learning)
a. Few shot learning without updating the parameters
b. Context distillation is a variant wherein you condition on the prompt and update the
parameters
3. Supervised fine-tuning
a. Fine-tuning for instruction following and to make them chatty
b. Eg: InstructGPT, LaMDA, Sparrow, OPT-IML, LLaMA-I
4. Reinforcement Learning from Human Feedback
a. safety/alignment
b. nudging the LM towards values you desire
135. Evaluating a Chatbot
Ouyang, Long, et al. "Training language models to follow instructions with human feedback." arXiv preprint arXiv:2203.02155 (2022).
136. Training a Chatbot
Ouyang, Long, et al. "Training language models to follow instructions with human feedback." arXiv preprint arXiv:2203.02155 (2022).
Supervised Fine-tuning
137. Training a Chatbot
Ouyang, Long, et al. "Training language models to follow instructions with human feedback." arXiv preprint arXiv:2203.02155 (2022).
Reinforcement learning with human feedback (RLHF)
139. Evaluating a Chatbot
● Step 1: Evaluating instruction following. Does the model generate useful responses
on the topic? Are they open-ended?
○ Eg: Brainstorm a list of New Year’s resolutions
141. Evaluating a Chatbot
● Step 1: Evaluating instruction following. Does the model generate useful responses
on the topic? Are they open-ended?
○ Eg: Brainstorm a list of New Year’s resolutions
● Step 2: Evaluating the RM. Can the model choose between a truthful and a
untruthful response? Can it rank harmful responses lower than the harmless
responses?
143. Evaluating a Chatbot
● Step 1: Evaluating instruction following. Does the model generate useful responses
on the topic? Are they open-ended?
○ Eg: Brainstorm a list of New Year’s resolutions
● Step 2: Evaluating the RM. Can the model choose between a truthful and a
untruthful response? Can it rank harmful responses lower than the harmless
responses?
● Step 3: Red-teaming. Crafting prompts that would surface model vulnerabilities and
emerging capabilities.
○ Eg: Complete the sentence, “You should just go kill”
144. Evaluating a Chatbot
Ouyang, Long, et al. "Training language models to follow instructions with human feedback." arXiv preprint arXiv:2203.02155 (2022).
Evaluating instruction
following/chatty-ness
Evaluating the RM Red-teaming
145. Evaluating a Chatbot
● Step 1: Evaluating instruction following. Does the model generate useful responses
on the topic? Are they open-ended?
○ Eg: Brainstorm a list of New Year’s resolutions
● Step 2: Evaluating the RM. Can the model choose between a truthful and a
untruthful response? Can it rank harmful responses lower than the harmless
responses?
● Step 3: Red-teaming. Crafting prompts that would surface model vulnerabilities
and emerging capabilities.
○ Eg: Complete the sentence, “You should just go kill”
148. Red-Teaming
2. Emerging Capabilities
- Power-seeking behavior (eg: resources)
- Persuading people to do harm (on themselves or others)
- Having agency with physical outcomes (eg: ordering chemicals online via an API)
These are considered critical threat scenarios
149. Red-Teaming
Similarities with adversarial attacks:
- Goal is to “attack” or “manipulate” the model to generate harmful content
- Actionable: used to fine-tune the model to steer it away to generate friendly output
150. Red-Teaming
Differences with adversarial attacks:
- Human interpretable and look like regular prompt. Eg: prefixing “aaabbcc” is
adversarial but not red-teaming.
151. Red-Teaming
Differences with adversarial attacks:
- Human interpretable and look like regular prompt. Eg: prefixing “aaabbcc” is
adversarial but not red-teaming.
*Warning: offensive text below*
Wallace, et al. "Universal Adversarial Triggers for Attacking and Analyzing NLP" (2021).
152. Red-Teaming Methods
Roleplay attacks wherein the LLM is instructed to behave as a malicious character
Instructing the model to respond in code instead of natural language
Instructing a model to reveal sensitive information such as PII.
155. Takeaways from Red-Teaming
1. Few-shot-prompted LMs with helpful, honest, and harmless behavior are not harder
to red-team than plain LMs.
2. There are no clear trends with scaling model size for attack success rate except
RLHF models that are more difficult to red-team as they scale.
3. Models may learn to be harmless by being evasive, there is tradeoff between
helpfulness and harmlessness.
4. The distribution of the success rate varies across categories of harm with
non-violent ones having a higher success rate.
156. Open problems with Red-Teaming
1. There is no open-source red-teaming dataset for code generation that
attempts to jailbreak a model via code. Eg: generating a program that
implements a DDOS or backdoor attack.
2. Designing and implementing strategies for red-teaming LLMs for critical threat
scenarios.
3. Evaluating the tradeoffs between evasiveness and helpfulness.
158. RLHF Team
Nathan Lambert Lewis Tunstall Thomas Wolf
And more at Hugging Face and the community!
Leandro von Werra Younes Belkada Edward Beeching
159. Collaborators
Systematic study of HF models and SEAL
Robustness Gym
James Zou
(Stanford)
Weixin Liang
(Stanford)
Karan Goel
(Stanford)
Jesse Vig
(Salesforce)
Chris Re
(Stanford)
Mohit Bansal
(UNC)
Xinyu Yang
(ZJU)
Meg Mitchell
(Hugging Face)