The document discusses learning classifier systems (LCS) for addressing class imbalance problems in datasets. It aims to enhance the applicability of LCS to knowledge discovery from real-world datasets that often exhibit class imbalance, where one class is represented by significantly fewer examples than other classes. The author proposes adapting parameters of the XCS learning classifier system, such as learning rate and genetic algorithm threshold, based on estimated class imbalance ratios within classifiers' niches in order to minimize bias towards majority classes and better handle small disjuncts representing minority classes.
Report
Share
Report
Share
1 of 29
Download to read offline
More Related Content
Similar to Learning Classifier Systems for Class Imbalance Problems
Structural Accuracy of Probabilistic Models in BOAclima
This document summarizes research into measuring the structural accuracy of probabilistic models in the Bayesian optimization algorithm (BOA). It finds that truncation selection leads to more accurate models than tournament selection. Replacement strategy has less impact on accuracy than selection strategy. While more elitist strategies like restricted tournament replacement require fewer evaluations, they result in less accurate linkage information in the probabilistic models. The document concludes there is a tradeoff between model accuracy and overall performance in BOA.
The document presents a study that uses machine learning techniques to build a diagnostic model to distinguish between very mild dementia (VMD) and cognitively normal individuals using MRI data. Seven machine learning algorithms were tested including naive Bayes, Bayesian networks, decision trees, support vector machines, and neural networks. The right hippocampus was the most important discriminating brain region. Algorithms like naive Bayes and support vector machines performed better than previous statistical approaches at classifying VMD versus controls based on MRI data. Cross-validation is a more reliable performance measure than accuracy alone.
GECCO'2006: Bounding XCS’s Parameters for Unbalanced DatasetsAlbert Orriols-Puig
This document discusses bounding the parameters of the XCS learning algorithm for imbalanced datasets. It outlines analyzing XCS's performance on imbalanced data, the contribution of its components, and approaches to facilitate learning minority class regions. The document will describe XCS and the domain, present experimentation, discuss how XCS handles class imbalances, provide guidelines for parameter tuning, consider online adaptation, and draw conclusions.
In this project, we study the classification problem and compare some traditional statistical models with neural networks. This work was done in the frame of postgraduate programme in Web Science at Department of Mathematics, Aristotle University of Thessaloniki
Statistical Analysis of Imaging Trials: Multivariate Methods and Prediction, Probing Cancer with MR II: From Animal Models to Clinical Assessment, 17th Annual Conference of the International Society for Magnetic Resonance in Medicine, Honolulu, Hawai\'i, April 19-24
This document summarizes an analysis of the K-nearest neighbors (KNN) machine learning algorithm on the Iris dataset. KNN was implemented on the Iris dataset, which contains 150 records across 5 attributes for 3 types of iris flowers. Data processing involved organizing the data and analyzing statistics and histograms. KNN classification works by finding the K closest training examples in attribute space and voting on the label. Testing showed that KNN achieved high accuracy, especially with a balanced training set and K=7 neighbors. While simple, KNN performs well on datasets with continuous attributes like Iris.
It's Not Magic - Explaining classification algorithmsBrian Lange
As organizations increasingly leverage data and machine learning methods, people throughout those organizations need to build a basic "data literacy" in those topics. In this session, data scientist and instructor Brian Lange provides simple, visual, and equation free explanations for a variety of classification algorithms, geared towards helping anyone understand how they work. Now with Python code examples!
This document appears to be lecture slides for a course on deriving knowledge from data at scale. It covers many topics related to building machine learning models including data preparation, feature selection, classification algorithms like decision trees and support vector machines, and model evaluation. It provides examples applying these techniques to a Titanic passenger dataset to predict survival. It emphasizes the importance of data wrangling and discusses various feature selection methods.
Defect models that are trained on class imbalanced datasets (i.e., the proportion of defective and clean modules is not equally represented) are highly susceptible to produce inaccurate prediction models. Prior research compares the impact of class rebalancing techniques on the performance of defect models but arrives at contradictory conclusions due to the use of different choice of datasets, classification techniques, and performance measures. Such contradictory conclusions make it hard to derive practical guidelines for whether class rebalancing techniques should be applied in the context of defect models. In this paper, we investigate the impact of class rebalancing techniques on performance measures and the interpretation of defect models. We also investigate the experimental settings in which class rebalancing techniques are beneficial for defect models. Through a case study of 101 datasets that span across proprietary and open-source systems, we conclude that the impact of class rebalancing techniques on the performance of defect prediction models depends on the used performance measure and the used classification techniques. We observe that the optimized SMOTE technique and the under-sampling technique are beneficial when quality assurance teams wish to increase AUC and Recall, respectively, but they should be avoided when deriving knowledge and understandings from defect models.
A quick overview of the seed for Meandre 2.0 series. It covers the main motivations moving forward and the disruptive changes introduced via the use of Scala and MongoDB
This document discusses cloud computing and the Meandre framework. It provides an overview of cloud concepts like public/private clouds and IaaS, PaaS, SaaS models. It describes NCSA's use of virtual machines and Eucalyptus cloud. Meandre is presented as a component-based framework that can orchestrate data-intensive applications across cloud resources through its dataflow model and scripting language. It aims to facilitate scaling applications to leverage elastic cloud infrastructure and integrate computation and data.
From Galapagos to Twitter: Darwin, Natural Selection, and Web 2.0Xavier Llorà
One hundred and fifty years have passed since the publication of Darwin's world-changing manuscript "The Origins of Species by Means of Natural Selection". Darwin's ideas have proven their power to reach beyond the biology realm, and their ability to define a conceptual framework which allows us to model and understand complex systems. In the mid 1950s and 60s the efforts of a scattered group of engineers proved the benefits of adopting an evolutionary paradigm to solve complex real-world problems. In the 70s, the emerging presence of computers brought us a new collection of artificial evolution paradigms, among which genetic algorithms rapidly gained widespread adoption. Currently, the Internet has propitiated an exponential growth of information and computational resources that are clearly disrupting our perception and forcing us to reevaluate the boundaries between technology and social interaction. Darwin's ideas can, once again, help us understand such disruptive change. In this talk, I will review the origin of artificial evolution ideas and techniques. I will also show how these techniques are, nowadays, helping to solve a wide range of applications, from life science problems to twitter puzzles, and how high performance computing can make Darwin ideas a routinary tool to help us model and understand complex systems.
Large Scale Data Mining using Genetics-Based Machine LearningXavier Llorà
We are living in the peta-byte era.We have larger and larger data to analyze, process and transform into useful answers for the domain experts. Robust data mining tools, able to cope with petascale volumes and/or high dimensionality producing human-understandable solutions are key on several domain areas. Genetics-based machine learning (GBML) techniques are perfect candidates for this task, among others, due to the recent advances in representations, learning paradigms, and theoretical modeling. If evolutionary learning techniques aspire to be a relevant player in this context, they need to have the capacity of processing these vast amounts of data and they need to process this data within reasonable time. Moreover, massive computation cycles are getting cheaper and cheaper every day, allowing researchers to have access to unprecedented parallelization degrees. Several topics are interlaced in these two requirements: (1) having the proper learning paradigms and knowledge representations, (2) understanding them and knowing when are they suitable for the problem at hand, (3) using efficiency enhancement techniques, and (4) transforming and visualizing the produced solutions to give back as much insight as possible to the domain experts are few of them.
This tutorial will try to answer this question, following a roadmap that starts with the questions of what large means, and why large is a challenge for GBML methods. Afterwards, we will discuss different facets in which we can overcome this challenge: Efficiency enhancement techniques, representations able to cope with large dimensionality spaces, scalability of learning paradigms. We will also review a topic interlaced with all of them: how can we model the scalability of the components of our GBML systems to better engineer them to get the best performance out of them for large datasets. The roadmap continues with examples of real applications of GBML systems and finishes with an analysis of further directions.
Data-Intensive Computing for Competent Genetic Algorithms: A Pilot Study us...Xavier Llorà
Data-intensive computing has positioned itself as a valuable programming paradigm to efficiently approach problems requiring processing very large volumes of data. This paper presents a pilot study about how to apply the data-intensive computing paradigm to evolutionary computation algorithms. Two representative cases (selectorecombinative genetic algorithms and estimation of distribution algorithms) are presented, analyzed, and discussed. This study shows that equivalent data-intensive computing evolutionary computation algorithms can be easily developed, providing robust and scalable algorithms for the multicore-computing era. Experimental results show how such algorithms scale with the number of available cores without further modification.
Scalabiltity in GBML, Accuracy-based Michigan Fuzzy LCS, and new TrendsXavier Llorà
The document summarizes a presentation given by Jorge Casillas on research related to scaling up genetic learning algorithms and fuzzy classifier systems. Specifically, it discusses:
1. An approach using evolutionary instance selection and stratification to extract rule sets from large datasets that balance prediction accuracy and interpretability.
2. Fuzzy-XCS, an accuracy-based genetic fuzzy system the author is developing that uses competitive fuzzy inference and represents rules as disjunctive normal forms to address challenges in credit assignment.
3. Open problems and opportunities in applying genetic learning at large scales, such as addressing chromosome size and efficient evaluation over large datasets.
Pittsburgh Learning Classifier Systems for Protein Structure Prediction: Sca...Xavier Llorà
This document summarizes research using a Pittsburgh Learning Classifier System (LCS) called GAssist to predict protein structure by determining coordination numbers (CN). The researchers tested GAssist on a dataset of over 250,000 protein residues, comparing it to support vector machines, Naive Bayes, and C4.5 decision trees. While support vector machines achieved the best accuracy, GAssist produced more interpretable and compact rule sets at the cost of lower performance. The researchers analyzed the interpretability and scalability of GAssist for this challenging bioinformatics problem, identifying avenues for improving its accuracy while maintaining explanatory power.
XCS: Current capabilities and future challengesXavier Llorà
The document discusses the XCS classifier system, which uses a combination of gradient-based techniques and evolutionary algorithms to learn predictive models from complex problems. It summarizes XCS's current capabilities in classification, function approximation, and reinforcement learning tasks. However, it notes there are still challenges to improve XCS's representations and operators, niching abilities, handling of dynamic problems, solution compactness, and development of hierarchical classifier systems.
Computed Prediction: So far, so good. What now?Xavier Llorà
This document discusses computed prediction in learning classifier systems (LCS). It addresses representing the payoff function Q(s,a) that maps state-action pairs to expected future payoffs. Specifically:
1) In computed prediction, each classifier has parameters w and the classifier prediction is computed as a parametrized function p(x,w) like a linear approximation.
2) Classifier weights are updated using the Widrow-Hoff rule online as the payoff function is learned.
3) Using a powerful approximator like tile coding to compute predictions allows the problem to potentially be solved by a single classifier, but evolution of different approximators per problem subspace may still
This document provides information about the NCSA/IlliGAL Gathering on Evolutionary Learning (NIGEL 2006) conference. It discusses how the conference originated from a previous 2003 gathering. It thanks the organizers and participants and provides details about the agenda, which includes presentations on topics like classifier systems and discussions around applications and techniques of evolutionary learning.
Linkage Learning for Pittsburgh LCS: Making Problems TractableXavier Llorà
Presentation by Xavier Llorà, Kumara Sastry, & David E. Goldberg showing how linkage learning is possible on Pittsburgh style learning classifier systems
Meandre: Semantic-Driven Data-Intensive Flows in the CloudsXavier Llorà
- Meandre is a semantic-driven data-intensive workflow infrastructure for distributed computing. It allows users to assemble modular components into complex workflows (flows) in a visual programming tool or using a scripting language called ZigZag.
- Workflows are composed of components, which can be executable or control components. Executable components perform computational tasks when data is available, while control components pause workflows for user interactions. Components are described semantically using ontologies to separate functionality from implementation.
- Data availability drives workflow execution in Meandre. When required inputs are available, components will fire and produce outputs to make data available for downstream components. This dataflow approach aims to make workflows transparent, intuitive, and reusable across
ZigZag is a new language for describing data-intensive workflows. It aims to make the Meandre infrastructure easier to use by allowing users to assemble complex data flows. The language has a new syntax and compiles workflows that can then be run on Meandre to process large datasets.
Do not Match, Inherit: Fitness Surrogates for Genetics-Based Machine Learning...Xavier Llorà
A byproduct benefit of using probabilistic model-building genetic algorithms is the creation of cheap and accurate surrogate models. Learning classifier systems---and genetics-based machine learning in general---can greatly benefit from such surrogates which may replace the costly matching procedure of a rule against large data sets. In this paper we investigate the accuracy of such surrogate fitness functions when coupled with the probabilistic models evolved by the x-ary extended compact classifier system (xeCCS). To achieve such a goal, we show the need that the probabilistic models should be able to represent all the accurate basis functions required for creating an accurate surrogate. We also introduce a procedure to transform populations of rules based into dependency structure matrices (DSMs) which allows building accurate models of overlapping building blocks---a necessary condition to accurately estimate the fitness of the evolved rules.
Towards Better than Human Capability in Diagnosing Prostate Cancer Using Infr...Xavier Llorà
Cancer diagnosis is essentially a human task. Almost universally, the process requires the extraction of tissue (biopsy) and examination of its microstructure by a human. To improve diagnoses based on limited and inconsistent morphologic knowledge, a new approach has recently been proposed that uses molecular spectroscopic imaging to utilize microscopic chemical composition for diagnoses. In contrast to visible imaging, the approach results in very large data sets as each pixel contains the entire molecular vibrational spectroscopy data from all chemical species. Here, we propose data handling and analysis strategies to allow computer-based diagnosis of human prostate cancer by applying a novel genetics-based machine learning technique ({\tt NAX}). We apply this technique to demonstrate both fast learning and accurate classification that, additionally, scales well with parallelization. Preliminary results demonstrate that this approach can improve current clinical practice in diagnosing prostate cancer.
MYIR Product Brochure - A Global Provider of Embedded SOMs & SolutionsLinda Zhang
This brochure gives introduction of MYIR Electronics company and MYIR's products and services.
MYIR Electronics Limited (MYIR for short), established in 2011, is a global provider of embedded System-On-Modules (SOMs) and
comprehensive solutions based on various architectures such as ARM, FPGA, RISC-V, and AI. We cater to customers' needs for large-scale production, offering customized design, industry-specific application solutions, and one-stop OEM services.
MYIR, recognized as a national high-tech enterprise, is also listed among the "Specialized
and Special new" Enterprises in Shenzhen, China. Our core belief is that "Our success stems from our customers' success" and embraces the philosophy
of "Make Your Idea Real, then My Idea Realizing!"
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - MydbopsMydbops
This presentation, delivered at the Postgres Bangalore (PGBLR) Meetup-2 on June 29th, 2024, dives deep into connection pooling for PostgreSQL databases. Aakash M, a PostgreSQL Tech Lead at Mydbops, explores the challenges of managing numerous connections and explains how connection pooling optimizes performance and resource utilization.
Key Takeaways:
* Understand why connection pooling is essential for high-traffic applications
* Explore various connection poolers available for PostgreSQL, including pgbouncer
* Learn the configuration options and functionalities of pgbouncer
* Discover best practices for monitoring and troubleshooting connection pooling setups
* Gain insights into real-world use cases and considerations for production environments
This presentation is ideal for:
* Database administrators (DBAs)
* Developers working with PostgreSQL
* DevOps engineers
* Anyone interested in optimizing PostgreSQL performance
Contact info@mydbops.com for PostgreSQL Managed, Consulting and Remote DBA Services
In this follow-up session on knowledge and prompt engineering, we will explore structured prompting, chain of thought prompting, iterative prompting, prompt optimization, emotional language prompts, and the inclusion of user signals and industry-specific data to enhance LLM performance.
Join EIS Founder & CEO Seth Earley and special guest Nick Usborne, Copywriter, Trainer, and Speaker, as they delve into these methodologies to improve AI-driven knowledge processes for employees and customers alike.
What's Next Web Development Trends to Watch.pdfSeasiaInfotech2
Explore the latest advancements and upcoming innovations in web development with our guide to the trends shaping the future of digital experiences. Read our article today for more information.
Hire a private investigator to get cell phone recordsHackersList
Learn what private investigators can legally do to obtain cell phone records and track phones, plus ethical considerations and alternatives for addressing privacy concerns.
How Netflix Builds High Performance Applications at Global ScaleScyllaDB
We all want to build applications that are blazingly fast. We also want to scale them to users all over the world. Can the two happen together? Can users in the slowest of environments also get a fast experience? Learn how we do this at Netflix: how we understand every user's needs and preferences and build high performance applications that work for every user, every time.
The Rise of Supernetwork Data Intensive ComputingLarry Smarr
Invited Remote Lecture to SC21
The International Conference for High Performance Computing, Networking, Storage, and Analysis
St. Louis, Missouri
November 18, 2021
GDG Cloud Southlake #34: Neatsun Ziv: Automating AppsecJames Anderson
The lecture titled "Automating AppSec" delves into the critical challenges associated with manual application security (AppSec) processes and outlines strategic approaches for incorporating automation to enhance efficiency, accuracy, and scalability. The lecture is structured to highlight the inherent difficulties in traditional AppSec practices, emphasizing the labor-intensive triage of issues, the complexity of identifying responsible owners for security flaws, and the challenges of implementing security checks within CI/CD pipelines. Furthermore, it provides actionable insights on automating these processes to not only mitigate these pains but also to enable a more proactive and scalable security posture within development cycles.
The Pains of Manual AppSec:
This section will explore the time-consuming and error-prone nature of manually triaging security issues, including the difficulty of prioritizing vulnerabilities based on their actual risk to the organization. It will also discuss the challenges in determining ownership for remediation tasks, a process often complicated by cross-functional teams and microservices architectures. Additionally, the inefficiencies of manual checks within CI/CD gates will be examined, highlighting how they can delay deployments and introduce security risks.
Automating CI/CD Gates:
Here, the focus shifts to the automation of security within the CI/CD pipelines. The lecture will cover methods to seamlessly integrate security tools that automatically scan for vulnerabilities as part of the build process, thereby ensuring that security is a core component of the development lifecycle. Strategies for configuring automated gates that can block or flag builds based on the severity of detected issues will be discussed, ensuring that only secure code progresses through the pipeline.
Triaging Issues with Automation:
This segment addresses how automation can be leveraged to intelligently triage and prioritize security issues. It will cover technologies and methodologies for automatically assessing the context and potential impact of vulnerabilities, facilitating quicker and more accurate decision-making. The use of automated alerting and reporting mechanisms to ensure the right stakeholders are informed in a timely manner will also be discussed.
Identifying Ownership Automatically:
Automating the process of identifying who owns the responsibility for fixing specific security issues is critical for efficient remediation. This part of the lecture will explore tools and practices for mapping vulnerabilities to code owners, leveraging version control and project management tools.
Three Tips to Scale the Shift Left Program:
Finally, the lecture will offer three practical tips for organizations looking to scale their Shift Left security programs. These will include recommendations on fostering a security culture within development teams, employing DevSecOps principles to integrate security throughout the development
AI_dev Europe 2024 - From OpenAI to Opensource AIRaphaël Semeteys
Navigating Between Commercial Ownership and Collaborative Openness
This presentation explores the evolution of generative AI, highlighting the trajectories of various models such as GPT-4, and examining the dynamics between commercial interests and the ethics of open collaboration. We offer an in-depth analysis of the levels of openness of different language models, assessing various components and aspects, and exploring how the (de)centralization of computing power and technology could shape the future of AI research and development. Additionally, we explore concrete examples like LLaMA and its descendants, as well as other open and collaborative projects, which illustrate the diversity and creativity in the field, while navigating the complex waters of intellectual property and licensing.
AC Atlassian Coimbatore Session Slides( 22/06/2024)apoorva2579
This is the combined Sessions of ACE Atlassian Coimbatore event happened on 22nd June 2024
The session order is as follows:
1.AI and future of help desk by Rajesh Shanmugam
2. Harnessing the power of GenAI for your business by Siddharth
3. Fallacies of GenAI by Raju Kandaswamy
Blockchain and Cyber Defense Strategies in new genre timesanupriti
Explore robust defense strategies at the intersection of blockchain technology and cybersecurity. This presentation delves into proactive measures and innovative approaches to safeguarding blockchain networks against evolving cyber threats. Discover how secure blockchain implementations can enhance resilience, protect data integrity, and ensure trust in digital transactions. Gain insights into cutting-edge security protocols and best practices essential for mitigating risks in the blockchain ecosystem.
Navigating Post-Quantum Blockchain: Resilient Cryptography in Quantum Threatsanupriti
In the rapidly evolving landscape of blockchain technology, the advent of quantum computing poses unprecedented challenges to traditional cryptographic methods. As quantum computing capabilities advance, the vulnerabilities of current cryptographic standards become increasingly apparent.
This presentation, "Navigating Post-Quantum Blockchain: Resilient Cryptography in Quantum Threats," explores the intersection of blockchain technology and quantum computing. It delves into the urgent need for resilient cryptographic solutions that can withstand the computational power of quantum adversaries.
Key topics covered include:
An overview of quantum computing and its implications for blockchain security.
Current cryptographic standards and their vulnerabilities in the face of quantum threats.
Emerging post-quantum cryptographic algorithms and their applicability to blockchain systems.
Case studies and real-world implications of quantum-resistant blockchain implementations.
Strategies for integrating post-quantum cryptography into existing blockchain frameworks.
Join us as we navigate the complexities of securing blockchain networks in a quantum-enabled future. Gain insights into the latest advancements and best practices for safeguarding data integrity and privacy in the era of quantum threats.
Quantum Communications Q&A with Gemini LLM. These are based on Shannon's Noisy channel Theorem and offers how the classical theory applies to the quantum world.
this resume for sadika shaikh bca studentSadikaShaikh7
I am a dedicated BCA student with a strong foundation in web technologies, including PHP and MySQL. I have hands-on experience in Java and Python, and a solid understanding of data structures. My technical skills are complemented by my ability to learn quickly and adapt to new challenges in the ever-evolving field of computer science.
How RPA Help in the Transportation and Logistics Industry.pptxSynapseIndia
Revolutionize your transportation processes with our cutting-edge RPA software. Automate repetitive tasks, reduce costs, and enhance efficiency in the logistics sector with our advanced solutions.
Video traffic on the Internet is constantly growing; networked multimedia applications consume a predominant share of the available Internet bandwidth. A major technical breakthrough and enabler in multimedia systems research and of industrial networked multimedia services certainly was the HTTP Adaptive Streaming (HAS) technique. This resulted in the standardization of MPEG Dynamic Adaptive Streaming over HTTP (MPEG-DASH) which, together with HTTP Live Streaming (HLS), is widely used for multimedia delivery in today’s networks. Existing challenges in multimedia systems research deal with the trade-off between (i) the ever-increasing content complexity, (ii) various requirements with respect to time (most importantly, latency), and (iii) quality of experience (QoE). Optimizing towards one aspect usually negatively impacts at least one of the other two aspects if not both. This situation sets the stage for our research work in the ATHENA Christian Doppler (CD) Laboratory (Adaptive Streaming over HTTP and Emerging Networked Multimedia Services; https://athena.itec.aau.at/), jointly funded by public sources and industry. In this talk, we will present selected novel approaches and research results of the first year of the ATHENA CD Lab’s operation. We will highlight HAS-related research on (i) multimedia content provisioning (machine learning for video encoding); (ii) multimedia content delivery (support of edge processing and virtualized network functions for video networking); (iii) multimedia content consumption and end-to-end aspects (player-triggered segment retransmissions to improve video playout quality); and (iv) novel QoE investigations (adaptive point cloud streaming). We will also put the work into the context of international multimedia systems research.
Learning Classifier Systems for Class Imbalance Problems
1. Learning Classifier Systems
for Class Imbalance
Problems
Ester Bernadó-Mansilla
Research Group in Intelligent Systems
Enginyeria i Arquitectura La Salle
Universitat Ramon Llull
Barcelona, Spain
2. Aim
Enhance the applicability of LCSs
to knowledge discovery from datasets
Classification problems
Real-world domains
Learning Classifier Systems for Class Imbalance Problems Ester Bernadó-Mansilla
3. Framework
model
LCS
Dataset
+
estimated
performance
• Representativity of the target
• Evolutionary pressures
concept
• Interpretability
• Geometrical complexity
• Domain of applicability
• Class imbalance
• Noise
Learning Classifier Systems for Class Imbalance Problems Ester Bernadó-Mansilla
4. Class Imbalance
When one class is represented by a small number of
examples, compared to other class/es.
Usually the class of that describes the circumscribed
concept (positive class) is the minority class
Where?
Rare medical diagnoses
Fraud detection
Oil spills in satellite images
Learning Classifier Systems for Class Imbalance Problems Ester Bernadó-Mansilla
5. Class Imbalance and Classifiers
Is there a bias towards the majority class?
Probably, because…
Most classifier schemes are trained to minimize the global error
As a result
They classify accurately the examples from the majority class
They tend to misclassify the examples of the minority class,
which are often those representing the target concept.
Learning Classifier Systems for Class Imbalance Problems Ester Bernadó-Mansilla
6. Measures of Performance
Confusion matrix
Prediction
A B
Actual A true positive (TP) false negative (FN)
B false positive (FP) true negative (TN)
Accuracy = (TP+TN)/(TP+FN+FP+TN)
TN rate = TN / (TN + FP)
TP rate = TP / (FN + TP)
ROC curves
Learning Classifier Systems for Class Imbalance Problems Ester Bernadó-Mansilla
7. The Higher Class Imbalance: the
Higher Bias?
Dataset 1 Dataset 2
concept: 15 concept: 15
counterpart: 150 counterpart: 45
ratio: 10:1 ratio: 3:1
Learning Classifier Systems for Class Imbalance Problems Ester Bernadó-Mansilla
8. XCS
XCS
class
input Set of
Rules
update
search
Genetic Reinforcement
Algorithms Learning
reward
Environment
Dataset
Learning Classifier Systems for Class Imbalance Problems Ester Bernadó-Mansilla
9. Our Approach with XCS
Bounding XCS’s parameters for unbalanced datasets
Online identification of small disjuncts
Adaptation of parameters for the discovery of small
disjuncts
Learning Classifier Systems for Class Imbalance Problems Ester Bernadó-Mansilla
10. XCS’s Behavior in Unbalanced
Datasets
Unbalanced 11-multiplexer problem
ir=16:1 ir=32:1 ir=64:1
Learning Classifier Systems for Class Imbalance Problems Ester Bernadó-Mansilla
11. XCS’s Population
Most numerous rules, ir=128:1
Classifier P Error F Num
###########:0 1000 0.12 0.98 385
1.2 10-4
###########:1 0.074 0.98 366
estimated estimated too high
high
prediction:
overgeneral error: numerosity
fitness
992.24
classifiers 15.38
7.75
Test examples are classified as belonging to the majority class
Learning Classifier Systems for Class Imbalance Problems Ester Bernadó-Mansilla
12. How Imbalance Affects XCS
Classifier’s error
Stability of prediction and error estimates
Occurrence-based reproduction
Learning Classifier Systems for Class Imbalance Problems Ester Bernadó-Mansilla
13. Classifier’s Error in Unbalanced
Datasets
Will an overgeneral classifier be detected as inaccurate if the
imbalance ratio is high?
Bound for inaccurate classifier: !quot;!0
Given the estimated prediction and error:
P = Pc (cl ) Rmax + (1 ! Pc (cl )) Rmin
quot;=| P ! Rmax | Pc (cl )+ | P ! Rmin | (1 ! Pc (cl ))
We derive:
# quot;o p 2 + 2 p ( Rmax # quot;0 )# quot;0 ! 0
where !quot;!0
p =!C / C
For
Rmax = 1000 !0 = 1
we get maximum imbalance ratio:
irmax = 1998
Learning Classifier Systems for Class Imbalance Problems Ester Bernadó-Mansilla
14. Prediction and Error Estimates and
Learning Rate
ir=128:1, ###########:0
Error
Prediction
β=0.2
β=0.002
Learning Classifier Systems for Class Imbalance Problems Ester Bernadó-Mansilla
15. Occurrence-based Reproduction
Probability of occurrence (pocc)
Given ir=maj/min:
0,6
Classifier poccB poccI
0,5
1/2 1/2
########### :0
probability of occurrence
1/2 1/2
########### :1 0,4
0,3
0000#######:0 1/32
0,2
0001#######:1 1/32 0,1
0
1 2 4 8 16 32 64 128 256
imbalance ratio
22ir
p occB 00001######:1 00000######:0
ir + 1 ###########:0 ###########:1
Learning Classifier Systems for Class Imbalance Problems Ester Bernadó-Mansilla
16. Occurrence-based Reproduction
Probability of reproduction (pGA)
1
pGA =
TGA
if Tocc < % GA
#% GA
where TGA $ quot;
!Tocc otherwise
With θGA=20:
GA
Tocc
…
T (# # # # # # # # # # #: 0) ! quot;
GA GA
θGA
Tocc
GA
…
T (0000# # # # # # #: 0) ! T 1
GA occ
θGA
1 Assuming non-overlapping
Learning Classifier Systems for Class Imbalance Problems Ester Bernadó-Mansilla
17. Guidelines for Parameter Tuning
Rmax and є0 determine the threshold between negligible noise and
imbalance ratio
β determines the size of the moving window. The window should be
high enough to allow computing examples from both classes:
f min
! =k
f maj
θGA can counterbalance the reproduction opportunities of most frequent
(majority) and least frequent niches (minority):
1
! GA = k '
f min
Learning Classifier Systems for Class Imbalance Problems Ester Bernadó-Mansilla
18. XCS with Parameters Tuning
XCS with parameter tuning
XCS with standard settings
ir=16:1 ir=32:1 ir=64:1 ir=64:1 ir=256:1
Learning Classifier Systems for Class Imbalance Problems Ester Bernadó-Mansilla
19. XCS Tuning for Real-world Datasets
How we can estimate the niche frequency?
Estimate from the ratio of majority class instances and minority
class instances
Problem:
• This may not be related to the distribution of niches in the feature
space
Take the approach to the small disjuncts problem
Learning Classifier Systems for Class Imbalance Problems Ester Bernadó-Mansilla
20. Online Identification of Small Disjuncts
We search for regions that promote
overgeneral classifiers
Estimate ircl based on the classifier’s
experience on each class:
exp max
ircl =
exp min
Adapt β and θGA according to ircl
ircl = 20 / 4
Learning Classifier Systems for Class Imbalance Problems Ester Bernadó-Mansilla
21. Online Parameter Adaptation
ir=256:1
Learning Classifier Systems for Class Imbalance Problems Ester Bernadó-Mansilla
22. What about UCS?
Supervised XCS:
Needs less exploration
Avoids XCS’s fitness dilemma
More robust to parameter settings
Overgeneral classifiers also tend to overcome the
population
Their probability of occurrence depends on the imbalance ratio
Partially minimized with fitness sharing
Learning Classifier Systems for Class Imbalance Problems Ester Bernadó-Mansilla
23. What about UCS?
ir=256:1
ir=512:1
Learning Classifier Systems for Class Imbalance Problems Ester Bernadó-Mansilla
25. How can we Minimize the Effects of
Small Disjuncts?
Resampling the dataset:
Addresses small
disjuncts
Classical methods:
• Random oversampling
• Random undersampling Assumes that
clusterization will
Heuristic methods:
find small
• Tomek links
disjuncts and
• CNN match classifier’s
• One-sided selection approximation
• Smote
Cluster-based oversampling
Could XCS
benefit from the
online
Cost-sensitive classifiers
identification of
small disjuncts?
Learning Classifier Systems for Class Imbalance Problems Ester Bernadó-Mansilla
26. Domains of Applicability
Should we use some counterbalancing scheme?
Which learning scheme should we use?
Is there a combination of counterbalancing
scheme+learner that beats all others?
How can we know the presence of small
disjuncts?
Are there other complexity factors mixed up with
the small disjuncts problem?
Learning Classifier Systems for Class Imbalance Problems Ester Bernadó-Mansilla
27. Domains of Applicability
Resampling/
Learn it! Classifier/
Resampling+classifier
Where are
LCSs
placed?
Dataset Dataset
Suggested
Prediction
characterization
approach
Type of dataset:
Geometrical distribution of classes
Possible presence of small disjuncts
Other complexity factors
Learning Classifier Systems for Class Imbalance Problems Ester Bernadó-Mansilla
28. Future Directions
Potential benefit of XCS to discover small disjuncts
…and learn from it online
Further analyze UCS
How do LCSs perform w.r.t. other classifiers for unbalanced
datasets?
Measures for small disjuncts identification
… and other possible complexity factors
What is noise and what is a small disjunct?
In which cases a LCS is applicable?
Learning Classifier Systems for Class Imbalance Problems Ester Bernadó-Mansilla
29. Learning Classifier Systems
for Class Imbalance
Problems
Ester Bernadó-Mansilla
Research Group in Intelligent Systems
Enginyeria i Arquitectura La Salle
Universitat Ramon Llull
Barcelona, Spain