Data-intensive computing has positioned itself as a valuable programming paradigm to efficiently approach problems requiring processing very large volumes of data. This paper presents a pilot study about how to apply the data-intensive computing paradigm to evolutionary computation algorithms. Two representative cases (selectorecombinative genetic algorithms and estimation of distribution algorithms) are presented, analyzed, and discussed. This study shows that equivalent data-intensive computing evolutionary computation algorithms can be easily developed, providing robust and scalable algorithms for the multicore-computing era. Experimental results show how such algorithms scale with the number of available cores without further modification.
The document describes COSBench, a benchmark tool for evaluating the performance of cloud object storage services. It provides an overview of COSBench's key components, including its configurable workload definition file, controller for managing tests, and drivers for generating load. The document also shares sample results from using COSBench to measure the throughput and response times of OpenStack Swift in different configurations. It found that the proxy node's CPU was the bottleneck for larger workloads on one setup. The goal is to open source COSBench to help storage providers optimize performance.
The document discusses COSBench, a benchmark tool developed by Intel to evaluate the performance of cloud object storage services. It describes key components of COSBench including its configurable workload definition, drivers to generate load, and a web console to view results. The document also uses COSBench to analyze the performance of OpenStack Swift, finding that insufficient processing power can throttle overall performance.
Hanborq Optimizations on Hadoop MapReduceHanborq Inc.
A Hanborq optimized Hadoop Distribution, especially with high performance of MapReduce. It's the core part of HDH (Hanborq Distribution with Hadoop for Big Data Engineering).
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”Databricks
NEC has developed a new vector processor called SX-Aurora TSUBASA to accelerate machine learning and data analytics workloads. They developed a middleware framework called Frovedis that provides Spark-like functionality and is optimized for SX-Aurora TSUBASA. Frovedis achieved 10-100x speedups on machine learning algorithms and SQL-like queries compared to Spark on CPUs. NEC has also opened a lab called VEDAC for external users to access SX-Aurora TSUBASA systems running Frovedis.
Hadoop 0.23 contains major architectural changes in both HDFS and Map-Reduce frameworks. The fundamental changes include HDFS (Hadoop Distributed File System) Federation and YARN (Yet Another Resource Negotiator) to overcome the current scalability limitations of both HDFS and Job Tracker. Despite major architectural changes, the impact on user applications and programming model has been kept to a minimal to ensure that existing user Hadoop applications written in Hadoop 20 will continue to function with minimal changes. In this talk we will discuss the architectural changes which Hadoop 23 introduces and compare it to Hadoop 20. Since this is the biggest major release of Hadoop that has been adopted at Yahoo! (after Hadoop 20) in 3 years, we will talk about the customer impact and potential deployment issues of Hadoop 23 and its ecosystems. The deployment of Hadoop 23 at Yahoo! is an ongoing process and is being conducted in a phased manner on our clusters.
Presenter: Viraj Bhat, Principal Engineer, Yahoo!
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...Cloudera, Inc.
Many people refer to Apache Hadoop as their system of choice for big data management but few actually use just Apache Hadoop. Hadoop has become a proxy for a much larger system which has HDFS storage at its core. The Apache Hadoop based "big data stack" has changed dramatically over the past 24 months and will chance even more over the next 24 months. This talk talks about trends in the evolution of the Hadoop stack, change in architecture and changes in the kinds of use cases that are supported. It will also talk about the role of interoperability and cohesion in the Apache Hadoop stack and the role of Apache Bigtop in this regard.
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...Chester Chen
Machine Learning at the Limit
John Canny, UC Berkeley
How fast can machine learning and graph algorithms be? In "roofline" design, every kernel is driven toward the limits imposed by CPU, memory, network etc. This can lead to dramatic improvements: BIDMach is a toolkit for machine learning that uses rooflined design and GPUs to achieve two- to three-orders of magnitude improvements over other toolkits on single machines. These speedups are larger than have been reported for *cluster* systems (e.g. Spark/MLLib, Powergraph) running on hundreds of nodes, and BIDMach with a GPU outperforms these systems for most common machine learning tasks. For algorithms (e.g. graph algorithms) which do require cluster computing, we have developed a rooflined network primitive called "Kylix". We can show that Kylix approaches the rooline limits for sparse Allreduce, and empirically holds the record for distributed Pagerank. Beyond rooflining, we believe there are great opportunities from deep algorithm/hardware codesign. Gibbs Sampling (GS) is a very general tool for inference, but is typically much slower than alternatives. SAME (State Augmentation for Marginal Estimation) is a variation of GS which was developed for marginal parameter estimation. We show that it has high parallelism, and a fast GPU implementation. Using SAME, we developed a GS implementation of Latent Dirichlet Allocation whose running time is 100x faster than other samplers, and within 3x of the fastest symbolic methods. We are extending this approach to general graphical models, an area where there is currently a void of (practically) fast tools. It seems at least plausible that a general-purpose solution based on these techniques can closely approach the performance of custom algorithms.
Bio
John Canny is a professor in computer science at UC Berkeley. He is an ACM dissertation award winner and a Packard Fellow. He is currently a Data Science Senior Fellow in Berkeley's new Institute for Data Science and holds a INRIA (France) International Chair. Since 2002, he has been developing and deploying large-scale behavioral modeling systems. He designed and protyped production systems for Overstock.com, Yahoo, Ebay, Quantcast and Microsoft. He currently works on several applications of data mining for human learning (MOOCs and early language learning), health and well-being, and applications in the sciences.
This document provides an overview of distributed databases and the Yahoo! Cloud Serving Benchmark (YCSB). It discusses NoSQL databases Cassandra and HBase and how YCSB can be used to benchmark their performance. Experiments were conducted on Amazon EC2 using YCSB to load data and run workloads on Cassandra and HBase clusters. The results showed Cassandra had lower latency and higher throughput than HBase. YCSB provides a way to compare the performance of different databases.
Rakuten, Inc. developed ROMA, a user-customizable NoSQL database in Ruby. ROMA provides high scalability, availability, and fault tolerance using techniques like consistent hashing and virtual nodes. It has a plug-in architecture that allows users to extend ROMA's functionality through custom commands. The plug-ins use a domain specific language for easily defining commands to perform operations on structured data stored in ROMA, such as lists and maps.
The document discusses lessons learned from scaling Hadoop and big data processing on Amazon EMR. It describes how EMR provides a scalable and cost-effective way to run Hadoop jobs in the cloud without having to manage infrastructure. While EMR enables bootstrapping large clusters easily, performance can vary due to network issues and disk I/O constraints of different instance types. The document outlines best practices for optimizing Hadoop jobs and tuning configurations on EMR.
Scaling Big Data Mining Infrastructure Twitter ExperienceDataWorks Summit
The analytics platform at Twitter has experienced tremendous growth over the past few years in terms of size, complexity, number of users, and variety of use cases. In this talk, we’ll discuss the evolution of our infrastructure and the development of capabilities for data mining on “big data”. We’ll share our experiences as a case study, but make recommendations for best practices and point out opportunities for future work.
Cassandra nyc 2011 ilya maykov - ooyala - scaling video analytics with apac...ivmaykov
This document discusses scaling video analytics using Apache Cassandra. It provides an overview of Ooyala's video analytics platform and the challenges of scaling to support billions of log pings and terabytes of data daily. Cassandra is used to store over 10 terabytes of historical analytics data covering 4 years of growth. The key challenges addressed are scaling to handle enormous data volumes, providing fast processing and query speeds, supporting deep queries over many dimensions of data, ensuring accuracy, and allowing for rapid developer iteration. The document explains how Cassandra's data model and capabilities help meet these challenges through features like linear scalability, tunable consistency, and a rich data model.
Cloudera Sessions - Clinic 1 - Getting Started With HadoopCloudera, Inc.
If you are interested in Hadoop and its capabilities, but you are not sure where to begin, this is the session for you. Learn the basics of Hadoop, see how to spin up a development cluster in the cloud or on-premise, and start exploring ETL processing with SQL and other familiar tools
Makoto Yui is a research engineer at Treasure Data who developed Hivemall, a scalable machine learning library for Apache Hive. Hivemall implements machine learning algorithms as user-defined table generating functions (UDTFs) that run in parallel on Hadoop. This allows machine learning tasks like model training to be performed with SQL queries, making machine learning more accessible and scalable for data analysts. Hivemall also includes techniques like amplifiers to enable model training with multiple iterations without requiring multiple MapReduce jobs.
Hadoop clusters can be provisioned quickly and easily on virtual infrastructure using techniques like linked clones and thin provisioning. This allows Hadoop to leverage capabilities of virtualization like high availability, resource controls, and re-using spare resources. Shared storage like SAN is useful for VM images and metadata, while local disks provide scalable bandwidth for HDFS data. Virtualizing Hadoop simplifies operations and enables flexible, on-demand provisioning of Hadoop clusters.
1) Running Hadoop on VMs provides advantages like easier cluster management, ability to consolidate clusters on spare resources, and more elastic scaling of clusters.
2) Separating Hadoop compute and data nodes into different VMs allows truly elastic scaling of clusters.
3) Hortonworks is working with VMware to provide first class support for running Hadoop on VMs, including high availability features and optimizations for performance.
Cloud computing, big data, and mobile are three major trends that will change the world. Cloud computing provides scalable and elastic IT resources as services over the internet. Big data involves large amounts of both structured and unstructured data that can generate business insights when analyzed. The hadoop ecosystem, including components like HDFS, mapreduce, pig, and hive, provides an architecture for distributed storage and processing of big data across commodity hardware.
MapReduce: A useful parallel tool that still has room for improvementKyong-Ha Lee
The document discusses MapReduce, a framework for processing large datasets in parallel. It provides an overview of MapReduce's basic principles, surveys research to improve the conventional MapReduce framework, and describes research projects ongoing at KAIST. The key points are that MapReduce provides automatic parallelization, fault tolerance, and distributed processing of large datasets across commodity computer clusters. It also introduces the map and reduce functions that define MapReduce jobs.
Parallel and Distributed Computing on Low Latency ClustersVittorio Giovara
This document summarizes work on parallelizing and distributing computation on a low latency cluster using OpenMP and MPI over Infiniband. The key strategies involved installing optimized Linux, compilers, and Infiniband drivers, then adding OpenMP and MPI directives to parallelize a micromagnetics simulation software. Results showed OpenMP provided 6-8x speedup, MPI 2x, and combined OpenMP and MPI provided 14-16x speedup, reducing computation time by 76%. Future work involves further parallelization and algorithm optimization.
Sharding Containers: Make Go Apps Computer-Friendly Again by Andrey Sibiryov Docker, Inc.
The document discusses how modern hardware has become more complex with multi-core, multi-socket CPUs and deep cache hierarchies. This complexity introduces latency and performance issues for software. The author describes their service that processes millions of requests per second spending a large amount of time on garbage collection, context switching, and CPU stalls. They developed a tool called Tesson that analyzes hardware topology and shards containerized applications across CPU cores, pinning linked components closer together to improve locality and performance. Tesson integrates with a local load balancer to distribute workloads efficiently utilizing the system resources.
Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...KRamasamy2
This document discusses parallel computer architecture and challenges. It covers topics such as resource allocation, data access, communication, synchronization, performance and scalability for parallel processing. It also discusses different levels of parallelism that can be exploited in programs as well as the need for and feasibility of parallel computing given technology and application demands.
This document discusses MapReduce and its suitability for processing large datasets across distributed systems. It describes challenges like node failures, network bottlenecks and the motivation for a simple programming model that can handle massive computations and datasets across thousands of machines. MapReduce provides a programming model using map and reduce functions that hides complexities of parallelization, fault tolerance and load balancing. It has been widely adopted for applications involving log analysis, indexing large datasets, iterative graph processing and more.
John Hugg presented on building an operational database for high-performance applications. Some key points:
- He set out to reinvent OLTP databases to be 10x faster by leveraging multicore CPUs and partitioning data across cores.
- The database, called VoltDB, uses Java for transaction management and networking while storing data in C++ for better performance.
- It partitions data and transactions across server cores for parallelism. Global transactions can access all partitions transactionally.
- VoltDB is well-suited for fast data applications like IoT, gaming, ad tech which require high write throughput, low latency, and global understanding of live data.
The Sanger Institute generates large amounts of genomic data and requires significant compute resources to analyze it. It has experimented with running its analysis pipelines in the cloud to expand capacity and markets. However, moving large datasets into the cloud and ensuring fast access to the data within cloud compute resources has proved challenging. While individual components like web services have worked well, the high performance computing workloads that rely on large-scale data access and processing have not scaled effectively due to data transfer bottlenecks and lack of high-performance filesystems in the cloud.
The document discusses a market data vendor that processes data from exchange feeds and distributes it to customers. It provides millions of quotes per second using a pure Java solution called QDS Core, which parses, normalizes, and distributes the data. QDS Core uses optimized data structures like arrays and lock-free synchronization to achieve high performance. The vendor also provides an easier to use API called dxFeed on top of QDS Core to enable integration.
This presentation will give you Information about :
1.Configuring HDFS
2.Interacting With HDFS
3.HDFS Permissions and Security
4.Additional HDFS Tasks
HDFS Overview and Architecture
5.HDFS Installation
6.Hadoop File System Shell
7.File System Java API
The document discusses big data and distributed computing. It provides examples of the large amounts of data generated daily by organizations like the New York Stock Exchange and Facebook. It explains how distributed computing frameworks like Hadoop use multiple computers connected via a network to process large datasets in parallel. Hadoop's MapReduce programming model and HDFS distributed file system allow users to write distributed applications that process petabytes of data across commodity hardware clusters.
Performance Analysis of Lattice QCD with APGAS Programming ModelKoichi Shirahata
The document analyzes the performance of a lattice quantum chromodynamics (QCD) application implemented using the Asynchronous Partitioned Global Address Space (APGAS) programming model in X10. It finds that the X10 implementation achieves a 102.8x speedup in strong scaling up to 256 places. However, the MPI implementation outperforms X10, being 2.26-2.58x faster due to more optimized communication overlapping in MPI. Analysis shows the X10 implementation suffers overhead from thread activation and synchronization. Communication using one-sided operations also outperforms two-sided in X10. The work contributes an X10 implementation of lattice QCD and evaluates its scalability and performance compared to MPI.
The document provides summaries of several workshops and presentations at an HPC conference:
1. The rasdman workshop discussed adding array support to SQL queries, array query operators, and storage techniques for large arrays like tiled storage.
2. The energy efficient HPC talk discussed optimization techniques to improve energy efficiency, with information provided in slides.
3. The data-aware networking workshop included discussions of techniques for improving data transfer performance over networks like pipelining and parallelism in gridftp.
This document discusses challenges and opportunities in parallel graph processing for big data. It describes how graphs are ubiquitous but processing large graphs at scale is difficult due to their huge size, complex correlations between data entities, and skewed distributions. Current computation models have problems with ghost vertices, too much interaction between partitions, and lack of support for iterative graph algorithms. New frameworks are needed to handle these graphs in a scalable way with low memory usage and balanced computation and communication.
This document discusses MapReduce and Big Data processing using ZeroVM, a lightweight virtualization platform. It provides an overview of MapReduce and how it is commonly implemented using Apache Hadoop. It then describes some limitations of running MapReduce on the cloud, including costly data transfers between storage and computing clusters. The document introduces ZeroVM as a way to run applications directly on storage clusters, avoiding these transfers. It outlines how ZeroVM enables MapReduce jobs to be run on the storage layer through its ZeroCloud module. Ongoing research at UTSA is further developing ZeroVM and ZeroCloud to optimize MapReduce for data locality, load balancing, and skew handling.
This document provides a summary of large scale machine learning frameworks. It discusses out-of-core learning, data parallelism using MapReduce, graph parallel frameworks like Pregel, and model parallelism using parameter servers. Spark is described as easy to use with a well-designed API, while GraphLab is designed for ML researchers with vertex programming. Parameter servers are presented as aiming to support very large learning but still being in early development.
Big Data is everywhere these days. But what is it and how can you use it to fuel your business? Data is as important to organizations as labour and capital, and if organizations can effectively capture, analyze, visualize and apply big data insights to their business goals, they can differentiate themselves from their competitors and outperform them in terms of operational efficiency and the bottom line.
Join this session to understand the different AWS Big Data and Analytics services such as Amazon Elastic MapReduce (Hadoop), Amazon Redshift (Data Warehouse) and Amazon Kinesis (Streaming), when to use them and how they work together.
Reasons to attend:
Learn how AWS can help you process and make better use of your data with meaningful insights.
Learn about Amazon Elastic MapReduce and Amazon Redshift, fully managed petabyte-scale data warehouse solutions.
Learn about real time data processing with Amazon Kinesis.
AWS Webcast - An Introduction to High Performance Computing on AWSAmazon Web Services
The document discusses high performance computing (HPC) on AWS. It begins with an agenda that includes HPC concepts, patterns and practices as well as a demo of CloudFormation Cluster (cfncluster). It then discusses using elastic AWS instance clusters that are neither too large nor too small for jobs. The document covers various pricing models on AWS including on-demand, reserved, spot and dedicated instances. It also discusses AWS services that can be used for HPC workloads like EC2, auto scaling, monitoring with CloudWatch, GPU computing, middleware services and integrated solutions. Finally, popular HPC workloads on AWS are listed in various industries.
Similar to Data-Intensive Computing for Competent Genetic Algorithms: A Pilot Study using Meandre (20)
A quick overview of the seed for Meandre 2.0 series. It covers the main motivations moving forward and the disruptive changes introduced via the use of Scala and MongoDB
This document discusses cloud computing and the Meandre framework. It provides an overview of cloud concepts like public/private clouds and IaaS, PaaS, SaaS models. It describes NCSA's use of virtual machines and Eucalyptus cloud. Meandre is presented as a component-based framework that can orchestrate data-intensive applications across cloud resources through its dataflow model and scripting language. It aims to facilitate scaling applications to leverage elastic cloud infrastructure and integrate computation and data.
From Galapagos to Twitter: Darwin, Natural Selection, and Web 2.0Xavier Llorà
One hundred and fifty years have passed since the publication of Darwin's world-changing manuscript "The Origins of Species by Means of Natural Selection". Darwin's ideas have proven their power to reach beyond the biology realm, and their ability to define a conceptual framework which allows us to model and understand complex systems. In the mid 1950s and 60s the efforts of a scattered group of engineers proved the benefits of adopting an evolutionary paradigm to solve complex real-world problems. In the 70s, the emerging presence of computers brought us a new collection of artificial evolution paradigms, among which genetic algorithms rapidly gained widespread adoption. Currently, the Internet has propitiated an exponential growth of information and computational resources that are clearly disrupting our perception and forcing us to reevaluate the boundaries between technology and social interaction. Darwin's ideas can, once again, help us understand such disruptive change. In this talk, I will review the origin of artificial evolution ideas and techniques. I will also show how these techniques are, nowadays, helping to solve a wide range of applications, from life science problems to twitter puzzles, and how high performance computing can make Darwin ideas a routinary tool to help us model and understand complex systems.
Large Scale Data Mining using Genetics-Based Machine LearningXavier Llorà
We are living in the peta-byte era.We have larger and larger data to analyze, process and transform into useful answers for the domain experts. Robust data mining tools, able to cope with petascale volumes and/or high dimensionality producing human-understandable solutions are key on several domain areas. Genetics-based machine learning (GBML) techniques are perfect candidates for this task, among others, due to the recent advances in representations, learning paradigms, and theoretical modeling. If evolutionary learning techniques aspire to be a relevant player in this context, they need to have the capacity of processing these vast amounts of data and they need to process this data within reasonable time. Moreover, massive computation cycles are getting cheaper and cheaper every day, allowing researchers to have access to unprecedented parallelization degrees. Several topics are interlaced in these two requirements: (1) having the proper learning paradigms and knowledge representations, (2) understanding them and knowing when are they suitable for the problem at hand, (3) using efficiency enhancement techniques, and (4) transforming and visualizing the produced solutions to give back as much insight as possible to the domain experts are few of them.
This tutorial will try to answer this question, following a roadmap that starts with the questions of what large means, and why large is a challenge for GBML methods. Afterwards, we will discuss different facets in which we can overcome this challenge: Efficiency enhancement techniques, representations able to cope with large dimensionality spaces, scalability of learning paradigms. We will also review a topic interlaced with all of them: how can we model the scalability of the components of our GBML systems to better engineer them to get the best performance out of them for large datasets. The roadmap continues with examples of real applications of GBML systems and finishes with an analysis of further directions.
Scalabiltity in GBML, Accuracy-based Michigan Fuzzy LCS, and new TrendsXavier Llorà
The document summarizes a presentation given by Jorge Casillas on research related to scaling up genetic learning algorithms and fuzzy classifier systems. Specifically, it discusses:
1. An approach using evolutionary instance selection and stratification to extract rule sets from large datasets that balance prediction accuracy and interpretability.
2. Fuzzy-XCS, an accuracy-based genetic fuzzy system the author is developing that uses competitive fuzzy inference and represents rules as disjunctive normal forms to address challenges in credit assignment.
3. Open problems and opportunities in applying genetic learning at large scales, such as addressing chromosome size and efficient evaluation over large datasets.
Pittsburgh Learning Classifier Systems for Protein Structure Prediction: Sca...Xavier Llorà
This document summarizes research using a Pittsburgh Learning Classifier System (LCS) called GAssist to predict protein structure by determining coordination numbers (CN). The researchers tested GAssist on a dataset of over 250,000 protein residues, comparing it to support vector machines, Naive Bayes, and C4.5 decision trees. While support vector machines achieved the best accuracy, GAssist produced more interpretable and compact rule sets at the cost of lower performance. The researchers analyzed the interpretability and scalability of GAssist for this challenging bioinformatics problem, identifying avenues for improving its accuracy while maintaining explanatory power.
Learning Classifier Systems for Class Imbalance ProblemsXavier Llorà
The document discusses learning classifier systems (LCS) for addressing class imbalance problems in datasets. It aims to enhance the applicability of LCS to knowledge discovery from real-world datasets that often exhibit class imbalance, where one class is represented by significantly fewer examples than other classes. The author proposes adapting parameters of the XCS learning classifier system, such as learning rate and genetic algorithm threshold, based on estimated class imbalance ratios within classifiers' niches in order to minimize bias towards majority classes and better handle small disjuncts representing minority classes.
XCS: Current capabilities and future challengesXavier Llorà
The document discusses the XCS classifier system, which uses a combination of gradient-based techniques and evolutionary algorithms to learn predictive models from complex problems. It summarizes XCS's current capabilities in classification, function approximation, and reinforcement learning tasks. However, it notes there are still challenges to improve XCS's representations and operators, niching abilities, handling of dynamic problems, solution compactness, and development of hierarchical classifier systems.
Computed Prediction: So far, so good. What now?Xavier Llorà
This document discusses computed prediction in learning classifier systems (LCS). It addresses representing the payoff function Q(s,a) that maps state-action pairs to expected future payoffs. Specifically:
1) In computed prediction, each classifier has parameters w and the classifier prediction is computed as a parametrized function p(x,w) like a linear approximation.
2) Classifier weights are updated using the Widrow-Hoff rule online as the payoff function is learned.
3) Using a powerful approximator like tile coding to compute predictions allows the problem to potentially be solved by a single classifier, but evolution of different approximators per problem subspace may still
This document provides information about the NCSA/IlliGAL Gathering on Evolutionary Learning (NIGEL 2006) conference. It discusses how the conference originated from a previous 2003 gathering. It thanks the organizers and participants and provides details about the agenda, which includes presentations on topics like classifier systems and discussions around applications and techniques of evolutionary learning.
Linkage Learning for Pittsburgh LCS: Making Problems TractableXavier Llorà
Presentation by Xavier Llorà, Kumara Sastry, & David E. Goldberg showing how linkage learning is possible on Pittsburgh style learning classifier systems
Meandre: Semantic-Driven Data-Intensive Flows in the CloudsXavier Llorà
- Meandre is a semantic-driven data-intensive workflow infrastructure for distributed computing. It allows users to assemble modular components into complex workflows (flows) in a visual programming tool or using a scripting language called ZigZag.
- Workflows are composed of components, which can be executable or control components. Executable components perform computational tasks when data is available, while control components pause workflows for user interactions. Components are described semantically using ontologies to separate functionality from implementation.
- Data availability drives workflow execution in Meandre. When required inputs are available, components will fire and produce outputs to make data available for downstream components. This dataflow approach aims to make workflows transparent, intuitive, and reusable across
ZigZag is a new language for describing data-intensive workflows. It aims to make the Meandre infrastructure easier to use by allowing users to assemble complex data flows. The language has a new syntax and compiles workflows that can then be run on Meandre to process large datasets.
Do not Match, Inherit: Fitness Surrogates for Genetics-Based Machine Learning...Xavier Llorà
A byproduct benefit of using probabilistic model-building genetic algorithms is the creation of cheap and accurate surrogate models. Learning classifier systems---and genetics-based machine learning in general---can greatly benefit from such surrogates which may replace the costly matching procedure of a rule against large data sets. In this paper we investigate the accuracy of such surrogate fitness functions when coupled with the probabilistic models evolved by the x-ary extended compact classifier system (xeCCS). To achieve such a goal, we show the need that the probabilistic models should be able to represent all the accurate basis functions required for creating an accurate surrogate. We also introduce a procedure to transform populations of rules based into dependency structure matrices (DSMs) which allows building accurate models of overlapping building blocks---a necessary condition to accurately estimate the fitness of the evolved rules.
Towards Better than Human Capability in Diagnosing Prostate Cancer Using Infr...Xavier Llorà
Cancer diagnosis is essentially a human task. Almost universally, the process requires the extraction of tissue (biopsy) and examination of its microstructure by a human. To improve diagnoses based on limited and inconsistent morphologic knowledge, a new approach has recently been proposed that uses molecular spectroscopic imaging to utilize microscopic chemical composition for diagnoses. In contrast to visible imaging, the approach results in very large data sets as each pixel contains the entire molecular vibrational spectroscopy data from all chemical species. Here, we propose data handling and analysis strategies to allow computer-based diagnosis of human prostate cancer by applying a novel genetics-based machine learning technique ({\tt NAX}). We apply this technique to demonstrate both fast learning and accurate classification that, additionally, scales well with parallelization. Preliminary results demonstrate that this approach can improve current clinical practice in diagnosing prostate cancer.
Understanding and Interpreting Teachers’ TPACK for Teaching Multimodalities i...Neny Isharyanti
Presented as a plenary session in iTELL 2024 in Salatiga on 4 July 2024.
The plenary focuses on understanding and intepreting relevant TPACK competence for teachers to be adept in teaching multimodality in the digital age. It juxtaposes the results of research on multimodality with its contextual implementation in the teaching of English subject in the Indonesian Emancipated Curriculum.
Integrated Marketing Communications (IMC)- Concept, Features, Elements, Role of advertising in IMC
Advertising: Concept, Features, Evolution of Advertising, Active Participants, Benefits of advertising to Business firms and consumers.
Classification of advertising: Geographic, Media, Target audience and Functions.
Satta Matka Dpboss Kalyan Matka Results Kalyan ChartMohit Tripathi
SATTA MATKA DPBOSS KALYAN MATKA RESULTS KALYAN CHART KALYAN MATKA MATKA RESULT KALYAN MATKA TIPS SATTA MATKA MATKA COM MATKA PANA JODI TODAY BATTA SATKA MATKA PATTI JODI NUMBER MATKA RESULTS MATKA CHART MATKA JODI SATTA COM INDIA SATTA MATKA MATKA TIPS MATKA WAPKA ALL MATKA RESULT LIVE ONLINE MATKA RESULT KALYAN MATKA RESULT DPBOSS MATKA 143 MAIN MATKA KALYAN MATKA RESULTS KALYAN CHART
Kalyan Matka Kalyan Result Satta Matka Result Satta Matka Kalyan Satta Matka Kalyan Open Today Satta Matka Kalyan
Kalyan today kalyan trick kalyan trick today kalyan chart kalyan today free game kalyan today fix jodi kalyan today matka kalyan today open Kalyan jodi kalyan jodi trick today kalyan jodi trick kalyan jodi ajj ka.
Beyond the Advance Presentation for By the Book 9John Rodzvilla
In June 2020, L.L. McKinney, a Black author of young adult novels, began the #publishingpaidme hashtag to create a discussion on how the publishing industry treats Black authors: “what they’re paid. What the marketing is. How the books are treated. How one Black book not reaching its parameters casts a shadow on all Black books and all Black authors, and that’s not the same for our white counterparts.” (Grady 2020) McKinney’s call resulted in an online discussion across 65,000 tweets between authors of all races and the creation of a Google spreadsheet that collected information on over 2,000 titles.
While the conversation was originally meant to discuss the ethical value of book publishing, it became an economic assessment by authors of how publishers treated authors of color and women authors without a full analysis of the data collected. This paper would present the data collected from relevant tweets and the Google database to show not only the range of advances among participating authors split out by their race, gender, sexual orientation and the genre of their work, but also the publishers’ treatment of their titles in terms of deal announcements and pre-pub attention in industry publications. The paper is based on a multi-year project of cleaning and evaluating the collected data to assess what it reveals about the habits and strategies of American publishers in acquiring and promoting titles from a diverse group of authors across the literary, non-fiction, children’s, mystery, romance, and SFF genres.
Webinar Innovative assessments for SOcial Emotional SkillsEduSkills OECD
Presentations by Adriano Linzarini and Daniel Catarino da Silva of the OECD Rethinking Assessment of Social and Emotional Skills project from the OECD webinar "Innovations in measuring social and emotional skills and what AI will bring next" on 5 July 2024
Still I Rise by Maya Angelou
-Table of Contents
● Questions to be Addressed
● Introduction
● About the Author
● Analysis
● Key Literary Devices Used in the Poem
1. Simile
2. Metaphor
3. Repetition
4. Rhetorical Question
5. Structure and Form
6. Imagery
7. Symbolism
● Conclusion
● References
-Questions to be Addressed
1. How does the meaning of the poem evolve as we progress through each stanza?
2. How do similes and metaphors enhance the imagery in "Still I Rise"?
3. What effect does the repetition of certain phrases have on the overall tone of the poem?
4. How does Maya Angelou use symbolism to convey her message of resilience and empowerment?
The Jewish Trinity : Sabbath,Shekinah and Sanctuary 4.pdfJackieSparrow3
we may assume that God created the cosmos to be his great temple, in which he rested after his creative work. Nevertheless, his special revelatory presence did not fill the entire earth yet, since it was his intention that his human vice-regent, whom he installed in the garden sanctuary, would extend worldwide the boundaries of that sanctuary and of God’s presence. Adam, of course, disobeyed this mandate, so that humanity no longer enjoyed God’s presence in the little localized garden. Consequently, the entire earth became infected with sin and idolatry in a way it had not been previously before the fall, while yet in its still imperfect newly created state. Therefore, the various expressions about God being unable to inhabit earthly structures are best understood, at least in part, by realizing that the old order and sanctuary have been tainted with sin and must be cleansed and recreated before God’s Shekinah presence, formerly limited to heaven and the holy of holies, can dwell universally throughout creation
Split Shifts From Gantt View in the Odoo 17Celine George
Odoo allows users to split long shifts into multiple segments directly from the Gantt view.Each segment retains details of the original shift, such as employee assignment, start time, end time, and specific tasks or descriptions.
No, it's not a robot: prompt writing for investigative journalismPaul Bradshaw
How to use generative AI tools like ChatGPT and Gemini to generate story ideas for investigations, identify potential sources, and help with coding and writing.
A talk from the Centre for Investigative Journalism Summer School, July 2024
Slide Presentation from a Doctoral Virtual Open House presented on June 30, 2024 by staff and faculty of Capitol Technology University
Covers degrees offered, program details, tuition, financial aid and the application process.
Data-Intensive Computing for Competent Genetic Algorithms: A Pilot Study using Meandre
1. Data-Intensive Computing for
Competent Genetic Algorithms:
A Pilot Study using Meandre
Xavier Llorà
National Center for Supercomputing Applications
University of Illinois at Urbana-Champaign
Urbana, Illinois, 61801
xllora@ncsa.illinois.edu
http://www.ncsa.illinois.edu/~xllora
2. Outline
• Data-intensive computing and HPC?
• Is this related at all to evolutionary computation?
• Data-intensive computing with Meandre
• GAs and competent GAs
• Data-intensive computing for GAs
3. 2 Minute HPC History
• The eighties and early nineties picture
• Commodity hardware rare, slow, and costly
• Supercomputers were extremely expensive
• Most of them hand crafted and only few units
• Two competing families
• CISC (e.g. Cray C90 with up to 16 processors)
• RISC (e.g. Connection Machine CM-5 with up 4,096 processors)
• Late nineties commodity hardware hit main stream
• Start becoming popular, cheaper, and faster
• Economy of scale
• Massive parallel computers build from commodity components become a
viable option
4. Two Visions
• C90 like supercomputers were like a comfy pair of trainers
• Oriented to scientific computing
• Complex vector oriented supercomputers
• Shared memory (lots of them)
• Multiprocessor enabled via some intercommunication networks
• Single system image
• CM5 like computers did not get massive traction, but a bit
• General purpose (as long as you can chop the work in simple units)
• Lots of simple processors available
• Distributed memory pushed new programming models (message passing)
• Complex interconnection networks
• NCSA have shared memory, distributed memory, and gpgpu based
5. Miniaturization Building Bridges
• Multicores and gpgpus are reviving the C90 flavor
• The CM-5 flavor now survives as distributed clusters of not so
simple units
6. Control Models of Parallelization in EC
Run 1 Run 5 Run 9 Master
Individual Evaluation
Run 2 Run 6 Run 10
Run 3 Run 7 Run 11
Slave Slave Slave
Migration
7. But Data is also Part of the Equation
• Google and Yahoo! revived an old route
• Usually refers to:
• Infrastructure
• Programming techniques/paradigms
• Google made it main stream after their MapReduce model
• Yahoo! provides and open source implementation
• Hadoop (MapReduce)
• HDFS (Hadoop distributed filesystem)
• Store petabytes reliably on commodity hardware (fault tolerant)
• Programming model
• Map: Equivalent to the map operation on functional programming
• Reduce: The reduction phase after maps are computed
8. A Simple Example
n
2
x → reduce(map(x, sqr), sum)
i=0
x x x x
map map map map
x2 x2 x2 x2
reduce reduce reduce reduce
sum
9. Is This Related to EC?
• How can we easily benefit of the current core race painlessly?
• NCSA’s Blue Waters estimated may top on 100K
• Yes on several facets
• Large optimization problems need to deal with large population sizes
(Sastry, Goldberg & Llorà, 2007)
• Large-scale data mining using genetic-based machine learning (Llorà et
al. 2007)
• Competent GAs model building extremely costly and data rich (Pelikan
et al. 2001)
• The goal?
• Rethink parallelization as data flow processes
• Show that traditional models can be map to data-intensive computing
models
• Foster you curiosity
11. The Meandre Infrastructure Challenges
• NCSA infrastructure effort on data-intensive computing
• Transparency
• From a single laptop to a HPC cluster
• Not bound to a particular computation fabric
• Allow heterogeneous development
• Intuitive programming paradigm
• Modular Components assembled into Flows
• Foster Collaboration and Sharing
• Open Source
• Service Orientated Architecture (SOA)
12. Basic Infrastructure Philosophy
• Dataflow execution paradigm
• Semantic-web driven
• Web oriented
• Facilitate distributed computing
• Support publishing services
• Promote reuse, sharing, and collaboration
• More information at http://seasr.org/meandre
13. Data Flow Execution in Meandre
• A simple example c ← a+b
• A traditional control-driven language
a = 1
b = 2
c = a+b
• Execution following the sequence of instructions
• One step at a time
• a+b+c+d requires 3 steps
• Could be easily parallelized
14. Data Flow Execution in Meandre
• Data flow execution is driven by data
• The previous example may have 2 possible data flow versions
Stateless data flow
value(a)
+ value(c)
value(b)
State-based data flow
value(b) value(a) + value(c)
?
15. The Basic Building Blocks: Components
Component
RDF descriptor of the The component
components behavior implementation
16. Go with the Flow: Creating Complex Tasks
• Directed multigraph of components creates a flow
Push
Text
Concatenate To Upper Print
Text Case Text Text
Push
Text
17. Automatic Parallelization:
Speed and Robustness
• Meandre ZigZag language allow automatic parallelization
To Upper
Case Text
Push
Text To Upper
Concatenate Case Text Print
Text Text
Push
Text
To Upper
Case Text
19. Selectorecombinative GAs
1. Initialize the population with random individuals
2. Evaluate the fitness value of the individuals
3. Select good solutions by using s-wise tournament selection
without replacement (Goldberg, Korb & Deb, 1989)
4. Create new individuals by recombining the selected population
using uniform crossover (Sywerda, 1989)
5. Evaluate the fitness valueof all offspring
6. Repeat steps 3-5 until convergence criteria are met
20. Extended Compact Genetic Algorithm
• Harik et al. 2006
• Initialize the population (usually random initialization)
• Evaluate the fitness of individuals
• Select promising solutions (e.g., tournament selection)
• Build the probabilistic model
• Optimize structure & parameters to best fit selected individuals
• Automatic identification of sub-structures
• Sample the model to create new candidate solutions
• Effective exchange of building blocks
• Repeat steps 2–7 till some convergence criteria are met
21. eCGA Model Building Process
• Use model-building procedure of extended compact GA
• Partition genes into (mutually) independent groups
• Start with the lowest complexity model
• Search for a least-complex, most-accurate model
Model Structure
Metric
[X0] [X1] [X2] [X3] [X4] [X5] [X6] [X7] [X8] [X9] [X10] [X11]
1.0000
[X0] [X1] [X2] [X3] [X4X5] [X6] [X7] [X8] [X9] [X10] [X11]
0.9933
[X0] [X1] [X2] [X3] [X4X5X7] [X6] [X8] [X9] [X10] [X11]
0.9819
[X0] [X1] [X2] [X3] [X4X5X6X7] [X8] [X9] [X10] [X11]
0.9644
…
[X0] [X1] [X2] [X3] [X4X5X6X7] [X8X9X10X11]
0.9273
…
[X0X1X2X3] [X4X5X6X7] [X8X9X10X11]
0.8895
27. eCGA Model Building Speedup
• Intel 2.8Ghz QuadCore, 4Gb RAM. Average of 20 runs.
• Speedup against original eCGA model building
5 ●
Speedup vs. Original eCGA Model Building
4
●
3
●
2
●
1
1 2 3 4
Number of cores
28. Scalability on NUMA Systems
• Run on NCSA’s SGI Altix Cobalt
• 1,120 processors and up to 5 TB of RAM
• SGI NUMAlink
• NUMA architecture
• Test for speedup behavior
• Average of 20 independent runs
• Automatic parallelization of the partition evaluation
• Results still show the linear trend (despite the NUMA)
• 16 processors, speedup = 14.01
• 32 processors, speedup = 27.96
30. Summary
• Evolutionary computation is data rich
• Data-intensive computing can provide to EC:
• Tap into parallelism quite painless
• Provide a simple programming and modeling
• Boost reusability
• Tackle otherwise intractable problems
• Shown that equivalent data-intensive computing versions of
traditional algorithms exist
• Linear parallelism can be tap transparently
31. Data-Intensive Computing for
Competent Genetic Algorithms:
A Pilot Study using Meandre
Xavier Llorà
National Center for Supercomputing Applications
University of Illinois at Urbana-Champaign
Urbana, Illinois, 61801
xllora@ncsa.illinois.edu
http://www.ncsa.illinois.edu/~xllora