This talk is a primer to Machine Learning. I will provide a brief introduction what is ML and how it works. I will walk you down the Machine Learning pipeline from data gathering, data normalizing and feature engineering, common supervised and unsupervised algorithms, training models, and delivering results to production. I will also provide recommendations to tools that help you provide the best ML experience, include programming languages and libraries.
If there is time at the end of the talk, I will walk through two coding examples, using the HMS Titanic Passenger List, present with Python scikit-learn using algorithm random-trees to check if ML can correctly predict passenger survival and with R programming for feature engineering of the same dataset
Note to data-scientists and programmers: If you sign up to attend, plan to visit my Github repository! I have many Machine Learning coding examples in Python scikit-learn, GNU Octave, and R Programming.
https://github.com/jefftune/gitw-2017-ml
The document provides an introduction to machine learning with Mahout. It discusses how machine learning can be used to analyze large datasets and extract useful patterns and insights. Specifically, it covers topics like supervised vs unsupervised learning, recommendation systems, classification, and clustering algorithms like k-means. As an example, it shows how k-means clustering could be used to group pizza delivery locations to optimally locate new stores.
Machine Learning for Non-technical Peopleindico data
Machine learning is one of the most promising and most difficult to understand fields of the modern age. Here are the slides from Slater Victoroff's (CEO of indico) talk at General Assembly Boston for non-technical folks on how to separate the signal from the noise -- stay tuned for the next time he speaks:
https://generalassemb.ly/education/machine-learning-for-non-technical-people
The newest buzzword after Big Data is AI. From Google search to Facebook messenger bots, AI is also everywhere.
• Machine learning has gone mainstream. Organizations are trying to build competitive advantage with AI and Big Data.
• But, what does it take to build Machine Learning applications? Beyond the unicorn data scientists and PhDs, how do you build on your big data architecture and apply Machine Learning to what you do?
• This talk will discuss technical options to implement machine learning on big data architectures and how to move forward.
This document provides an introduction to machine learning. It discusses key machine learning concepts like supervised learning, unsupervised learning, reinforcement learning, batch learning, online learning, instance-based learning, and model-based learning. It also discusses applications of machine learning like spam filtering, clustering, and anomaly detection. Machine learning algorithms like artificial neural networks and deep learning are also introduced. The document aims to explain machine learning concepts and techniques in a clear and intuitive manner using examples.
An overview of Tensorflow, and then we'll walk through how to utilize this library within the H2O platform. Tensorflow is an open source, deep learning framework utilized by Google and Deepmind. #h2ony
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Slide presentasi ini dibawakan oleh Imron Zuhri dalam acara Seminar & Workshop Pengenalan & Potensi Big Data & Machine Learning yang diselenggarakan oleh KUDO pada tanggal 14 Mei 2016.
Mapping (big) data science (15 dec2014)대학(원)생Han Woo PARK
This document discusses big data mapping and issues. It begins with definitions and characteristics of big data, including volume, velocity, variety, variability and complexity. It then covers the background of data science and trends in big data research and development. Finally, it addresses social issues and implications related to big data, including potential divides between developed and developing countries, academic and commercial researchers, and those with and without computational skills.
Materials for getting started with data scienceihansel
This document provides an overview of free resources for learning data science. It recommends the Data Science Handbook, DataCamp for learning coding skills, and An Introduction to Statistical Learning for algorithms and concepts. It also recommends Kaggle for data science projects and competitions and notes that it allows working on datasets in a virtual machine without installing software. The resources are aimed at beginners and provide reviews of books, websites, MOOCs and podcasts for learning data science.
This document provides an introduction to machine learning. It discusses machine learning background, including the differences between artificial intelligence, machine learning, and deep learning. It also covers machine learning algorithms, applications, and how machine learning works. Example machine learning techniques discussed include classification using k-nearest neighbors, naive Bayes, and decision trees, as well as clustering with k-means.
The document discusses putting "magic" into data science. It provides several tricks or techniques for data science, including collecting novel data sources, dimensionality reduction, Bayesian methods, bootstrapping statistics, and matrix factorizations. It also emphasizes the importance of reliability, latency/interactivity, simplicity/modularity, and unexpectedness to solve the "last mile" problem of getting people to actually use data science tools and models. Specific Facebook tools like Planout, Deltoid, ClustR, Prophet, and Hive/Presto/Scuba are presented as examples.
Human in the loop: a design pattern for managing teams working with MLPaco Nathan
Strata CA 2018-03-08
https://conferences.oreilly.com/strata/strata-ca/public/schedule/detail/64223
Although it has long been used for has been used for use cases like simulation, training, and UX mockups, human-in-the-loop (HITL) has emerged as a key design pattern for managing teams where people and machines collaborate. One approach, active learning (a special case of semi-supervised learning), employs mostly automated processes based on machine learning models, but exceptions are referred to human experts, whose decisions help improve new iterations of the models.
Introduction to Data Science and Large-scale Machine LearningNik Spirin
This document is a presentation about data science and artificial intelligence given by James G. Shanahan. It provides an outline that covers topics such as machine learning, data science applications, architecture, and future directions. Shanahan has over 25 years of experience in data science and currently works as an independent consultant and teaches at UC Berkeley. The presentation provides background on artificial intelligence and machine learning techniques as well as examples of their successful applications.
AI, Machine Learning and Deep Learning - The OverviewSpotle.ai
The deck takes you into a fascinating journey of Artificial Intelligence, Machine Learning and Deep Learning, dissect how they are connected and in what way they differ. Supported by illustrative case studies, the deck is your ready reckoner on the fundamental concepts of AI, ML and DL.
Explore more videos, masterclasses with global experts, projects and quizzes on https://spotle.ai/learn
HackerEarth is pleased to announce its next session to help you understand what it really takes to become a data scientist.
Agenda of this session will include answers to the following questions:
- Why is it the best time to take up Data Science as a career?
- How can you take the first step in Data Science? (After all, first step is always the hardest!)
- How can you become better and progress fast?
- How is life after becoming a Data Scientist?
Speaker:
Jesse Steinweg-Woods is soon-to-be a Senior Data Scientist at tronc, working on recommender systems for articles and understanding customer behavior. Previously, he worked at Argo Group Insurance on new pricing models that took advantage of machine learning techniques. He received his PhD in Atmospheric Science from Texas A&M University, and his research focused on numerical weather and climate prediction.
Mahout and Distributed Machine Learning 101John Ternent
The document provides an introduction to machine learning with Mahout. It discusses machine learning concepts and algorithms like clustering, classification, and recommendation. It introduces Hadoop as a framework for distributed processing of big data and Mahout as an open-source library for machine learning algorithms on Hadoop. The document demonstrates how to run recommendation algorithms and clustering algorithms using Mahout on local machines or cloud platforms like Amazon EC2 and EMR. It also discusses preprocessing text data and classifiers.
This was part of my inaugural lecture of Summer Internship on Machine Learning at NMAM Institute of Technology, Nitte on 7th June, 2018. A lot more than what was on this presentation was discussed. We spoke on the ethics of choices we make as developers, socio-cultural impact of AI and ML and the political repercussions of deploying ML and AI.
This document provides an introduction to machine learning and data science. It discusses key concepts like supervised vs. unsupervised learning, classification algorithms, overfitting and underfitting data. It also addresses challenges like having bad quality or insufficient training data. Python and MATLAB are introduced as suitable software for machine learning projects.
This document provides an overview of deep learning, machine learning, and artificial intelligence. It defines artificial intelligence as efforts to automate intellectual tasks normally performed by humans. Machine learning involves training systems using examples rather than explicit programming. Deep learning uses successive layers of representations in neural networks to transform input data into more useful representations. It has achieved near-human level performance on tasks like image classification and speech recognition. While popular, deep learning is not always the best approach and other machine learning methods exist.
This document provides an overview of deep learning and neural networks. It begins with definitions of machine learning, artificial intelligence, and the different types of machine learning problems. It then introduces deep learning, explaining that it uses neural networks with multiple layers to learn representations of data. The document discusses why deep learning works better than traditional machine learning for complex problems. It covers key concepts like activation functions, gradient descent, backpropagation, and overfitting. It also provides examples of applications of deep learning and popular deep learning frameworks like TensorFlow. Overall, the document gives a high-level introduction to deep learning concepts and techniques.
Machine Learning is a fascinating field that has been making headlines for its incredible advancements in recent years. Whether you're a tech enthusiast or just curious about how machines can learn, this article will provide you with a simple and easy-to-understand overview of some key Machine Learning concepts. Think of it as your first step towards a Machine Learning Complete Course!
The document provides an overview of concepts and topics to be covered in the MIS End Term Exam for AI and A2 on February 6th 2020, including: decision trees, classifier algorithms like ID3, CART and Naive Bayes; supervised and unsupervised learning; clustering using K-means; bias and variance; overfitting and underfitting; ensemble learning techniques like bagging and random forests; and the use of test and train data.
In a world of data explosion, the rate of data generation and consumption is on the increasing side, there comes the buzzword - Big Data.
Big Data is the concept of fast-moving, large-volume data in varying dimensions (sources) and
highly unpredicted sources.
The 4Vs of Big Data
● Volume - Scale of Data
● Velocity - Analysis of Streaming Data
● Variety - Different forms of Data
● Veracity - Uncertainty of Data
With increasing data availability, the new trend in the industry demands not just data collection,
but making ample sense of acquired data - thereby, the concept of Data Analytics.
Taking it a step further to further make a futuristic prediction and realistic inferences - the concept
of Machine Learning.
A blend of both gives a robust analysis of data for the past, now and the future.
There is a thin line between data analytics and Machine learning which becomes very obvious
when you dig deep.
This document provides an introduction to machine learning, including definitions, examples of tasks well-suited to machine learning, and different types of machine learning problems. It discusses how machine learning algorithms learn from examples to produce a program or model, and contrasts this with hand-coding programs. It also briefly covers supervised vs. unsupervised vs. reinforcement learning, hypothesis spaces, regularization, validation sets, Bayesian learning, and maximum likelihood learning.
The document discusses machine learning and learning agents in three main points:
1. It defines machine learning and discusses different types of machine learning tasks like supervised, unsupervised, and reinforcement learning.
2. It explains the key differences between traditional machine learning approaches and learning agents, noting that learning is one of many goals for agents and must be integrated with other agent functions.
3. It discusses different challenges of integrating machine learning into intelligent agents, such as balancing learning with recall of existing knowledge and addressing time constraints on learning from the environment.
In a world of data explosion, the rate of data generation and consumption is on the increasing side,
there comes the buzzword - Big Data.
Big Data is the concept of fast-moving, large-volume data in varying dimensions (sources) and
highly unpredicted sources.
The 4Vs of Big Data
● Volume - Scale of Data
● Velocity - Analysis of Streaming Data
● Variety - Different forms of Data
● Veracity - Uncertainty of Data
With increasing data availability, the new trend in the industry demands not just data collection but making an ample sense of acquired data - thereby, the concept of Data Analytics.
Taking it a step further to further make futuristic prediction and realistic inferences - the concept
of Machine Learning.
A blend of both gives a robust analysis of data for the past, now and the future.
There is a thin line between data analytics and Machine learning which becomes very obvious
when you dig deep.
Machine learning is a technology design to build intelligent systems. These systems also have the ability to learn from past experience or analyze historical data. It provides results according to its experience.
Alpavdin defines Machine Learning as-
“Optimizing a performance criterion using example data and past experience”.
Data is the key concept of machine learning. We can also apply its algorithms on data to identify hidden patterns and gain insights. These patterns and gained knowledge help systems to learn and improve their performance.
Machine learning technology involves both statistics and computer science. Statistics allows one to draw inferences from the given data. To implement efficient algorithms we can also use computer science. It represents the required model, and evaluate the performance of the model.
How data science works and how can customers helpDanko Nikolic
The document discusses how CSC creates specialized models for customers through data science. It explains that textbooks oversimplify real-world data modeling, and that data scientists create customized models rather than just applying existing ones. Specialized model architectures require less data and training than general ones. The customer can help data scientists develop specialized architectures by understanding their business needs, explaining the data generation process, formulating hypotheses, and providing domain experts for consultation. CSC provides data science expertise to develop specialized models that can achieve excellent results for customers.
Hacking Predictive Modeling - RoadSec 2018HJ van Veen
This document provides an overview of machine learning and predictive modeling techniques for hackers and data scientists. It discusses foundational concepts in machine learning like functionalism, connectionism, and black box modeling. It also covers practical techniques like feature engineering, model selection, evaluation, optimization, and popular Python libraries. The document encourages an experimental approach to hacking predictive models through techniques like brute forcing hyperparameters, fuzzing with data permutations, and social engineering within data science communities.
The document discusses machine learning and data science concepts. It begins with an introduction to machine learning and the machine learning process. It then provides an overview of select machine learning algorithms and concepts like bias/variance, generalization, underfitting and overfitting. It also discusses ensemble methods. The document then shifts to discussing time series, functions for manipulating time series, and laying the foundation for time series prediction and forecasting. It provides examples of applying techniques like median filtering to smooth time series data. Overall, the document provides a high-level introduction and overview of key machine learning and time series concepts.
This document provides an overview of machine learning. It begins with an introduction and definitions, explaining that machine learning allows computers to learn without being explicitly programmed by exploring algorithms that can learn from data. The document then discusses the different types of machine learning problems including supervised learning, unsupervised learning, and reinforcement learning. It provides examples and applications of each type. The document also covers popular machine learning techniques like decision trees, artificial neural networks, and frameworks/tools used for machine learning.
Week 4 advanced labeling, augmentation and data preprocessingAjay Taneja
This document provides an overview of advanced machine learning techniques for data labeling, augmentation, and preprocessing. It discusses semi-supervised learning, active learning, weak supervision, and various data augmentation strategies. For data labeling, it describes how semi-supervised learning leverages both labeled and unlabeled data, while active learning intelligently samples data and weak supervision uses noisy labels from experts. For data augmentation, it explains how existing data can be modified through techniques like flipping, cropping, and padding to generate more training examples. The document also introduces the concepts of time series data and how time ordering is important for modeling sequential data.
Lecture related to machine learning. Here you can read multiple things. Lecture related to machine learning. Here you can read multiple things. Lecture related to machine learning. Here you can read multiple things. Lecture related to machine learning. Here you can read multiple things. Lecture related to machine learning. Here you can read multiple things.
This document provides an overview of machine learning concepts from the first lecture of an introduction to machine learning course. It discusses what machine learning is, examples of tasks that can be solved with machine learning, and key concepts like supervised vs. unsupervised learning, hypothesis spaces, searching hypothesis spaces, generalization, and model complexity.
This document provides an overview of machine learning. It begins with an introduction and discusses the basics, types (supervised, unsupervised, reinforcement learning), technologies, applications, and vision for the next few years. Key points covered include definitions of machine learning, examples of applications (search engines, spam filters, personalized recommendations), and descriptions of different problem types (classification, regression, clustering) and learning approaches (decision trees, neural networks, Bayesian methods).
This document provides an overview of machine learning. It discusses the history of machine learning beginning in 1957 with the first neural network. It defines machine learning as using algorithms and statistical models to perform tasks without explicit instructions by learning from patterns in data. Supervised learning uses labeled training data to guide model training, and is used for classification and regression problems. Unsupervised learning finds patterns in unlabeled data using clustering. Reinforcement learning involves an agent interacting with an environment and receiving rewards or penalties to learn the best outcomes. Popular machine learning software libraries and applications are also mentioned.
Presentation given at the Stockholm R useR Group (SRUG) meetup on Dec 6, 2016. Contains a general overview of deep learning, material on using Tensorflow in R etc.
Data analysis & integration challenges in genomicsmikaelhuss
Presentation given at the Genomics Today and Tomorrow event in Uppsala, Sweden, 19 March 2015. (http://connectuppsala.se/events/genomics-today-and-tomorrow/) Topics include APIs, "querying by data set", machine learning.
The document discusses issues and pitfalls with using publicly available RNA-seq data. It was presented by Mikael Huss from the SciLifeLab and Stockholm University at RNA-Seq Europe in Basel. Huss works with a team of bioinformaticians at SciLifeLab to analyze new sequencing data and put it into context with existing information to ensure the data makes sense. The presentation addresses how to evaluate RNA-seq data quality and compare it to array data.
Lecture given for the Data Mining course at Uppsala university in October 2013. The presentation talks about data analysis in the context of genomics, next-generation sequencing, metagenomics etc.
Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...weiwchu
We recently discovered that models trained with large-scale speech datasets sourced from the web could achieve superior accuracy and potentially lower cost than traditionally human-labeled or simulated speech datasets. We developed a customizable AI-driven data labeling system. It infers word-level transcriptions with confidence scores, enabling supervised ASR training. It also robustly generates phone-level timestamps even in the presence of transcription or recognition errors, facilitating the training of TTS models. Moreover, It automatically assigns labels such as scenario, accent, language, and topic tags to the data, enabling the selection of task-specific data for training a model tailored to that particular task. We assessed the effectiveness of the datasets by fine-tuning open-source large speech models such as Whisper and SeamlessM4T and analyzing the resulting metrics. In addition to openly-available data, our data handling system can also be tailored to provide reliable labels for proprietary data from certain vertical domains. This customization enables supervised training of domain-specific models without the need for human labelers, eliminating data breach risks and significantly reducing data labeling cost.
The Rise of Python in Finance,Automating Trading Strategies: _.pdfRiya Sen
In the dynamic realm of finance, where every second counts, the integration of technology has become indispensable. Aspiring traders and seasoned investors alike are turning to coding as a powerful tool to unlock new avenues of financial success. In this blog, we delve into the world of Python live trading strategies, exploring how coding can be the key to navigating the complexities of the market and securing your path to prosperity.
I’m excited to finally share my research from last year on the hypnotic effects of mass media and digital platformization. This study explores how our attention is influenced through YouTube’s audio-visual content. Key points:
- **Objective:** Examine the hypnotic side effects of media on attention.
- **Focus:** Sound and visual experiences on YouTube.
- **Methodology:** Mixed digital approach with quantitative and qualitative analysis.
- **Findings:** Observations on techniques in attention-based economies and their cognitive impact.
- **Implications:** Considerations for future research in media and mind interactions, especially within OSINT-oriented communities.
Curious about the details? Check out my slide deck and let’s discuss the future possibilities.
#Research #AttentionEconomy #YouTube #DigitalMedia #MediaStudies #VisualNetworkAnalysis #HypnodelicMedia
AWS re:Invent 2023 - Deep dive into Amazon Aurora and its innovations DAT408Grant McAlister
With an innovative architecture that decouples compute from storage and advanced features like Global Database and low-latency read replicas, Amazon Aurora reimagines what it means to be a relational database. Aurora is a modern database service offering unparalleled performance and high availability at scale with full open source MySQL and PostgreSQL compatibility. In this session, dive deep into the most exciting new features Aurora offers, including Aurora I/O-Optimized, Aurora zero-ETL integration with Amazon Redshift, and Aurora Serverless v2. Learn how the addition of the pgvector extension allows for the storage of vector embeddings and support of vector similarity searches for generative AI.
Overview of Statistical software such as ODK, surveyCTO,and CSPro
2. Software installation(for computer, and tablet or mobile devices)
3. Create a data entry application
4. Create the data dictionary
5. Create the data entry forms
6. Enter data
7. Add Edits to the Data Entry Application
8. CAPI questions and texts
Towards an Analysis-Ready, Cloud-Optimised service for FAIR fusion dataSamuel Jackson
We present our work to improve data accessibility and performance for data-intensive tasks within the fusion research community. Our primary goal is to develop services that facilitate efficient access for data-intensive applications while ensuring compliance with FAIR principles [1], as well as adoption of interoperable tools, methods and standards.
The major outcome of our work is the successful creation and deployment of a data service for the MAST (Mega Ampere Spherical Tokamak) experiment [2], leading to substantial enhancements in data discoverability, accessibility, and overall data retrieval performance, particularly in scenarios involving large-scale data access. Our work follows the principles of Analysis-Ready, Cloud Optimised (ARCO) data [3] by using cloud optimised data formats for fusion data.
Our system consists of a query-able metadata catalogue, complemented with an object storage system for publicly serving data from the MAST experiment. We will show how our solution integrates with the Pandata stack [4] to enable data analysis and processing at scales that would have previously been intractable, paving the way for data-intensive workflows running routinely with minimal pre-processing on the part of the researcher. By using a cloud-optimised file format such as zarr [5] we can enable interactive data analysis and visualisation while avoiding large data transfers. Our solution integrates with common python data analysis libraries for large, complex scientific data such as xarray [6] for complex data structures and dask [7] for parallel computation and lazily working with larger that memory datasets.
The incorporation of these technologies is vital for advancing simulation, design, and enabling emerging technologies like machine learning and foundation models, all of which rely on efficient access to extensive repositories of high-quality data. Relying on the FAIR guiding principles for data stewardship not only enhances data findability, accessibility, and reusability, but also fosters international cooperation on the interoperability of data and tools, driving fusion research into new realms and ensuring its relevance in an era characterised by advanced technologies in data science.
[1] Wilkinson, M., Dumontier, M., Aalbersberg, I. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data 3, 160018 (2016) https://doi.org/10.1038/sdata.2016.18
[2] M Cox, The Mega Amp Spherical Tokamak, Fusion Engineering and Design, Volume 46, Issues 2–4, 1999, Pages 397-404, ISSN 0920-3796, https://doi.org/10.1016/S0920-3796(99)00031-9
[3] Stern, Charles, et al. "Pangeo forge: crowdsourcing analysis-ready, cloud optimized data production." Frontiers in Climate 3 (2022): 782909.
[4] Bednar, James A., and Martin Durant. "The Pandata Scalable Open-Source Analysis Stack." (2023).
[5] Alistair Miles (2024) ‘zarr-developers/zarr-python: v2.17.1’. Zenodo. doi: 10.5281/zenodo.10790679
[6] Hoyer, S. & Hamman, J., (20
2. Questions to answer
1. What is meant by “machine learning” and “deep learning”?
2. Is deep learning with neural networks the best solution for
most problems now days or what else is there to use?
3. How much theory do I need to get started with my app or
service?
4. How do I get from idea to trained and deployed model? …
and things to consider
4. Machine learning is a way to make advanced statistical models using
math. It’s a way to make a computer guess.
Machine learning models are fantastic with access to good data.
However machine learning models can’t perform magic.
Garbage in = Garbage out
5. Type of learning Description Example uses
Supervised learning Modelling an specified
target/output variable. Each
example is “labelled”.
Classification (into categories) or
regression (on a numerical target).
Predict if a person will default on
their loan. Model the sale price of
an apartment.
Unsupervised learning No target/control signal. Find
structure in data. Clustering.
Divide movies or bands into
genres from user data.
Semi-supervised learning Mix of the above. Only part of
input data is labelled.
As for supervised learning but
with incomplete labeling.
Reinforcement learning A system explores its
environment, takes actions and
receives rewards. No explicit
control signal. Learning by doing.
Teach software to play a game
(checkers, Atari Breakout, …)
Active learning Techniques to select the training
examples that the algorithm could
learn the most from at a given
time.
As for supervised learning but
trying to optimize the learning
process.
Types of machine learning
6. Deep learning is a subset of ML
“Deep learning” is just a
rebranding of “neural networks”!
These in turn are just systems for
fitting an output to an input by
repeatedly applying linear
transformations to the inputs until
they match the outputs.
7. Basic idea of neural networks
The input data, your data points, are assigned to input
“nodes” in an input “layer”.
These are connected with weights to a “hidden” layer,
which in turn is connected to the next hidden layer, or the
output layer.
The values in the input layer are multiplied by a weight
and the resulting products are summed to give an
“activation value” in the hidden (or output) layer.
In the output layer, the activations are compared to the
desired activations, and the weights are adjusted based
on how big the mismatch is. This is the “learning” part.
Deep learning just means many layers like this plus
maybe more complicated patterns in how the weights are
connected to inputs.
9. Random forests
Decision tree ensembles
Gaussian process regression (good at
handling uncertainty)
Lasso regression (looking for simple
models)
Elaborations on classical ML methods
17. Applications where you probably don’t need deep learning
• “Tabular data”, i.e., you have a table with rows and columns, where
the variables (columns) are a mix of numerical and categorical
variables – often standard methods are enough
• You have a small number of training examples (e.g. a couple of
hundred or less)
• When you want to create an easily interpretable model or just make
a quick sanity check
• Often, end users are more interested in understanding which
variables are important rather than the model’s accuracy
… in other words, you will quite often not need it.
19. How much math do I need?
Of course it is better to know some math/theory but frankly, it is probably sufficient to have some
intuition on how each method works.
If you know your maths, it is easier to implement models from papers or your own models, but existing
frameworks are enough to do a lot already. There has never been a better time to get into machine
learning!
- Incredible amounts of tutorials on e.g. Github and Medium
- MOOCs:
Andrew Ng, Coursera and deeplearning.ai
Jeremy Howard, Fast.ai
cognitiveclass.ai
- Software frameworks such as scikit-learn for Python
DSX tutorials & articles https://apsportal.ibm.com/community
21. A way to practice: online contests
§ Largest online predictive modeling competition
platform
§ Founded 2010. Acquired by Google 2016
§ Companies or organizations define problems and
provide data; users compete for the best score. The
winner gets a money prize or in some case a job offer
22. • The leaderboard is motivating
• You can learn a lot from the discussion board
• Useful to learn and try out new techniques
• Learn not to overfit
24. 4 - How to go from idea to trained & deployed model?
25. Understand the goal.
What do you want to be able to predict or understand?
Can it be measured in a good way?
Do you have the data necessary to model it?
Actionability.
What is the next step if you get a good predictive model? Can you use it?
Are the variables that you use such that they can be easily adjusted?
Will end users be able to act on the results?
Data quality.
Can you extract the data in a good way?
Are the data complete? Are there missing/suspicious values?
Training data size and shape.
Do you have enough examples for training compared to the number of
variables (dimensions)? Do you have “wide” or “long” data?
Checklist for a machine learning idea
27. Tools: open-source vs. proprietary
Open source
Proprietary
Data science tools
Project collaboration
Notebooks
Model deployment
scikit-learn
Tensorflow
Keras
caret
Mlbench
Shiny
28. Deploying machine learning models
Easiest way? – Watson ML (today’s demo), or equivalents on Azure (Microsoft), CloudML
(Google), ECS (Amazon) …
Tensorflow (as some others) has built-in serving capabilities (Tensorflow Serving).
Do-it-yourself web servers – often done using Flask (Python web server framework), or for
language-independent model deployment, OpenScoring (uses PMML).
For non-production-grade deployment, can use Shiny (R web app library), or Python
equivalents (Plotly) Dash, Bokeh, (IBM) PixieDust.
Yhat - https://www.yhat.com/products/scienceops - commercial model deployment
solution that hooks directly into R or Python
29. What was not covered in this talk
• Visualisation and exploratory data analysis (including dynamic
data exploration apps)
• Details on how different ML models work
• Case studies
Maybe next time? J