A presentation given at AWS re:Invent on how Netflix induces failure to validate and harden production systems. Technologies discussed include the Simian Army (Chaos Monkey, Gorilla, Kong) and our next gen Failure Injection Test framework (FIT).
GameDay - Achieving resilience through Chaos EngineeringDiUS
http://dius.com.au/resources/game-day/
Agility has brought us iterative software development, independent feature teams, nimble architectures and distributed, scalable infrastructure. But how do you maintain confidence in these systems in the face of this emergent complexity and fast paced change? The answer is to anticipate and practice failure!
In this session we explore GameDays, a collaborative exercise where teams safely introduce chaos into their systems, in order to make them better.
Manual Monitoring Slows Deployment and Introduces Risk
How often do you update your applications?
“We deploy multiple times per day” seems to be the new badge of honor for DevOps.
But what you don’t often hear about are the problems caused by process acceleration as a result of continuous integration and continuous deployment (CI/CD).
Rapid introduction of performance problems and errors
Rapid introduction of new endpoints causing monitoring issues
Lengthy root cause analysis as number of services expand
When implementing CI/CD, ANY manual intervention slows down the entire pipeline. You can’t achieve complete CI/CD without automating your monitoring processes (just like you did for integration, testing, and deployment).
The document discusses concepts related to game day and chaos engineering on AWS. It provides examples of chaos experiments that can be conducted such as resource exhaustion, network unreliability, and datastore saturation. It also discusses tools for chaos engineering like Chaos Toolkit and Simian Army. The goal of game days and chaos engineering is to test systems resilience by simulating failures and disasters to gain insights on how to improve systems reliability.
Micro Focus Software Delivery and Testing Jan De Coster Presentation on the Journey to DevOps in the recent Micro Focus #DevDay Copenhagen.
Micro Focus enables enterprise software organizations to build innovative software and accelerate application delivery to meet the needs of the business. Whatever the challenges and infrastructures, our core principle—of reusing what already works to minimize business risk while supporting modern software practices—has positioned our customers to be better prepared to support the digital transformation of the business.
Build, test and deliver innovative software faster with less risk.
April 2017.
The document discusses best practices for DevOps culture. It outlines 5 topics: 1) Train everyone on new DevOps tools and workflows, 2) Share and speak openly about projects, 3) Collaborate between development and operations teams and automate processes, 4) Prioritize building trust between teams with a focus on business services, and 5) Build a diverse project team with different skills including development, deployment, and testing. The document provides an overview of DevOps and examples of how companies like Amazon, Facebook, and Etsy implement DevOps practices.
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...ITSM Academy, Inc.
Presenter: Perry Statham
SRE Squad Leader with IBM Cloud DevOps Services
In this presentation, the IBM DevOps Services SRE team will give a brief introduction to Site Reliability Engineering, then show how they adopted its principals in their existing enterprise organization.
SRE (service reliability engineer) on big DevOps platform running on the clou...DevClub_lv
SRE (service reliability engineer). The talk is to explain the SRE philosophy and the principles of production engineering and operations in clouds.
(Language – English)
Pavlo is ADOP (Accenture DevOps Platform) Service Reliability Team Lead, SRE practitioner. Has more then 18 years of IT experience in Ops and Dev.
Devops architecture involves three main categories of infrastructure: IT infrastructure (version control, issue tracking, etc.), build infrastructure (build servers with access to source code), and test infrastructure (deployment, acceptance, and functional testing). Continuous integration involves automating the integration of code changes, while continuous delivery ensures code is always releasable but actual deployment is manual. Continuous deployment automates deployment so that any code passing tests is immediately deployed to production. The document discusses infrastructure hosting options, automation approaches, common CI/CD workflows, and provides examples of low and medium-cost devops tooling setups using open source and proprietary software.
The document discusses modern software development approaches and best practices. It covers topics like popular programming languages, architectural patterns, team structures, and practices. Key points include: JavaScript, Python, and Java are the most popular languages; architectural approaches like monoliths, microservices, and serverless all have tradeoffs; modern teams include roles like SREs and benefit from practices like living documentation and mob programming; and factors like team maturity impact project success rates more than project size alone.
The document discusses effective test automation in DevOps. It begins with an introduction of the speaker and an overview of the topics to be covered. These include test automation in DevOps, common obstacles to automation success, and the pillars of effective test automation regarding scope, approach, and test environment and data management. The document emphasizes that continuous testing requires reliable automated tests, stopping production when tests fail, and developing in small batches. It also outlines challenges around test environments and data availability hindering automation goals.
The Next Wave of Reliability EngineeringMichael Kehoe
In 2018, Site Reliability Engineering (SRE) will turn 15 years old. Since Google's inception of the term SRE, companies across the world have adopted a new operations mindset along with automation, deployment and monitoring principals. Most of what SRE does now is well established throughout the industry, so what is the next-wave of reliability principals and automation frameworks?
This session will dive into what the future holds for reliability engineering as a field and what will be the next areas of investment and improvement for reliability teams.
Overcoming Common Product Backlog Management Traps — David Pereira at the 54....Stefan Wolpers
How teams manage their Product Backlog often makes or breaks their value creation chances. Poor backlog management leads to a feature factory trap, while a mindful strategy enables the team to drive value steadily.
Over the years, David has identified common traps teams often face and learned how to overcome them the hard way. Let David help you identify such challenges and help you overcome them, too.
This document discusses site reliability engineering (SRE) for growing organizations. SRE focuses on production automation, resiliency and scalability, similar to devops but with more emphasis on keeping systems running. As companies grow, expectations often outpace capacity and complexity increases, requiring more automation rather than personnel to maintain high uptime levels. A dedicated SRE team can improve reaction times, learn from incidents, raise awareness of system behaviors, and focus on forward-looking improvements rather than just keeping existing systems running. Key SRE practices include automated monitoring, log indexing, health checks, establishing service level objectives and agreements, and implementing self-healing systems and runbooks.
The document discusses the growth of Site Reliability Engineering (SRE) at Squarespace from a team of 2 people in New York to a global organization with teams in New York, Portland, and Dublin. It describes how the initial SRE team focused on three pillars: monitoring and alerting, configuration management, and builds and deploys. It then explains how the SRE organization expanded to include additional teams focused on areas like provisioning, release engineering, developer productivity, and observability while also embedding SREs within product teams.
This document discusses observability and its three pillars: logs, metrics, and traces. It introduces common observability tools like Elastic Stack, Prometheus, and Jaeger. Logs should be aggregated and indexed, metrics can use recording rules and alerting, and traces enable root cause analysis. Best practices include monitoring components, testing configurations, and retaining sufficient log data. Observability provides insight into systems from external outputs and context about internal states.
The document introduces the Agile Testing Manifesto, which was created in 2013 to summarize key learnings from a conference talk about Agile Testing. It presents principles of the manifesto including that testing is an ongoing activity rather than a separate phase, preventing bugs rather than just finding them, testers helping build the best system rather than trying to break it, and quality being a team responsibility rather than just the tester's job. The document then asks questions to check understanding and provides contact information for any other questions.
The document discusses chaos engineering and building system resiliency. It defines chaos engineering as experimenting on distributed systems to build confidence in their ability to withstand turbulent conditions in production. It advocates breaking systems on purpose through failure injection experiments to discover weaknesses before they cause problems. The document provides examples of chaos engineering experiments at different levels including application, host, network, and region failures. It also covers principles for ensuring system reliability through approaches like infrastructure as code, immutable infrastructure, auto-scaling, and non-blocking architectures.
Whether you are a professional trainer or trying to bring agility to your organization or team, you no doubt have encountered the difficulty in conveying agile values and principles. Learning practices and techniques is easy in comparison; You learn by doing. But how do you teach a philosophy or mindset? How do you 'do' a value?
Through trial and error, through hundreds of classes, through training thousands of agile practitioners, we have put together a set of best practices (and not-so-best practices) for delivering powerful agile learning experiences. Participants in this session will walk away with a toolkit they can put to use the next day. The toolkit will include scenario simulations, learning games, discussion generators, reenforcement exercises, student patterns, common pitfalls, and other activities to help you get out of the way and let the learning happen.
Links to the activities we did during the presentation:
'What Were They Thinking?' game: http://tastycupcakes.org/2009/06/what-were-they-thinking/
'Pocket-Sized Principles' activity: http://tastycupcakes.org/2010/01/pocket-sized-principles/
'Presto Manifesto' activity: http://tastycupcakes.org/2009/06/presto-manifesto/
Application Networks: Microservices and APIs at NetflixMuleSoft
Who better to talk about microservices — one of the hottest technology trends for 2016 — than Netflix? This streaming-entertainment giant began adopting them in 2009, years before the exact term even existed. Join MuleSoft and Netflix as they co-present the value that a microservices architecture can bring to your business, and see first hand the real-world implementation of APIs at Netflix. Then learn from MuleSoft’s CTO how APIs and DevOps are two important pillars of microservices and discover how they can become part of your application network.
Mastering Chaos - A Netflix Guide to MicroservicesJosh Evans
QConSF 2016 Abstract:
By embracing the tension between order and chaos and applying a healthy mix of discipline and surrender Netflix reliably operates microservices in the cloud at scale. But every lesson learned and solution developed over the last seven years was born out of pain for us and our customers. Even today we remain vigilant as we evolve our service architecture. For those just starting the microservices journey these lessons and solutions provide a blueprint for success.
In this talk we’ll explore the chaotic and vibrant world of microservices at Netflix. We’ll start with the basics - the anatomy of a microservice, the challenges around distributed systems, and the benefits realized when integrated operational practices and technical solutions are properly leveraged. Then we’ll build on that foundation exploring the cultural, architectural, and operational methods that lead to microservice mastery.
Fred George describes his personal journey discovering microservice architecture over 15 years working on large software projects. He details how his projects evolved from monolithic 1 million line applications to small, independent services. This allowed for improved agility, with services being short-lived and able to deploy several times a day. George also discusses challenges faced and lessons learned around loosely coupling services, managing data across services, and establishing practices for a "living software" system with continuous deployment of services.
Just about all of my current technical content in one 364 slide mega-deck. Source files at https://github.com/adrianco/slides
Sections on:
Scene Setting
State of the Cloud
What Changes?
Product Processes
Microservices
State of the Art
Segmentation
What’s Missing?
Monitoring
Challenges
Migration
Response Times
Serverless
Lock-In
Teraservices
Wrap-Up
Netflix receives 2 billion requests per day to its API from users and makes 12 billion outbound requests from its personalization engine to power recommendations. The personalization engine uses data on users, movies, ratings, reviews, and similar movies to conduct A/B tests and has experienced 30 times growth over two years. The document requests feedback on the presentation and conference.
MicroServices at Netflix - challenges of scaleSudhir Tonse
Microservices at Netflix have evolved over time from a single monolithic application to hundreds of fine-grained services. While this provides benefits like independent delivery, it also introduces complexity and challenges around operations, testing, and availability. Netflix addresses these challenges through tools like Hystrix for fault tolerance, Eureka for service discovery, Ribbon for load balancing, and RxNetty for asynchronous communication between services.
Beyond DevOps - How Netflix Bridges the GapJosh Evans
Operating a massively scalable, constantly changing, distributed global service is a daunting task. We innovate at breakneck speed to attract new customers and stay ahead of the competition. Simultaneously improving service quality and enabling rapid, continuous change seems impossible on the surface.
At Netflix, Operations Engineering is a centralized organization whose charter is to accomplish just that by applying high-leverage software engineering practices like continuous delivery. real-time analytics, and automation to solve operational problems. It's well established that many traditional IT Operations teams struggle to bridge the gap with software engineering. Operations Engineering is no exception. And while DevOps as a construct seeks to address this gap, it doesn't go far enough. It does not explain how to bridge the gap or even why it's important to do so.
In this talk we’ll use Netflix Operations Engineering as a case study to address these questions. We'll explore common challenges faced by operational teams and strategies to overcome them.
This document discusses patterns for scaling unstable systems. It outlines several principles: embrace failures and useful complexity; understand horizontal and vertical scaling patterns; know how to measure performance; have full stack developers; destroy statefulness; use asynchronous communication and event sourcing; and plan edge architectures. The goal is to establish a culture and technologies that enable systems to scale in unstable conditions.
Slide deck from my talk about "Principles of Chaos Engineering" at the first ever Chaos Engineering Hamburg meet up.
Come join us at http://www.meetup.com/Chaos-Engineering-Hamburg/ and stay up to date with new events and other news.
An overview of historical and current fault-injection techniques used to test operating systems.
Additionally, questions that might trigger future work are presented.
Web Scale Applications using NeflixOSS Cloud PlatformSudhir Tonse
Web Scale Applications using NeflixOSS Cloud Platform. Infographics on IaaS, PaaS, SaaS. Commandments of developing a cloud based distributed application.
Open Business Conference: Continuous Delivery At Netflix -- Powered by Open S...Dianne Marsh
Netflix uses continuous delivery practices powered by open source tools to deploy code rapidly and reliably across multiple AWS regions. Teams deploy their own code using tools like Nebula/Gradle and Jenkins Job DSL for automated builds. The Aminator creates AMIs and Asgard deploys them using red/black deployment. Simian Army monkeys like Chaos Monkey test resiliency. Self-service, awareness of regions, and rollback ability are key to Netflix's approach.
This document summarizes Netflix's journey to building a globally ubiquitous and failure-resilient architecture. It describes how Netflix evolved from a single data center architecture to a multi-region active-active design using microservices, Cassandra for data storage, EVCache for caching, and virtual DNS regions for traffic management. The architecture is designed to reliably serve customers from any region by replicating data and traffic across regions and implementing failover mechanisms.
How Netflix tests in production to augment more traditional testing methods. This talk covers the Simian Army (Chaos Monkey & friends, code coverage in production, and canary testing.
Engineering Netflix Global Operations in the CloudJosh Evans
Delivered at re:Invent 2015.
Operating a massively scalable, constantly changing, distributed global service is a daunting task. We innovate at breakneck speed to attract new customers and stay ahead of the competition. This means more features, more experiments, more deployments, more engineers making changes in production environments, and ever-increasing complexity. Simultaneously improving service availability and accelerating rate of change seems impossible on the surface. At Netflix, operations engineering is both a technical and organizational construct designed to accomplish just that by integrating disciplines like continuous delivery, fault injection, regional traffic management, crisis response, best practice automation, and real-time analytics. In this talk, designed for technical leaders seeking a path to operational excellence, we'll explore these disciplines in depth and how they integrate and create competitive advantages.
The Journey of Chaos Engineering Begins with a Single StepBruce Wong
PagerDuty Summit 2016
Presenters: Bruce Wong, James Burns
https://www.pagerduty.com/pagerduty-summit-2016/
Heard of Netflix' Chaos Engineering & the Simian Army? Google's legendary DiRT exercises? Hear about how Twilio is getting started on its journey with Chaos Engineering. This talk is the story of how Twilio got started with Chaos Engineering, lessons learned, and the impact to our engineering culture.
Release the Monkeys ! Testing in the Wild at NetflixGareth Bowles
This document discusses Netflix's use of "chaos monkeys" to deliberately cause failures in their systems to test resiliency. The chaos monkeys include Chaos Monkey which terminates instances, Chaos Gorilla which simulates an availability zone outage, and Chaos Kong which simulates a full region outage. The monkeys help validate redundancy, improve designs to avoid failures, and ensure systems can handle degradation without affecting other services. The chaos testing is released as open source and helps Netflix understand how systems will behave during random failures.
This document summarizes Ajay Vaddadi's work on developing an automated fault-tolerance testing tool called ScrewDriver at Groupon. It discusses (1) the need for fault tolerance testing due to economic losses from outages, (2) ScrewDriver's main components like the Controller, Capsule, and Topology Translation Engine, and (3) next steps to test ScrewDriver extensively on Groupon services and eventually open source it.
DevOps for the Enterprise: Automated Testing and Monitoring Amazon Web Services
This document summarizes an episode of a DevOps webinar series about enabling business agility through automated testing and monitoring. It discusses using AWS CloudFormation to automatically create test environments and AWS OpsWorks for automated deployments. This allows for on-demand test environments. It also discusses using CloudWatch alarms to monitor for failures, simulating extreme situations for crisis preparation, and replaying network activity and failures to test systems. The importance of validation, debriefing, and testing assumptions in a production-like environment is emphasized.
Isep m2 m - iot - course 1 - update 2013 - 09122013 - part 3 - v(0.7)Thierry Lestable
This document provides a 3-part summary of Internet of Things (IoT) topics, including market trends, technology roadmaps and standards, and cloud computing applications. It discusses convergence of WiFi and cellular networks, smart grid and smart vehicle use cases, and cloud-based services like gaming, TV, and storage. Standardization efforts by groups like ETSI, 3GPP, and the GSC are reviewed. Open issues regarding architecture, governance, interoperability, and neutrality are also covered.
(PFC305) Embracing Failure: Fault-Injection and Service Reliability | AWS re:...Amazon Web Services
Complex distributed systems fail. They fail more frequently, and in different ways, as they scale and evolve over time. In this session, you learn how Netflix embraces failure to provide high service availability. Netflix discusses their motivations for inducing failure in production, the mechanics of how Netflix does this, and the lessons they learned along the way. Come hear about the Failure Injection Testing (FIT) framework and suite of tools that Netflix created and currently uses to induce controlled system failures in an effort to help discover vulnerabilities, resolve them, and improve the resiliency of their cloud environment.
Net flix embracingfailure re-invent2014-141113085858-conversion-gate02~Eric Principe
This document discusses Netflix's approach to embracing failure through fault injection testing. Netflix has over 50 million members across 50+ countries and streams over 1 billion hours of content per month. To ensure high availability, Netflix designs its complex distributed systems for failure by implementing exception handling, fault tolerance, fallbacks, auto-scaling, and redundancy. However, testing failures at such a large scale is challenging. Netflix developed several "monkey" tools that randomly inject failures like instance reboots or availability zone outages to validate system resilience. More advanced tools like Fault Injection Testing (FIT) allow simulating specific failure scenarios by injecting errors or latency at various points. This helps Netflix continuously validate assumptions and discover issues to further improve availability.
(ISM301) Engineering Netflix Global Operations In The CloudAmazon Web Services
- Netflix faced two operational challenges of accelerating innovation while sustaining quality at growing scale and complexity.
- Netflix adopted an approach of operational excellence through continuous improvement of operations management, design, and function to achieve greater quality and velocity.
- Netflix practices operations engineering by applying software engineering practices to operations to achieve operational excellence through automation, modular components, tools, and services.
If you need to build highly performant, mission critical ,microservice-based system following DevOps best practices, you should definitely check Service Fabric!
Service Fabric is one of the most interesting services Azure offers today. It provide unique capabilities outperforming competitor products.
We are seeing global companies start to use Service Fabric for their mission critical solutions.
In this talk we explore the current state of Service Fabric and dive deeper to highlight best practices and design patterns.
We will cover the following topics:
• Service Fabric Core Concepts
• Cluster Planning and Management
• Stateless Services
• Stateful Services
• Actor Model
• Availability and reliability
• Scalability and perfromance
• Diganostics and Monitoring
• Containers
• Testing
• IoT
Live broadcast on https://www.youtube.com/watch?v=Zuxfhpab6xo
Arc305 how netflix leverages multiple regions to increase availability an i...Ruslan Meshenberg
Learn how to make your services more resilient and available by embracing principles of isolation and redundancy. See details of 2 projects - Isthmus and Active/Active to learn how Netflix architects for availability in multi-regional environment.
Netflix Development Patterns for Scale, Performance & Availability (DMG206) |...Amazon Web Services
This session explains how Netflix is using the capabilities of AWS to balance the rate of change against the risk of introducing a fault. Netflix uses a modular architecture with fault isolation and fallback logic for dependencies to maximize availability. This approach allows for rapid independent evolution of individual components to maximize the pace of innovation and A/B testing, and offers nearly unlimited scalability as the business grows. Learn how we balance managing change to (or subtraction from) the customer experience, while aggressively scraping barnacle features that add complexity for little value.
Migrating IBM i Systems to the Cloud: Exploring the Pros and ConsPrecisely
In today's dynamic IT landscape, businesses running IBM Power Systems are continuously seeking ways to optimize their infrastructure and leverage cutting-edge technologies. Migrating IBM i and AIX workloads to the cloud has emerged as a compelling option for many organizations, offering a host of potential benefits. However, it is crucial to carefully weigh the pros and cons of cloud migration before making a decision.
In this webinar, we'll delve into the intricate world of IBM i cloud migration, equipping you with the knowledge to make an informed choice.
Join us for this webcast to hear about:
• The compelling advantages of migrating to the cloud
• The potential challenges of migration
• Recommended best practices for undertaking a migration
Expect the unexpected: Prepare for failures in microservicesBhakti Mehta
My talk at Confoo 2016 Montreal
It is well said that "The more you sweat on the field, the less you bleed in war". Failures are an inevitable part of complex systems. Accepting that failures happen, will help you design the system's reactions to specific failures.
This talks on best practices for building resilient, stable and predictable services:
preventing Cascading failures, Timeouts pattern, Retry pattern,Circuit breakers
and many more techniques in microservices
“Microservices” have become a trendy development strategy. Hosting and running such services used to be pretty painful... but here comes Service Fabric! Let’s take a closer look at this platform, its different development models and all the features it offers, and not only for microservices!
More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...Amazon Web Services
Running your Amazon EC2 instances in Auto Scaling groups allows you to improve your application's availability right out of the box. Auto Scaling replaces impaired or unhealthy instances automatically to maintain your desired number of instances (even if that number is one). You can also use Auto Scaling to automate the provisioning of new instances and software configurations as well as to track of usage and costs by app, project, or cost center. Of course, you can also use Auto Scaling to adjust capacity as needed - on demand, on a schedule, or dynamically based on demand. In this session, we show you a few of the tools you can use to enable Auto Scaling for the applications you run on Amazon EC2.
A brief intro to microservice patters and strategies.
This is a presentation from the series "by Developer for Developers" powered by eSolutions Grup.
You can find the practical example at https://github.com/eSolutionsGrup/microshop
Service Stampede: Surviving a Thousand ServicesAnil Gursel
How many services do you have? 5, 10, 100? How do you even run large number of services? A micro service may be relatively simple. But services also mean distributed systems, which are inherently complex. 5 services are complex. A thousand services across many generations are at least 200 times as complex. How do we deal with such complexity?
This talk discusses service architecture at Internet scale, the need for larger transaction density, larger horizontal and vertical scale, more predictable latencies under stress, and the need for standardization and visibility. We’ll dive into how we build our latest generation service infrastructure based on Scala and Akka to serve the needs of such a large scale ecosystem.
Lastly, have the cake and eat it too. No, we’re not keeping all the goodies only to ourselves. They are all there for you in open source.
There are two main types of scaling: vertical scaling involves upgrading hardware on a single server, while horizontal scaling involves adding more servers. Before scaling, strategies like caching and optimizing queries can improve performance. When scaling databases, options include sharding data across multiple servers and using master-slave replication. Server scaling with microservices breaks applications into independent, communicating components. General tips for scaling include monitoring resources, auto-scaling with metrics, and killing instances to reduce costs when possible.
More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...Amazon Web Services
Running your Amazon EC2 instances in Auto Scaling groups allows you to improve your application's availability right out of the box. Auto Scaling replaces impaired or unhealthy instances automatically to maintain your desired number of instances (even if that number is one). You can also use Auto Scaling to automate the provisioning of new instances and software configurations as well as to track of usage and costs by app, project, or cost center. Of course, you can also use Auto Scaling to adjust capacity as needed - on demand, on a schedule, or dynamically based on demand. In this session, we show you a few of the tools you can use to enable Auto Scaling for the applications you run on Amazon EC2. We also share tips and tricks we've picked up from customers such as Netflix, Adobe, Nokia, and Amazon.com about managing capacity, balancing performance against cost, and optimizing availability.
This tutorial gives out an brief and interesting introduction to modern stream computing technologies. The participants can learn the essential concepts and methodologies for designing and building a advanced stream processing system. The tutorial unveils the key fundamentals behind various kinds of design choices. Some forecast of technology developments in this domain is also introduced at the last section of this tutorial.
Moderator:
Chris Grundemann, Network Automation Forum
Speakers:
Jeff Loughridge, Konekti Systems
Mark Ciecior, Carrier Access IT
William Collins, Alkira
Netflix uses a microservices architecture and immutable infrastructure approach. It loads content across multiple AWS regions for high availability and scales services dynamically. Netflix employs techniques like caching, adaptive streaming, and content delivery networks to optimize the user experience of streaming video globally to over 140 million subscribers.
High Availability in the Cloud - Architectural Best PracticesRightScale
RightScale Webinar: The April 21st Amazon service disruption in the US East Region caused many to revisit application architectures to better withstand failures. With cloud infrastructure as a level playing field, we all have effectively the same building blocks and it’s up to each of us to balance cost and complexity against the risk of outages. Fortunately, there are many simple approaches in the cloud that dramatically improve application scalability and availability with little incremental cost.
Similar to Embracing Failure - Fault Injection and Service Resilience at Netflix (20)
Performance Budgets for the Real World by Tammy EvertsScyllaDB
Performance budgets have been around for more than ten years. Over those years, we’ve learned a lot about what works, what doesn’t, and what we need to improve. In this session, Tammy revisits old assumptions about performance budgets and offers some new best practices. Topics include:
• Understanding performance budgets vs. performance goals
• Aligning budgets with user experience
• Pros and cons of Core Web Vitals
• How to stay on top of your budgets to fight regressions
Video traffic on the Internet is constantly growing; networked multimedia applications consume a predominant share of the available Internet bandwidth. A major technical breakthrough and enabler in multimedia systems research and of industrial networked multimedia services certainly was the HTTP Adaptive Streaming (HAS) technique. This resulted in the standardization of MPEG Dynamic Adaptive Streaming over HTTP (MPEG-DASH) which, together with HTTP Live Streaming (HLS), is widely used for multimedia delivery in today’s networks. Existing challenges in multimedia systems research deal with the trade-off between (i) the ever-increasing content complexity, (ii) various requirements with respect to time (most importantly, latency), and (iii) quality of experience (QoE). Optimizing towards one aspect usually negatively impacts at least one of the other two aspects if not both. This situation sets the stage for our research work in the ATHENA Christian Doppler (CD) Laboratory (Adaptive Streaming over HTTP and Emerging Networked Multimedia Services; https://athena.itec.aau.at/), jointly funded by public sources and industry. In this talk, we will present selected novel approaches and research results of the first year of the ATHENA CD Lab’s operation. We will highlight HAS-related research on (i) multimedia content provisioning (machine learning for video encoding); (ii) multimedia content delivery (support of edge processing and virtualized network functions for video networking); (iii) multimedia content consumption and end-to-end aspects (player-triggered segment retransmissions to improve video playout quality); and (iv) novel QoE investigations (adaptive point cloud streaming). We will also put the work into the context of international multimedia systems research.
this resume for sadika shaikh bca studentSadikaShaikh7
I am a dedicated BCA student with a strong foundation in web technologies, including PHP and MySQL. I have hands-on experience in Java and Python, and a solid understanding of data structures. My technical skills are complemented by my ability to learn quickly and adapt to new challenges in the ever-evolving field of computer science.
GDG Cloud Southlake #34: Neatsun Ziv: Automating AppsecJames Anderson
The lecture titled "Automating AppSec" delves into the critical challenges associated with manual application security (AppSec) processes and outlines strategic approaches for incorporating automation to enhance efficiency, accuracy, and scalability. The lecture is structured to highlight the inherent difficulties in traditional AppSec practices, emphasizing the labor-intensive triage of issues, the complexity of identifying responsible owners for security flaws, and the challenges of implementing security checks within CI/CD pipelines. Furthermore, it provides actionable insights on automating these processes to not only mitigate these pains but also to enable a more proactive and scalable security posture within development cycles.
The Pains of Manual AppSec:
This section will explore the time-consuming and error-prone nature of manually triaging security issues, including the difficulty of prioritizing vulnerabilities based on their actual risk to the organization. It will also discuss the challenges in determining ownership for remediation tasks, a process often complicated by cross-functional teams and microservices architectures. Additionally, the inefficiencies of manual checks within CI/CD gates will be examined, highlighting how they can delay deployments and introduce security risks.
Automating CI/CD Gates:
Here, the focus shifts to the automation of security within the CI/CD pipelines. The lecture will cover methods to seamlessly integrate security tools that automatically scan for vulnerabilities as part of the build process, thereby ensuring that security is a core component of the development lifecycle. Strategies for configuring automated gates that can block or flag builds based on the severity of detected issues will be discussed, ensuring that only secure code progresses through the pipeline.
Triaging Issues with Automation:
This segment addresses how automation can be leveraged to intelligently triage and prioritize security issues. It will cover technologies and methodologies for automatically assessing the context and potential impact of vulnerabilities, facilitating quicker and more accurate decision-making. The use of automated alerting and reporting mechanisms to ensure the right stakeholders are informed in a timely manner will also be discussed.
Identifying Ownership Automatically:
Automating the process of identifying who owns the responsibility for fixing specific security issues is critical for efficient remediation. This part of the lecture will explore tools and practices for mapping vulnerabilities to code owners, leveraging version control and project management tools.
Three Tips to Scale the Shift Left Program:
Finally, the lecture will offer three practical tips for organizations looking to scale their Shift Left security programs. These will include recommendations on fostering a security culture within development teams, employing DevSecOps principles to integrate security throughout the development
Coordinate Systems in FME 101 - Webinar SlidesSafe Software
If you’ve ever had to analyze a map or GPS data, chances are you’ve encountered and even worked with coordinate systems. As historical data continually updates through GPS, understanding coordinate systems is increasingly crucial. However, not everyone knows why they exist or how to effectively use them for data-driven insights.
During this webinar, you’ll learn exactly what coordinate systems are and how you can use FME to maintain and transform your data’s coordinate systems in an easy-to-digest way, accurately representing the geographical space that it exists within. During this webinar, you will have the chance to:
- Enhance Your Understanding: Gain a clear overview of what coordinate systems are and their value
- Learn Practical Applications: Why we need datams and projections, plus units between coordinate systems
- Maximize with FME: Understand how FME handles coordinate systems, including a brief summary of the 3 main reprojectors
- Custom Coordinate Systems: Learn how to work with FME and coordinate systems beyond what is natively supported
- Look Ahead: Gain insights into where FME is headed with coordinate systems in the future
Don’t miss the opportunity to improve the value you receive from your coordinate system data, ultimately allowing you to streamline your data analysis and maximize your time. See you there!
What's Next Web Development Trends to Watch.pdfSeasiaInfotech2
Explore the latest advancements and upcoming innovations in web development with our guide to the trends shaping the future of digital experiences. Read our article today for more information.
UiPath Community Day Kraków: Devs4Devs ConferenceUiPathCommunity
We are honored to launch and host this event for our UiPath Polish Community, with the help of our partners - Proservartner!
We certainly hope we have managed to spike your interest in the subjects to be presented and the incredible networking opportunities at hand, too!
Check out our proposed agenda below 👇👇
08:30 ☕ Welcome coffee (30')
09:00 Opening note/ Intro to UiPath Community (10')
Cristina Vidu, Global Manager, Marketing Community @UiPath
Dawid Kot, Digital Transformation Lead @Proservartner
09:10 Cloud migration - Proservartner & DOVISTA case study (30')
Marcin Drozdowski, Automation CoE Manager @DOVISTA
Pawel Kamiński, RPA developer @DOVISTA
Mikolaj Zielinski, UiPath MVP, Senior Solutions Engineer @Proservartner
09:40 From bottlenecks to breakthroughs: Citizen Development in action (25')
Pawel Poplawski, Director, Improvement and Automation @McCormick & Company
Michał Cieślak, Senior Manager, Automation Programs @McCormick & Company
10:05 Next-level bots: API integration in UiPath Studio (30')
Mikolaj Zielinski, UiPath MVP, Senior Solutions Engineer @Proservartner
10:35 ☕ Coffee Break (15')
10:50 Document Understanding with my RPA Companion (45')
Ewa Gruszka, Enterprise Sales Specialist, AI & ML @UiPath
11:35 Power up your Robots: GenAI and GPT in REFramework (45')
Krzysztof Karaszewski, Global RPA Product Manager
12:20 🍕 Lunch Break (1hr)
13:20 From Concept to Quality: UiPath Test Suite for AI-powered Knowledge Bots (30')
Kamil Miśko, UiPath MVP, Senior RPA Developer @Zurich Insurance
13:50 Communications Mining - focus on AI capabilities (30')
Thomasz Wierzbicki, Business Analyst @Office Samurai
14:20 Polish MVP panel: Insights on MVP award achievements and career profiling
The DealBook is our annual overview of the Ukrainian tech investment industry. This edition comprehensively covers the full year 2023 and the first deals of 2024.
AI_dev Europe 2024 - From OpenAI to Opensource AIRaphaël Semeteys
Navigating Between Commercial Ownership and Collaborative Openness
This presentation explores the evolution of generative AI, highlighting the trajectories of various models such as GPT-4, and examining the dynamics between commercial interests and the ethics of open collaboration. We offer an in-depth analysis of the levels of openness of different language models, assessing various components and aspects, and exploring how the (de)centralization of computing power and technology could shape the future of AI research and development. Additionally, we explore concrete examples like LLaMA and its descendants, as well as other open and collaborative projects, which illustrate the diversity and creativity in the field, while navigating the complex waters of intellectual property and licensing.
How Netflix Builds High Performance Applications at Global ScaleScyllaDB
We all want to build applications that are blazingly fast. We also want to scale them to users all over the world. Can the two happen together? Can users in the slowest of environments also get a fast experience? Learn how we do this at Netflix: how we understand every user's needs and preferences and build high performance applications that work for every user, every time.
The Rise of Supernetwork Data Intensive ComputingLarry Smarr
Invited Remote Lecture to SC21
The International Conference for High Performance Computing, Networking, Storage, and Analysis
St. Louis, Missouri
November 18, 2021
An invited talk given by Mark Billinghurst on Research Directions for Cross Reality Interfaces. This was given on July 2nd 2024 as part of the 2024 Summer School on Cross Reality in Hagenberg, Austria (July 1st - 7th)
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - MydbopsMydbops
This presentation, delivered at the Postgres Bangalore (PGBLR) Meetup-2 on June 29th, 2024, dives deep into connection pooling for PostgreSQL databases. Aakash M, a PostgreSQL Tech Lead at Mydbops, explores the challenges of managing numerous connections and explains how connection pooling optimizes performance and resource utilization.
Key Takeaways:
* Understand why connection pooling is essential for high-traffic applications
* Explore various connection poolers available for PostgreSQL, including pgbouncer
* Learn the configuration options and functionalities of pgbouncer
* Discover best practices for monitoring and troubleshooting connection pooling setups
* Gain insights into real-world use cases and considerations for production environments
This presentation is ideal for:
* Database administrators (DBAs)
* Developers working with PostgreSQL
* DevOps engineers
* Anyone interested in optimizing PostgreSQL performance
Contact info@mydbops.com for PostgreSQL Managed, Consulting and Remote DBA Services
How RPA Help in the Transportation and Logistics Industry.pptxSynapseIndia
Revolutionize your transportation processes with our cutting-edge RPA software. Automate repetitive tasks, reduce costs, and enhance efficiency in the logistics sector with our advanced solutions.
Details of description part II: Describing images in practice - Tech Forum 2024BookNet Canada
This presentation explores the practical application of image description techniques. Familiar guidelines will be demonstrated in practice, and descriptions will be developed “live”! If you have learned a lot about the theory of image description techniques but want to feel more confident putting them into practice, this is the presentation for you. There will be useful, actionable information for everyone, whether you are working with authors, colleagues, alone, or leveraging AI as a collaborator.
Link to presentation recording and transcript: https://bnctechforum.ca/sessions/details-of-description-part-ii-describing-images-in-practice/
Presented by BookNet Canada on June 25, 2024, with support from the Department of Canadian Heritage.
Transcript: Details of description part II: Describing images in practice - T...BookNet Canada
This presentation explores the practical application of image description techniques. Familiar guidelines will be demonstrated in practice, and descriptions will be developed “live”! If you have learned a lot about the theory of image description techniques but want to feel more confident putting them into practice, this is the presentation for you. There will be useful, actionable information for everyone, whether you are working with authors, colleagues, alone, or leveraging AI as a collaborator.
Link to presentation recording and slides: https://bnctechforum.ca/sessions/details-of-description-part-ii-describing-images-in-practice/
Presented by BookNet Canada on June 25, 2024, with support from the Department of Canadian Heritage.
2. Netflix Ecosystem
• ~50 million members, ~50 countries
• > 1 billion hours per month
• > 1000 device types
• 3 AWS Regions, hundreds of services
• Hundreds of thousands of requests/second
• CDN serves petabytes of data at terabits/second
Static
Content
Akamai
Netflix CDN
Service
Partners
AWS/Netfli
x
Control
Plane
Internet
5. Failures can happen any time
• Disks fail
• Power outages
• Natural disasters
• Software bugs
• Human error
6. We design for failure
• Exception handling
• Fault tolerance and isolation
• Fall-backs and degraded experiences
• Auto-scaling clusters
• Redundancy
7. Testing for failure is hard
• Web-scale traffic
• Massive, changing data sets
• Complex interactions and request patterns
• Asynchronous, concurrent requests
• Complete and partial failure modes
Constant innovation and change
8. What if we regularly inject failures
into our systems under controlled
circumstances?
10. Blast Radius
• Unit of isolation
• Scope of an outage
• Scope a chaos exercise
Instance
Zone
Region
Global
11. An Instance Fails
Edge Cluster
Cluster A
Cluster B
Cluster C
Cluster D
12. Chaos Monkey
• Monkey loose in your DC
• Run during business hours
• What we learned
– Auto-replacement works
– State is problematic
13. A State of Xen - Chaos Monkey & Cassandra
Out of our 2700+ Cassandra nodes
• 218 rebooted
• 22 did not reboot successfully
• Automation replaced failed nodes
• 0 downtime due to reboot
15. Chaos Gorilla
Simulate an availability zone
outage
• 3-zone configuration
• Eliminate one zone
• Ensure that others can
handle the load and
nothing breaks
Chaos Gorilla
16. Challenges
• Rapidly shifting traffic
– LBs must expire connections quickly
– Lingering connections to caches must be addressed
• Service configuration
– Not all clusters auto-scaled or pinned
– Services not configured for cross-zone calls
– Mismatched timeouts – fallbacks prevented fail-over
18. Regional Load Balancers
Zuul – Traffic Shaping/Routing
AZ1 AZ2 AZ3
Data Data Data
Geo-located
Chaos Kong
Chaos Kong
Regional Load Balancers
Zuul – Traffic Shaping/Routing
AZ1 AZ2 AZ3
Data Data Data
Customer
Device
19. Challenges
● Rapidly shifting traffic
○ Auto-scaling configuration
○ Static configuration/pinning
○ Instance start time
○ Cache fill time
20. Challenges
● Service Configuration
○ Timeout configurations
○ Fallbacks fail or don’t provide the
desired experience
● No minimal (critical) stack
○ Any service may be critical!
22. Services Slow Down and Fail
Simulate latent/failed service
calls
• Inject arbitrary latency and errors at
the service level
• Observe for effects
Latency Monkey
24. Challenges
• Startup resiliency is an issue
• Services owners don’t know all dependencies
• Fallbacks can fail too
• Second order effects not easily tested
• Dependencies are in constant flux
• Latency Monkey tests function and scale
– Not a staged approach
– Lots of opt-outs
31. FIT Details
● Common Simulation Syntax
● Single Simulation Interface
● Transported via Http Request header
32. Integrating Failure
request
[sendRequestHeader] >>fit.failure: 1|fit.Serializer|
2|[[{"name”:”failSocial,
Filter
Service
Ribbon
Filter
Service
Ribbon
ServerRcv
ClientSend
ServerRcv
Service A
response
Service B
”whitelist":false,
"injectionPoints”:
[“SocialService”]},{}
]],
{"Id":
"252c403b-7e34-4c0b-a28a-3606fcc38768"}]]
33. Failure Scenarios
● Set of injection points to fail
● Defined based on
○ Past outages
○ Specific dependency interactions
○ Whitelist of a set of critical services
○ Dynamic tracing of dependencies
34. FIT Insights : Salp
● Distributed tracing inspired by Dapper paper
● Provides insight into dependencies
● Helps define & visualize scenarios
35. Dialing Up Failure
Functional Validation
● Isolated synthetic transactions
○ Set of devices
Validation at Scale
● Dial up customer traffic - % based
● Simulation of full service failure
Chaos!
38. Take-aways
• Don’t wait for random failures
– Cause failure to validate resiliency
– Remove uncertainty by forcing failures regularly
– Better to fail at 2pm than 2am
• Test design assumptions by stressing them
Embrace Failure
39. The Simian Army is part of
the Netflix open source
cloud platform
http://netflix.github.com
40. Netflix talks at re:Invent
Talk Time Title
BDT-403 Wednesday, 2:15pm Next Generation Big Data Platform at Netflix
PFC-306 Wednesday, 3:30pm Performance Tuning EC2
DEV-309 Wednesday, 3:30pm From Asgard to Zuul, How Netflix’s proven Open Source
Tools can accelerate and scale your services
ARC-317 Wednesday, 4:30pm Maintaining a Resilient Front-Door at Massive Scale
PFC-304 Wednesday, 4:30pm Effective Inter-process Communications in the Cloud: The
Pros and Cons of Micro Services Architectures
ENT-209 Wednesday, 4:30pm Cloud Migration, Dev-Ops and Distributed Systems
APP-310 Friday 9:00am Scheduling using Apache Mesos in the Cloud