Posts Tagged ‘cloud’

Chaos Engineering: Breaking Things on Purpose

Written by Brooks Canavesi on November 5, 2017. Posted in Blog, Mobile App Development, Software & App Sales, Technology trends

Modern distributed systems, especially within the realm of cloud computing, have become so complex and unpredictable that it’s no longer feasible to reliably identify all the things that can go wrong. From bad configuration pushes to hardware failures to sudden surges in traffic with unexpected results, the number of possible failures is too large for flawless distributed systems to exist. If perfection is unattainable, what else is there to strive for? Resiliency.

“The cloud is all about redundancy and fault-tolerance. Since no single component can guarantee 100 percent uptime (and even the most expensive hardware eventually fails), we have to design a cloud architecture where individual components can fail without affecting the availability of the entire system. In effect, we have to be stronger than our weakest link,” explains Netflix.

To know with certainty that a failure of an individual component won’t affect the availability of the entire system, it’s necessary to experience the failure in practice, preferably in a realistic and fully automated manner. When a system has been tested for a sufficient number of failures, and all the discovered weaknesses have been addressed, such system is very likely resilient enough to survive use in production.

“This was our philosophy when we built Chaos Monkey, a tool that randomly disables our production instances to make sure we can survive this common type of failure without any customer impact,” says Netflix, whose early experiments with resiliency testing in production have given birth to a new discipline in software engineering: Chaos Engineering.

What Is Chaos Engineering?

The core idea behind Chaos Engineering is to break things on purpose to discover and fix weaknesses. Chaos Engineering is defined by Netflix, a pioneer in the field of automated failure testing and the company that originally formalized Chaos Engineering as a discipline in the Principles of Chaos Engineering, as the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.

Chaos Engineering acknowledges that we live in an imperfect world where things break unexpectedly and often catastrophically. Knowing this, the most productive decision we can make is to accept this reality and focus on creating quality products and services that are resilient to failures.

Mathias Lafeldt, a professional infrastructure developer who’s currently working remotely for Gremlin Inc., says, “Building resilient systems requires experience with failure. Waiting for things to break in production is not an option. We should rather inject failures proactively in a controlled way to gain confidence that our production systems can withstand those failures. By simulating potential errors in advance, we can verify that our systems behave as we expect—and to fix them if they don’t.”

In doing so, we’re building systems that are antifragile, which is a term borrowed from Nassim Nicholas Taleb’s 2012 book titled “Antifragile: Things That Gain from Disorder.” Taleb, a Lebanese-American essayist, scholar, statistician, former trader, and risk analyst, introduces the book by saying, “Some things benefit from shocks; they thrive and grow when exposed to volatility, randomness, disorder, and stressors and love adventure, risk, and uncertainty. Yet, in spite of the ubiquity of the phenomenon, there is no word for the exact opposite of fragile. Let us call it antifragile. Antifragility is beyond resilience or robustness. The resilient resists shocks and stays the same; the antifragile gets better.”

On his blog, Lafeldt gives another example of antifragility, “Take the vaccine—we inject something harmful into a complex system (an organism) in order to build an immunity to it. This translates well to our distributed systems where we want to build immunity to hardware and network failures, our dependencies going down, or anything that might go wrong.”

Just like with vaccination, the exposition of a system to volatility, randomness, disorder, and stressors must be executed in a well-thought-out manner that won’t wreak havoc on it should something go wrong. Automated failure testing should ideally start with the smallest possible impact that can still teach something and gradually become more impactful as the tested system becomes more resilient.

The Five Principles of Chaos Engineering

“The term ‘chaos’ evokes a sense of randomness and disorder. However, that doesn’t mean Chaos Engineering is something that you do randomly or haphazardly. Nor does it mean that the job of a chaos engineer is to induce chaos. On the contrary: we view Chaos Engineering as a discipline. In particular, we view Chaos Engineering as an experimental discipline,” state Casey Rosenthal, Lorin Hochstein, Aaron Blohowiak, Nora Jones, and Ali Basiri in “Chaos Engineering: Building Confidence in System Behavior through Experiments.”

In their book, the authors propose the following five principles of Chaos Engineering:

Hypothesize About Steady State

The Systems Thinking community uses the term “steady state” to refer to a property where the system tends to maintain that property within a certain range or pattern. In terms of failure testing, the normal operation of the tested system is the system’s steady state, and we can determine what constitutes as normal based on a number of metrics, including CPU load, memory utilization, network I/O, how long it takes to service web requests, or how much time is spent in various database queries, and so on.

“Once you have your metrics and an understanding of their steady state behavior, you can use them to define the hypotheses for your experiment. Think about how the steady state behavior will change when you inject different types of events into your system. If you add requests to a mid-tier service, will the steady state be disrupted or stay the same? If disrupted, do you expect the system output to increase or decrease?” ask the authors.

Vary Real-World Events

Suitable events for a chaos experiment include all events that are capable of disrupting steady state. This includes hardware failures, functional bugs, state transmission errors (e.g., inconsistency of states between sender and receiver nodes), network latency and partition, large fluctuations in input (up or down) and retry storms, resource exhaustion, unusual or unpredictable combinations of inter-service communication, Byzantine failures (e.g., a node believing it has the most current data when it actually does not), race conditions, downstream dependencies malfunction, and others.

“Only induce events that you expect to be able to handle! Induce real-world events, not just failures and latency. While the examples provided have focused on the software part of systems, humans play a vital role in resiliency and availability. Experimenting on the human-controlled pieces of incident response (and their tools!) will also increase availability,” warn the authors.

Run Experiments in Production

Chaos Engineering prefers to experiment directly on production traffic to guarantee both authenticity of the way in which the system is exercised and relevance to the currently deployed system. This goes against the commonly held tenet of classical testing, which strives to identify problems as far away from production as possible. Naturally, one needs to have a lot of confidence in the tested system’s resiliency to the injected events. The knowledge of existing weaknesses indicates a lack of maturity of the system, which needs to be addressed before conducting any Chaos Engineering experiments.

“When we do traditional software testing, we’re verifying code correctness. We have a good sense about how functions and methods are supposed to behave, and we write tests to verify the behaviors of these components. When we run Chaos Engineering experiments, we are interested in the behavior of the entire overall system. The code is an important part of the system, but there’s a lot more to our system than just code. In particular, state and input and other people’s systems lead to all sorts of system behaviors that are difficult to foresee,” write the authors.

Automate Experiments to Run Continuously

Automation is a critical pillar of Chaos Engineering. Chaos engineers automate the execution of experiments, the analysis of experimental results, and sometimes even aspire to automate the creation of new experiments. That said, one-off manual experiments are a good place where to start with failure testing. After a few batches of carefully designed manual experiments, the next natural level we can aspire to is their automation.

“The challenge of designing Chaos Engineering experiments is not identifying what causes production to break, since the data in our incident tracker has that information. What we really want to do is identify the events that shouldn’t cause production to break, and that have never before caused production to break, and continuously design experiments that verify that this is still the case,” the authors emphasize what to pay attention to when designing automated experiments.

Minimize Blast Radius

It’s important to realize that each chaos experiment has the potential to cause real damage. The difference between a badly designed chaos experiment and a well-designed chaos experiment is in the blast radius. The most basic way how to minimize the blast radius of any chaos experiment is to always have an emergency stop mechanism in place to instantly shut down the experiment in case it goes out of control. Chaos experiments should be built upon each other by taking careful, measured risks that gradually escalate the overall scope of the testing without causing unnecessary harm.

“The entire purpose of Chaos Engineering is undermined if the tooling and instrumentation of the experiment itself cause an undue impact on the metric of interest. We want to build confidence in the resilience of the system, one small and contained failure at a time,” caution the authors in the book.

Chaos at Netflix

Netflix has been practicing some form of resiliency testing in production ever since the company began moving out of data centers into the cloud in 2008. The first Chaos Engineering tool to gain fame outside Netflix’s offices was Chaos Monkey, which is currently in version 2.0.

“Years ago, we decided to improve the resiliency of our microservice architecture. At our scale, it is guaranteed that servers on our cloud platform will sometimes suddenly fail or disappear without warning. If we don’t have proper redundancy and automation, these disappearing servers could cause service problems. The Freedom and Responsibility culture at Netflix doesn’t have a mechanism to force engineers to architect their code in any specific way. Instead, we found that we could build strong alignment around resiliency by taking the pain of disappearing servers and bringing that pain forward. We created Chaos Monkey to randomly choose servers in our production environment and turn them off during business hours,” explains Netflix.

The rate at which Chaos Monkey turns off servers is higher than the rate at which server outages happen normally, and Chaos Monkey is configured to turn off servers during production hours. Thus, engineers are forced to build resilient services through automation, redundancy, fallbacks, and other best practices of resilient design.

While previous versions of Chaos Monkey were additionally allowed to perform actions like burning up CPU and taking storage devices offline, Netflix uses Chaos Monkey 2.0 to only terminate instances. Chaos Monkey 2.0 is fully integrated with Netflix’s open source multi-cloud continuous delivery platform, Spinnaker, which is intended to make it easy to extend and enhance cloud deployment models. The integration with Spinnaker allows service owners to set their Chaos Monkey 2.0 configs through the Spinnaker apps, and Chaos Monkey 2.0 to get information about how services are deployed from Spinnaker.

Once Netflix realized the enormous potential of breaking things on purpose to rebuild them better, the company decided to take things to the next level and move from the small scale to the very large scale with the release of Chaos Kong in 2013, a tool capable of testing how their services behave when a zone or an entire region is taken down. According to Nir Alfasi, a Netflix engineer, the company practices region outages using Kong almost every month.

“What we need is a way to limit the impact of failure testing while still breaking things in realistic ways. We need to control the outcome until we have confidence that the system degrades gracefully, and then increase it to exercise the failure at scale. This is where FIT (Failure Injection Testing) comes in,” stated Netflix in early 2014, after realizing that they need a finer degree of control when deliberately breaking things than their existing tool allowed for at the time. FIT is a platform designed to simplify the creation of failure within Netflix’s ecosystem with a greater degree of precision. FIT also allows Netflix to propagate its failures across the entirety of Netflix in a consistent and controlled manner. “FIT has proven useful to bridge the gap between isolated testing and large-scale chaos exercises, and make such testing self-service.”

Once the Chaos Engineering team at Netflix believed that they had a good story at small scale (Chaos Monkey) and large scale (Chaos Kong) and in between (FIT), it was time to formalize Chaos Engineering as a practice, which happened in mid-2015 with the publication of the Principles of Chaos Engineering. “With this new formalization, we pushed Chaos Engineering forward at Netflix. We had a blueprint for what constituted chaos: we knew what the goals were, and we knew how to evaluate whether or not we were doing it well. The principles provided us with a foundation to take Chaos Engineering to the next level,” write Casey Rosenthal, Lorin Hochstein, Aaron Blohowiak, Nora Jones, and Ali Basiri in “Chaos Engineering: Building Confidence in System Behavior through Experiments.”

The latest notable addition to Netflix’s Chaos Engineering family of tools is ChAP (Chaos Automation Platform), which was launched in late 2016. “We are excited to announce ChAP, the newest member of our chaos tooling family! Chaos Monkey and Chaos Kong ensure our resilience to instance and regional failures, but threats to availability can also come from disruptions at the microservice level. FIT was built to inject microservice-level failure in production, and ChAP was built to overcome the limitations of FIT so we can increase the safety, cadence, and breadth of experimentation,” introduced Netflix their new failure testing automation tool.

Although Netflix isn’t the only company interested in Chaos Engineering, their willingness to develop in the open and share with others has had a profound influence on the industry. Besides regularly speaking at various industry events, Netflix’s GitHub page contains a wealth of interesting open source projects that are ready for adoption.

Chaos Engineering is also being embraced by Etsy, Microsoft, Jet, Gremlin, Google, and Facebook, just to name a few. These and other companies have developed a comprehensive range of open source tools for different use cases. The tools include Simoorg (LinkedIn’s own failure inducer framework), Pumba (a chaos testing and network emulation tool for Docker), Chaos Lemur (self-hostable application to randomly destroy virtual machines in a BOSH-managed environment), and Blockade (a Docker-based utility for testing network failures and partitions in distributed applications), just to name a few.

Learn to Embrace Chaos

If you now feel inspired to embrace the above-described principles and the tool to create your own Chaos Engineering experiments, you may want to adhere to the following Chaos Engineering experiment design process, as outlined in “Chaos Engineering: Building Confidence in System Behavior through Experiments.”

Pick a hypothesis
- Decide what hypothesis you’re going to test and don’t forget that your system includes the humans that are involved in maintaining it.
Choose the scope of the experiment
- Strive to run experiments in production and minimize blast radius. The closer your test is to production, the more you’ll learn from the results.
Identify the metrics you’re going to watch
- Try to operationalize your hypothesis using your metrics as much as possible. Be ready to abort early if the experiment has a more serious impact than you expected.
Notify the organization
- Inform members of your organization about what you’re doing and coordinate with multiple teams who are interested in the outcome and are nervous about the impact of the experiment.
Run the experiment
- The next step is to run the experiment while keeping an eye on your metrics in case you need to abort it.
Analyze the results
- Carefully analyze the result of the experiment and feed the outcome of the experiment to all the relevant teams.
Increase the scope
- Once you gain confidence running smaller-scale experiments, you may want to increase the scope of an experiment to reveal systemic effects that aren’t noticeable with smaller-scale experiments.
Automate
- The more regularly you run your Chaos Experiments, the more value you can get out of them.

Since some degree of chaos and unpredictability is inevitable, why not embrace it? “The next step is to institutionalize chaos, perhaps by embracing Netflix’s open source Simian Army. But really [embracing Chaos Engineering] is not so much a matter of technology as it is culture. Telling your developers to expect and foster failure as a way to drive resilience into your cloud systems is a big step on the path to engineering in the 21st Century. Time to get started,” concludes Matt Asay his article on the subject.

Conclusion

Chaos Engineering is a remarkably valuable discipline and practice that can help any business or organization build a resilient distributed system capable of withstanding all challenges and adversities it might face. Chaos Engineering can be performed at any scale and any level of automation. Despite its young age, Chaos Engineering has already changed how we think about failure testing, and thanks to companies such as Netflix there’s also a sizable range of Chaos Engineering testing available to anyone who would like to experience first-hand what Chaos Engineering has to offer.

MICROSOFT AZURE AND XAMARIN: THE BIG PICTURE

Written by Brooks Canavesi on July 26, 2016. Posted in Uncategorized

The development of mobile and IoT (Internet of Things) applications often involves a lot of moving parts that need to be tightly integrated with one another for the whole system to perform at sufficiently high level. Microsoft is set to help developers create scalable, performant, highly available, and cross-platform IoT service and application with Azure and Xamarin – two names we are likely to hear a lot more about in the near future.

Microsoft Azure

First introduced by Microsoft in October 2008, Azure is a collection of integrated cloud services that is expected to reach a market size of $555 billion in 2020, according to the new report by Allied Market Research. The original name of the platform was Windows Azure, but Microsoft has decided to rebrand to Microsoft Azure in April 2014 to emphasize its central position within the company.

It competes with other public cloud platforms, such as Amazon Web Services (AWS) and Google Cloud Platform, by providing a range of cloud services, including computing, analytics, storage, mobile, database, the web, and networking. The beauty of the platform is that everyone can pick and choose which services to use for development and deployment of a new application or as a support for existing applications and infrastructure. As such, some organizations use Azure as their data backup solution, and others as an alternative to their own data center.

Microsoft Azure has several advantages over investing in local servers and storage. Microsoft’s data centers are located in 22 regions across the globe, and, through their service level agreements, Microsoft guarantees at least 99.9% availability of the Azure Active Directory Basic and Premium services. The other important advantage is that Azure primarily uses a utility pricing model that charges customers based on what they actually use – just like an electric supplier chargers, for example, only when you turn on a light in your room.

However, there are also subscription-based models, with discounts for customers who are willing to commit to six months of use; and volume licensing models for enterprise customers. Azure compute costs 12 cents per service hour, and the company’s storage service costs 15 cents per GB of data per month.

There are 12 main categories of Azure services:

Compute – these services include virtual machines, large-scale parallel and batch computing job, containers, remote application access, and infinitely scalable cloud applications and APIs.
Web & Mobile – allows developers to create and deploy web and mobile applications for any platform and any device. Included are API management, scalable push notification infrastructure, reporting, and mobile engagement management.
Data Storage – takes care of SQL and NoSQL databases, as well as unstructured and cached cloud storage.
Intelligence – revolves around the Cortana Intelligence Suite, which is designed to help companies collect and manage huge chunks of data, extend applications with predictive and cognitive insights, and operationalize data science pipeline for iterative learning.
Analytics – Azure’s analytics services are an umbrella for many smaller parts that all deal with big data and insight generation. They include data lake analytics, HDInsight, machine learning, stream analytics, data factory, and others.
Networking – give customers a way how to easily provision private networks, route incoming traffic for high performance and availability, host a DNS domain in Azure, establish secure connectivity through VPN gateways, or take advantage of dedicated private network fiber connections to Azure.
Media & Content Delivery Network (CDN) – with the Azure Media Player, all audio and video files stored in the cloud can be automatically played on most popular devices. The content can be delivered securely with DRM technology, through Microsoft PlayReady and Widevine, or Advanced Encryption Standard (AES) 128-bit clear-key
Hybrid Integration – provides a way how to extend on-premises systems to the cloud for hybrid integration. Businesses can connect across private and public cloud environments and easily backup their data to the cloud.
Identity & Access Management (IAM) – is centered on the Azure Active Directory, which enables single sign-on to any cloud and on-premises web app and is preintegrated with Salesforce.com, Office 365, Box, and others.
Internet of Things (IoT) – helps capture, monitor and analyze IoT data from sensors and other devices. Real-time data streams from millions of IoT devices can be effortlessly processed and powerful cloud-based predictive analytics tools enable predictive maintenance.
Development – Azure lets developers build apps with JavaScript, Python, .NET, PHP, Java and Node.js. It comes with build back-ends for iOS, Android, and Windows devices. Furthermore, Visual Studio Team Services let teams share code, track work, and ship software in a single package.
Management & Security – these products are designed to help cloud administrators manage Azure deployment, schedule and run jobs, and facilitate automation.

Xamarin

The origin of Xamarin goes back to 2011, when Miguel de Icaza, the founder of Mono, announced on his blog that further development of Mono will be supported by a new company that planned to release a new suite of mobile products – that company was Xamarin.

Microsoft announced that they signed an agreement to acquire Xamarin on February 24, 2016. Although specific terms weren’t disclosed, the acquisition probably cost Microsoft between $400 million and $500 million, according to the Wall Street Journal.

“As the role of mobile devices in people’s lives expands even further, mobile app developers have become a driving force for software innovation. … As part of this commitment I am pleased to announce today that Microsoft has signed an agreement to acquire Xamarin, a leading platform provider for mobile app development,” said Scott Guthrie, an Executive Vice President of the Cloud and Enterprise group at Microsoft, in his blog post.

Scott described Xamarin as a rich mobile development that enables developers to build mobile apps using C# and deliver fully native mobile app experiences to all major devices – including iOS, Android, and Windows. The platform consists of a number of elements that allow you to develop applications for iOS and Android: C# language, Mono .NET framework, compiler, and IDE tools.

Consequently, developers can write the same C# code that can be used on all platforms and offer a seamless experience despite the differences under the hood. Xamarin takes advantage of native UI toolkits but abstracts them, which makes the development process very similar to early years of Java programming. Xamarin development can be done in either Xamarin Studio or Visual Studio, but developing iOS applications requires a Mac computer, running Mac OS X.

The Significance of Azure and Xamarin for the IoT and Mobile Computing

Why are Azure and Xamarin important for the IoT and mobile computing? Because companies are looking for ways how to improve their businesses by employing scalable, cost-effective solutions that meet the growing demand for virtualization services and multi-platform deployment. In a short-term, this could lead to $200 billion worth of growth by 2018 according to market research firm Infonetics Research. IaaS is expected to grow from about $23 billion in 2014 to $34 billion in 2015, and PaaS to grow from 13% of the total cloud revenue in 2013 to 16% in 2018. No wonder that Microsoft and other companies are seeing the tremendous opportunity presented right in front of them.

“The Xamarin acquisition will ensure people put Microsoft in to the equation,” says Hammond. “The reality is that 90 percent or more of the mobile market is iOS and Android. So Microsoft needs dev tool to target those platforms and grow its developer base. Now there is a single stream for Windows 10, tablet, iOS, Android and Windows Phone,” says Wes Miller, an analyst at Directions on Microsoft.

Microsoft is simply continuing the trend of making their applications accessible to as many users as possible. They recognize that the time when Windows was synonymous with computing of any kind is long gone. Xamarin allows developers to target any major current platform – all they need is to adopt Microsoft’s development tools and infrastructure.