A New Hope for IoT Security?
With the constantly rising number of connected devices also rises the number of IoT-based cyber-attacks, such as when the Mirai botnet launched one of the largest and most powerful distributed denial of service (DDoS) attacks on DNS provider Dyn and its customers, temporarily rendering services like Twitter, Reddit, and Spotify inaccessible.The situation has become so bleak that the US Congress even proposed an Internet of Things Cybersecurity Improvement Act to force the manufacturers of connected devices, such as webcams, printers, light bulbs, or home routers, to comply with what the regulators call minimal cybersecurity operational standards for IoT devices.
Currently, statistics show that there are approximately 20 billion connected devices worldwide, and some, such as ARM and SoftBank Chairman Masayoshi Son, expect a trillion connected devices by 2035. But unless IoT security fundamentally improves, we could be headed toward what can only be described as IoT apocalypse.
“These attacks have highlighted the very real need for better security measures to be implemented, throughout the value chain of connected devices, covering high-level infrastructure, such as energy supply and connected vehicles to low-cost devices, such as webcams and smart lighting. Breaches in security present a host of issues for those operating in the IoT. Leaks in confidential information, theft of personal data, a loss of control of connected systems and the shutting down of critical infrastructure, all represent major areas at risk,” states ARM on its community blog.
Considering how many IoT devices are built on the ARM architecture, which is known for its remarkable efficiency, and considering that the British multinational semiconductor manufacturer expects to have shipped 200 billion ARM-based chips by 2021, it’s easy to see why ARM might be interested in taking IoT security into their own hands to ease some of the concerns legislators and the general public already have.
Recently, almost exactly a year after Masayoshi Son announced his vision for a trillion connected devices by 2035 at Arm TechCon, ARM announced its open source Platform Security Architecture (PSA), which is described as an holistic set of threat models, security analyses, hardware and firmware architecture specifications intended to serve as a secure foundation for connected devices.
Some of the biggest names in the industry are already supporting PSA, including Google, Microsoft, Cisco, Vodafone, Symantec, SoftBank, and Alibaba, just to name a few.
According to Paul Williamson, vice president and general manager of IoT Device IP at ARM, “The growing number of devices being connected to the internet need to be secure without sacrificing the very diversity which make them innovative and unique. ARM chief system architect Andy Rose and his team made sure this was top of mind when developing PSA through analysis of devices and best practices for securing them.”
As such, PSA delivers hardware and firmware architecture specifications, built on key security principles, defining a best practice approach for designing endpoint devices and a reference open source implementation of the firmware specification, called Trusted Firmware-M, which is designed to work with the company’s ARMv8-M processor architecture. Trusted Firmware-M is scheduled for release in early 2018.
According to Naked Security, Trusted Firmware-M makes possible:
- A proper root of trust.
- A protected crypto keystore.
- Software isolation between trusted and untrusted processes.
- A way of securely updating firmware.
- Easy debugging down to chip level.
- A reliable cryptographic random number generator.
- On-chip acceleration to make crypto run smoothly.
Considering how many major industry players already stand behind ARM’s effort, it seems that the release of Trusted Firmware-M in early 2018 could be the tipping point that so many of those who have been preaching about the growing need for improved IoT security have been waiting for.
The last few years proved that IoT vendors cannot be relied on when it comes to securing their products as the entire world witnessed the consequences of poor security practices such as including weak default passwords in hardware or never releasing security updates to patch critical vulnerabilities.
ARM’s bottom-up approach to IoT security seems like the only reasonable way to go at this point, providing a strong incentive for IoT vendors to build their products using ARM’s cost-effective, scalable, easy-to-implement security framework.
“The value of the ARM ecosystem is to provide diversity and choice to end-customers, and this benefit extends to the IoT and its broad range of technologies and providers. ARM recognizes this potential, alongside the risks that threaten the devices, systems, and infrastructures operating within the IoT. PSA provides the common framework for the ecosystem, from chip designers and device developers, to cloud and network infrastructure providers and software vendors,” states ARM.
The Emerging Category of Invisible Devices
According to the International Data Corporation (IDC), 102.4 million wearable devices were shipped in 2016, and the wearable market is expected to double by 2021. But one wouldn’t have guessed that the popularity of wearables is on the rise after attending the 2017 GSMA Mobile World Congress, which is a combination of the world’s largest exhibition for the mobile industry and a conference featuring prominent executives representing mobile operators, device manufacturers, technology providers, vendors, and content owners from across the world.The winner of the Best Wearable category at MWC 2017 was the Huawei Watch 2, which is a sporty smartwatch aimed predominantly at men. While excellent in many ways, the Huawei Watch 2 isn’t exactly a groundbreaking departure from all other bulky sports smartwatches, which have been dominating the wearable world for the past few years.
The Huawei Watch 2 feels like a wearable piece of technology first and a watch second. Even people who are used to wearing a watch all the time might find it difficult to get used to the extra weight or the daily charging, not to mention the complicated synchronization with other devices. That might explain why fewer consumers use their wearables daily over time, as stated in a report from PwC. Wearables simply still feel too unnatural to become habitual or part of our daily routines.
“I don’t know about you, but when I’m adorning myself with wearable technology, I like it to feel natural. By that, I mean seamlessly integrated, so instead of being conscious that I’m wearing tech, I ideally want to reap the added bonus that it brings while not wearing any additional item: invisible, perhaps,” summarizes the problem Forbes contributor Lee Bell.
Thankfully, it seems that the wearable industry has recognized the need for seamlessly integrated wearables because a new category of smart gadgets is emerging, and the name of this category is “invisibles.”
As the name suggests, invisibles are so well-integrated that the users are not even aware of them unless they are actively taking advantage of the features they provide. Even though most invisibles that are currently available still focus mainly on the fitness market, there have already been several releases of invisible wearables that do other things besides heart rate monitoring.
Under Armour Gemini 3 RE Smart Shoes
The Under Armour Gemini 3 RE are smart shoes with a built-in motion sensor that tracks, analyzes, and stores virtually every running metric, synchronizing wirelessly with the UA MapMyRun smartphone application. The shoes can take advantage of a smartphone’s GPS sensor for even more comprehensive tracking, but they can also work on their own. Besides being packed with advanced electronic components, the Under Armour Gemini 3 RE also feature the innovative UA SpeedForm construction molds to the foot for a precision fit and the Threadborne midfoot panel for distinct style and enhanced ventilation. The shoes are available for $119.99 in two colors through Under Armour’s official website.Nokia Steel HR
Most modern smartwatches have one thing in common: they don’t look like traditional watches. They either fall into the same category of fitness-oriented products as the aforementioned Huawei Watch 2, or they go for the elegantly futuristic look of the Apple Watch. The Nokia Steel HR smartwatch is different, though. Featuring a traditional watch movement along with a small LCD display, the Steel HR shows all the information you might expect a smartwatch to show, including heart rate, number of steps, distance, calories burned, and alarm time, but without the annoying need for daily charging. The watch is also water resistant up to 50 meters and comes with a stylish black watchband that makes it look just as good when worn with jeans as it does with dress pants.JBL UA Sport Wireless Heart Rate Headphones
Under Armour have teamed up with audio experts JBL to produce a pair of wireless fitness-oriented headphones with a built-in heart rate monitoring functionality. Thanks to the ability of the JBL UA Sport headphones to monitor heart rate, users can receive updates for things like pace, distance, and heart-rate zones without looking at the display of a smartphone or smartwatch, which can be distracting and sometimes even impossible when in the zone. The heart rate monitoring functionality aside, the JBL UA Sport headphones offer excellent sound quality, good seal, and maximum comfort for daily listening. The headphones are water and sweat resistant, as fitness headphones should be, and they last up to 5 hours with audio and heart rate enabled on a charge.Myontec Mbody Connected Shorts
The Myontec Mbody connected shorts boast the most comprehensive and advanced training system available, featuring an array of sensors that capture heart rate, cadence, speed, distance, and other conventional performance. The sensors can also perform muscle overload analysis, determine training readiness based on warm-up monitoring, and analyze for imbalance and injury prevention. The shorts have been designed with the needs of cyclists, duathletes, and triathletes in mind, which is reflected in the use of 3D elastic compression textile by Carvico Revolutional, a leader in the national and international textile market. Unlike conventional sports shorts, the Myontec Mbody can be connected to a smartphone via Bluetooth and paired with the Mbody Live app, which is available for iOS and Android.Coros LINX Smart Bicycle Helmet
Modern cycling helmets already do many things to make cycling safer and more enjoyable, but they don’t help cyclists maintain a sharp focus on the road ahead. Every year, thousands of cyclists suffer serious and sometimes even fatal injuries because they use a smartphone for GPS navigation and communication. The Coros LINX smart helmet features wireless connectivity, allowing cyclists to wirelessly connect their helmet to their smartphone to listen to their own music, take phone calls, talk to fellow riders, and hear navigation and ride data through the helmet’s open-ear Bone Conduction Technology and a precision wind-resistant microphone. The helmet is available in four colors and two sizes.Levi’s Commuter Trucker Jacket
Levi’s Commuter Trucker Jacket is the first mass-produced article of clothing to feature Jacquard by Google, the first full-scale digital platform created for smart clothing that allows clothing manufacturers to add a new layer of connectivity and interactivity to everything from jackets to shoes to bags. In the case of the Levi’s Commuter Trucker Jacket, touch-sensitive fabric is woven into the sleeve, and the remaining electronic components are incorporated into the design of the jacket in such a way that the jacket is washable and virtually indistinguishable from ordinary jackets. When paired with a smartphone, the touch-sensitive sleeve makes it possible to control music, navigation, or phone calls.Conclusion
Together, all the invisible wearables described above demonstrate that consumers are interested in modern technology and its numerous benefits, but want to get them in a form that more seamlessly fits into their daily lives. As companies continue to develop new, more advanced textiles and sensors, we will likely see a boom of invisibles and the emergence of many never-before-seen smart products.The State of Mobile Gaming and Razer’s New Gaming Phone
The year 2017 came to an end and the number of available app in the Google Play Store climbed to 3.3 million. Not too far behind is Apple’s App Store with 2.2 million available apps. Combined, the global mobile gaming market generated $36.9 billion in revenue in 2016, overtaking the PC gaming market for the first time in history. According to the latest quarterly update by Newzoo, a leading provider of market intelligence covering the global games, e-sports, and mobile markets, mobile gaming will represent just more than half of the total games market by 2020.A study by PayPal and SuperData revealed that 78 percent of US respondents prefer gaming on smartphones instead of tablets, and recent smash hits such as Niantic’s Pokémon Go, which was downloaded over 100 million times in its first month alone, or Clash Royal, which generates over $2 million in daily revenue from in-app purchases, demonstrate that mobile gamers have a huge appetite for entertaining games and are willing to spend a lot of money on them.
According to Tapjoy’s report The Changing Face of Mobile Gamers: What Brands Need to Know, mobile gamers are most interested in puzzle games (59 percent), followed by strategy games (38 percent), trivia games (33 percent), and casino/card games (27 percent). Perhaps surprisingly, player-versus-player (15) and sports (11 percent) games are among the least popular categories of mobile games.
The lack of interest in more graphically intense mobile games with complex gameplay mechanics and cinematic storytelling could be, at least in part, attributed to the small number of smartphones created with gaming in mind.
Razer Phone: A Game Changer?
On November 1st, Razer, a company known for its gaming peripherals and computers, announced its grand entry into the smartphone market with the reveal of the Razer Phone, a smartphone created exclusively to satisfy the needs of mobile gamers.Razer’s announcement didn’t come as a big surprise because, earlier this year, the company acquired fledgling smartphone maker Nextbit, who has given us the Robin, a smartphone that bears a striking resemblance to the Razer Phone.
Made from solid aluminum with a black matte finish, the Razer Phone purposefully goes against the biggest trend of this smartphone generation: bezel-less design. Unlike the iPhone X, the Samsung Galaxy S8, or the already largely forgotten Essential Phone, the Razer Phone sports unusually thick top and bottom bezels, which hide stereo speakers with support for Dolby Atmos for Mobile.
Thanks to the thick bezels, the Razor Phone can be comfortably held in the landscape mode, and the large speakers hidden inside deliver an immersive audio experience with minimal distortion even at higher volume levels. Realizing that one of the biggest attractions of mobile gaming is that it can happen anywhere, Razer didn’t forget about gamers who prefer or are forced to use headphones, including a USB-C audio adapter with a THX-certified 24-bit DAC right in the box.
The Razor Phone is powered by a Snapdragon 835 processor and Adreno 540 GPU, and it features 64 GB of expandable storage and 8 GB of RAM. Such a large amount of RAM allows Razor to use a less aggressive memory management strategy to keep more processes running in the background at the same time, which can significantly reduce the time it takes to switch between apps. The smartphone’s large 4,000 mAh battery promises a full day of intense gaming and extremely quick charging thanks to the support for Qualcomm Quick Charge 4.0+, which includes all the benefits of Quick Charge 4 plus Dual Charge, intelligent thermal balancing, and several additional advanced safety features.
The phone’s 16:9 5.7-inch QHD IPS Sharp IGZO display isn’t the largest mobile display on the market nor the most color-accurate, but it delivers the highest refresh rates of all Android smartphones currently available: up to 120 Hz. “This means, zero lag or stuttering—just fluid, buttery smooth motion content for you to enjoy,” claims Razor.
Because refreshing everything 120 times per second takes its toll on the battery, Razer made the refresh rate adaptive, meaning that it dynamically adjusts itself for the best viewing experience as well as the best battery life. On the home screen, the refresh rate can be as low as 10 Hz, but it can instantly shoot up to 120 Hz when a user enters a game and stay at 120 Hz for as long as needed. Besides relying only on the automatic refresh rate adjustment to prolong the smartphone’s battery life, you can also manually set a frame rate cap of 60, 90, or 120 Hz as well as change the display resolution to 720p, 1080p, or the native QHD.
Even though only a small handful of popular games take advantage of either the front- or the rear-facing camera, the Razor Phone still comes with a decent dual 12 MP camera system: one camera taking wide-angle pictures and the other one being zoomed. The wide-angle camera has f/1.75 aperture, while the telephoto lens offers f/2.6 aperture. The front-facing camera has 8 MP.
The Razor Phone ships with Android 7.1.1 Nougat, and Razer plans to upgrade to Android 8 Oreo in spring 2018.
So far, Razor’s first entry into the smartphone market has been met with unanimously favorable reviews and positive reception from smartphone enthusiasts and mobile gamers alike. If the Razor Phone turns out to be a success, it might inspire other smartphone manufacturers to follow suit and release more smartphones aimed exclusively at mobile gamers, which could even reshape the mobile gaming market itself by increasing the demand for high-fidelity games that can push cutting-edge hardware to its limits.
Chaos Engineering: Breaking Things on Purpose
Modern distributed systems, especially within the realm of cloud computing, have become so complex and unpredictable that it’s no longer feasible to reliably identify all the things that can go wrong. From bad configuration pushes to hardware failures to sudden surges in traffic with unexpected results, the number of possible failures is too large for flawless distributed systems to exist. If perfection is unattainable, what else is there to strive for? Resiliency.“The cloud is all about redundancy and fault-tolerance. Since no single component can guarantee 100 percent uptime (and even the most expensive hardware eventually fails), we have to design a cloud architecture where individual components can fail without affecting the availability of the entire system. In effect, we have to be stronger than our weakest link,” explains Netflix.
To know with certainty that a failure of an individual component won’t affect the availability of the entire system, it’s necessary to experience the failure in practice, preferably in a realistic and fully automated manner. When a system has been tested for a sufficient number of failures, and all the discovered weaknesses have been addressed, such system is very likely resilient enough to survive use in production.
“This was our philosophy when we built Chaos Monkey, a tool that randomly disables our production instances to make sure we can survive this common type of failure without any customer impact,” says Netflix, whose early experiments with resiliency testing in production have given birth to a new discipline in software engineering: Chaos Engineering.
What Is Chaos Engineering?
The core idea behind Chaos Engineering is to break things on purpose to discover and fix weaknesses. Chaos Engineering is defined by Netflix, a pioneer in the field of automated failure testing and the company that originally formalized Chaos Engineering as a discipline in the Principles of Chaos Engineering, as the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.Chaos Engineering acknowledges that we live in an imperfect world where things break unexpectedly and often catastrophically. Knowing this, the most productive decision we can make is to accept this reality and focus on creating quality products and services that are resilient to failures.
Mathias Lafeldt, a professional infrastructure developer who’s currently working remotely for Gremlin Inc., says, “Building resilient systems requires experience with failure. Waiting for things to break in production is not an option. We should rather inject failures proactively in a controlled way to gain confidence that our production systems can withstand those failures. By simulating potential errors in advance, we can verify that our systems behave as we expect—and to fix them if they don’t.”
In doing so, we’re building systems that are antifragile, which is a term borrowed from Nassim Nicholas Taleb’s 2012 book titled “Antifragile: Things That Gain from Disorder.” Taleb, a Lebanese-American essayist, scholar, statistician, former trader, and risk analyst, introduces the book by saying, “Some things benefit from shocks; they thrive and grow when exposed to volatility, randomness, disorder, and stressors and love adventure, risk, and uncertainty. Yet, in spite of the ubiquity of the phenomenon, there is no word for the exact opposite of fragile. Let us call it antifragile. Antifragility is beyond resilience or robustness. The resilient resists shocks and stays the same; the antifragile gets better.”
On his blog, Lafeldt gives another example of antifragility, “Take the vaccine—we inject something harmful into a complex system (an organism) in order to build an immunity to it. This translates well to our distributed systems where we want to build immunity to hardware and network failures, our dependencies going down, or anything that might go wrong.”
Just like with vaccination, the exposition of a system to volatility, randomness, disorder, and stressors must be executed in a well-thought-out manner that won’t wreak havoc on it should something go wrong. Automated failure testing should ideally start with the smallest possible impact that can still teach something and gradually become more impactful as the tested system becomes more resilient.
The Five Principles of Chaos Engineering
“The term ‘chaos’ evokes a sense of randomness and disorder. However, that doesn’t mean Chaos Engineering is something that you do randomly or haphazardly. Nor does it mean that the job of a chaos engineer is to induce chaos. On the contrary: we view Chaos Engineering as a discipline. In particular, we view Chaos Engineering as an experimental discipline,” state Casey Rosenthal, Lorin Hochstein, Aaron Blohowiak, Nora Jones, and Ali Basiri in “Chaos Engineering: Building Confidence in System Behavior through Experiments.”In their book, the authors propose the following five principles of Chaos Engineering:
Hypothesize About Steady State
The Systems Thinking community uses the term “steady state” to refer to a property where the system tends to maintain that property within a certain range or pattern. In terms of failure testing, the normal operation of the tested system is the system’s steady state, and we can determine what constitutes as normal based on a number of metrics, including CPU load, memory utilization, network I/O, how long it takes to service web requests, or how much time is spent in various database queries, and so on.“Once you have your metrics and an understanding of their steady state behavior, you can use them to define the hypotheses for your experiment. Think about how the steady state behavior will change when you inject different types of events into your system. If you add requests to a mid-tier service, will the steady state be disrupted or stay the same? If disrupted, do you expect the system output to increase or decrease?” ask the authors.
Vary Real-World Events
Suitable events for a chaos experiment include all events that are capable of disrupting steady state. This includes hardware failures, functional bugs, state transmission errors (e.g., inconsistency of states between sender and receiver nodes), network latency and partition, large fluctuations in input (up or down) and retry storms, resource exhaustion, unusual or unpredictable combinations of inter-service communication, Byzantine failures (e.g., a node believing it has the most current data when it actually does not), race conditions, downstream dependencies malfunction, and others.“Only induce events that you expect to be able to handle! Induce real-world events, not just failures and latency. While the examples provided have focused on the software part of systems, humans play a vital role in resiliency and availability. Experimenting on the human-controlled pieces of incident response (and their tools!) will also increase availability,” warn the authors.
Run Experiments in Production
Chaos Engineering prefers to experiment directly on production traffic to guarantee both authenticity of the way in which the system is exercised and relevance to the currently deployed system. This goes against the commonly held tenet of classical testing, which strives to identify problems as far away from production as possible. Naturally, one needs to have a lot of confidence in the tested system’s resiliency to the injected events. The knowledge of existing weaknesses indicates a lack of maturity of the system, which needs to be addressed before conducting any Chaos Engineering experiments.“When we do traditional software testing, we’re verifying code correctness. We have a good sense about how functions and methods are supposed to behave, and we write tests to verify the behaviors of these components. When we run Chaos Engineering experiments, we are interested in the behavior of the entire overall system. The code is an important part of the system, but there’s a lot more to our system than just code. In particular, state and input and other people’s systems lead to all sorts of system behaviors that are difficult to foresee,” write the authors.
Automate Experiments to Run Continuously
Automation is a critical pillar of Chaos Engineering. Chaos engineers automate the execution of experiments, the analysis of experimental results, and sometimes even aspire to automate the creation of new experiments. That said, one-off manual experiments are a good place where to start with failure testing. After a few batches of carefully designed manual experiments, the next natural level we can aspire to is their automation.“The challenge of designing Chaos Engineering experiments is not identifying what causes production to break, since the data in our incident tracker has that information. What we really want to do is identify the events that shouldn’t cause production to break, and that have never before caused production to break, and continuously design experiments that verify that this is still the case,” the authors emphasize what to pay attention to when designing automated experiments.
Minimize Blast Radius
It’s important to realize that each chaos experiment has the potential to cause real damage. The difference between a badly designed chaos experiment and a well-designed chaos experiment is in the blast radius. The most basic way how to minimize the blast radius of any chaos experiment is to always have an emergency stop mechanism in place to instantly shut down the experiment in case it goes out of control. Chaos experiments should be built upon each other by taking careful, measured risks that gradually escalate the overall scope of the testing without causing unnecessary harm.“The entire purpose of Chaos Engineering is undermined if the tooling and instrumentation of the experiment itself cause an undue impact on the metric of interest. We want to build confidence in the resilience of the system, one small and contained failure at a time,” caution the authors in the book.
Chaos at Netflix
Netflix has been practicing some form of resiliency testing in production ever since the company began moving out of data centers into the cloud in 2008. The first Chaos Engineering tool to gain fame outside Netflix’s offices was Chaos Monkey, which is currently in version 2.0.“Years ago, we decided to improve the resiliency of our microservice architecture. At our scale, it is guaranteed that servers on our cloud platform will sometimes suddenly fail or disappear without warning. If we don’t have proper redundancy and automation, these disappearing servers could cause service problems. The Freedom and Responsibility culture at Netflix doesn’t have a mechanism to force engineers to architect their code in any specific way. Instead, we found that we could build strong alignment around resiliency by taking the pain of disappearing servers and bringing that pain forward. We created Chaos Monkey to randomly choose servers in our production environment and turn them off during business hours,” explains Netflix.
The rate at which Chaos Monkey turns off servers is higher than the rate at which server outages happen normally, and Chaos Monkey is configured to turn off servers during production hours. Thus, engineers are forced to build resilient services through automation, redundancy, fallbacks, and other best practices of resilient design.
While previous versions of Chaos Monkey were additionally allowed to perform actions like burning up CPU and taking storage devices offline, Netflix uses Chaos Monkey 2.0 to only terminate instances. Chaos Monkey 2.0 is fully integrated with Netflix’s open source multi-cloud continuous delivery platform, Spinnaker, which is intended to make it easy to extend and enhance cloud deployment models. The integration with Spinnaker allows service owners to set their Chaos Monkey 2.0 configs through the Spinnaker apps, and Chaos Monkey 2.0 to get information about how services are deployed from Spinnaker.
Once Netflix realized the enormous potential of breaking things on purpose to rebuild them better, the company decided to take things to the next level and move from the small scale to the very large scale with the release of Chaos Kong in 2013, a tool capable of testing how their services behave when a zone or an entire region is taken down. According to Nir Alfasi, a Netflix engineer, the company practices region outages using Kong almost every month.
“What we need is a way to limit the impact of failure testing while still breaking things in realistic ways. We need to control the outcome until we have confidence that the system degrades gracefully, and then increase it to exercise the failure at scale. This is where FIT (Failure Injection Testing) comes in,” stated Netflix in early 2014, after realizing that they need a finer degree of control when deliberately breaking things than their existing tool allowed for at the time. FIT is a platform designed to simplify the creation of failure within Netflix’s ecosystem with a greater degree of precision. FIT also allows Netflix to propagate its failures across the entirety of Netflix in a consistent and controlled manner. “FIT has proven useful to bridge the gap between isolated testing and large-scale chaos exercises, and make such testing self-service.”
Once the Chaos Engineering team at Netflix believed that they had a good story at small scale (Chaos Monkey) and large scale (Chaos Kong) and in between (FIT), it was time to formalize Chaos Engineering as a practice, which happened in mid-2015 with the publication of the Principles of Chaos Engineering. “With this new formalization, we pushed Chaos Engineering forward at Netflix. We had a blueprint for what constituted chaos: we knew what the goals were, and we knew how to evaluate whether or not we were doing it well. The principles provided us with a foundation to take Chaos Engineering to the next level,” write Casey Rosenthal, Lorin Hochstein, Aaron Blohowiak, Nora Jones, and Ali Basiri in “Chaos Engineering: Building Confidence in System Behavior through Experiments.”
The latest notable addition to Netflix’s Chaos Engineering family of tools is ChAP (Chaos Automation Platform), which was launched in late 2016. “We are excited to announce ChAP, the newest member of our chaos tooling family! Chaos Monkey and Chaos Kong ensure our resilience to instance and regional failures, but threats to availability can also come from disruptions at the microservice level. FIT was built to inject microservice-level failure in production, and ChAP was built to overcome the limitations of FIT so we can increase the safety, cadence, and breadth of experimentation,” introduced Netflix their new failure testing automation tool.
Although Netflix isn’t the only company interested in Chaos Engineering, their willingness to develop in the open and share with others has had a profound influence on the industry. Besides regularly speaking at various industry events, Netflix’s GitHub page contains a wealth of interesting open source projects that are ready for adoption.
Chaos Engineering is also being embraced by Etsy, Microsoft, Jet, Gremlin, Google, and Facebook, just to name a few. These and other companies have developed a comprehensive range of open source tools for different use cases. The tools include Simoorg (LinkedIn’s own failure inducer framework), Pumba (a chaos testing and network emulation tool for Docker), Chaos Lemur (self-hostable application to randomly destroy virtual machines in a BOSH-managed environment), and Blockade (a Docker-based utility for testing network failures and partitions in distributed applications), just to name a few.
Learn to Embrace Chaos
If you now feel inspired to embrace the above-described principles and the tool to create your own Chaos Engineering experiments, you may want to adhere to the following Chaos Engineering experiment design process, as outlined in “Chaos Engineering: Building Confidence in System Behavior through Experiments.”- Pick a hypothesis
- Decide what hypothesis you’re going to test and don’t forget that your system includes the humans that are involved in maintaining it.
- Choose the scope of the experiment
- Strive to run experiments in production and minimize blast radius. The closer your test is to production, the more you’ll learn from the results.
- Identify the metrics you’re going to watch
- Try to operationalize your hypothesis using your metrics as much as possible. Be ready to abort early if the experiment has a more serious impact than you expected.
- Notify the organization
- Inform members of your organization about what you’re doing and coordinate with multiple teams who are interested in the outcome and are nervous about the impact of the experiment.
- Run the experiment
- The next step is to run the experiment while keeping an eye on your metrics in case you need to abort it.
- Analyze the results
- Carefully analyze the result of the experiment and feed the outcome of the experiment to all the relevant teams.
- Increase the scope
- Once you gain confidence running smaller-scale experiments, you may want to increase the scope of an experiment to reveal systemic effects that aren’t noticeable with smaller-scale experiments.
- Automate
- The more regularly you run your Chaos Experiments, the more value you can get out of them.