Article

Don’t Brick Your Devices: 5 Problems and Solutions for IoT Over-the-Air Updates

March 30, 2024

One of the most important features of an IoT product is over-the-air updates. In fact, over-the-air updates, also known as OTAs, are likely the most important feature.

What are OTAs? In a nutshell, they are software updates that are sent from a central server to remote IoT devices. These updates can include bug fixes, security updates, and/or new customer-facing features. 

Getting OTAs right are critical to the long-term success of a connected product. In our opinion, selling an IoT device without the ability to send OTAs is the equivalent of malfeasance: you’re selling a device that has no way to receive critical updates, which could result in hacked devices, leading to all sorts of bad outcomes for customers

In this blog post, we explore what OTAs are, and share a few horror stories about what can happen when OTAs go wrong. To ensure you don’t share a similar fate, we’ll review 5 of the most common problems you might encounter, as well as their solutions.

What Exactly are Over-the-Air Updates?

OTAs are firmware and/or software updates that are delivered to a remote device. They are especially prevalent in the Internet of Things (IoT) ecosystem, where devices often operate in hard-to-reach or remote locations, or are prohibitively expensive to send out technicians to perform updates. 

For example, consider a smart home thermostat receiving an update to improve its efficiency algorithms without requiring an on-site visit. If you’ve sold tens of thousands of these devices, you can’t reasonably expect to send a support truck to each and every customer. 

While OTA updates are primarily wireless, don’t get too hung up on the “air” part. Updates are often delivered via radio technology, such as Wi-Fi, cellular or Bluetooth. But they can also be delivered over a wired connection. For this blog post, we’re ignoring the delivery mechanism, and focusing on the common problems and solutions for all OTAs.

At a high level, an OTA process looks like this:

  1. The device receives a notification that a new update is ready for download.
  2. The update is downloaded to the device, and checks are run to confirm it is legitimate and undamaged. 
  3. The device installs the update, then reboots to load the new software.
  4. Another set of checks are run to confirm everything is in working order. If not, the device reverts to a previous good state. 
  5. The device notifies the update service it is now running the most recent version of the firmware.

Easy as pie, right? What could go wrong?

Over-the-Air update Horror Stories

As it turns out, lots of things can go wrong with over-the-air updates. 

Below we are two, true stories that convey the pain that broken OTAs can cause. We don’t share the names of the organizations involved, and while we didn’t cause the problems, we were involved in the clean up.

Story #1: 

Early in our journey with IoT, we helped an organization deploy a new cloud for their IoT products. To do this, we assisted our customer in pushing out a new firmware update to a fleet of roughly 120,000 devices.

Unfortunately, there were two problems: First, the device did not have enough storage for both the old firmware and the new firmware. So, if anything went wrong, the device would be “bricked”. That is, they would cease to operate completely, and basically become expensive bricks.

This problem was compounded by the second problem: the old firmware did not include a verification system. The update was broken into roughly one thousand 1kb chunks. The device “glued” these chunks together before rebooting. But this glue process was error prone, and occasionally did the wrong thing.

As a result, a little over 3% of devices were bricked, resulting in 4,000 very upset customers. Not only did the company need to send out 4,000 new devices, but they had a huge backlash from their customer base, resulting in years of lost sales.

That leads us to story #2:

Fast forward a few years: we were in the middle of architecting a 2nd generation IoT platform for a Fortune 500 company. One day, we got a very panicked email: could we drop everything and help them with a problem?

It turns out, the customer had built their first IoT device using a third-party IoT platform provided by their chip vendor. The vendor had been purchased, and the IoT platform was being shut down. 

Being new to IoT, our customer’s leadership deprioritized the issue, waiting until it was too late to act. They lost the chance to send out an OTA to redirect the devices to a new platform. As a result, the devices stopped working.

Normally, having a platform shutdown would be a nuisance. But the customer didn’t own a few critical parts of the infrastructure: the root device certificate, or the DNS records. 

Because of these oversights, the customer needed to send out 25,000 new devices to customers, and recall 75,000 products from store shelves. All at a cost of a few million dollars. 

Five Common Problems with OTAs

Thankfully, you can sidestep your own horror stories. With a little bit of planning, you can create an OTA system that is secure, reliable, and scalable. 

To that end, here are five OTA problems we often see and the solutions to overcome them.

Problem 1: Insecure Over-The-Air Updates

The Challenge: Unsecured OTA systems can be a gateway for attackers, leading to compromised devices, data breaches, or ransomware attacks. The ultimate prize for an IoT hacker is to completely control the device and install their own software. 

The Solution: 

There are three interlocking solutions to this problem.

  1. Signing and Encrypting Firmware: Think of firmware signing as putting a tamper-proof seal on software updates, ensuring that only updates from trusted sources can be installed. Encrypting the firmware is like putting it in a secure box that only the intended device can open, safeguarding against interception and tampering. These steps are critical in preventing unauthorized or malicious updates.
  2. Leveraging a Secure Boot Process: Secure boot acts as a bouncer for your device, only letting in software that has been verified as safe. This prevents malicious code from running, adding an essential layer of security to the update process.
  3. Own the DNS records, as well as Security Certificate Keys: As we explained in the second horror story, owning your own records is critical to successful OTAs. It is especially important you own the security certificates, and ideally air-gap them in a safe location in a safe with very limited access.

Although no device can be 100% secure, these are the bare minimum steps needed to ensure you can deliver secure over-the-air updates.

Problem 2: Not Thoroughly Testing OTA Updates

The Challenge: Broken firmware or software can at best be a hassle, or worse, cause significant downtime or issues for a customer. However, it can be a challenge to cost effectively and quickly test firmware. 

This challenge grows significantly as the diversity of connected products grows. A mature IoT fleet might include several generations of hardware, different model numbers, and varying operating conditions, complicating comprehensive testing.

The Solution:

It’s important to set yourself up for success, and make testing as easy (and therefore as cheap) as possible. Some of the keys to this include the following:

  • Comprehensive Testing Procedures: Establish rigorous testing protocols that simulate real-world environments and device configurations, ensuring updates work seamlessly across the entire fleet before rollout.
  • Follow Version Control Best Practices: Implement strict version control to manage and track different firmware versions effectively, ensuring a smooth and reliable update process.
  • Automate as much as you can: Building and delivering over-the-air updates to your test devices should be simple. The “nirvana” of this is to push a button, and any version of any firmware can be automatically sent to a test device, and a detailed battery of tests can be run against a device. 
  • Use Hardware-In-Loop Testing When Possible: Also known as HIL testing, this is a specialized form of automation. A battery of tests can be automatically run against a device. This is typically performed by connecting the device-under-test to a dedicated testing rig. 

Problem 3: No Roll-Back Strategy

The Challenge:  Even with thorough testing, updates can introduce problems. Without a method to revert to a previous version, devices might become broken. 

The Solution:

Like we saw in our first horror story, having a roll-back strategy is critical to ensuring you don’t accidentally brick a device. This involves two steps:

  1. Dual-Slot Firmware Storage: Design the device with two storage slots. The first slot is for the current, working version of the software. The second slot is reserved for updates. Imagine updating your device like changing clothes. One slot holds the outfit you’re wearing (the current firmware), and the other holds a new outfit (the new firmware). If the new outfit doesn’t fit right, you can quickly switch back to the comfortable one you were wearing before, ensuring your device always works.
  2. Robust Post-Reboot Tests: Dual-Slot storage is no good if you don’t provide robust post-reboot testing. These tests ensure the new firmware a) reboots correctly, and b) passes some “smoke tests” to ensure it is operating properly. If these checks fail, the device is rebooted into the previous firmware. 

On a few rare occasions, we’ve seen organizations use three slots. In addition to the dual-slot architecture, they include a hard-coded “gold” slot, containing a read-only version of the firmware. If either of the writable slots becomes damaged, the device can reboot into the gold slot. This is often overkill for most products, but is useful in high security or safety critical situations. 

Problem 4: Sending OTAs to Many Devices

The Challenge: An IoT fleet might range from a few dozen devices during a pilot phase, to millions in a global deployment. This presents a significant challenge in ensuring all devices receive the correct firmware in a timely manner. 

The Solution: 

Again, automation is a key to the solution: 

  1. Automated, asynchronous updates: Once you approve a firmware for release, the process should be 100% automated. Notifying devices, downloading updates, and recording status, should be managed by your IoT platform. This is important because not all IoT devices are online 100% of the time. It might take days, weeks, or even months to ensure every device receives an update. 
  2. Incremental rollouts: Rarely do you want to send an update to all devices immediately. Instead, begin with a canary deployment, releasing the update to a small, controlled group of devices to catch potential issues early. Following a successful canary phase, proceed with a staged deployment to the rest of the fleet. If you have 10,000 devices, you might send out updates to 250 or 500 at a time. Monitor each wave for anomalies, pausing or rolling back if you detect issues.
  3. Documentation and Training: Provide comprehensive documentation and training for teams to fully understand the OTA process and potential pitfalls, fostering a knowledgeable and prepared workforce.

Believe it or not, we’ve seen organizations who manually send OTAs to thousands of devices. This is both error prone and costly in terms of man hours. Don’t be that guy.

Problem 5: Poor User Feedback 

The Challenge: Users may become frustrated or confused if their device behaves unexpectedly ahead of, during, or after an update, leading to a poor experience.

An example of a bad experience could be a smartwatch restarting during a workout without warning, disrupting the user’s exercise and data tracking.

The Solution: 

Finally, don’t forget your users! Offer clear communication before, during and after the update process:

  1. User Notifications and Automation: Keeping users in the loop with notifications like “Software update available: Install now or schedule for later?” ensures transparency and minimizes disruptions, enhancing trust and satisfaction. Possible messages include update progress notifications, scheduling options, and confirmation requests for restarts, offering users control over the update process.
  2. Clear Retry Notifications: If an update fails, it can be extremely helpful to explain how to resolve the issue. “Sorry, something went wrong” is a horrible message. Instead, enable a user to manually retry the update process, or point them to the proper customer support phone number or email address.

Product-Grade OTAs: Complex, but Worth It

There’s much more to say about developing a rock solid OTA process. But if you follow our advice, you’ll be well on your way. 

And remember: the hard work is well worth it. Delivering timely, secure and fully tested over-the-air updates will give you a leg up on your competition, and keep your customers happy for years to come. 

If you want to learn more about designing a great OTA experience, drop us a note at hello@spindance.com.

And don’t forget to checkout CallBox, SpinDance’s software framework for building production-grade IoT products. It includes a rock solid OTA system.