Meta-support: Tools for supporting IoT
In the IoT world, a common headache for customers is supporting their product after it is released. At SpinDance, one of our core values is to Deliver Every Day; this includes supporting both our internal tools and our customers’ infrastructure. To do this with excellence, we rely on the intelligent use of meta-support tools to amplify the support team’s work. In this post, we will highlight some of these tools.
The Tools of the Support Trade
Datadog is an excellent monitoring service that can be extended and hooked into in some incredible ways. This system keeps 24/7 surveillance on our cloud infrastructure, dedicated hosts, websites, and databases.
The features we use the most are the monitors. Monitors can be simple smoke-test-style checks where a tool goes to a URL and makes sure it returns the expected HTTP code. Monitors can also be complex, carefully designed checks that pass input to the tool or a query to a database. For example, we have one check that queries a Redshift database’s rows by time, averages the amount of entries made in the past hour, and returns a warning if the value is below a specified value.
Datadog has excellent first-class support for all major cloud vendors, and if there isn’t a pre-made tool, the platform is customizable enough to let you build it yourself!
In theory, this could be replicated by a master server pinging other servers or a free tool (like the excellent Cockpit Project). However, Datadog includes some unique sauce additions: its visualization of monitors and metrics is fantastic out of the box and has incredibly granular tools to, say, mute individual monitors once they have been triaged. When it is necessary to alert a support team member, Datadog hooks into our slack channel to communicate to the respective support role:
Certain critical level events also send an automated text to the on-call team member, using Twilio (more on that later).
We have multiple redundant Datadog monitoring instances, which we orchestrate and deploy automatically using the Chef tool.
Chef Software is an extensive suite of tools built around configuration management. Paired with Datadog, this enables us to quickly respond to issues and keep everything we’re responsible for in a working, consistently-reproducible state.
While it may be a bit more work the first time to create a chef “cookbook” of recipes and use that to deploy infrastructure compared to just doing it yourself with a terminal, once the work to make the infrastructure as code has been done, further work is much more trivial. Keeping our servers specified in Chef cookbooks enables us to, for example, apply Linux Kernel security patches on a dev machine, automatically run tests against it, and when comfortable, apply the update to every relevant machine and know what exactly has changed.
Chef also makes recovering from a disaster much more manageable. Ransomware? No problem, just spin up a new EC2 instance and execute your chef cookbooks! Combined with other backup tools, a complete server replacement is possible with a simple invocation of the command chef-client.
To round things off, the Chef community keeps a public list of cookbooks up-to-date in the Chef “supermarket”. For almost any common task you can think of, somebody has a cookbook to help implement it. We also run our own private supermarket instance, in case we need to tweak a public cookbook or make one from scratch easily.
The only major downside of Chef is the unfortunate fact that these tools share names with actual cooking and shopping terms, which can test the skills of any developer’s Google-fu. Example: Chef has a command-line tool for managing dependencies called “knife.” If you Google “chef knife not working,” you get many results, very few of which will help you set up a server.
Fortunately for us, we have found Chef’s documentation, supplemented with a healthy seasoning of StackOverflow, answers 99.9% of questions.
AWS – CodeCommit/CodeBuild
Listing AWS as a meta-tool is a bit like listing “a computer” as a developer’s tech stack. It is technically accurate but, well, a bit general. AWS can do a lot (just ask anyone who has studied for one of their exams), but today I will focus on the bits we use specifically for our support pipeline: CodeCommit and CodeBuild.
CodeCommit isn’t too exciting on its own: it is a private git host. It’s usefulness comes, as with many AWS products, in the tight integration to other AWS tools. Storing code in CodeCommit allows us to trigger CloudWatch events on commits with certain messages and run automatic linting, CI/CD, and execute builds. We aren’t too fancy; our support team has a nice but straightforward workflow where we keep all of our Chef cookbooks as individual CodeCommit repositories, which we can then use CodeBuild to produce artifacts of. The chef infrastructure then checks daily for any updates and applies them to all our relevant machines.
So far, this blog post has walked through the primary “loop” of support: infrastructure, coded for Chef, deployed with AWS, and monitored by Datadog. Of course, these are not all the tools, but I sadly don’t have the time to write (nor most of our dear readers, to consume) the dozens of pages it would take to fill in every detail. Instead, I will briefly mention some of the other tools outside of this primary support pipeline.
For some legacy systems and tools, CrashPlan is a dead-simple automated system backup tool. It can perform a complete system recovery, but we often find ourselves using it to help a customer recover an individual file that was modified or deleted. It is a great tool that you hope to never use, but boy will you be glad when you have to use it.
To make sure we can respond to emergencies at any time of the day, we have Datadog hooked into a Twillio connection, allowing us to text the on-call support team member under certain circumstances. In this instance, Twilio is just the pipe; Datadog and some cron jobs perform the brunt of the work, but without the texts, the support response would potentially be lacking outside business hours.
Realistically, you could put a wide variety of wiki tools in this bullet. Atlassian’s confluence product is polished, has every possible knob and button you could imagine, and ties in with other Atlassian tools. Documenting our solutions and pipeline in the wiki is crucial for raising our team’s “bus factor” — the hypothetical number of people who could, for one reason or another, disappear (be “hit by a bus”) without our support suffering.
Long-term, our goal is to make the support role as unnecessary as possible. While we understand that a human support specialist role is always going to be needed, by magnifying the effectiveness of our team, we avoid burnout and wasted time on micro-managing issues. Instead, we can focus on the important part of our work: bringing our customers’ ideas and solutions to fruition with IoT!
Some exciting and promising tools we are looking into include Ansible as a potential complement or replacement for Chef, replacing all manual SSL cert renewal with automatic renewal via an ACME-client registrar, and hybrid-cloud solutions using Terraform to allow cross-cloud fallback.
We hope that this short dive into our support tooling inspires or encourages you in your own support endeavors. As an industry, helping others increase their quality and longevity of support for IoT products and services ultimately helps everyone, customers and companies alike.
SpinDance relentlessly focuses on quality: this means caring about both the products and companies we work with for an entire IoT journey. To accomplish this, an efficient and effective support structure is essential, and these tools comprise the platform we stand on to reach this goal.
If you or your organization want to step up your product, bring an idea to reality, or just get up to speed on IoT, we would love to discuss solutions with you. SpinDance offers guided IoT introductory classes, as well as world-class software development services.
If you are an individual interested in an IoT career, we’d love to hear from you as well! SpinDance is an agile-focused development environment using the Scrum methodology. Our main presence is in Holland, Michigan, but our remote-friendly culture includes employees from coast to coast.
About the Author:
Quentin Baker, Software Engineer
As a Software Engineer at SpinDance, Quentin is passionate about product quality and assisting our customers in their IoT quest. Connect with Quentin