Monday, February 27, 2017

Home Automation - Part 2

This post is to make our program in Part 1 a 12 factor app. It is therefore a digression from home automation, with more focus on software architecture and design.

December 2020 Update:

Things have changed after this article was published in February 2017.
  • Weatherunderground ceased to be an option, so we switched to Openweather
  • mlab has merged to Mongo proper, so we needed to do a DB migration. That was painless.
  • ubidots changed the API endpoint, which we had to adjust our code.
Generally speaking, the system has worked fine. None of the dependencies changes required significant code change. 


A recap of what we have from Part 1:

  • Node.js scripts running on a Raspberry pi (model B). If the Pi goes, the program goes.
  • All parameters hard coded, including URLs, sampling frequency, username, password,..
  • The output written to standout, and redirected to a log file, which is later parsed for analysis.
  • A cron job to kick off the measurement script, and another cron job (different frequency) to post to ubidots for dashboarding.

We try to use free online services as much as possible, without hard dependency, so they can be easily swapped out.

Journey to the 12 factor app

  1. One revision controlled code base: This is achieved after moving the code to gitlab. We decided to use gitlab (instead of github) since gitlab allows private repos. There is nothing secret, but the program is also of little interest to other people.
  2. Explicitly declare and isolate dependencies: We just use package.json to declare dependencies, and use 'npm shrinkwrap' for runtime isolation. With shrinkwrap, we have a set of deterministic runtime packages that will be exactly the same as development. 
  3. Store config in the environment: We store all configuration information in environmental variables. This approach definitely is not without risks, as an env dump will reveal all the credentials.
  4. Backing services as attached resources: Backing services locator/credentials are stored in the config. These include Mongodb (storing the data. we can also use MariaDB or other mySQL variants. mLab offers MongoDB for free, an offer hard to beat), Redis (distributed cache & locking), ubidots (dashboarding), particle API endpoint (for checking the sensor readings), Wunderground API endpoint (for local weather).
  5. Build, Release, Run: The app can run by itself at command line. We also packaged the app into a self contained Docker image. By packaging the configs into the container, we have a ready to run  release.
  6. Stateless and share nothing processes: Stateless processes with no local dependencies, no sticky sessions, no assumption of # of instances, and no knowledge which process is running where.
  7. Port binding: Listening port is defined through environmental variable with local default. Express is used as a dependency to provide the HTTP service.
  8. Concurrency through process model: While not necessary, we did refactor the app to allow concurrency.
    1. Use distributed locks (redlock) so only a single processes is allowed to take measurements or posting to the dashboard. If the process dies, another process will acquire the lock when the lock expires.
    2. Use redis pub/sub messaging to make measurement results available for posting, asynchronously. Part of the reason for doing so, is to separate the work of taking measurement from posting to the dashboard or serving HTTP requests. When a new measurement is taken, the result is published to a redis pub/sub channel, and all processes will get the new data, so they can serve HTTP requests. This method has its scalability limitations inherent to redis.
    3. Now we can comfortably/confidently running as many instances of the app, without worrying about over-using the backing services (and charged for the overage).
  9. Disposability: The processes can be started/stopped easily, and it is crash-safe. If a process owning the lock(s) crashes, another process will automatically acquire the lock and perform the work.
  10. Dev/Prod parity: Since we use the same types of backing services (like redis for pub/sub), the environments match "mostly".  It turned out that their version also matters.
  11. Logs as event streams: We log all outputs to stdout. Concurrency makes logging a first level priority. When there are more than 3 or 4 instances of the application running, it is a must that the log is accessible in a central place for monitoring or troubleshooting for distributed, parallel programs. The Synology DS station turns out to be very handy as a syslog server, with visual dashboard automatically refreshed. 
  12.  Admin tasks as one-off processes: We ended up needing this for data import.

Design Considerations

Take measurements at defined intervals

The asynchronous nature of JavaScript is both a blessing and a curse. To carry out a periodic task, both setInternal() and recursive setTimeout() will do. If a task may take a long time, recursive setTimeout()has the benefit of preventing the tasks overlapping each other. But both functions return immediately since they are asynchronous. So if the next functional block is dependent on the tasks, they need to be wrapped into the callback. Asynchronous programming takes new thinking for most people coming from imperative programming backgrounds.

Another side effect of asynchronous programming is, the main program exits after the asynchronous calls, even before the callback logic is executed. Therefore, we need a mechanism to keep the main program from exiting, without consuming resources (like an infinite loop).  This is where Express comes in handy. In addition to providing a web interface and REST API endpoint, it also provides an efficient loop to keep the main program.

Parallelism

One concern of multiple parallel processes is race conditions -- the processes stepping upon each other leading to data corruption. Another concern is a wasting of resources -- there is no benefit for the temperature being measured 20 times within a second. The solution is using a distributed locking mechanism, so each process will acquire the lock before taking on the given action. The lock must have an expiration time, so a crashed process doesn't hold the lock indefinitely. 

With parallelism, we also need to consider the granularity of computing. Does a single lock cover all the functions the program can perform? If so, only one process is active, and all other others are in stand-by by waiting for the lock. We would normally prefer to break the main logic into smaller units (or micro services), so they can be executed by other instances. With micro services, the execution orchestration becomes an issue of greater scale. Messaging and locks used in inter-process communications (IPC) and multi-threaded programming, now need to function across multiple machines, or the wide area network.

Runtime Environment

Once we have achieved the goal of making our application a 12factor app, the execution becomes more predictable, since we have a clearly defined and managed contract with the execution environment. We experimented with the following mechanisms.
  • Init: We still use the Raspberry Pi for the ongoing runtime. But instead of cron, we use /etc/inittab as the process monitor to keep the process running. We wrote a "run" script to define the environment variables. 
  • Docker: We also tested by creating a docker image, and launched like 5 instances. This is extremely straightforward thanks to the refactoring. All we need to do is to define the environmental variables in the Dockerfile, and use -p to expose the http port to the host.
  • Cloud Foundry: We initially attempted bringing the docker image into Cloud Foundry, but found that CF expects the docker image from the Docker Hub, or a trusted Docker Registry. A private registry does not work. Then it turns out that using the buildpack is very straightforward,  4-step process. 
    1. Download the code with "git clone ..". 
    2. Push the app to Cloud Foundry with cf push my-app-name -c "node index.js"
    3. Generate the depoloyment manifest with cf create-app-manifest app_name -o manifest.yml.
    4. Edit manifest.yml to add the ENV variables do a "cf push" to update the app.

At one time, we attempted to launch 10+1 instances of the app (1 in init, 5 in Cloud Foundry, and 5 in Docker), but one keeps on failing. It turns out we exceeded our connection limit on the free redis service.

Summary

Refactoring the application, as small and trivial it is, is a very good exercise. Through the process, we gained more confidence that we can bring the application up and running in multiple but consistent ways, as proclaimed by the 12factor principles.

In a loosely coupled world, coordinated through locking and messaging, things are fine most of the time, but not as tight as "never miss a beat." For example, the lock expiration time MUST be longer than the time it takes for a task to complete. If the process owning a lock crashes while executing the task, that task execution will be missed, until the next period when it will be re-tried. Missing a "beat" is very acceptable in taking the attic temperature, but totally unacceptable in mission critical situations. Designing mission critical systems in a loosely coupled environment is
very different from a system that works the most of time.

There are other concerns to be battled with beyond code. For example, how to automate the testing for a distributed system, where pre-existing conditions are needed, such as a lock to be in a certain state?

Lastly are security concerns. For example, redis is very convenient but default security is lacking. As long as I know the password, I can connect to the cache, inspecting cached K/V pairs, and listening on pub/sub channels. Another example is the ENV configurations. If they be packed into containers, the containers need to be secured. In case of CloudFoundry, the admin console can examine all the ENVs with the cf env command.

No comments:

Post a Comment