One of the perks of working in multidisciplinary teams for the past few years has been that there I have continuously been surrounded with people with more tech chops than me who would tackle the really hard problems™. While working on a little thing for the Human Tide project, I was reminded how tricky some of those problems can be sometimes.
In a nutshell, I was building a web service that uses the HTML5 Geolocation API to track the location and movement of people as they walk along the shoreline of Herne Bay. The plan was that a group of people – around a hundred – will each carry a mobile phone that sends its GPS coordinates to a server, which stores them in a database. It was a simple enough job, were it not that each phone continuously sends its coordinates back in real time – once per second – which quickly begins to look like a small-scale distributed denial of service attack. Furthermore, as each request needs to be stored in a database, it was not a matter that can easily be solved using caching. I needed a backend that could deal with 100+ requests per second without breaking a sweat. Just the use case node.js, with its asynchronous event-driven model, was supposed to be good at!
Building the little web service went well enough. Within a few hours, I had a small node.js app running on a free Heroku plan. Using the expressjs framework and the node-postgres library, the whole thing was done in under 100 lines of code. It was easy. Too easy. Unfortunately, it performed absolutely terrible.
To simulate the effect of a hundred phones hammering away at the service and measure its performance, I unleashed ApacheBench on my little app. Simulating 5,000 requests from 100 concurrent users is as easy as running:
ab -n 5000 -c 100 my.herokuapp.com
The results were devastating:
Concurrency Level: 100 Time taken for tests: 565.061 seconds Complete requests: 5000 Failed requests: 0 Write errors: 0 Total transferred: 3385800 bytes HTML transferred: 2655000 bytes Requests per second: 8.85 [#/sec] (mean) Time per request: 11301.229 [ms] (mean) Time per request: 113.012 [ms] (mean, across all concurrent requests) Transfer rate: 5.85 [Kbytes/sec] received Connection Times (ms) min mean[+/-sd] median max Connect: 17 18 0.5 18 25 Processing: 31 11193 2975.6 11176 21911 Waiting: 31 11193 2975.6 11176 21911 Total: 49 11211 2975.5 11195 21928 Percentage of the requests served within a certain time (ms) 50% 11195 66% 11714 75% 12074 80% 12232 90% 13032 95% 14745 98% 16573 99% 17852 100% 21928 (longest request)
My state-of-the-art node.js app managed to handle all of the requests (mostly thanks to Heroku’s routing stack queuing them up), but at a rate of less than 10 requests per second and taking up to 20 (!) seconds per request. Granted, this was on a single Heroku “dyno”, but with these results, it would take at least 10 additional dynos (at a cost of almost $400 a month) to cope with the load. Not exactly what I had in mind. I figured it might have something to do with my app using Heroku’s dev postgresql database plans, but running the same test against a route that didn’t hit the database produced similar results. It seemed that the slowness was the request handling itself. Looking in Heroku’s logs, I saw a depressing number of lines like this, showing severe delays in the “service” step:
heroku[router]: … connect=0ms service=4506ms status=200 bytes=543
I found an article explaining how to use node cluster to better benefit from the multiple cores in each Heroku dyno, which didn’t appear to make much of a difference. It was still just too slow.
And then it hit me. From Heroku’s own tutorial on getting started with Node.js on their platform, I copied the following line to enable logging in Express.js:
app[web.1]: - - - [Mon, 05 Aug 2013 14:05:51 GMT] "GET / HTTP/1.1" 200 - "-" "ApacheBench/2.3"
Deleting that line improved performance… with a factor 50. I ran my little concurrency benchmark again and this time it completed in a fraction of the time, even though I gave it a lot more requests to process:
Concurrency Level: 100 Time taken for tests: 105.179 seconds Complete requests: 50000 Failed requests: 0 Write errors: 0 Total transferred: 33843447 bytes HTML transferred: 26550000 bytes Requests per second: 475.38 [#/sec] (mean) Time per request: 210.358 [ms] (mean) Time per request: 2.104 [ms] (mean, across all concurrent requests) Transfer rate: 314.23 [Kbytes/sec] received Connection Times (ms) min mean[+/-sd] median max Connect: 17 18 1.3 18 59 Processing: 23 192 160.6 135 1400 Waiting: 23 192 160.6 135 1400 Total: 41 210 160.6 153 1419 Percentage of the requests served within a certain time (ms) 50% 153 66% 280 75% 341 80% 366 90% 419 95% 458 98% 546 99% 656 100% 1419 (longest request)
From less than 10 requests per second, it can now handle almost 500 requests per second and rather than 10 seconds, requests now take around 200ms to complete. All because of a single line of logging. In fact, I found that any logging, even simple “console.log”-statements completely destroy Node.JS performance on Heroku.
What I’ve learned from this is that scaling is hard and finding out why things are not performing as expected when using new technology is too, even for very simple apps like this. I’m a bit surprised that a bit of logging has such a severe impact on performance on Heroku and even more surprised that they recommend you to enable logging in their Express tutorial. Turning logging off allowed my app to handle 53x as many requests per second on a single “dyno”. I would have had to pay for quite a few of those dynos, at $34.50 a piece, in order to deal with the same volume of traffic. In the end, my little app performed great and I highly recommend Heroku or similar cloud services for short-term high-availability purposes, but keep an eye out for little things that hurt performance before reaching out for that credit card.