Friday, April 25, 2014

Request Debouncer Pattern

The other day I was brainstorming with a client about a typical issue faced during service restarts in discrete service systems.

Consider a fairly typical architecture as shown below where ancillary services depend on some data generated by a common endpoint in the main Rails app. When the services are restarted, each service instance will send a request to the API server, causing a thundering herd scenario. Even with caching, it would cause equivalent number of the Rails API instances to be blocked serving the same request.

A Request Debouncer will reverse proxy calls to the common endpoint, blocking on all the ancillary service requests until the first call to the Rails API service returns.

Note that this makes sense only for internal GET api endpoints.  There is no support for authorization when using the typical cookie or param based session authentication. However if the auth process can be decoupled from the request, it may be possible to extend this pattern to such scenarios as well.

Possible choice of technology is node.js and simple implementation is shown below as well.


var httpProxy = require('http-proxy'),
connect = require('connect');

var proxy = httpProxy.createServer({target: 'http://localhost:9000'});

    require('./request-debouncer-middleware')(),  // last middleware
    function(req, res) {
        console.log('starting to proxy at %s', new Date());
        try { proxy.web(req, res) }
        catch (err) { res.statusCode = 500; res.end(); console.log('caught err:', err) }
        console.log('ended proxy at %s', new Date());


module.exports = function request_debouncer(cache_time){

    var outstanding = {};
    cache_time =  cache_time || 20 * 1000

    var nullify_promise_after_delay = function(key, delay) { setTimeout(function() { console.log('expiring cache at %s', new Date()); outstanding[key] = null; }, delay); }

    return function debouncer(req, res, next) {
        if (req.method != 'GET') { next() }
        else {
            // TODO - consider protocol and stringified query parameters in the key as well
            var key = req.method + req.url;
            var outstanding_promise = outstanding[key];
            if (outstanding_promise) {
                when(outstanding_promise, function(proxied_res_api_calls, code, headersSent) {
                    // call all the methods that were called on the debounced res
                    // TODO set other headers as well
                    res.statusCode  = code
                    for (var i=0;i < proxied_res_api_calls.length;i++) {
                        var api_call = proxied_res_api_calls[i];
                        res[api_call[0]].apply(res, api_call[1])
                }, function(err) {
                    res.statusCode = 502
                    res.end('Bad Gateway')
            } else {
                // create a promise
                var outstanding_promise = outstanding[key] = defer();

                var api_calls = new Array();

                // all calls that mutate the response are trapped so we can make the same changes to the waiting responses
                var _end = res.end, _write = res.write, _writeHead = res.writeHead, _sendHeader = res.sendHeader;
                res.end = function() { api_calls.push(['end', arguments]); return _end.apply(res, arguments)}
                res.write = function() { api_calls.push(['write', arguments]); return _write.apply(res, arguments)}
                res.writeHead = function() { api_calls.push(['writeHead', arguments]); return _writeHead.apply(res, arguments)}
                res.sendHeader = function() { api_calls.push(['sendHeader', arguments]); return _sendHeader.apply(res, arguments)}

                // resolve the promise when this req is finished
                res.on('finish', function() { outstanding_promise.resolve(api_calls, res.statusCode, res.headersSent); nullify_promise_after_delay(key, cache_time) })
                res.on('close', function() {  outstanding_promise.reject('socket closed before response was sent'); nullify_promise_after_delay(key, 0.1) } )

                // go ahead and proxy the request


Uploaded to github.

Friday, April 11, 2014

Ruby Concurrency Primer - Part 1

In today's multi-core cpu world, concurrent programming is required to scale and fully utilize shared resources. Writing concurrent programs is hard and the choice of concurrency model chosen can greatly affect the application scalability. 

Consider nginx vs Apache Http Server - nginx uses non-blocking event based concurrency to fully utilize a single core, while using process based concurrency to scale across multiple cores. The Apache Http Server on the other hand uses thread (via the worker mpm module) based concurrency to scale on a single core and process based concurrency to scale across multiple cores.  The higher resource consumption per thread incurred by the Apache Server leaves it far behind nginx when scaling to thousands of requests.

While building concurrent applications in Ruby, you will want to consider the following aspects:

- Task profileIs the task cpu bound? Does it make system calls or network i/o? synchronous / asynchronous i/o?

- Choice of concurrency models :  Ruby provides Process and Thread based concurrency model constructs as part of its Standard libraries, while various gems exist to add Event based (eventmachine, rev etc)  and Actor based (celluloid) concurrency constructs to the language.

- Choice of Ruby Interpreter  : multi-core support , GVM(GIL), Copy on Write (COW) support

* Choice of OS matters as well, but assuming you are constrained to one of the linux variants

Lets explore all of these aspects practically with some code examples:

Threads and GVM (GIL)

First, let us consider the case where we need to run some busy cpu work concurrently. 

  def busy_work
    a=i=0; loop { break if 10000000 < (i=i+1); a += (i << 32) }


without_threading = measure_time_taken do
  ITERATIONS.times.each do |i|

with_threading = measure_time_taken do
  ITERATIONS.times.each_slice(NUM_THREADS) do |i|
    NUM_THREADS.times.collect { { busy_work } }.each {|t| t.join}


(YARV 1.9.3)

 RUBY_PLATFORM: x86_64-darwin13.0.0  RUBY_VERSION: 1.9.3 RUBY_INSTALL_NAME: ruby
                                                    user     system      total        real
busy work (10 iterations)                      19.050000   0.010000  19.060000 ( 19.059950)
threaded busy work (10 iters / 2 thrds)        19.690000   0.020000  19.710000 ( 19.709079)
Percentage increase in time taken with Thread concurrency: 3.41

(JRuby 1.9.3)

                                                    user     system      total        real
busy work (10 iterations)                      18.610000   0.830000  19.440000 ( 17.609000)
threaded busy work (10 iters / 2 thrds)        21.910000   0.980000  22.890000 ( 11.353000)
Percentage decrease in time taken with Thread concurrency: 35.53

Right off the bat, we see the choice of interpreter implementation makes a difference in our concurrent implementation. The JRuby version ran 30% faster, whereas YARV 1.9.3 version ran 22% slower. Ruby 2.0 didn't fare any better either.

In this code snippet, you saw YARV's GVM (GIL) come into play for cpu bound tasks. Even though YARV maps its thread implementation to native OS threads, only a single thread gets scheduled to run at a time as it has to acquire and hold the GIL for the duration of execution.

If you are stuck with YARV due to gem requirements etc, the only way to scale pure ruby work is by using process based concurrency,a heavier form of concurrency.

ITERATIONS.times.each_slice(num_threads) do |i|
   pids = num_threads.times.collect { fork { busy_work } }

 RUBY_PLATFORM: x86_64-darwin13.0.0  RUBY_VERSION: 1.9.3 RUBY_INSTALL_NAME: ruby
                                                    user     system      total        real
busy work (10 iterations)                      18.730000   0.010000  18.740000 ( 18.734536)
threaded busy work (10 iters / 2 thrds)        18.810000   0.010000  18.820000 ( 18.816004)
process based busy work (10 iters / 2 processes)  0.000000   0.000000  18.900000 (  9.457859)
Percentage increase in time taken with Thread concurrency: 0.43
Percentage decrease in time taken with Process concurrency (2 processes): 49.52

Using process based concurrency made the whole thing go faster, but as concurrency is increased, it will do so with diminishing returns as forking related system costs start to dominate.

The good news though is that most often real world tasks are not as cpu bound as our test case. Whenever the task involves any system call or network i/o, the YARV VM thread will release the GIL, allowing other threads to be scheduled. So Thread based concurrency can still be used based on your task profile.

  def mixed_work
    start =
    while ( - start) < 5.0
      i=0; loop { break if 400000 < (i=i+1); dirs = Dir.glob(ENV['HOME']) }

 RUBY_PLATFORM: x86_64-darwin13.0.0  RUBY_VERSION: 1.9.3 RUBY_INSTALL_NAME: ruby
                                                    user     system      total        real
mixed work (10 iterations)                     47.600000  12.350000  59.950000 ( 59.942870)
threaded mixed work (10 iters / 2 thrds)       32.960000   8.690000  41.650000 ( 41.650272)
Percentage decrease in time taken with Thread concurrency: 30.52

This also works with synchronous I/O tasks like connecting to the database or Net::HTTP calls:

def sync_io_work{:dbname => 'dev', :user => 'me', :password => 'cccccc', :host => ''}).exec("SELECT pg_sleep(3)")

 RUBY_PLATFORM: x86_64-darwin13.0.0  RUBY_VERSION: 1.9.3 RUBY_INSTALL_NAME: ruby
                                                    user     system      total        real
sync_io work (10 iterations)                    0.000000   0.010000   0.010000 ( 30.161611)
threaded sync_io work (10 iters / 2 thrds)      0.010000   0.000000   0.010000 ( 15.029913)
Percentage decrease in time taken with Thread concurrency: 50.17

Unfortunately, if you are on an older implementation like MRI, results are not so rosy (MRI does not map its threads to OS native threads):

 RUBY_PLATFORM: i686-darwin13.0.0  RUBY_VERSION: 1.8.7 RUBY_INSTALL_NAME: ruby
                                                   user     system      total        real
sync_io work (10 iterations)                   0.000000   0.000000   0.000000 ( 30.067002)
threaded sync_io work (10 iters / 2 thrds)     0.010000   0.000000   0.010000 ( 30.071508)
Percentage increase in time taken with Thread concurrency: 0.02

MRI uses green threads that allow for portability, but at the cost of blocking all threads whenever there is any pure ruby work or synchronous i/o being done by any of the threads in the process. 

However, one can use non-blocking i/o to achieve some form of thread based concurrency in MRI.

  # ...
  # utils.rb

      def async_io_work
        conn ={:dbname => 'dev', :user => 'me', :password => 'cccccc', :host => ''})
        conn.send_query("SELECT pg_sleep(3)")

      def wait_for_results(conns)
        hsh = conns.inject({}) {|a,c| a[] = { :conn => c, :done => false}; a}
        loop do
          break if hsh.values.collect {|v| v[:done]}.all?
          res = select( {|k| !hsh[k][:done]}, nil, nil, 0.1)
          res.first.each {|s| hsh[s][:done] = process_non_blocking_event(hsh[s][:conn])} if res

      def process_non_blocking_event(conn)
        unless conn.is_busy
          res, data = 0, []
          while res != nil
            res = conn.get_result
            res.each {|d| data.push d} unless res.nil?
          return true
        return false

 #  ...
 #  driver.rb"async_io work (#{ITERATIONS} iterations)") do
      without_threading = measure_time_taken do
        wait_for_results(ITERATIONS.times.collect do |i|

 RUBY_PLATFORM: i686-darwin13.0.0  RUBY_VERSION: 1.8.7 RUBY_INSTALL_NAME: ruby
                                                   user     system      total        real
sync_io work (10 iterations)                   0.000000   0.000000   0.000000 ( 30.070065)
async_io work (10 iterations)                  0.010000   0.000000   0.010000 (  3.036897)

Process based concurrency and Pre-forking  


Process based concurrency model, though a heavy option due to the cost incurred in forking a process, stays an option of choice in many popular applications like Unicorn. As we saw above, GVM is one reason why we might want to continue to consider process based concurrency. 

One additional reason why these applications support this model is that thread safety may not be guaranteed when the user application code is loaded and a single thread of execution per process needs to be maintained. So if your application has thread unsafe code, you only option may be process based concurrency.

A common architecture for process based concurrency is a master process forking a worker process to handle each request. However this can be expensive as the process forking costs start to add up.
ITERATIONS.times.each_slice(num_processes) do |i|
  pids = num_processes.times.collect { fork { busy_work } }

Pre-forking is a common server side scaling technique that spins up a pool of persistent worker sub-processes which can then be dispatched the task that needs to be executed concurrently. This complicates the concurrency model implementation though, as some sort of bi-directional communication needs to be maintained between the master server process and the worker processes. The master brokers the request and decides which worker should be sent the work.

# ----------------------
# process_concurrency.rb
# -----------------------"with pre-forking workers, busy load (#{ITERATIONS} iters / #{num_processes} procs)") do
      pool = []
      time_taken = measure_time_taken do
        pool = create_worker_pool(num_processes)
        ITERATIONS.times.each { |i| send_task_to_free_worker(pool, :busy_work) }
        worker = send_task_to_free_worker(pool, :exit); pool.delete(worker)
      end while pool.length > 0

# ----------------------
# utils.rb
# -----------------------

  module ProcessPool
    def create_worker_pool(num_workers)
      num_workers.times.collect do |i|
        m_r,m_w, w_r,w_w= IO.pipe + IO.pipe
        if fork
          m_r.close; w_w.close
          { :r => w_r, :w => m_w }
          # worker code
          m_w.close; w_r.close; r = m_r; w = w_w
          loop { w.write "ready"; cmd_len =;; send(cmd.to_sym) }

    def send_task_to_free_worker(pool, task)
      free_worker = wait_for_free_worker(pool)
      free_worker[:w].write(((cmd_len = task.to_s.length) > 9) ? cmd_len.to_s : "0#{cmd_len}")
      free_worker[:w].write task.to_s

    def wait_for_free_worker(pool)
      read_handles = pool.collect {|v| v[:r]}
      ready_read_handle =;
      pool.detect {|v| v[:r] == ready_read_handle}

The results don't show a huge decrease, but over thousands of connections, it starts to add up.

 RUBY_PLATFORM: i686-darwin13.0.0  RUBY_VERSION: 1.8.7 RUBY_INSTALL_NAME: ruby
                                                                  user     system      total        real
no pre-forking workers, busy work (30 iters / 4 procs)        0.000000   0.000000 320.460000 ( 81.602852)
with pre-forking workers, busy load (30 iters / 4 procs)      0.000000   0.000000   0.000000 ( 77.597048)

That is a lot of code to digest in one sitting. We will continue with process based concurrency constructs and explore event based and actor based concurrency models in the next post in this series.

The code is available on github.

Wednesday, May 2, 2012

Twitter like Api routes for Rails 3

Twitter api routes follow the pattern ''.

The new Rails 3 routing makes it easy to implement this:

scope "/1", :constraints => { :subdomain => /^api(?:-dev)?$/, :format => :json }, :module => 'api/v1' do
  resources :projects, :only => [:index, :show]
  resources :tasks

which allows for routes such as

while neatly organizing your controllers under

For rspec request specs, use the following pattern:

describe "/1/projects" do
  let(:url) { "/1/projects"}
  [:html, :xml].each do |fmt|
    it "rejects explict #{fmt} format" do
      get "#{url}.#{fmt}", nil, {'HTTP_HOST' => ''}
      last_response.status.should be(404)


Sunday, February 6, 2011

Benchmarking a system command

time for((n=0;n<100;n++)); do convert CIMG0022.jpg -rotate 90 CIMG0022_90.jpg; done

real 0m42.438s
user 0m33.934s
sys 0m8.424s

Avg run time: ~ 424ms