All opinions expressed are those of the authors and not necessarily those of OSNews.com, our sponsors, or our affiliates.
  Add to My Yahoo!  Subscribe with Bloglines  Subscribe in NewsGator Online

published by noreply@blogger.com (Patrick Lewis) on 2016-05-02 13:22:00 in the "dependencies" category
The third-party gem ecosystem is one of the biggest selling points of Rails development, but the addition of a single line to your project's Gemfile can introduce literally dozens of new dependencies. A compatibility issue in any one of those gems can bring your development to a halt, and the transition to a new major version of Rails requires even more caution when managing your gem dependencies.

In this post I'll illustrate this issue by showing the steps required to get rails_admin (one of the two most popular admin interface gems for Rails) up and running even partially on a freshly-generated Rails 5 project. I'll also identify some techniques for getting unreleased and forked versions of gems installed as stopgap measures to unblock your development while the gem ecosystem catches up to the new version of Rails.

After installing the current beta3 version of Rails 5 with gem install rails --pre and creating a Rails 5 project with rails new I decided to address the first requirement of my application, admin interface, by installing the popular Rails Admin gem. The rubygems page for rails_admin shows that its most recent release 0.8.1 from mid-November 2015 lists Rails 4 as a requirement. And indeed, trying to install rails_admin 0.8.1 in a Rails 5 app via bundler fails with a dependency error:

Resolving dependencies...
Bundler could not find compatible versions for gem "rails":
In snapshot (Gemfile.lock):
rails (= 5.0.0.beta3)

In Gemfile:
rails (< 5.1, >= 5.0.0.beta3)

rails_admin (~> 0.8.1) was resolved to 0.8.1, which depends on
rails (~> 4.0)

I took a look at the GitHub page for rails_admin and noticed that recent commits make reference to Rails 5, which is an encouraging sign that its developers are working on adding compatibility with Rails 5. Looking at the gemspec in the master branch on GitHub shows that the rails_admin gem dependency has been broadened to include both Rails 4 and 5, so I updated my app's Gemfile to install rails_admin directly from the master branch on GitHub:

gem 'rails_admin', github: 'sferik/rails_admin'

This solved the above dependency of rails_admin on Rails 4 but revealed some new issues with gems that rails_admin itself depends on:

Resolving dependencies...
Bundler could not find compatible versions for gem "rack":
In snapshot (Gemfile.lock):
rack (= 2.0.0.alpha)

In Gemfile:
rails (< 5.1, >= 5.0.0.beta3) was resolved to 5.0.0.beta3, which depends on
actionmailer (= 5.0.0.beta3) was resolved to 5.0.0.beta3, which depends on
actionpack (= 5.0.0.beta3) was resolved to 5.0.0.beta3, which depends on
rack (~> 2.x)

rails_admin was resolved to 0.8.1, which depends on
rack-pjax (~> 0.7) was resolved to 0.7.0, which depends on
rack (~> 1.3)

rails (< 5.1, >= 5.0.0.beta3) was resolved to 5.0.0.beta3, which depends on
actionmailer (= 5.0.0.beta3) was resolved to 5.0.0.beta3, which depends on
actionpack (= 5.0.0.beta3) was resolved to 5.0.0.beta3, which depends on
rack-test (~> 0.6.3) was resolved to 0.6.3, which depends on
rack (>= 1.0)

rails_admin was resolved to 0.8.1, which depends on
sass-rails (< 6, >= 4.0) was resolved to 5.0.4, which depends on
sprockets (< 4.0, >= 2.8) was resolved to 3.6.0, which depends on
rack (< 3, > 1)

This bundler output shows a conflict where Rails 5 depends on rack 2.x while rails_admin's rack-pjax dependency depends on rack 1.x. I ended up resorting to a Google search which led me to the following issue in the rails_admin repo: https://github.com/sferik/rails_admin/issues/2532

Installing rack-pjax from GitHub:

gem 'rack-pjax', github: 'afcapel/rack-pjax', branch: 'master'

resolves the rack dependency conflict, and bundle install now completes without error. Things are looking up! At least until you try to run the Rake task to rails g rails_admin:install and you're presented with this mess:

/Users/patrick/.rbenv/versions/2.3.0/lib/ruby/gems/2.3.0/gems/actionpack-5.0.0.beta3/lib/action_dispatch/middleware/stack.rb:108:in `assert_index': No such middleware to insert after: ActionDispatch::ParamsParser (RuntimeError)
from /Users/patrick/.rbenv/versions/2.3.0/lib/ruby/gems/2.3.0/gems/actionpack-5.0.0.beta3/lib/action_dispatch/middleware/stack.rb:80:in `insert_after'

This error is more difficult to understand, especially given the fact that the culprit (the remotipart gem) is not actually mentioned anywhere in the error. Thankfully, commenters on the above-mentioned rails_admin issue #2532 were able to identify the remotipart gem as the source of this error and provide a link to a forked version of that gem which allows rails_admin:install to complete successfully (albeit with some functionality still not working).

In the end, my Gemfile looked something like this:

gem 'rails_admin', github: 'sferik/rails_admin'
# Use github rack-pjax to fix dependency versioning issue with Rails 5
# https://github.com/sferik/rails_admin/issues/2532
gem 'rack-pjax', github: 'afcapel/rack-pjax'
# Use forked remotipart until following issues are resolved
# https://github.com/JangoSteve/remotipart/issues/139
# https://github.com/sferik/rails_admin/issues/2532
gem 'remotipart', github: 'mshibuya/remotipart', ref: '3a6acb3'

A total of three unreleased versions of gems, including the forked remotipart gem that breaks some functionality, just to get rails_admin installed and up and running enough to start working with. And some technical debt in the form of comments about follow-up tasks to revisit the various gems as they have new versions released for Rails 5 compatibility.

This process has been a reminder that when working in a Rails 4 app it's easy to take for granted the ability to install gems and have them 'just work' in your application. When dealing with pre-release versions of Rails, don't be surprised when you have to do some investigative work to figure out why gems are failing to install or work as expected.

My experience has also underscored the importance of understanding all of your application's gem dependencies and having some awareness of their developers' intentions when it comes to keeping their gems current with new versions of Rails. As a developer it's in your best interest to minimize the amount of dependencies in your application, because adding just one gem (which turns out to have a dozen of its own dependencies) can greatly increase the potential for encountering incompatibilities.

published by noreply@blogger.com (Greg Sabino Mullane) on 2016-04-29 00:04:00 in the "postgres" category

Postgres has a wonderful feature called concurrent indexes. It allows you to create indexes on a table without blocking reads OR writes, which is quite a handy trick. There are a number of circumstances in which one might want to use concurrent indexes, the most common one being not blocking writes to production tables. There are a few other use cases as well, including:


Photograph by Nicholas A. Tonelli

  • Replacing a corrupted index
  • Replacing a bloated index
  • Replacing an existing index (e.g. better column list)
  • Changing index parameters
  • Restoring a production dump as quickly as possible

In this article, I will focus on that last use case, restoring a database as quickly as possible. We recently upgraded a client from a very old version of Postgres to the current version (9.5 as of this writing). The fact that use of pg_upgrade was not available should give you a clue as to just how old the "very old" version was!

Our strategy was to create a new 9.5 cluster, get it optimized for bulk loading, import the globals and schema, stop write connections to the old database, transfer the data from old to new, and bring the new one up for reading and writing.

The goal was to reduce the application downtime as much as reasonably possible. To that end, we did not want to wait until all the indexes were created before letting people back in, as testing showed that the index creations were the longest part of the process. We used the "--section" flags of pg_dump to create pre-data, data, and post-data sections. All of the index creation statements appeared in the post-data file.

Because the client determined that it was more important for the data to be available, and the tables writable, than it was for them to be fully indexed, we decided to try using CONCURRENT indexes. In this way, writes to the tables could happen at the same time that they were being indexed - and those writes could occur as soon as the table was populated. That was the theory anyway.

The migration went smooth - the data was transferred over quickly, the database was restarted with a new postgresql.conf (e.g. turn fsync back on), and clients were able to connect, albeit with some queries running slower than normal. We parsed the post-data file and created a new file in which all the CREATE INDEX commands were changed to CREATE INDEX CONCURRENTLY. We kicked that off, but after a certain amount of time, it seemed to freeze up.


The frogurt is also cursed.

Looking closer showed that the CREATE INDEX CONCURRENTLY statement was waiting, and waiting, and never able to complete - because other transactions were not finishing. This is why concurrent indexing is both a blessing and a curse. The concurrent index creation is so polite that it never blocks writers, but this means processes can charge ahead and be none the wiser that the create index statement is waiting on them to finish their transaction. When you also have a misbehaving application that stays "idle in transaction", it's a recipe for confusion. (Idle in transaction is what happens when your application keeps a database connection open without doing a COMMIT or ROLLBACK). A concurrent index can only completely finish being created once any transaction that has referenced the table has completed. The problem was that because the create index did not block, the app kept chugging along, spawning new processes that all ended up in idle in transaction.

At that point, the only way to get the concurrent index creation to complete was to forcibly kill all the other idle in transaction processes, forcing them to rollback and causing a lot of distress for the application. In contrast, a regular index creation would have caused other processes to block on their first attempt to access the table, and then carried on once the creation was complete, and nothing would have to rollback.

Another business decision was made - the concurrent indexes were nice, but we needed the indexes, even if some had to be created as regular indexes. Many of the indexes were able to be completed (concurrently) very quickly - and they were on not-very-busy tables - so we plowed through the index creation script, and simply canceled any concurrent index creations that were being blocked for too long. This only left a handful of uncreated indexes, so we simply dropped the "invalid" indexes (these appear when a concurrent index creation is interrupted), and reran with regular CREATE INDEX statements.

The lesson here is that nothing comes without a cost. The overly polite concurrent index creation is great at letting everyone else access the table, but it also means that large complex transactions can chug along without being blocked, and have to have all of their work rolled back. In this case, things worked out as we did 99% of the indexes as CONCURRENT, and the remaining ones as regular. All in all, the use of concurrent indexes was a big win, and they are still an amazing feature of Postgres.


published by noreply@blogger.com (Elizabeth Garrett) on 2016-04-27 18:59:00 in the "community" category

We all love a good ending. I was happy to hear that one of End Point?s clients, Cybergenetics, was involved in a case this week to free a falsely imprisoned man, Darryl Pinkins.

Darryl was convicted of a crime in Indiana in 1991. In 1995 Pinkins sought the help of the Innocence Project. His attorney Frances Watson and her students turned to Cybergenetics and their DNA interpretation technology called TrueAllele® Casework. The TrueAllele DNA identification results exonerated Pinkins. The Indiana Court of Appeals dropped all charges against Pinkins earlier this week and he walked out of jail a free man after fighting for 24 years to clear his name.

TrueAllele can separate out the people who contributed their DNA to a mixed DNA evidence sample. It then compares the separated out DNA identification information to other reference or evidence samples to see if there is a DNA match.

End Point has worked with Cybergenetics since 2003 and consults with them on security, database infrastructure, and website hosting. End Point congratulates Cybergenetics on their success in being part of the happy ending for Darryl Pinkins and his family!

More of the story is available at Cybergenetics? Newsroom or the Chicago Tribune.


published by noreply@blogger.com (Yaqi Chen) on 2016-04-27 17:49:00 in the "Liquid Galaxy" category

Nowadays, virtual reality is one of the hottest topics in tech, with VR enabling users to enter immersive environments built up by computer technology. I attended Mobile World Congress 2016 a few weeks ago, and it was interesting to see people sit next to one another and totally ignore one another while they were individually immersed in their own virtual reality worlds.

When everyone is so addicted to their little magic boxes, they tend to lose their connections with people around them. End Point has developed a new experience in which users can watch and share their virtually immersive world together. This experience is called the Liquid Galaxy.

When a user stands in front of Liquid Galaxy and is surrounded by a multitude of huge screens arranged in a semicircle, he puts not only his eyes but his whole body into an unprecedented 3D space. These screens are big enough to cover the audience?s entire peripheral vision and bring great visual stimulation from all directions. When using the Liquid Galaxy system, the users become fully immersed in the system and the imagery they view.


Movie Night at End Point

This digital chamber can be considered a sort of VR movie theater, where an audience can enjoy the same content, and probably the same bucket of popcorn! While this setup makes the Liquid Galaxy a natural fit for any sort of exhibit, many End Point employees have also watched full length feature movies on the system during our monthly Movie Night at our Headquarters office in Manhattan. This sort of shared experience is not something that is possible on typical VR, because unlike VR the Liquid Galaxy is serving a larger audience and presenting stories in a more interactive way.


For most meetings, exhibitions, and other special occasions, the Liquid Galaxy helps to provide an amazing and impactful experience to the audience. Any scenario can be built for users to explore, and geospatial data sets can be presented immersively.

With the ability to serve a group of people simultaneously, Liquid Galaxy increases the impact of content presentation and brings a revolutionary visual experience to its audiences. If you'd like to learn more, you can call us at 212-929-6923, or contact us here.


published by noreply@blogger.com (Ben Witten) on 2016-04-22 21:52:00 in the "Liquid Galaxy" category

The Liquid Galaxy, an immersive and panoramic presentation tool, is the perfect fit for any time you want to grab the attention of your audience and leave a lasting impression. The system has applications in a variety of industries (which include museums and aquariums, hospitality and travel, research libraries at universities, events, and real estate, to name a few) but no industry's demand rivals the popularity seen in real estate.

The Liquid Galaxy provides an excellent tool for real estate brokerages and land use agencies to showcase their properties with multiple large screens showing 3D building models and complete Google Earth data. End Point can configure the Liquid Galaxy to highlight specific buildings, areas on the map, or any set of correlated land use data, which can then be shown in a dazzling display that forms the centerpiece of a conference room or lobby. We can program the Liquid Galaxy to show floor plans, panoramic interior photos, and even Google Street View ?walking tours? around a given property.

A Liquid Galaxy in your office will provide your firm with a sophisticated and cutting edge sales tool. You will depart from the traditional ways of viewing, presenting, and even managing real estate sites by introducing your clients to multiple prime locations and properties in a wholly unique, professional and visually stunning manner. We can even highlight amenities such as mass transit, road usage, and basic demographic data for proper context.

The Liquid Galaxy allows your clients an in-depth contextual tour of multiple listings in the comfort of your office without having to travel to multiple locations. Liquid Galaxy brings properties to the client instead of taking the client to every property. This saves time and energy for both you and your prospective clients, and sets your brokerage apart as a technology leader in the market.

If you'd like to learn more about the Liquid Galaxy, you can call us at 212-929-6923, or contact us here.


published by noreply@blogger.com (Peter Hankiewicz) on 2016-04-21 23:00:00 in the "AngularJS" category

Introduction

The current state of development for web browsers is still problematic. We have multiple browsers, each browser has plenty of versions. There are multiple operating systems and devices that can be used. All of this makes it impossible to be sure that our code will work on every possible browser and system (unfortunately). With proper testing, we can make our product stable and good enough for production, but we can't expect that everything will go smoothly, well, it won't. He is always somewhere, a guy sitting in his small office and using outdated software, Internet Explorer 6 for example. Usually you want to try to support as many as possible users, here, I will explain how to help find them. Then you just need to decide if it is worth fixing an issue for them.

Browser errors logging

What can really help us and is really simple to do is browser error logging. Every time an error occurs on the client side (browser will generate an error that the user most likely won't see), we can log this error on the server side, even with a stack trace. Let's see an example:

window.onerror = function (errorMsg, url, lineNumber, column, errorObj) {
    $.post('//your.domain/client-logs', function () {
        errorMsg: errorMsg,
        url: url,
        lineNumber: lineNumber,
        column: column,
        errorObj: errorObj
    });
        
    // Tell browser to run its own error handler as well   
    return false;
};

What do we have here? We bind a function to the window.onerror event. Every time an error occurs this function will be called. Some arguments are passed together:

  • errorMsg - this is an error message, usually describing why an error occurred (for example: "Uncaught ReferenceError: heyyou is not defined"),
  • url - current url location,
  • lineNumber - script line number where an error happened,
  • column - the same as above but about column,
  • errorObj - the most important part here, an error object with a stack trace included.

What to do with this data? You will probably want to send it to a server and save it, to be able to go through this log from time to time like we do in our example:

$.post('//your.domain/client-logs', function () {
    errorMsg: errorMsg,
    url: url,
    lineNumber: lineNumber,
    column: column,
    errorObj: errorObj
});

It's very helpful, usually with proper unit and functional testing errors generated are minor, but sometimes you may find a critical issue before a bigger number of clients will actually discover it. It is a big profit.

JSNLog

JSNLog is a library that helps with client error logging. You can find it here: http://jsnlog.com/. I can fully recommend using this one, it can also do the AJAX calls, timeout handling, and many more.

Client error notification

If you want to be serious and professional every issue should be reported to a user in some way. On the other side, it's sometimes dangerous to do if the user will be spammed with information that an error occurred because of some minor error. It's not easy to find the best solution because it's not easy to identify an error priority.

Just from experience, if you have a system where users are logged on, you can create a simple script that will send an email to a user with a question regarding an issue. You can set up a limit value to avoid sending too many messages. If the user will be interested he can always reply and explain an issue. Usually the user will appreciate this interest.

Errors logging in Angular

It's worth mentioning how we can handle error logging in the Angular framework, with useful stack traces and error descriptions. See an example below:

First we need to override default log functions in Angular:

angular.module('logToServer', [])
  .service('$log', function () {
    this.log = function (msg) {
      JL('Angular').trace(msg);
    };
    this.debug = function (msg) {
      JL('Angular').debug(msg);
    };
    this.info = function (msg) {
      JL('Angular').info(msg);
    };
    this.warn = function (msg) {
      JL('Angular').warn(msg);
    };
    this.error = function (msg) {
      JL('Angular').error(msg);
    };
  });

Then override exception handler to use our function:

factory('$exceptionHandler', function () {
    return function (exception, cause) {
      JL('Angular').fatalException(cause, exception);
      throw exception;
    };
  });

We also need an interceptor to handle AJAX call errors. This time we need to override $q object like this:

factory('logToServerInterceptor', ['$q', function ($q) {
    var myInterceptor = {
      'request': function (config) {
          config.msBeforeAjaxCall = new Date().getTime();

          return config;
      },
      'response': function (response) {
        if (response.config.warningAfter) {
          var msAfterAjaxCall = new Date().getTime();
          var timeTakenInMs = msAfterAjaxCall - response.config.msBeforeAjaxCall;

          if (timeTakenInMs > response.config.warningAfter) {
            JL('Angular.Ajax').warn({ 
              timeTakenInMs: timeTakenInMs, 
              config: response.config, 
              data: response.data
            });
          }
        }

        return response;
      },
      'responseError': function (rejection) {
        var errorMessage = "timeout";
        if (rejection && rejection.status && rejection.data) {
          errorMessage = rejection.data.ExceptionMessage;
        }
        JL('Angular.Ajax').fatalException({ 
          errorMessage: errorMessage, 
          status: rejection.status, 
          config: rejection.config }, rejection.data);
        
          return $q.reject(rejection);
      }
    };

    return myInterceptor;
  }]);

How it looks all together:

angular.module('logToServer', [])
  .service('$log', function () {
    this.log = function (msg) {
      JL('Angular').trace(msg);
    };
    this.debug = function (msg) {
      JL('Angular').debug(msg);
    };
    this.info = function (msg) {
      JL('Angular').info(msg);
    };
    this.warn = function (msg) {
      JL('Angular').warn(msg);
    };
    this.error = function (msg) {
      JL('Angular').error(msg);
    };
  })
  .factory('$exceptionHandler', function () {
    return function (exception, cause) {
      JL('Angular').fatalException(cause, exception);
      throw exception;
    };
  })
  .factory('logToServerInterceptor', ['$q', function ($q) {
    var myInterceptor = {
      'request': function (config) {
          config.msBeforeAjaxCall = new Date().getTime();

          return config;
      },
      'response': function (response) {
        if (response.config.warningAfter) {
          var msAfterAjaxCall = new Date().getTime();
          var timeTakenInMs = msAfterAjaxCall - response.config.msBeforeAjaxCall;

          if (timeTakenInMs > response.config.warningAfter) {
            JL('Angular.Ajax').warn({ 
              timeTakenInMs: timeTakenInMs, 
              config: response.config, 
              data: response.data
            });
          }
        }

        return response;
      },
      'responseError': function (rejection) {
        var errorMessage = "timeout";
        if (rejection && rejection.status && rejection.data) {
          errorMessage = rejection.data.ExceptionMessage;
        }
        JL('Angular.Ajax').fatalException({ 
          errorMessage: errorMessage, 
          status: rejection.status, 
          config: rejection.config }, rejection.data);
        
          return $q.reject(rejection);
      }
    };

    return myInterceptor;
  }]);

This should handle most of the errors that could happen in the Angular framework. Here I used the JSNLog library to handle sending logs to a server.

Almost the end

There are multiple techniques for logging errors on a client side. It does not really matter which one you choose, it only matters that you do it. Especially when it's really a little amount of time to invest and make it work and a big profit in the end.


published by noreply@blogger.com (Phin Jensen) on 2016-04-19 17:21:00 in the "Conference" category

Another talk from MountainWest RubyConf that I enjoyed was How to Build a Skyscraper by Ernie Miller. This talk was less technical and instead focused on teaching principles and ideas for software development by examining some of the history of skyscrapers.

Equitable Life Building

Constructed from 1868 to 1870 and considered by some to be the first skyscraper, the Equitable Life Building was, at 130 feet, the tallest building in the world at the time. An interesting problem arose when designing it: it was too tall for stairs. If a lawyer?s office was on the seventh floor of the building, he wouldn?t want his clients to walk up six flights of stairs to meet with him.

Elevators and hoisting systems existed at the time, but they had one fatal flaw: there were no safety systems if the rope broke or was cut. While working on converting a sawmill to a bed frame factory, a man named Elisha Otis had the idea for a system to stop an elevator if its rope is cut. He and his sons designed the system and implemented it at the factory. At the time, he didn?t think much of the design, and didn?t patent it or try to sell it.

Otis? invention became popular when he showcased it at the 1854 New York World?s Fair with a live demo. Otis stood in front of a large crowd on a platform and ordered the rope holding it to be cut. Instead of plummeting to the ground, the platform was caught by the safety system after falling only a few inches.

Having a way to safely and easily travel up and down many stories literally flipped the value propositions of skyscrapers upside down. Where lower floors were desired more because they were easy to access, higher floors are now more coveted, since they are easy to access but get the advantages that come with height, such as better air, light, and less noise. A solution that seems unremarkable to you might just change everything for others.

When the Equitable Life Building was first constructed, it was described as fireproof. Unfortunately, it didn?t work out quite that way. On January 9, 1912, the timekeeper for a cafe in the building started his day by lighting the gas in his office. Instead of disposing properly of the match, he distractedly threw it into the trashcan. Within 10 minutes, the entire office was engulfed in flame, which spread to the rest of the building, completely destroying it and killing six people.

Never underestimate the power of people to break what you build.

Home Insurance Building

The Home Insurance Building, constructed in 1884, was the first building to use a fireproof metal frame to bear the weight of the building, as opposed to using load-bearing masonry. The building was designed by William LeBaron Jenney, who was struck by inspiration when his wife placed a heavy book on top of a birdcage. From Wikipedia:

"
According to a popular story, one day he came home early and surprised his wife who was reading. She put her book down on top of a bird cage and ran to meet him. He strode across the room, lifted the book and dropped it back on the bird cage two or three times. Then, he exclaimed: ?It works! It works! Don?t you see? If this little cage can hold this heavy book, why can?t an iron or steel cage be the framework for a whole building??
"

With this idea, he was able to design and build the Home Insurance Building to be 10 stories and 138 feet tall while only weighing 1/3rd of what the same building in stone would weigh because he was able to Find inspiration from unexpected places.

Monadnock Building

The Monadnock Building was designed by Daniel Burnham and John Wellborn Root. Burnham preferred simple and functional designs and was known for his stinginess while Root was more artistically inclined and known for his detailed ornamentation on building designs. Despite their philosophical differences, they were one of the world?s most successful architectural firms.

One of the initial sketches (shown) for the building included Ancient Egyptian-inspired ornamentation with slight flaring at the top. Burnham didn?t like the design, as illustrated in a letter he wrote to the property manager:

"
My notion is to have no projecting surfaces or indentations, but to have everything flush .... So tall and narrow a building must have some ornament in so conspicuous a situation ... [but] projections mean dirt, nor do they add strength to the building ... one great nuisance [is] the lodgment of pigeons and sparrows.
"

While Root was on vacation, Burnham worked to re-design the building to be straight up-and-down with no ornamentation. When Root returned, he initially objected to the design but eventually embraced it, declaring that the heavy lines of the Egyptian pyramids captured his imagination. We can learn a simple lesson from this: Learn to embrace constraints.

When construction was completed in 1891, the building was a total of 17 stories (including the attic) and 215 feet tall. At the time, it was the tallest commercial structure in the world. It is also the tallest load-bearing brick building constructed. In fact, to support the weight of the entire building, the walls at the bottom had to be six feet (1.8 m) wide.

Because of the soft soil of Chicago and the weight of the building, it was designed to settle 8 inches into the ground. By 1905, it had settled that much and several inches more, which led to the reconstruction of the first floor. By 1948, it had settled 20 inches, making the entrance a step down from the street. If you only focus on profitability, don?t be surprised when you start sinking.

Fuller Flatiron Building

The Flatiron building, constructed in 1902, was also designed by Daniel Burnham, although Root had died of pneumonia during the construction of the Monadnock building. The Flatiron building presented an interesting problem because it was to be built on an odd triangular plot of land. In fact, the building was only 6 and a half feet wide at the tip, which obviously wouldn?t work with the load-bearing masonry design of the Monadnock building.

So the building was constructed using a steel-frame structure that would keep the walls to a practical size and allow them to fully utilize the plot of land. The space you have to work with should influence how you build and you should choose the right materials for the job.

During construction of the Flatiron building, New York locals called it ?Burnham?s Folly? and began to place bets on how far the debris would fall when a wind storm came and knocked it over. However, an engineer named Corydon Purdy had designed a steel bracing system that would protect the building from wind four times as strong as it would ever feel. During a 60-mph windstorm, tenants of the building claimed that they couldn?t feel the slightest vibration inside the building. This gives us another principle we can use: Testing makes it possible to be confident about what we build, even when others aren?t.

40 Wall Street v. Chrysler Building


40 Wall Street
Photo by C R, CC BY-SA 2.0

The stories of 40 Wall Street and the Chrysler Building start with two architects, William Van Alen and H. Craig Severance. Van Alen and Severance established a partnership together in 1911 which became very successful. However, as time went on, their personal differences caused strain in the relationship and they separated on unfriendly terms in 1924. Soon after the partnership ended, they found themselves to be in competition with one another. Severance was commissioned to design 40 Wall Street while Van Alen would be designing the Chrysler Building.

The Chrysler Building was initially announced in March of 1929, planned to be built 808 feet tall. Just a month later, Severance was one-upping Van Alen by announcing his design for the building, coming in at 840 feet. By October, Van Alen announced that the steel work of the Chrysler Building was finished, putting it as the tallest building in the world, over 850 feet tall. Severance wasn?t particularly worried, as he already had plans in motion to build higher. Even after reports came in that the Chrysler Building had a 60-foot flagpole at the top, Severance made more changes for 40 Wall Street to be taller than the Chrysler Building. These plans were enough for the press to announce that 40 Wall Street had won the race to build highest since construction of the Chrysler Building was too far along to be built any higher.


The Chrysler Building
Photo by Chris Parker, CC BY-ND 2.0

Unfortunately for Severance, the 60-foot flagpole wasn?t a flagpole at all. Instead, it was part of an 185-foot steel spire which Van Alen had designed and had built and shipped to the construction site in secret. On October 23rd, 1929, the pieces of the spire were hoisted to the top of the building and installed in just 90 minutes. The spire was initially mistaken for a crane, and it wasn?t until 4 days after it was installed that the spire was recognized as a permanent part of the building, making it the tallest in the world. When all was said and done, 40 Wall Street was came in at 927 feet, with a cost of $13,000,000, while the Chrysler Building finished at 1,046 feet and cost $14,000,000.

There are two morals we can learn from this story: There is opportunity for great work in places nobody is looking and big buildings are expensive, but big egos are even more so.

Empire State Building

The Empire State Building was built in just 13 months, from March 17, 1930, to April 11, 1931. Its primary architects were Richmond Shreve and William Lamb, who were part of the team assembled by Severance to design 40 Wall Street. They were joined by Arthur Harmon to form Shreve, Lamb, & Harmon. Lamb?s partnership with Shreve was not unlike that of Van Alen and Severance or Burnham and Root. Lamb was more artistic in his architecture, but he was also pragmatic, using his time and design constraints to shape the design and characteristics of the building.

Lamb completed the building drawings in just two weeks, designing from the top down, which was a very unusual method. When designing the building, Lamb made sure that even when he was making concessions, using the building would be a pleasant experience for those who mattered. Lamb was able to complete the design so quickly because he reused previous work, specifically the Reynolds Building in Winston-Salem, NC, and the Carew Tower in Cincinnati, Ohio.

In November of 1929, Al Smith, who commissioned the building as head of Empire State, Inc., announced that the company had purchased land next to the plot where the construction would start, in order to build higher. Shreve, Lamb, and Harmon were opposed to this idea since it would force tenants of the top floors to switch elevators on the way up, and they were focused on making the experience as pleasant as possible.

John Raskob, one of the main people financing the building, wanted the building to be taller. While looking at a small model of the building, he reportedly said ?What this building needs is a hat!? and proposed his idea of building a 200-foot mooring tower for a zeppelin at the top of the building, despite several problems such as high winds making the idea unfeasible. But Raskob felt that he had to build the tallest building in the world, despite all of the problems and the higher cost that a taller building would introduce because people can rationalize anything.

There are two more things we should note about the story of the Empire State building. First, despite the fact that it was designed top-to-bottom, it wasn?t built like that. No matter how a something is designed, it needs to be built from the bottom up. Second, the Empire State Building was a big accomplishment in architecture and construction, but at no small cost. Five people died during the construction of the building, and that may seem like a small number considering the scale of the project, but we should remember that no matter how important speed is, it?s not worth losing people over.

United Nations Headquarters

The United Nations Headquarters were constructed between 1948 and 1952. It wasn?t built to be particularly tall?less than half the height of the Empire State Building?but it came with its own set of problems. As you can see in the picture, the building had a lot of windows. The wide faces of the building are almost completely covered in windows. These windows offer great lighting and views, but when the sun shines on them, they generate a lot of heat, not unlike a greenhouse. Unless you?re building a greenhouse, you probably don?t want that. It doesn?t matter how pretty your building is if nobody wants to occupy it.

The solution to the problem was created years before by an engineer named Willis Carrier, who created an ?Apparatus for Treating Air? (now called an air conditioner) to keep the paper in a printing press from being wrinkled. By creating this air conditioner, Carrier didn?t just make something cool. He made something cool that everyone can use. Without it, buildings like then UNHQ could never have been built.

Willis (or Sears) Tower


Willis Tower, left

The Willis tower was built between 1970 and 1973. Fazlur Rahman Khan was hired as the structural engineer for the Willis Tower, which needed to be very tall in order to house all of the employees of Sears. A steel frame design wouldn?t work well in Chicago (also known as the Windy City) since they tend to bend and sway in heavy winds, which can cause discomfort for people on higher floors, even causing sea-sickness in some cases.

To solve the problem, Khan invented a ?bundled tube structure?, which put the structure of a building on the outside as a thin tube. Using the tube structure not only allowed Khan to build a higher tower, but it also increased floor space and cost less per unit area. But these innovations only came because Khan realized that the higher you build, the windier it gets.

Taipei 101

Taipei 101 was constructed from 1999 to 2004 near the Pacific Ring of Fire, which is the most seismically active part of the world. Earthquakes present very different problems from the wind since they affect a building at its base, instead of the top. Because of the location of the building it needed to be able to withstand both typhoon-force winds (up to 130 mph) and extreme earthquakes, which meant that it had to be designed to be both structurally strong and flexible.

To accomplish this, the building was constructed with high-performance steel, 36 columns, and 8 ?mega-columns? packed with concrete connected by outrigger trusses which acted similarly to rubber bands. During the construction of the building, Taipei was hit by a 6.8-magnitude earthquake which destroyed smaller buildings around the skyscraper, and even knocked cranes off of the incomplete building, but when the building was inspected it was found to have no structural damage. By being rigid where it has to be and flexible where it can afford to be, Taipei 101 is one of the most stable buildings ever constructed.

Of course, being flexible introduces the problem of discomfort for people in higher parts of the building. To solve this problem, Taipei 101 was built with a massive 728-ton (1,456,000 lb) tuned mass damper, which helps to offset the swaying of the building in strong winds. We can learn from this damper: When the winds pick up, it?s good to have someone (or something) at the top pulling for you.

Burj Khalifa

The newest and tallest building on our list, the Burj Khalifa was constructed from 2004 to 2009. With the Burj Khalifa, design problems came with incorporating adequate safety features. After the terrorist attacks of September 11, 2001, the problem of evacuation became more prominent in construction and design of skyscrapers. When it comes to an evacuation, stairs are basically the only way to go, and going down many flights of stairs can be as difficult as going up them?especially if the building is burning around you. The Burj Khalifa is nearly twice as tall as the old World Trade Center, and in the event of an emergency, walking down nearly half a mile of stairs won?t work out.

So how do the people in the Burj Khalifa get out in an emergency? Well, they don?t. Instead, the Burj Khalifa is designed with periodic safe rooms protected by reinforced concrete and fireproof sheeting that will protect people inside for up to two hours of during a fire. Each room has a dedicated supply of air, which is delivered through fire resistant pipes. These safe rooms are placed every 25 floors or so, because a safe space won?t do good if it can?t be reached.

You may know that the most common cause of death in a fire is actually smoke inhalation, not the fire itself. To deal with this, the Burj Khalifa has a network of high powered fans throughout which will blow clean air from outside into the building and keep the stairwells leading to the safe rooms clear of smoke. A very important part of this is pushing the smoke out of the way, eliminating the toxic elements.

It?s important to remember that these safe spaces, as useful as they may be, are not a substitute for rescue workers coming to aid the people trapped in the building. The safe rooms are only there to protect people who can?t help themselves until help can come. Because, after all, what we build is only important because of the people who use it.

Thanks to Ernie Miller for a great talk! The video is also available on YouTube.


published by noreply@blogger.com (Kamil Ciemniewski) on 2016-04-12 11:16:00 in the "classifiers" category

Previous in series:

In my last article I presented an approach that simplifies computations of very complex probability models. It makes these complex models viable by shrinking the amount of needed memory and improving the speed of computing probabilities. The approach we were exploring is called the Naive Bayes model.

The context was the e-commerce feature in which a user is presented with the promotion box. The box shows the product category the user is most likely to buy.

Though the results we got were quite good, I promised to present an approach that gives much better ones. While the Naive Bayes approach may not be acceptable in some scenarios due to the gap between approximated and real values, the approach presented in this article will make this distance much, much smaller.

Naive Bayes as a simple Bayesian Network

When exploring the Naive Bayes model, we said that there is a probabilistic assumption the model makes in order to simplify the computations. In the last article I wrote:

"

The Naive Bayes assumption says that the distribution factorizes the way we did it only if the features are conditionally independent given the category.

"

Expressing variable dependencies as a graph

Let's imagine the visual representation of the relations between the random variables in the Naive Bayes model. Let's make it into a directed acyclic graph. Let's mark the dependence of one variable on another as a graph edge from the parent node pointing to it's dependent node.

Because of the assumption the Naive Bayes model enforces, its structure as a graph looks like the following:

You can notice there are no lines between all the "evidence" nodes. The assumption says that knowing the category, we have all needed knowledge about every single evidence node. This makes category the parent of all the other nodes. Intuitively, we can say that knowing the class (in this example, the category) we know everything about all features. It's easy to notice that this assumption doesn't hold in this example.

In our fake data generator, we made it so that e.g. relationship status depends on age. We've also made the category depend on sex and age directly. This way we can't say that knowing category we know everything about e. g. age. The random variables age and sex are not independent even if we know the value of category. It is clear that the above graph does not model the dependency relationships between these random variables.

Let's draw a graph that represents our fake data model better:

The combination of a graph like the one above and the probability distribution that follows the independencies it describes are known as a Bayesian Network.

Using the graph representation in practice - the chain rule for Bayesian Networks

The fact that our distribution is part of the Bayesian Network, allows us to use the formula for simplifying the distribution itself. The formula is called the chain rule for Bayesian Networks and for our particular example looks like the following:

p(cat, sex, age, rel, loc) = p(sex) * p(age) * p(loc) * p(rel | age) * p(cat | sex, age)

You can notice that the equation is just a product of a number of factors. There's one factor for each random variable. The factors for variables that in the graph don't have any parents are expressed as p(var) while those that do are expressed as p(var | par) or p(var | par1, par2...).

Notice that the Naive Bayes model fits perfectly into this equation. If you were to take the first graph presented in this article ? for the Naive Bayes, and use the above equation, you'd get exactly the formula we used in the last article.

Coding the updated probabilistic model

Before going further, I strongly advise you to make sure you read the previous article - about the Naive Bayes model - to fully understand the classes used in the code in this section.

Let's take our chain rule equation and simplify it:

p(cat, sex, age, rel, loc) = p(sex) * p(age) * p(loc) * p(rel | age) * p(cat | sex, age)

Again a conditional distrubution can be expressed as:

p(a | b) = p(a, b) / p(b)

This gives us:

p(cat, sex, age, rel, loc) = p(sex) * p(age) * p(loc) * (p(rel, age)/ p(age)) * (p(cat, sex, age) / p(sex, age))

We can easily factor out the p(age) with:

p(cat, sex, age, rel, loc) = p(sex) * p(loc) * p(rel, age) * (p(cat, sex, age) / p(sex, age))

Let's define needed random variables and factors:

category = RandomVariable.new :category, [ :veggies, :snacks, :meat, :drinks, :beauty, :magazines ]
age      = RandomVariable.new :age,      [ :teens, :young_adults, :adults, :elders ]
sex      = RandomVariable.new :sex,      [ :male, :female ]
relation = RandomVariable.new :relation, [ :single, :in_relationship ]
location = RandomVariable.new :location, [ :us, :canada, :europe, :asia ]

loc_dist     = Factor.new [ location ]
sex_dist     = Factor.new [ sex ]
rel_age_dist = Factor.new [ relation, age ]
cat_age_sex_dist = Factor.new [ category, age, sex ]
age_sex_dist = Factor.new [ age, sex ]

full_dist = Factor.new [ category, age, sex, relation, location ]

The learning part is as trivial as in the Naive Bayes case. The only difference is the set of distributions involved:

Model.generate(1000).each do |user|
  user.baskets.each do |basket|
    basket.line_items.each do |item|
      loc_dist.observe! location: user.location
      sex_dist.observe! sex: user.sex
      rel_age_dist.observe! relation: user.relationship, age: user.age
      cat_age_sex_dist.observe! category: item.category, age: user.age, sex: user.sex
      age_sex_dist.observe! age: user.age, sex: user.sex
      full_dist.observe! category: item.category, age: user.age, sex: user.sex,
        relation: user.relationship, location: user.location
    end
  end
end

The inference part is also very similar to the one from the previous article. Here too the only difference are the distributions involved:

infer = -> (age, sex, rel, loc) do
  all = category.values.map do |cat|
    pl  = loc_dist.value_for location: loc
    ps  = sex_dist.value_for sex: sex
    pra = rel_age_dist.value_for relation: rel, age: age
    pcas = cat_age_sex_dist.value_for category: cat, age: age, sex: sex
    pas = age_sex_dist.value_for age: age, sex: sex
    { category: cat, value: (pl * ps * pra * pcas) / pas }
  end

  all_full = category.values.map do |cat|
    val = full_dist.value_for category: cat, age: age, sex: sex,
            relation: rel, location: loc
    { category: cat, value: val }
  end

  win      = all.max      { |a, b| a[:value] <=> b[:value] }
  win_full = all_full.max { |a, b| a[:value] <=> b[:value] }

  puts "Best match for #{[ age, sex, rel, loc ]}:"
  puts "   #{win[:category]} => #{win[:value]}"
  puts "Full pointed at:"
  puts "   #{win_full[:category]} => #{win_full[:value]}nn"
end

The results

Now let's run the inference procedure with the same set of examples as in the previous post to compare the results:

infer.call :teens, :male, :single, :us
infer.call :young_adults, :male, :single, :asia
infer.call :adults, :female, :in_relationship, :europe
infer.call :elders, :female, :in_relationship, :canada

Which yields:

Best match for [:teens, :male, :single, :us]:
   snacks => 0.020610837341908994
Full pointed at:
   snacks => 0.02103999999999992

Best match for [:young_adults, :male, :single, :asia]:
   meat => 0.001801062449999991
Full pointed at:
   meat => 0.0010700000000000121

Best match for [:adults, :female, :in_relationship, :europe]:
   beauty => 0.0007693377820183494
Full pointed at:
   beauty => 0.0008300000000000074

Best match for [:elders, :female, :in_relationship, :canada]:
   veggies => 0.0024346445741176875
Full pointed at:
   veggies => 0.0034199999999999886

Just as with using the Naive Bayes, we got correct values for all cases. When you look closer though, you can notice that the resulting probability values were much closer to the original, full distribution ones. The approach we took here makes the values differ only a couple times in 10000. That result could make a difference in the e-commerce shop from the example if it were visited by millions of customers each month.


published by noreply@blogger.com (Phin Jensen) on 2016-04-09 02:14:00 in the "Conference" category

On March 21 and 22, I had the opportunity to attend the 10th and final MountainWest RubyConf at the Rose Wagner Performing Arts Center in Salt Lake City.

One talk that I really enjoyed was Writing a Test Framework from Scratch by Ryan Davis, author of MiniTest. His goal was to teach the audience how MiniTest was created, by explaining the what, why and how of decisions made throughout the process. I learned a lot from the talk and took plenty of notes, so I?d like to share some of that.

The first thing a test framework needs is an assert function, which will simply check if some value or comparison is true. If it is, great, the test passed! If not, the test failed and an exception should be raised. Here is our first assert definition:

def assert test
  raise "Failed test" unless test
end

This function is the bare minimum you need to test an application, however, it won?t be easy or enjoyable to use. The first step to improve this is to make error messages more clear. This is what the current assert function will return for an error:

path/to/microtest.rb:2:in `assert': Failed test (RuntimeError)
        from test.rb:5:in `<main>'

To make this more readable, we can change the raise statement a bit:

def assert test
  raise RuntimeError, "Failed test", caller unless test
end

A failed assert will now throw this error, which does a better job of explaining where things went wrong:

test.rb:5:in `<main>': Failed test (RuntimeError)

Now we?re ready to create another assertion function, assert_equals. A test framework can have many different types of assertions, but when testing real applications, the vast majority will be tests for equality. Writing this assertion is easy:

def assert_equal a, b
  assert a == b
end

assert_equal 4, 2+2 # this will pass
assert_equal 5, 2+2 # this will raise an error

Great, right? Wrong! Unfortunately, the error messages have gone right back to being unhelpful:

path/to/microtest.rb:6:in `assert_equal': Failed test (RuntimeError)
        from test.rb:9:in `<main>'

There are a couple of things we can do to improve these error messages. First, we can filter the backtrace to make it more clear where the error is coming from. Second, we can add a parameter to assert which will take a custom message.

def assert test, msg = "Failed test"
  unless test then
    bt = caller.drop_while { |s| s =~ /#{__FILE__}/ }
    raise RuntimeError, msg, bt
  end
end

def assert_equal a, b
  assert a == b, "Failed assert_equal #{a} vs #{b}"
end

#=> test.rb:9:in `<main>': Failed assert_equal 5 vs 4 (RuntimeError)

This is much better! We?re ready to move on to another assert function, assert_in_delta. The way floating point numbers are represented, comparing them for equality won?t work. Instead, we will check to see that they are within a certain range of each other. We can do this with a simple calculation: (a-b).abs < ?, where ? is a very small number, like 0.001 (in reality, you will probably want a smaller delta than that). Here?s the function in Ruby:

def assert_in_delta a, b
  assert (a-b).abs <= 0.001, "Failed assert_in_delta #{a} vs #{b}"
end

assert_in_delta 0.0001, 0.0002 # pass
assert_in_delta 0.5000, 0.6000 # raise

We now have a solid base for our test framework. We have a few assertions and the ability to easily write more. Our next logical step would be to make a way to put our assertions into separate tests. Organizing these assertions allows us to refactor more easily, reuse code more effectively, avoid problems with conflicting tests, and run multiple tests at once.

To do this, we will wrap our assertions into functions and those function into classes, giving us two layers of compartmentalization.

class XTest
  def first_test
    a = 1
    assert_equal 1, a # passes
  end

  def second_test
    a = 1
    a += 1
    assert_equal 2, a # passes
  end

  def third_test
    a = 1
    assert_equal 1, a # passes
  end
end

That adds some structure, but how do we run the tests now? It?s not pretty:

XTest.new.first_test
XTest.new.second_test
XTest.new.third_test

Each test function needs to be called specifically, by name, which will become very tedious once there are 5, or 10, or 1000 tests. This is obviously not the best way to run tests. Ideally, the tests would run themselves, and to do that we?ll start by adding a method to run our tests to the class:

class XTest
  def run name
    send name
  end

  # ...test methods?
end

XTest.new.run :first_test
XTest.new.run :second_test
XTest.new.run :third_test

This is still very cumbersome, but it puts us in a better position, closer to our goal of automation. Using Class.public_instance_methods, we can find which methods are tests:

XTest.public_instance_methods
# => %w[some_method one_test two_test ...]

XTest.public_instance_methods.grep(/_test$/)
# => %w[one_test two_test red_test blue_test]

And run those automatically.

class XTest
  def self.run
    public_instance_methods.grep(/_test$/).each do |name|
      self.new.run name
    end
  end
  # def run...
  # ...test methods...
end

XTest.run # => All tests run

This is much better now, but we can still improve our code. If we try to make a new set of tests, called YTest for example, we would have to copy these run methods over. It would be better to move the run methods into a new abstract class, Test, and inherit from that.

class Test
  # ...run & assertions...
end 

class XTest < Test
  # ...test methods...
end 

XTest.run

This improves our code structure significantly. However, when we have multiple classes, we get that same tedious repetition:

XTest.run
YTest.run
ZTest.run # ...ugh

To solve this, we can have the Test class create a list of classes which inherit it. Then we can write a method in Test which will run all of those classes.

class Test
  TESTS = []

  def self.inherited x
    TESTS << x
  end 

  def self.run_all_tests
    TESTS.each do |klass|
      klass.run
    end 
  end 
  # ...self.run, run, and assertions...
end 

Test.run_all_tests # => We can use this instead of XTest.run; YTest.run; etc.

We?re really making progress now. The most important feature our framework is missing now is some way of reporting test success and failure. A common way to do this is to simply print a dot when a test successfully runs.

def self.run_all_tests
  TESTS.each do |klass|
    Klass.run
  end
  puts
end 

def self.run
  public_instance_methods.grep(/_test$/).each do |name|
    self.new.run name 
    print "."
  end 
end

Now, when we run the tests, it will look something like this:

% ruby test.rb
...

Indicating that we had three successful tests. But what happens if a test fails?

% ruby test.rb
.test.rb:20:in `test_assert_equal_bad': Failed assert_equal 5 vs 4 (RuntimeError)
  [...tons of blah blah...]
  from test.rb:30:in `<main>'

The very first error we come across will stop the entire test. Instead of the error being printed naturally, we can catch it and print the error message ourselves, letting other tests continue:

def self.run
  public_instance_methods.grep(/_test$/).each do |name|
    begin
      self.new.run name
      print "."
    rescue => e
      puts
      puts "Failure: #{self}##{name}: #{e.message}"
      puts "  #{e.backtrace.first}"
    end 
  end 
end

# Output

% ruby test.rb
.
Failure: Class#test_assert_equal_bad: Failed assert_equal 5 vs 4  
  test.rb:20:in `test_assert_equal'
.

That?s better, but it?s still ugly. We have failures interrupting the visual flow and getting in the way. We can improve on this. First, we should reexamine our code and try to organize it more sensibly.

def self.run
  public_instance_methods.grep(/_test$/).each do |name|
    begin
      self.new.run name
      print "."
    rescue => e
      puts
      puts "Failure: #{self}##{name}: #{e.message}"
      puts "  #{e.backtrace.first}"
    end
  end
end

Currently, this one function is doing 4 things:

  1. Line 2 is selecting and filtering tests.
  2. The begin clause is handling errors.
  3. `self.new.run name` runs the tests.
  4. The various puts and print statements print results.

This is too many responsibilities for one function. Test.run_all_tests should simply run classes, Test.run should run multiple tests, Test#run should run a single test, and result reporting should be done by... Something else. We?ll get back to that. The first thing we can do to improve this organization is to push the exception handling into the individual test running method.

class Test
  def run name
    send name
    false
  rescue => e
    e
  end

  def self.run
    public_instance_methods.grep(/_test$/).each do |name|
      e = self.new.run name
      
      unless e then
        print "."
      else
        puts
        puts "Failure: #{self}##{name}: #{e.message}"
        puts " #{e.backtrace.first}"
      end
    end
  end
end

This is a little better, but Test.run is still handling all the result reporting. To improve on that, we can move the reporting into another function, or better yet, its own class.

class Reporter
  def report e, name
    unless e then
      print "."
    else
      puts
      puts "Failure: #{self}##{name}: #{e.message}"
      puts " #{e.backtrace.first}"
    end
  end

  def done
    puts
  end
end

class Test
  def self.run_all_tests
    reporter = Reporter.new

    TESTS.each do |klass|
      klass.run reporter
    end
   
    reporter.done
  end
 
  def self.run reporter
    public_instance_methods.grep(/_test$/).each do |name|
      e = self.new.run name
      reporter.report e, name
    end
  end

  # ...
end

By creating this Reporter class, we move all IO out of the Test class. This is a big improvement, but there?s a problem with this class. It takes too many arguments to get the information it needs, and it?s not even getting everything it should have! See what happens when we run tests with Reporter:

.
Failure: ##test_assert_bad:
Failed test
 test.rb:9:in `test_assert_bad'
.
Failure: ##test_assert_equal_bad: Failed
assert_equal 5 vs 4
 test.rb:17:in `test_assert_equal_bad'
.
Failure: ##test_assert_in_delta_bad: Failed
assert_in_delta 0.5 vs 0.6
 test.rb:25:in `test_assert_in_delta_bad'

Instead of reporting what class has the failing test, it?s saying what reporter object is running it! The quickest way to fix this would be to simply add another argument to the report function, but that just creates a more tangled architecture. It would be better to make report take a single argument that contains all the information about the error. The first step to do this is to move the error object into a Test class attribute:

class Test
  # ...
  attr_accessor :failure
  
  def initialize
    self.failure = false
  end

  def run name
    send name
    false
  rescue => e
    self.failure = e
    self
  end
end

After moving the failure, we?re ready to get rid of the name parameter. We can do this by adding a name attribute to the Test class, like we did with the failure class:

class Test
  attr_accessor :name
  attr_accessor :failure
  def initialize name
    self.name = name
    self.failure = false
  end

  def self.run reporter
    public_instance_methods.grep(/_test$/).each do |name|
      e = self.new(name).run
      reporter.report e
    end
  end
  # ...
end

This new way of calling the Test#run method requires us to change that a little bit:

class Test
  def run
    send name
    false
  rescue => e
    self.failure = e
    self
  end
end

We can now make our Reporter class work with a single argument:

class Reporter
  def report e
    unless e then
      print "."
    else
      puts
      puts "Failure: #{e.class}##{e.name}: #{e.failure.message}"
      puts " #{e.failure.backtrace.first}"
    end
  end
end

We now have a much better Reporter class, and we can now turn our attention to a new problem in Test#run: it can return two completely different classes. false for a successful test and a Test object for a failure. Tests know if they fail, so we can know when they succeed without that false value.

class Test
  # ...
  attr_accessor :failure
  alias failure? failure
  # ...
  
  def run
    send name
  rescue => e
    self.failure = e
  ensure
    return self
  end
end

class Reporter
  def report e
    unless e.failure? then
      print "."
    else
      # ...
    end
  end
end

It would now be more appropriate for the argument to Reporter#report to be named result instead of e.

class Reporter
  def report result
    unless result.failure? then
      print "."
    else
      failure = result.failure
      puts
      puts "Failure: #{result.class}##{result.name}: #{failure.message}"
      puts " #{failure.backtrace.first}"
    end
  end
end

Now, we have one more step to improve reporting. As of right now, errors will be printed with the dots. This can make it difficult to get an overview of how many tests passed or failed. To fix this, we can move failure printing and progress reporting into two different sections. One will be an overview made up of dots and "F"s, and the other a detailed summary, for example:

...F..F..F

Failure: TestClass#test_method1: failure message 1
 test.rb:1:in `test_method1?

Failure: TestClass#test_method2: failure message 2
 test.rb:5:in `test_method2?

... and so on ...

To get this kind of output, we can store failures while running tests and modify the done function to print them at the end of the tests.

class Reporter
  attr_accessor :failures
  def initialize
    self.failures = []
  end

  def report result
    unless result.failure? then
      print "."
    else
      print "F"
      failures << result
    end
  end

  def done
    puts

    failures.each do |result|
      failure = result.failure
      puts
      puts "Failure: #{result.class}##{result.name}: #{failure.message}"
      puts " #{failure.backtrace.first}"
    end
  end
end

One last bit of polishing on the reporter class. We?ll rename the report method to << and the done method to summary.

class Reporter
  # ...
  def << result
    # ...
  end

  def summary
    # ...
  end
end

class Test
  def self.run_all_tests
    # ...
    reporter.summary
  end
 
  def self.run reporter
    public_instance_methods.grep(/_test$/).each do |name|
    reporter << self.new(name).run
  end
end

We?re almost done now! We?ve got one more step. Tests should be able to run in any order, so we want to make them run in a random order every time. This is as simple as adding `.shuffle` to our Test.run function, but we?ll make it a little more readable by moving the public_instance_methods.grep statement into a new function:

class Test
  def self.test_names
    public_instance_methods.grep(/_test$/)
  end
  
  def self.run reporter
    test_names.shuffle.each do |name|
      reporter << self.new(name).run
    end
  end
end

And we?re done! This may not be the most feature-rich test framework, but it?s very simple, small, well written, and gives us a base which is easy to extend and build on. The entire framework is only about 70 lines of code.

Thanks to Ryan Davis for an excellent talk! Also check out the code and slides from the talk.


published by Eugenia on 2016-04-09 00:15:58 in the "General" category
Eugenia Loli-Queru

For the kind of illustration I’m interested in, the style requires some very smooth, matte, single-color backgrounds. Traditionally with watercolor people would do large washes of 2 to 3 colors (e.g. for a sky), but for the kind of illustration I do, which has a lot of details, traditional washes are not a way to go. I could not find a single article or youtube video that shows how to do large, non-square areas of matte, smooth painting, so after a lot of tries, I found this technique:

– Get some paint on a plastic palette. About the size of a raisin for a small area.
– On a separate palette hole, add thrice as much water as the raisin size of paint above.
– Use a size 8 “pointed-round” soft brush (Kolinsky sounds good).
– Mix the paint with some Titanium White.
– With the tip of the brush, get some paint (just a little bit, maybe about 1/6th of it), and mix it well with the water. It will create a very pale color, but it will still have a color.
– Strain away as much water as possible from the brush. It should not be full of water when you lay it on paper.
– Start laying the pale color on your paper. Use as large brush strokes as possible, and move the pools of paint towards a single direction.
– Let it dry for a minute or so.
– Add 2/6ths of the paint (basically, double as much as before), on a bit more water than before (maybe about 1.5 times as much as before).
– Mix well, strain the brush, and paint over, the same way as before.
– Let it dry for 3 minutes or so.
– Add the rest of the paint to about 2x more water as in the beginning, strain the brush, paint over again. The consistency should be that of a melted ice cream.
– Let it dry for 5 minutes before you decide if you need yet another hand on top, or add details on it.

That’s it. Basically, you need multiple layers to get a smooth, matte finish.


My illustration “Divorce Papers”

Another way to do it with gouache, is to lay gesso+medium in the paper before painting, just as if you were using acrylics. The 2-3 gesso hands would then serve the same way as the multiple hands of paint. Personally, I prefer the first method.


published by noreply@blogger.com (Szymon Lipi?ski) on 2016-04-08 13:42:00 in the "bash" category

Bash has quite a nice feature, you can write a command in a console, and then press <TAB> twice. This should show you all possible arguments you can use for this command.

In our Liquid Galaxy software stack we have a script which allows us to connect with ssh to one of our installations using a special tunnel. This script is complicated, however the main usage is simple. The command below will connect me to my Liquid Galaxy machine through a special proxy server.

lg-ssh szymon

The szymon part is the name of my LG, and it is taken from one of our chef node definition files.

This script also takes huge number of arguments like:

lg-ssh --chef-directory --ssh-identity --ssh-tunnel-port

There are two kinds of arguments: one is a simple string, one begins with --.

To implement the bash completion on double <TAB>, first I wrote a simple python script, which makes a huge list of all the node names:

#!/usr/bin/env python

from sys import argv
import os
import json

if __name__ == "__main__":
    pattern = ""
    if len(argv) == 2:
        pattern = argv[1]

    chef_dir = os.environ.get('LG_CHEF_DIR', None)
    if not chef_dir:
        exit(0)
    node_dirs = [os.path.join(chef_dir, "nodes"),
                 os.path.join(chef_dir, "dev_nodes")]
    node_names = []

    for nodes_dir in node_dirs:
        for root, dirs, files in os.walk(nodes_dir):
            for f in files:
                try:
                    with open(os.path.join(root, f), 'r') as nf:
                        data = json.load(nf)
                        node_names.append(data['normal']['liquid_galaxy']['support_name'])
                except:
                    pass

    for name in node_names:
        print name

Another thing was to get a list of all the program options. We used this simple one liner:

$LG_CHEF_DIR/repo_scripts/lg-ssh.py --help | grep '  --' | awk {'print $1'}

The last step to make all this work was making a simple bash script, which uses the python script, above, and the one liner.

_lg_ssh()
{
    local cur prev opts node_names
    COMPREPLY=()
    cur="${COMP_WORDS[COMP_CWORD]}"
    prev="${COMP_WORDS[COMP_CWORD-1]}"
    opts=`$LG_CHEF_DIR/repo_scripts/lg-ssh.py --help | grep '  --' | awk {'print $1'}`
    node_names=`python $LG_CHEF_DIR/repo_scripts/node_names.py`

    if [[ ${cur} == -* ]] ; then
        COMPREPLY=( $(compgen -W "${opts}" -- ${cur}) )
        return 0
    fi

    COMPREPLY=( $(compgen -W "${node_names}" -- ${cur}) )
}

complete -F _lg_ssh lg-ssh
complete -F _lg_ssh lg-scp
complete -F _lg_ssh lg-ssh.py

Now I just need to source this file in my current bash session, so I've added the line below in my ~/.bashrc.

source $LG_CHEF_DIR/repo_scripts/lg-ssh.bash-completion

And now pressing the <TAB> twice in a console shows a nice list of completion options:

$ lg-ssh 
Display all 129 possibilities? (y or n)
... and here go all 129 node names ...
$ lg-ssh h
... and here go all node names beginning with 'h' ...
$ lg-ssh --
.. and here go all the options beginning with -- ...

The great feature of this implementation is that when someone changes any of the script's options, or changes a chef node name, then the completion mechanism will automatically support all the changes.


published by Eugenia on 2016-04-07 23:40:31 in the "Entertainment" category
Eugenia Loli-Queru

I’m almost shocked by the Pitchfork review on Yeasayer’s new album, “Amen & Goodbye”. To me, over the years, it was baffling why originally Pitchfork endorsed Yeasayer in 2007, but they killed them in their subsequent albums (which in my opinion were more interesting). This was answered in the first paragraph of their latest album review. Basically, Pitchfork hated the fact that Yeasayer weren’t writing lyrics about things they truly believe in, that they were in fact, trend-hoppers.

Wait a second, so did Pitchfork truly believed back in 2007 that a bunch of kids from Brooklyn would ever want to leave the city and become “handsome farmers”, as their lyrics claimed? Are their writers that gullible? Or do they live in a fantasy world that the first Yeasayer album reinforced in their heads, only to be deflated by the clearly urban sound of the albums that followed?

Why blame Yeasayer for it? Why blame a bunch of musicians who want to make it in the industry? Why would anyone think that art is only about what the artist believes and not what the masses want to see/hear? Because let me tell you, if you’re a professional artist, by definition you have to make art that people want to see or hear. Only a part of it could coincide with what the artist actually truly likes/believes. Why? Because that’s what “professional” means. It’s not about “selling out”, it’s about literally being able to sell.

The artists who create only what THEY want to create, they’re by definition either not professionals, or they can’t live off their craft (and need a second job). It is EXTREMELY RARE that an artist creates only what they want to, and have commercial success at the same time. And even when that happens, it also means that they will be out of favor within 3-5 years, as trends naturally change. Tough luck after that time passes.

I know a lot of people would like to make art sound special, but art today is no different than anything else. It’s democratized immensely, and that also means that it’s been commoditized. And anything that is a commodity, is bound to trends. Even trendsetters have to build on top of existing trends, nothing happens in a vaccuum. Everything is connected.

So yeah, going back to that Pitchfork review, I have trouble understanding how they can call Yeasayer “trend hoppers” but also at the same time “out-of-step with current trends”, and judge their music on their character or how they do business, and not on the music itself. In fact, a lot of Pitchfork reviews are like that: they judge the people themselves, not their work. A lot of bands have been destroyed just because Pitchfork didn’t think they were hipster enough, or for being hipsters in disguise.

In my opinion, the album itself is rather “blah” (not as interesting as their 2010 “Odd Blood”), but I try to judge the music itself as music and what it does to my synesthetic brain. Does it turn it On, does it transport me to another dimension? Does it make me feel something, or makes me see something that wasn’t there, as true psychedelics do? If yes, it gets more points, if not, it gets fewer. I care not about lyrics, because I almost never care about what others think about stuff. To me, especially as a non-native English speaker, it’s only about the music.

But I won’t judge music or art in general based on the creator’s character, or what my own beliefs expect that creator’s character to be. This raises the philosophical question: “is the art separate from the artist?”. And the answer to this depends on your point of view, how you consume art. From the point of view of the artist, the art and the artist are not separate. But for all third parties, it depends: if you can only understand art by understanding the artist, then yes, judging the artist himself, might make sense. But if you make the art your own by separating it from the artist (as I usually do), then I don’t need to know about the artist’s convictions. Because at that point, his/her art and me, are one. And by proxy, that makes myself and the artist one. So it’s a synergistic/symbiotic way of consuming art, rather than a conditional one (e.g. “I might like that art if its artist is in agreement with my beliefs”.)

My score for their new album: 5/10 (lower than Pitchfork’s score in fact, but without a cultural bias attached to it)


published by noreply@blogger.com (Szymon Lipi?ski) on 2016-04-07 20:12:00 in the "postgres" category

The new PostgreSQL 9.5 release has a bunch of great features. I describe below the ones I find most interesting.

Upsert

UPSERT is simply a combination of INSERT and UPDATE. This works like this: if a row exists, then update it, if it doesn't exist, create it.

Before Postgres 9.5 when I wanted to insert or update a row, I had to write this:

INSERT INTO test(username, login)
SELECT 'hey', 'ho ho ho'
WHERE NOT EXISTS (SELECT 42 FROM test WHERE username='hey');

UPDATE test SET login='ho ho ho' WHERE username='hey' AND login <> 'ho ho ho';

Which was a little bit problematic. You need to make two queries, and both can have quite complicated WHERE clauses.

In PostgreSQL 9.5 there is much simpler version:

INSERT INTO test(username, login) VALUES ('hey', 'ho ho ho')
ON CONFLICT (username)
DO UPDATE SET login='ho ho ho';

The only requirement is that there should be a UNIQUE constraint on a column which should fail while inserting a row.

The version above makes the UPDATE when the INSERT fails. There is also another form of the UPSERT query, which I used in this blog post. You can just ignore the INSERT failure:

INSERT INTO test(username, login) VALUES ('hey', 'ho ho ho')
ON CONFLICT (username)
DO NOTHING;

Switching Tables to Logged and Unlogged

PostgreSQL keeps a transaction write ahead log, which helps restore the database after a crash, and is used in replication, but it comes with some overhead, as additional information must be stored on disk.

In PostgreSQL 9.5 you can simply switch a table from logged to unlogged. The unlogged version can be much faster when filling it with data, processing it etc. However at the end of such operations it might be good to make it a normal logged table. Now it is simple:

ALTER TABLE barfoo SET LOGGED;

JSONB Operators and Functions

This is the binary JSON type, and these new functions allow us to perform more operations without having to convert our data first to the slower, non-binary JSON alternative.

Now you can remove a key from a JSONB value:

SELECT '{"a": 1, "b": 2, "c": 3}'::jsonb || '{"x": 1, "y": 2, "c": 42}'::jsonb;

     ?column?
??????????????????
 {"b": 2, "c": 3}

And merge JSONB values (the last value's keys overwrite the first's one):

SELECT '{"a": 1, "b": 2, "c": 3}'::jsonb || '{"x": 1, "y": 2, "c": 42}'::jsonb;

                 ?column?
???????????????????????????????????????????
 {"a": 1, "b": 2, "c": 42, "x": 1, "y": 2}

And we have the nice jsonb_pretty() function which instead of this:

SELECT jsonb_set('{"name": "James", "contact": {"phone": "01234 567890",
                   "fax": "01987 543210"}}'::jsonb,
                   '{contact,phone}', '"07900 112233"'::jsonb);

                                   jsonb_set
????????????????????????????????????????????????????????????????????????????????
 {"name": "James", "contact": {"fax": "01987 543210", "phone": "07900 112233"}}

prints this:

SELECT jsonb_pretty(jsonb_set('{"name": "James", "contact": {"phone": "01234 567890",
                   "fax": "01987 543210"}}'::jsonb,
                   '{contact,phone}', '"07900 112233"'::jsonb));


         jsonb_pretty
?????????????????????????????????
  {                              ?
      "name": "James",           ?
      "contact": {               ?
          "fax": "01987 543210", ?
          "phone": "07900 112233"?
      }                          ?
  }

More Information

There are more nice features in the new PostgreSQL 9.5. You can read the full list at https://wiki.postgresql.org/wiki/What'snewinPostgreSQL9.5


published by noreply@blogger.com (Elizabeth Garrett) on 2016-03-29 16:53:00

I recently did some research for one of End Point?s ecommerce clients on their PCI compliance and wanted to share some basic information for those of you who are new to this topic.

TLS

TLS (Transport Layer Security) is a standard for secure communications between applications. TLS is the current version of what used to be called SSL, the secure sockets layer. In the case of a financial transaction, this is the communication between the website selling a product and the end user. TLS works by encrypting data between two endpoints to ensure any sensitive data (such as financial details and private customer information) is exchanged securely. As security measures increase, new versions of TLS are released. To date, TLS 1.2 is the most up-to-date, with TLS 1.1 being considered safe, and TLS 1.0 being phased out. For details about OS versions supporting the latest TLS standards, please see Jon Jensen?s write-up here.

Compliance with PCI DSS

As all online retailers know, becoming and staying compliant with PCI DSS (Payment Card Industry Data Security Standard) is a big job. PCI is THE ecommerce security standard and in order to accept payment with Visa, MasterCard, American Express, and Discover, you must comply with their security standards.

As the Internet security landscape changes, PCI DSS standards are updated and reflect new risks and adjustments in security protections. As of today, PCI is requiring vendors to upgrade TLS 1.1 or above by June of 2016, with an optional extension until June 2018.

Compliance Assessors

Here?s where things get tricky. PCI does not actually do their own compliance, instead each merchant must have a neutral third party help them fulfill their PCI requirements. These are called ?assessors? and there are a large number of companies that offer this service along with help for other security-related tasks.

In preparation for the new requirements, many of the assessor companies are including the new TLS standards in their current compliance protocols.

What does that mean? Well, it means that even though PCI might not be requiring you to have TLS 1.1 until June of 2018, your compliance assessor might be require you to do it right now.

Bite the Bullet

So, now, given that you know you this change is coming AND you need it to get your compliance done, you might as well get your site updated. So where?s the catch?

Unsupported browsers

The big catch is that some browsers do not support TLS 1.1 or 1.2. In those cases, some of your users will not be able to complete a payment transaction and will instead hit an error screen and cannot continue. They are:

  • Internet Explorer on Windows XP
  • Internet Explorer older than version 11 on any version of Windows
  • the stock Android browser on versions of Android before 5.0
  • Safari 6 or older on Mac OS X 10.8 (Mountain Lion) or older
  • Safari on iOS 4 or older
  • very, very old versions of Firefox or Chrome that have been set not to auto-update

Okay, so how many people still use those old browsers? We?ll take a look at some of the breakdowns here:

http://www.w3schools.com/browsers/browsers_explorer.asp

You might be thinking, ?That doesn?t seem like very many people?. And that?s true. However, every site has a different customer base and browsers use varies widely by demographics. So where can you go to find out what kinds of browser?s your customers use?

Google Analytics, your old friend

If you have Google Analytics setup, you can go through the Audience/Technology/Browser&OS screens to find out what kind of impact this might have.

Plan for the Worst

Now armed with your information, you will probably want to go ahead and get your website on the newest TLS version. The change is coming anyways but help your staff and web team plan for the worst by making sure everyone knows about the browser limitations and can help your customers through the process.

Server Compatibility Notes

For many ecommerce sites, enabling TLS 1.1 and 1.2 is easy, just changing a configuration setting and restarting the web server. But on older operating systems, such as the still supported and very popular Red Hat Enterprise Linux 5 and CentOS Linux 5, TLS 1.0 is the newest supported version. Various workarounds might be possible, but the only real solution is to migrate to a newer version of the operating system. There can be cost and time factors to consider, so it?s best to plan ahead. Ask us or your in-house developers whether a migration will be necessary!

Need Help?

As End Point?s client liaison, I?m happy to chat with anyone who needs answers or advice about PCI DSS and your ecommerce site.


published by noreply@blogger.com (Kamil Ciemniewski) on 2016-03-23 08:41:00 in the "classifiers" category

Have you ever wondered what is the machinery behind some of the algorithms for doing seemingly very intelligent tasks? How is it possible that the computer program can recognize faces in photos, turn an image into a text or even classify some emails as legitimate or as spam?

Today, I?d like to present one of the simplest models for performing classification tasks. The model enables extremely fast execution, making it very practical in many use cases. The example I?ll choose will enable us to extend the discussion about the most optimal approach to another blog post.

The problem

Imagine that you?re working on an e-commerce store for your client. One of the requirements is to present the currently logged in user with a ?promotion box? somewhere on the page. The goal is to maximize our chances of having the user put the product from the box into the basket. There?s one promotional box and a couple of different categories of products to choose the actual product from.

Thinking about the solution ? using probability theory

One of the obvious directions we may want to turn towards is to use probability theory. If we could collect the data about the user?s previous choices and his or her characteristics, we can use probability to select the product category best suited for the current user. We would then choose a product from this category that currently has an active promotion.

Quick theory refresher for programmers

As we'll be exploring the probability approaches using Ruby code, I'd like to very quickly walk you through some of the basic concepts we will be using from now on.

Random variables

The simplest probability scenario many of us are already accustomed with is the coin toss results distribution. Here we're throwing the coin, noting whether we get heads or tails. In this experiment, we call "got heads" and "got tails" probability events. We can also shift the terminology a bit by calling them: two values of the "toss result" random variable.

So in this case we'd have a random variable ? let's call it T (for "toss") that can take values of: "heads" or "tails". We then define the probability distribution P(T) as a function from the random variable value to a real number between 0 and 1 inclusively on both sides. In real world the probability values after e. g 10000 tosses might look like the following:

+-------+---------------------+
| toss  | value               |
+-------+---------------------+
| heads | 0.49929999999999947 |
| tails |   0.500699999999998 |
+-------+---------------------+

These values are nearing 0.5 more and more with the greater number of tosses.

Factors and probability distributions

We've shown a simple probability distribution. To ease the comprehension of the Ruby code we'll be working with, let me introduce the notion of the factor. We called the "table" from the last example a probability distribution. The table represented a function from a random variable's value to a real number between [0, 1]. The factor is a generalization of that notion. It's a function from the same domain, but returning any real number. We'll explore the usability of this notion in some of our next articles.

The probability distribution is a factor that adds two constraints:

  • its values are always in the range [0, 1] inclusively
  • the sum of all it's values is exactly 1

Simple Ruby modeling of random variables and factors

We need to have some ways of computing probability distributions. Let's define some simple tools we'll be using in this blog series:

# Let's define a simple version of the random variable
# - one that will hold discrete values
class RandomVariable
  attr_accessor :values, :name

  def initialize(name, values)
    @name = name
    @values = values
  end
end

# The following class really represents here a probability
# distribution. We'll adjust it in the next posts to make
# it match the definition of a "factor". We're naming it this
# way right now as every probability distribution is a factor
# too.
class Factor
  attr_accessor :_table, :_count, :variables

  def initialize(variables)
    @_table = {}
    @_count = 0.0
    @variables = variables
    initialize_table
  end

  # We're choosing to represent the factor / distribution
  # here as a table with value combinations in one column
  # and probability values in another. Technically, we're using
  # Ruby's Hash. The following method builds the the initial hash
  # with all the possible keys and values assigned to 0:
  def initialize_table
    variables_values = @variables.map do |var|
      var.values.map do |val|
        { var.name.to_sym => val }
      end.flatten
    end # [ [ { name: value } ] ]   
    @_table = variables_values[1..(variables_values.count)].inject(variables_values.first) do |all_array, var_arrays|
      all_array = all_array.map do |ob|
        var_arrays.map do |var_val|
          ob.merge var_val
        end
      end.flatten
      all_array
    end.inject({}) { |m, item| m[item] = 0; m }
  end

  # The following method adjusts the factor by adding information
  # about observed combination of values. This in turn adjusts probability
  # values for all the entries:
  def observe!(observation)
    if !@_table.has_key? observation
      raise ArgumentError, "Doesn't fit the factor - #{@variables} for observation: #{observation}"
    end

    @_count += 1

    @_table.keys.each do |key|
      observed = key == observation
      @_table[key] = (@_table[key] * (@_count == 0 ? 0 : (@_count - 1)) + 
       (observed ? 1 : 0)) / 
         (@_count == 0 ? 1 : @_count)
    end

    self
  end

  # Helper method for getting all the possible combinations
  # of random variable assignments
  def entries
    @_table.each
  end

  # Helper method for testing purposes. Sums the values for the whole
  # distribution - it should return 1 (close to 1 due to how computers
  # handle floating point operations)
  def sum
    @_table.values.inject(:+)
  end

  # Returns a probability of a given combination happening
  # in the experiment
  def value_for(key)
    if @_table[key].nil?
      raise ArgumentError, "Doesn't fit the factor - #{@varables} for: #{key}"
    end
    @_table[key]
  end

  # Helper method for testing purposes. Returns a table object
  # ready to be printed to stdout. It shows the whole distribution
  # as a table with some columns being random variables values and
  # the last one being the probability value
  def table
    rows = @_table.keys.map do |key|
      key.values << @_table[key]
    end
    table = Terminal::Table.new rows: rows, headings: ( @variables.map(&:name) << "value" )
    table.align_column(@variables.count, :right)
    table
  end

  protected

  def entries=(_entries)
    _entries.each do |entry|
      @_table[entry.keys.first] = entry.values.first
    end
  end

  def count
    @_count
  end

  def count=(_count)
    @_count = _count
  end
end

Notice that we're using here the terminal-table gem as a helper for printing out the factors in an easy to grasp fashion. You'll need the following requires:

require 'rubygems'
require 'terminal-table'

The scenario setup

Let?s imagine that we have the following categories to choose from:

category = RandomVariable.new :category, [ :veggies, :snacks, :meat, :drinks, :beauty, :magazines ]

And the following user features on each request:

age      = RandomVariable.new :age,      [ :teens, :young_adults, :adults, :elders ]
sex      = RandomVariable.new :sex,      [ :male, :female ]
relation = RandomVariable.new :relation, [ :single, :in_relationship ]
location = RandomVariable.new :location, [ :us, :canada, :europe, :asia ]

Let's define the data model that resembles logically the one we could have in our real e-commerce application:

class LineItem
  attr_accessor :category

  def initialize(category)
    self.category = category
  end
end

class Basket
  attr_accessor :line_items

  def initialize(line_items)
    self.line_items = line_items
  end
end

class User
  attr_accessor :age, :sex, :relationship, :location, :baskets

  def initialize(age, sex, relationship, location, baskets)
    self.age = age
    self.sex = sex
    self.relationship = relationship
    self.location = location
    self.baskets = baskets
  end
end

We want to utilize a user?s baskets in order to infer the most probable value for a category, given a set of user?s features. In our example, we can imagine that we?re offering authentication via Facebook. We can grab info about a user?s sex, location, age and whether she/he is in relationship or not. We want to find a category that?s being chosen the most by users with a given set of features.

As we don't have any real data to play with, we'll need a generator to create fake data of certain characteristics. Let's first define a helper class with a method, that will allow us to choose a value out of a given list of options along with their weights:

class Generator
  def self.pick(options)
    items = options.inject([]) do |memo, keyval|
      key, val = keyval
      memo << Array.new(val, key)
      memo
    end.flatten
    items.sample
  end
end

With all the above we can define a random data generation model:

class Model

  # Let's generate `num` users (1000 by default)
  def self.generate(num = 1000)
    num.times.to_a.map do |user_index|
      gen_user
    end
  end

  # Returns a user with randomly selected traits and baskets
  def self.gen_user
    age = gen_age
    sex = gen_sex
    rel = gen_rel(age)
    loc = gen_loc
    baskets = gen_baskets(age, sex)

    User.new age, sex, rel, loc, baskets
  end

  # Randomly select a sex with 40% chance for getting a male
  def self.gen_sex
    Generator.pick male: 4, female: 6
  end

  # Randomly select an age with 50% chance for getting a teen
  # (among other options and weights)
  def self.gen_age
    Generator.pick teens: 5, young_adults: 2, adults: 2, elders: 1
  end

  # Randomly select a relationship status.
  # Depend the chance of getting a given option on the user's age
  def self.gen_rel(age)
    case age
      when :teens        then Generator.pick single: 7, in_relationship: 3
      when :young_adults then Generator.pick single: 4, in_relationship: 6
      else                    Generator.pick single: 8, in_relationship: 2
    end
  end

  # Randomly select a location with 40% chance for getting a united states
  # (among other options and weights)
  def self.gen_loc
    Generator.pick us: 4, canada: 3, europe: 1, asia: 2
  end

  # Randomly select 20 basket line items.
  # Depend the chance of getting a given option on the user's age and sex
  def self.gen_items(age, sex)
    num = 20

    num.times.to_a.map do |i|
      if (age == :teens || age == :young_adults) && sex == :female
        Generator.pick veggies: 1, snacks: 3, meat: 1, drinks: 1, beauty: 9, magazines: 6
      elsif age == :teens  && sex == :male
        Generator.pick veggies: 1, snacks: 6, meat: 4, drinks: 1, beauty: 1, magazines: 4
      elsif (age == :young_adults || age == :adults) && sex == :male
        Generator.pick veggies: 1, snacks: 4, meat: 6, drinks: 6, beauty: 1, magazines: 1
      elsif (age == :young_adults || age == :adults) && sex == :female
        Generator.pick veggies: 4, snacks: 4, meat: 2, drinks: 1, beauty: 6, magazines: 3
      elsif age == :elders && sex == :male
        Generator.pick veggies: 6, snacks: 2, meat: 2, drinks: 2, beauty: 1, magazines: 1
      elsif age == :elders && sex == :female
        Generator.pick veggies: 8, snacks: 1, meat: 2, drinks: 1, beauty: 4, magazines: 1
      else
        Generator.pick veggies: 1, snacks: 1, meat: 1, drinks: 1, beauty: 1, magazines: 1
      end
    end.map do |cat|
      LineItem.new cat
    end
  end

  # Randomly select 5 baskets depending the traits of the basket on user
  # age and sex
  def self.gen_baskets(age, sex)
    num = 5

    num.times.to_a.map do |i|
      Basket.new gen_items(age, sex)
    end
  end
end

Where is the complexity?

The approach described above doesn?t seem that exciting or complex. Usually reading about probability theory applied in the field of machine learning requires going through quite a dense set of mathematical notions. The field is also being actively worked on by researchers. This implies a huge complexity ? certainly not the simple definition of probability that we got used to in high school.

The problem becomes a bit more complex if you consider efficiency of computing the probabilities. In our example, the joined probability distribution ? to fully describe the scenario ? needs to specify probability values for 383 cases:

p(:veggies, :teens, :male, :single, :us) # one of 384 combinations

Given that the probability distributions have to sum up to 1, the last case can be fully inferred from the sum of all the others. This means that we need 6 * 4 * 2 * 2 * 4 - 1 = 383 parameters in the model: 6 categories, 4 age classes, 2 sexes, 2 relationship kinds and 4 locations. Imagine adding one additional, 4 valued feature (a season). This would grow our number of parameters to 1535. And this is a very simple training example. We could have a model with close to 100 different features. The number of parameters would clearly be unmanageable even on the biggest servers we could put them in. This approach would also make it very painful to add additional features to the model.

Very simple but powerful optimization: The Naive Bayes model

In this section I?m going to present you with an equation we?ll be working with when optimizing our example. I?m not going to explain the mathematics behind it as you can easily read about them on e. g. Wikipedia.

The approach is called the Naive Bayes model. It is being used e .g. in spam filters. It also has been used in the past in medical diagnosis field.

It allows us to present the full probability distribution as a product of factors:

p(cat, age, sex, rel, loc) == p(cat) * p(age | cat) * p(sex | cat) * p(rel | cat) * p(loc | cat)

Where e. g. p(age | cat) represents the probability of a user being a certain age given that this user selects cat products most frequently. This is called the ?posterior probability?. The above equation states that we can simplify the distribution to be a product of some number of much more easily manageable factors.

The category from our example is often called a class and the rest of random variables in the distribution are often called features.

In our example, the number of parameters we?ll need to manage when presenting the distribution in this form drops to:

(6 - 1) + (6 * 4 - 1) + (6 * 2 - 1) + (6 * 2 - 1) + (6 * 4 - 1) == 73

That?s just around 19% of the original amount! Also, adding another variable (season) would only add 23 new parameters (compared to 1152 in the full distribution case).

The Naive Bayes model limits the number of parameters we have to manage but it comes with very strong assumptions about the variables involved: in our example, that the user features are conditionally independent given the resulting category. Later on I?ll show why this isn?t true in this case even though the results will still be quite okay.

Implementing the Naive Bayes model

As we now have all the tools we need, let's get back to the probability theory to figure out how best to model the Naive Bayes in terms of the Ruby blocks we now have.

The approach says that under the assumptions we discussed we can approximate the original distribution to be the product of factors:

p(cat, age, sex, rel, loc) = p(cat) * p(age | cat) * p(sex | cat) * p(rel | cat) * p(loc | cat)

Given the definition of the conditional probability we have that:

p(a | b) = p(a, b) / p(b)

Thus, we can express the approximation as:

p(cat, age, sex, rel, loc) = p(cat) * ( p(age, cat) / p(cat) ) * ( p(sex, cat) / p(cat) ) * ( p(rel, cat) / p(cat) ) * ( p(loc, cat) / p(cat) )

And then simplify it even further as:

p(cat, age, sex, rel, loc) = p(age, cat) * ( p(sex, cat) / p(cat) ) * ( p(rel, cat) / p(cat) ) * ( p(loc, cat) / p(cat) )

Let's define all the factors we will need:

cat_dist     = Factor.new [ category ]
age_cat_dist = Factor.new [ age, category ]
sex_cat_dist = Factor.new [ sex, category ]
rel_cat_dist = Factor.new [ relation, category ]
loc_cat_dist = Factor.new [ location, category ]

Also, we want a full distribution to compare the results:

full_dist = Factor.new [ category, age, sex, relation, location ]

Let's generate 1000 random users and looping through them and their baskets - adjust probability distributions for combinations of product categories and user traits:

Model.generate(1000).each do |user|
  user.baskets.each do |basket|
    basket.line_items.each do |item|
      cat_dist.observe! category: item.category
      age_cat_dist.observe! age: user.age, category: item.category
      sex_cat_dist.observe! sex: user.sex, category: item.category
      rel_cat_dist.observe! relation: user.relationship, category: item.category
      loc_cat_dist.observe! location: user.location, category: item.category
      full_dist.observe! category: item.category, age: user.age, sex: user.sex,
        relation: user.relationship, location: user.location
    end
  end
end

We can now print the distributions as tables to have an insight about the data:

[ cat_dist, age_cat_dist, sex_cat_dist, rel_cat_dist, 
  loc_cat_dist, full_dist ].each do |dist|
    puts dist.table
    # Let's print out the sum of all entries to ensure the
    # algorithm works well:
    puts dist.sum
    puts "nn"
end

Which yields the following to the console (the full distribution is truncated due to its size):

+-----------+---------------------+
| category  | value               |
+-----------+---------------------+
| veggies   |             0.10866 |
| snacks    | 0.19830999999999863 |
| meat      |             0.14769 |
| drinks    | 0.10115999999999989 |
| beauty    |             0.24632 |
| magazines | 0.19785999999999926 |
+-----------+---------------------+
0.9999999999999978

+--------------+-----------+----------------------+
| age          | category  | value                |
+--------------+-----------+----------------------+
| teens        | veggies   |  0.02608000000000002 |
| teens        | snacks    |  0.11347999999999969 |
| teens        | meat      |  0.06282999999999944 |
| teens        | drinks    |   0.0263200000000002 |
| teens        | beauty    |   0.1390699999999995 |
| teens        | magazines |  0.13322000000000103 |
| young_adults | veggies   | 0.010250000000000023 |
| young_adults | snacks    |  0.03676000000000003 |
| young_adults | meat      |  0.03678000000000005 |
| young_adults | drinks    |  0.03670000000000045 |
| young_adults | beauty    |  0.05172999999999976 |
| young_adults | magazines | 0.035779999999999916 |
| adults       | veggies   | 0.026749999999999927 |
| adults       | snacks    |  0.03827999999999962 |
| adults       | meat      | 0.034600000000000505 |
| adults       | drinks    | 0.028190000000000038 |
| adults       | beauty    |  0.03892000000000036 |
| adults       | magazines |  0.02225999999999998 |
| elders       | veggies   |  0.04558000000000066 |
| elders       | snacks    | 0.009790000000000047 |
| elders       | meat      | 0.013480000000000027 |
| elders       | drinks    | 0.009949999999999931 |
| elders       | beauty    | 0.016600000000000226 |
| elders       | magazines | 0.006600000000000025 |
+--------------+-----------+----------------------+
1.0000000000000013

+--------+-----------+----------------------+
| sex    | category  | value                |
+--------+-----------+----------------------+
| male   | veggies   |  0.03954000000000044 |
| male   | snacks    |   0.1132499999999996 |
| male   | meat      |  0.10851000000000031 |
| male   | drinks    |                0.073 |
| male   | beauty    | 0.023679999999999857 |
| male   | magazines |  0.05901999999999993 |
| female | veggies   |  0.06911999999999997 |
| female | snacks    |  0.08506000000000069 |
| female | meat      |  0.03918000000000006 |
| female | drinks    |  0.02816000000000005 |
| female | beauty    |  0.22264000000000062 |
| female | magazines |  0.13884000000000046 |
+--------+-----------+----------------------+
1.000000000000002

+-----------------+-----------+----------------------+
| relation        | category  | value                |
+-----------------+-----------+----------------------+
| single          | veggies   |  0.07722000000000082 |
| single          | snacks    |  0.13090999999999794 |
| single          | meat      |  0.09317000000000061 |
| single          | drinks    | 0.059979999999999915 |
| single          | beauty    |  0.16317999999999971 |
| single          | magazines |  0.13054000000000135 |
| in_relationship | veggies   | 0.031440000000000336 |
| in_relationship | snacks    |  0.06740000000000032 |
| in_relationship | meat      | 0.054520000000000006 |
| in_relationship | drinks    |  0.04118000000000009 |
| in_relationship | beauty    |  0.08314000000000002 |
| in_relationship | magazines |  0.06732000000000182 |
+-----------------+-----------+----------------------+
1.000000000000003

+----------+-----------+----------------------+
| location | category  | value                |
+----------+-----------+----------------------+
| us       | veggies   |  0.04209000000000062 |
| us       | snacks    |  0.07534000000000109 |
| us       | meat      | 0.055059999999999984 |
| us       | drinks    |  0.03704000000000108 |
| us       | beauty    |  0.09879000000000099 |
| us       | magazines |  0.07867999999999964 |
| canada   | veggies   | 0.027930000000000062 |
| canada   | snacks    |  0.05745999999999996 |
| canada   | meat      |  0.04288000000000003 |
| canada   | drinks    |  0.03078999999999948 |
| canada   | beauty    |  0.06397999999999997 |
| canada   | magazines | 0.053959999999999675 |
| europe   | veggies   | 0.013110000000000132 |
| europe   | snacks    |   0.0223200000000001 |
| europe   | meat      |  0.01730000000000005 |
| europe   | drinks    | 0.011859999999999964 |
| europe   | beauty    | 0.025490000000000183 |
| europe   | magazines | 0.020920000000000164 |
| asia     | veggies   |  0.02552999999999989 |
| asia     | snacks    |  0.04319000000000044 |
| asia     | meat      |  0.03244999999999966 |
| asia     | drinks    |  0.02147000000000005 |
| asia     | beauty    |  0.05805999999999953 |
| asia     | magazines |   0.0442999999999999 |
+----------+-----------+----------------------+
1.0000000000000029

+-----------+--------------+--------+-----------------+----------+------------------------+
| category  | age          | sex    | relation        | location | value                  |
+-----------+--------------+--------+-----------------+----------+------------------------+
| veggies   | teens        | male   | single          | us       |  0.0035299999999999936 |
| veggies   | teens        | male   | single          | canada   |  0.0024500000000000073 |
| veggies   | teens        | male   | single          | europe   |  0.0006999999999999944 |
| veggies   | teens        | male   | single          | asia     |  0.0016699999999999899 |
| veggies   | teens        | male   | in_relationship | us       |   0.001340000000000006 |
| veggies   | teens        | male   | in_relationship | canada   |  0.0010099999999999775 |
| veggies   | teens        | male   | in_relationship | europe   |  0.0006499999999999989 |
| veggies   | teens        | male   | in_relationship | asia     |   0.000819999999999994 |

(... many rows ...)

| magazines | elders       | male   | in_relationship | asia     | 0.00012000000000000163 |
| magazines | elders       | female | single          | us       |  0.0007399999999999966 |
| magazines | elders       | female | single          | canada   |  0.0007000000000000037 |
| magazines | elders       | female | single          | europe   |  0.0003199999999999965 |
| magazines | elders       | female | single          | asia     |  0.0005899999999999999 |
| magazines | elders       | female | in_relationship | us       |  0.0004899999999999885 |
| magazines | elders       | female | in_relationship | canada   | 0.00027000000000000114 |
| magazines | elders       | female | in_relationship | europe   | 0.00012000000000000014 |
| magazines | elders       | female | in_relationship | asia     | 0.00012000000000000014 |
+-----------+--------------+--------+-----------------+----------+------------------------+
1.0000000000000004

Let's define a Proc for inferring categories based on user traits as evidence:

infer = -> (age, sex, rel, loc) do

  # Let's map through the possible categories and the probability
  # values the distibutions assign to them:
  all = category.values.map do |cat|
    pc  = cat_dist.value_for category: cat
    pac = age_cat_dist.value_for age: age, category: cat
    psc = sex_cat_dist.value_for sex: sex, category: cat
    prc = rel_cat_dist.value_for relation: rel, category: cat
    plc = loc_cat_dist.value_for location: loc, category: cat

    { category: cat, value: (pac * psc/pc * prc/pc * plc/pc) }
  end

  # Let's do the same with the full distribution to be able to compare
  # the results:
  all_full = category.values.map do |cat|
    val = full_dist.value_for category: cat, age: age, sex: sex,
            relation: rel, location: loc

    { category: cat, value: val }
  end

  # Here's we're getting the most probable categories based on the
  # Naive Bayes distribution approximation model and based on the full
  # distribution:
  win      = all.max      { |a, b| a[:value] <=> b[:value] }
  win_full = all_full.max { |a, b| a[:value] <=> b[:value] }

  puts "Best match for #{[ age, sex, rel, loc ]}:"
  puts "   #{win[:category]} => #{win[:value]}"
  puts "Full pointed at:"
  puts "   #{win_full[:category]} => #{win_full[:value]}nn"
end

The results

We're ready now to use the model and see how well the Naive Bayes model performs in this particular scenario:

infer.call :teens, :male, :single, :us
infer.call :young_adults, :male, :single, :asia
infer.call :adults, :female, :in_relationship, :europe
infer.call :elders, :female, :in_relationship, :canada

This gave the following results on the console:

Best match for [:teens, :male, :single, :us]:
   snacks => 0.016252573282200262
Full pointed at:
   snacks => 0.01898999999999971

Best match for [:young_adults, :male, :single, :asia]:
   meat => 0.0037455794492659757
Full pointed at:
   meat => 0.0017000000000000016

Best match for [:adults, :female, :in_relationship, :europe]:
   beauty => 0.0012287311061725868
Full pointed at:
   beauty => 0.0003000000000000026

Best match for [:elders, :female, :in_relationship, :canada]:
   veggies => 0.002156365730474441
Full pointed at:
   veggies => 0.0013500000000000022

That's quite impressive! Even though we're using a simplified model to approximate the original distribution, the algorithm managed to infer the correct values in all cases. You can notice also that the results differ only by a couple of cases in 1000.

The approximation like that would certainly be very useful in a more complex e-commerce scenario, in the case where the number of evidence variables would be big enough to be unmanageable using the full distribution. There are use cases though, where a couple of errors in 1000 cases would be too many ? the traditional example is medical diagnosis. There are also cases where the number of errors would be much greater just because the Naive Bayes assumption of conditional independence of variables is not always a fair an assumption. Is there a way to improve?

The Naive Bayes assumption says that the distribution factorizes the way we did it only if the features are conditionally independent given the category. The notion of conditional independence (apart from the formal mathematical definition) suggests that if some variables a and b are conditionally independent given c, then if we know the value of c then no additional information about b can alter our knowledge about a. In our example, knowing the category, let say :beauty doesn?t mean that e. g sex is independent from age. In real world examples, it's often very hard to find a use case for Naive Bayes that would follow the assumption in all the cases.

There are alternative approaches that allow us to apply the assumptions that more rigidly follow the chosen data set. We will explore these in the next articles, building on top of what we saw here.