Lately, Ron, Ethan, and I have been blogging about several of our CakePHP learning experiences, such as incrementally migrating to CakePHP, using the CakePHP Security component, and creating CakePHP fixtures for HABTM relationships. This week, I came across another blog-worthy topic while troubleshooting for JackThreads that involved auto login, requests that were forced to be secure, and infinite redirects.
Ack! Users were experiencing infinite redirects!
Some users were seeing infinite redirects. The following use cases identified the problem:
- Auto login true, click on link to secure or non-secure homepage => Whammy: Infinite redirect!
- Auto login false, click on link to secure or non-secure homepage => No Whammy!
- Auto login true, type in secure or non-secure homepage in new tab => No Whammy!
- Auto login false, type in secure or non-secure homepage in new tab => No Whammy!
So, the problem boiled down to an infinite redirect when auto login customers clicked to the site through a referer, such as a promotional email or a link to the site.
Identifying the Cause of the ProblemAfter I applied initial surface-level debugging without success, I decided to add excessive debugging to the code. I added debug statements throughout:
- the CakePHP Auth object
- the CakePHP Session object
- the app's app_controller beforeFilter that completed the auto login
- the app's component that forced a secure redirect on several pages (login, checkout, home)
I output the session id and request location with the following debug statement:
$this->log($this->Session->id().':'.$this->here.':'.'/*relevant message about whatsup*/', LOG_DEBUG);
With the debug statement shown above, I was able to compare the normal and infinite redirect output and identify a problem immediately:
normal output2009-12-09 11:44:55 Debug: d3c2297ddea9b76605cb7a459f45965b:/: User does not exist! 2009-12-09 11:44:55 Debug: d3c2297ddea9b76605cb7a459f45965b:/: Success in auto login! 2009-12-09 11:44:55 Debug: d3c2297ddea9b76605cb7a459f45965b:/: redirecting to /sale 2009-12-09 11:44:55 Debug: d3c2297ddea9b76605cb7a459f45965b:/sale: User exists! 2009-12-09 11:44:55 Debug: d3c2297ddea9b76605cb7a459f45965b:/sale: calling action!infinite redirect output
2009-12-09 11:43:30 Debug: 65cb23e4ca358b7270513cca4a52e9b7:/: User does not exist! 2009-12-09 11:43:30 Debug: 65cb23e4ca358b7270513cca4a52e9b7:/: Success in auto login! 2009-12-09 11:43:30 Debug: 65cb23e4ca358b7270513cca4a52e9b7:/: redirecting to /sale 2009-12-09 11:43:30 Debug: 397f099790347716e0bc58c73f23358d:/sale: User does not exist! 2009-12-09 11:43:30 Debug: 397f099790347716e0bc58c73f23358d:/sale: redirecting to /login 2009-12-09 11:43:30 Debug: 0dfee15a4295b26aad115ae37d470d30:/login: User does not exist! 2009-12-09 11:43:30 Debug: 0dfee15a4295b26aad115ae37d470d30:/login: Success in auto login! 2009-12-09 11:43:30 Debug: 0dfee15a4295b26aad115ae37d470d30 /login: redirecting to /sale 2009-12-09 11:43:31 Debug: 3f23b7f7bead5d23fd006b6d91b1d195:/sale: User does not exist! 2009-12-09 11:43:31 Debug: 3f23b7f7bead5d23fd006b6d91b1d195:/sale: redirecting to /login ...![]()
What I immediately noticed was that sessions were dropped at every redirect on the infinite redirect path. So I researched a bit and found the following resources:
- http://groups.google.com/group/cake-php/browse_thread/thread/4d7807465be56b03: A CakePHP google group message about lost sessions.
- http://book.cakephp.org/view/42/The-Configuration-Class: CakePHP documentation on the Security.level setting.
- http://www.php.net/manual/en/session.configuration.php#ini.session.referer-check: PHP documentation on referer_check.
As it turns out, the Security.level configuration affected the referer check for redirects. The CakePHP Session object set the referer_check to HTTP_HOST if Security.level was equal to 'high' or 'medium'. A couple of the resources mentioned above recommend to adjust the Security.level to 'low', which sounded like a potential solution. But I wasn't certain that this was the cause of the redirect, so I tested several changes to verify the problem.
First, I tested the Security.levels to 'high', 'medium', and 'low'. With the Security.level set to 'low', the infinite redirect would not happen and the debug log would show a consistent session id. Next, I commented out the code in the CakePHP Session object that set the referer_check and set the Security.level to 'high'. This also seemed to fix the infinite redirect, although, it wasn't ideal to make changes to the the core CakePHP code. Finally, I changed this->host to HTTPS_HOST instead of HTTP_HOST in the CakePHP Session object, so that the referer would be checked against the secure host rather than the non-secure host. This also fixed the infinite redirect, but again, it wasn't ideal to change the core CakePHP code.
I concluded that the secure redirect to the homepage or login page coupled with the auto login caused this infinite redirect. As pages were redirected between /login and /sale, the session (that stored the auto logged in user) was dropped since the referer check against HTTP_HOST failed.
The SolutionIn an ideal world, I would like to see HTTP_HOST and HTTPS_HOST included in the CakePHP referer check. But because we didn't want to edit the CakePHP core, I investigated the affect of changing the Security.level on the app:
|
Security.level == high |
Security.level == medium |
|
*Security.level == low |
Security.level is not set |
I provided this information to the client and let them decide which scenario met their business needs. For this situation, I recommended commenting out the Security.level configuration so that the session timeout would stay the same, but the cookie lifetime and inactiveMins values would increase.
This was an interesting learning experience that helped me understand a bit more about how CakePHP handles sessions. It also gave me exposure to referer checks in PHP, which I haven't dealt with much in the past.
Kiel and I had a fun time tracking down a client's networking problem the other day. Their scp transfers from their application servers behind a Cisco PIX firewall failed after a few seconds, consistently, with a connection reset.
The problem was easily reproducible with packet sizes of 993 bytes or more, not just with TCP but also ICMP (bloated ping packets, generated with ping -s 993 $host). That raised the question of how this problem could go undetected for their heavy web traffic. We determined that their HTTP load balancer avoided the problem as it rewrote the packets for HTTP traffic on each side.
Kiel narrowed the connect resets down to iptables' state-tracking considering packets INVALID, not ESTABLISHED or RELATED as they should be.
Then he found via tcpdump that the problem was easily visible in scp connections when TCP window scaling adjustments were made by either side of the connection. We tried disabling window scaling but that didn't help.
We tried having iptables allow packets in state INVALID when they were also ESTABLISHED or RELATED, and that reduced the frequency of terminated connections, but still didn't eliminate them entirely. (And it was a kludge we weren't eager to keep in place anyway.)
We wanted to avoid some unpleasant possibilities: (1) turn off stateful firewalling or (2) perform risky updates or configuration changes on the Cisco PIX, which may or may not fix the problem, in the middle of the busy holiday ecommerce season.
Finally, Kiel found this netfilter mailing list post which describes how to enable a Linux kernel workaround for the mangled packets the Cisco generates:
echo 1 > /proc/sys/net/ipv4/netfilter/ip_conntrack_tcp_be_liberal
Of course saving that in /etc/sysctl.conf so it persists after a reboot.
So we have reliable long-running scp connections with TCP window scaling working and iptables doing its job. I love it when a plan comes together.
As Steph noted, we recently embarked on an adventure with a client who had a legacy PHP app. The app was initially developed in rapid fashion, with changing business goals along the way. Some effort was made at the outset with this vanilla PHP app to put key business logic in classes, but as often happens over time the cleanliness of those classes degraded. While much of the business rules and state management (i.e. database manipulation, session wrangling, authentication/access-control, etc.) were kept separate from the "views" (the PHP entry pages), the classes themselves became tightly coupled, overburdened with myriad responsibilities, etc.
This was a far cry from the stereotypical spaghetti PHP app, but nevertheless it needed some reorganization; all but the smallest changes inevitably required touching a wide range of classes and pages, and the code would only grow more brittle unless some serious refactoring took place.
We determined at the outset that getting the application moved into an established MVC framework would be of great benefit, and further determined that CakePHP would be a good choice. (This is the point where anybody reading will inevitably ask in comments "Why CakePHP instead of My Preferred Awesome Framework?" Sigh.) The client agreed. The question became: how do we get there from here?
I spent some time investigating and inevitably came across the well-regarded three-part blog series:
(The author of that series has a book out on the subject, as well.)For somebody new to MVC application design, especially in the PHP space, the series (and presumably the related book) probably makes for pretty good reading. They present a decent approach to how the refactoring of legacy code can be accomplished. However, the series also appears to operate under the assumption that you're in a scrap-and-rebuild situation: the legacy app can essentially go nowhere for a few weeks while it gets gutted into CakePHP.
As noted in a review of the related book, the rebuild-it-all assumption doesn't apply to many real world situations. The more money your application makes, the more users it affects, the larger the feature set, the more likely it is that the business cannot afford to have an application sit in a code freeze while an entire rewrite takes place.
We ultimately opted for a different approach: iteratively migrate to CakePHP. The simplicity of the basic PHP paradigm makes this remarkably easy.
The basic steps:
- Rearrange the legacy application so it runs "within" CakePHP, with the CakePHP dispatcher handling the request but ultimately invoking the original legacy view
- Make adjustments to the legacy code such that it gets its database handle(s) from CakePHP rather than internally, it uses CakePHP's session, etc.
- New development can proceed within CakePHP; legacy logic can be refactored into CakePHP over time as the opportunity presents itself (or the situation demands)
Getting the application to run within CakePHP in this manner does not require that much effort. Of course, this would depend on your situation, but in the traditional model of presentation-oriented code relying on some business objects and a database, it works out. For the initial step:
- Prepare a basic CakePHP application
- Pull the legacy code into the CakePHP webroot, with the legacy pages moved under a new legacy/ subdirectory
- Prepare a "legacy" action in the default PageController that maps the requested URI path to a path relative to the legacy/ directory, then invokes the file living at that path
- Set up a new catch-all route that invokes this legacy action
function includeLegacyPage($path = null) {
// map the path passed in or from the request to the legacy/ subdirectory
$cakeRequestPath = $path ? $path : $this->controller->params['url']['url'];
$path = WWW_ROOT . 'legacy/' . $cakeRequestPath;
// This just maps input arguments to globals
$this->prepareGlobals(array('cakeRequestPath' => $cakeRequestPath));
// Resolve directories to an index.php page as necessary
if(is_dir($path))
{
if(substr($path, -1) != '/')
$path .= '/';
$path .= 'index.php';
}
if (!file_exists($path)) {
$this->controller->render('error');
}
try {
// buffer PHP output
ob_start();
// this "invokes" the legacy page and gathers its content
include $path;
// pull in the buffered content
$this->controller->output = ob_get_contents();
// stop output buffering
ob_end_clean();
} catch (JackExceptionRedirect $e) {
// We adjusted the legacy app's redirect functions to throw a custom exception
// class that we catch here, so we can use CakePHP's native redirection
$this->controller->redirect($e->location, $e->getCode(), false);
} catch (Exception $e) {
// All other errors propagate up
throw $e;
}
$this->controller->autoRender = false;
$this->controller->autoLayout = false;
}
Our PageController's "legacy" action uses the above routine to pull in the legacy page.
The second step, of getting CakePHP to control the session, the database handle, etc., involves some minor hacks. They don't feel elegant. They go outside the MVC pattern. But they provide the crucial glue necessary to put CakePHP in charge.
- Make the controller's session available from a global; adjust legacy code to use it instead of direct use of the PHP session. This means that CakePHP controls the session configuration.
- Make the CakePHP database handle available from a global as well; adjust your legacy database initialization code so it simply uses the global handle from CakePHP. Now CakePHP controls your database configuration, and CakePHP and the legacy code will use the same handle in a given request.
- And so on and so forth.
App::import('ConnectionManager');
$standard_globals = array(
'cakeDbh' => ConnectionManager::getDataSource('default')->connection,
'cakeSession' => $this->Session
);
$this->prepareGlobals($standard_globals);
Up until now, CakePHP's introduction into the mix hasn't added value. Having reached this point, however, you're ready to start taking advantage of CakePHP. From here, we refactored our special "legacy" action logic into a new "LegacyPage" component so any controller/action could use the mechanism. Then we were able to:
- Refactor legacy user authentication logic to use CakePHP's Auth core component
- Refactor various legacy pages to be fronted by CakePHP controller actions, moving the high-level flow control (input validation, user validation, and associated redirects) out of the legacy page and into the controller. This simplifies the legacy page (making it more strictly limited to presentation) and puts flow control where it belongs.
- For a new feature involving new data structures, developed a new CakePHP component to implement the business operations, new controllers/actions for aspects of the new functionality, and adjusted some legacy code to get data from the new component rather than original direct database calls or legacy class calls
So, what are the advantages of this approach, versus a slash-and-burn rewrite-it-all approach?
- We get to a point in which we're tangibly benefitting from CakePHP with minimal investment of time/money; contrast that with the potential expense of rewriting the entire application before the business sees any benefit
- While we proceeded in this work, the client was actively developing their legacy system; there was no need for a code freeze, and reconciling their changes with our work was fairly trivial; one Git rebase took care of it (though I admittedly missed a couple things during the rebase, which we caught and fixed with some spot-checking).
- No repeating of oneself: by making the entire legacy application available within the context of the target framework, we don't need to spend cycles rewriting existing functionality; the do-it-from-scratch approach would, by contrast, require reimplementation of everything
- We can refactor the legacy code in a prioritized, iterative fashion: refactor the most important stuff first, and the less important stuff later.
- We can partially refactor specific pieces of legacy code, such as removing business/data logic from pages such that legacy pages become more like views in the MVC triad; we're not forced to redo an entire legacy subsystem to improve the code organization
- The legacy work that is solid and doesn't need much refactoring stays put, and is usable from the rest of the CakePHP application
We may well get to the point (in late 2010, perhaps) when all legacy code has been refactored into CakePHP's MVC architecture. Or perhaps not: the business has to balance competing priorities, and it may ultimately be that some aspects of the legacy code just don't get refactored because they aren't especially broken and the business need simply doesn't come up. That's part of the beauty here: we don't have to make that decision right now; we can let the real-world priorities make that decision for us over time.
It's easy to imagine an engineer finding this less attractive than a redo-it-in-my-favorite-framework-du-jour approach. It reeks of compromise. Yet, from a business standpoint the advantages are hard to dispute. From a technical standpoint, they're hard to dispute as well: faster, shorter cycles of development bring a higher likelihood of success, particularly for small teams (or lone individuals); the management of change is much simpler with iterative design; the iterative approach is arguably less prone to second-system effect than is a rewrite; etc.
This asks more of the engineer than does a ground-up rewrite in Framework X. So many modern frameworks positively shine with possibility; the engineer lusts for the opportunity to Do It Right, and falls prey to the fallacy that the framework will solve all their problems given that Done Right investment. But, whatever the features and community offerings may be, modern frameworks ultimately help us organize our code better; better organization of code is amongst the most obvious benefits one gets in moving into a modern framework. The iterative approach gets us there with far less risk and, in many cases, far more naturally than does the rewrite-it-all approach, but it asks us to have the patience to move in small steps. It asks that we have the mental room and rigor to envision what the Done Right system might look like, as well as a long chain of interrim steps taking us from here to there. But it delivers value much faster, at lower risk, at lower cost, and crucially, reduces redundant work and gives us the opportunity to change direction as we go. Consequently, for many -- even most -- business situations the iterative transformation is the system Done Right.
A couple of months ago, I worked on an project for Survival International that required two dimensional product optioning for products. The shopping component of the site used Spree, an open source rails ecommerce project that End Point previously sponsored and continues to support. Because this open source project is quickly evolving, we wanted to implement a custom solution that would "stand the test of time" and work with new Spree releases. I worked with the existing data structures and functionality as much as possible. The product optioning implementation discussed in this article should translate to other ecommerce platforms as well.

Here's what I mean when I say "two dimensional product optioning".
The first step to extending the core ecommerce functionality was to understand the data model. A single product "has many" option types (size, color). An option type "has many" option values (size: small, medium, large). Each product also "has many" variants. Each variant was tied to an option value for each product option type. For example, each variant would requires a corresponding size and color option value in the example above. Ideally, each variant represents a unique size and color combination.

An *awesome* database dependency diagram.
Using the Spree demo data, I set up the Apache Baseball Jersey to have option types "PO_Size" and "PO_Color". PO_Size contains option values Red, Blue, and Green. PO_Color contains option values Small, Medium, and Large.

Variants assigned to the Apache Baseball Jersey
The second step to producing a two dimensional product option table was to generate the required data in a before_filter method in the controller. Below are the contents of the module that generates the hash in the before_filter method with color and size information. The module retrieves active variants first, then verifies that the required option types are tied to the product. Then, size, color, and variant ids are collected from the active variants producing the data structure described above.
def self.included(target)
target.class_eval do
before_filter :define_2d_option_matrix, :only => :show
end
end
def define_2d_option_matrix
variants = Spree::Config[:show_zero_stock_products] ?
object.variants.active.select { |a| !a.option_values.empty? } :
object.variants.active.select { |a| !a.option_values.empty? && a.in_stock }
return if variants.empty? ||
object.option_types.select { |a| a.presentation == 'PO_Size' }.empty? ||
object.option_types.select { |a| a.presentation == 'PO_Color' }.empty?
variant_ids = Hash.new
sizes = []
colors = []
variants.each do |variant|
active_size = variant.option_values.select { |a| a.option_type.presentation == 'PO_Size' }.first
active_color = variant.option_values.select { |a| a.option_type.presentation == 'PO_Color' }.first
variant_ids[active_size.id.to_s + '_' + active_color.id.to_s] = variant.id
sizes << active_size
colors << active_color
end
size_sort = Hash['S', 0, 'M', 1, 'L', 2]
@sc_matrix = { 'sizes' => sizes.sort_by { |s| size_sort[s.presentation] }.uniq,
'colors' => colors.uniq,
'variant_ids' => variant_ids }
end
The code above produces a hash with three components:
- @sc_matrix['variant_ids']: a hash that maps size and color combinations to variant id
- @sc_matrix['sizes']: an array of sorted unique sizes of product variants
- @sc_matrix['colors']: an array of unique colors of product variants
In the view, the output of size and color arrays is used to generate a table. In this hardcoded view, sizes are displayed as the horizontal option across the top of the table, and colors as the vertical option along the left side of the table.
...
<% if @sc_matrix -%>
<p>Choose your colour, size and quantity below.</p>
<table id="option-matrix">
<tr>
<th></th>
<% @sc_matrix['sizes'].each do |s| %>
<th class="size"><%= s.presentation %></th>
<td class="spacer"></td>
<% end -%>
</tr>
<% @sc_matrix['colors'].each do |c| -%>
<tr>
<th class="color"><%= c.presentation %></th>
<% @sc_matrix['sizes'].each do |s| -%>
<td>
<% if @sc_matrix['variant_ids'][s.id.to_s + '_' + c.id.to_s] -%>
<input type="radio" value="<%= @sc_matrix['variant_ids'][s.id.to_s + '_' + c.id.to_s] %>" name="products[<%= @product.id %>]" />
<% else -%>
<img src="/images/radio-notavailable.png" alt="X" width="20" height="20" />
<% end -%>
</td>
<td class="spacer"></td>
<% end -%>
</tr>
<% end -%>
</table>
<% elsif #check for other stuff
...
Here is a comparison of the current variant display method versus two dimensional variant display of the same product:
Current variant display method.
Two dimensional variant display method. Two variants shown here are out of stock.
And here is another example of two dimensional optioning in use at Survival International (more glamorous styling):
Spree extensions are similar to WordPress plugins or Drupal modules that do not typically require you to edit core code. The primary components of the extension are a module with the before_filter functionality and a custom view that overrides the core product view. An extension was created for this functionality and it lives at http://github.com/stephp/spree-product-options.
Possibilities for future work include editing the extension to be more robust by eliminating the use of the hard-coded option types of "PO_Size" and "PO_Color" and removing the hard-coded size ordering hash. It would be ideal to be able to assign the two dimension option types (horizontal axis and vertical axis) in the Spree admin for each product or a set of products. Another option for future work with this extension includes extending the functionality to multi-dimensional product optioning that would allow you to select more than two option types per product (for example: size, color, and material), but this functionality is more complex and may be dependent on Javascript to hide and show option types and values.
Learn more about End Point's rails development or rails shopping cart development.
As a consultant, I'm often called to make changes on production systems - sometimes in a hurry. One of my rules is to document all changes I make, no matter how small or unimportant they may seem. In addition to local notes, I always check in any files I change, or might change in the future, into version control. In the past, I would always use RCS. However, Jon Jensen challenged me to rethink my automatic use of RCS and give git a try for this.
This makes sense on some levels. We use git for everything here at End Point, and it is our preferred version control system. I still use other systems: there are some clients and projects that require the use of subversion, mercurial, and even cvs. The advantage of git for quick one off checkins is that, similar to RCS, there is no central repository, and setup is extremely easy.
As an example, one of the files I often check into version control is postgresql.conf, the main configuration file for the Postgres database. Before I even edit the file, I'll check it in, so the sequence of events looks like this:
mkdir RCS ci -l postgresql.conf edit postgresql.conf
The creation of the RCS directory is optional but recommended. RCS (which stands for Revision Control System) uses a very simple tracking mechanism. A new file that tracks all changes is created for each file. This new file takes the original name of the file and adds a ",v" to the end of it. However, it's annoying to have all those "comma vee" files laying around, so RCS has a nice trick that when a directory named RCS exists, all the comma vee files will be placed into that directory. The "ci -l postgresql.conf" checks in (ci) the file, and the "-l" file instructs RCS to immediately check it back out again and lock it (as the current user). This is an RCS specific advisory lock, and only gets in the way if you try to check in the file as a different user. The final command above, "edit postgresql.conf" calls up my editor of choice so I can start modifying the file.
Once the file has been modified, checking in the changes made is as simple as once again doing:
ci -l postgresql.conf
Now that it has been checked in, I can perform other common version control tasks against it. To see the complete log of changes:
rlog postgresql.conf
To see the differences between the current version and the last checkin, or against a specific version:
rcsdiff postgresql.conf rcsdiff -r1.3 postgresql.conf
To find a string in a specific previous version:
co -p -r1.3 postgresql.conf | grep foobar
Using git for this purpose is fairly similar. The first steps now become:
git init git add postgresql.conf git commit postgresql.conf edit postgresql.conf
Technically, one more step than before, but not really a big deal. Note that we don't need to create a special directory to hold the versioning information: by default, git puts everything in a ".git" directory. Once we've made changes to the file, we can commit out changes with:
git commit postgresql.conf
to see the log of changes:
git log postgresql.conf
To see the differences between the current version and the last checkin, or against a specific version:
git diff postgresql.conf git diff 11a049bc80fe4a2f4584465fe13d8bb4ee479f23 postgresql.conf
To find a string in a specific previous version:
git show 11a049bc80fe4a2f4584465fe13d8bb4ee479f23:postgresql.conf | grep foobar
With git, there is also quite a bit more than an be done now - easy branching, grepping, generating diffs, etc. However, most of it is overkill for the simple purpose of tracking local changes. On the downside, git does not have the simple version numbering that RCS has, and the syntax can be a bit trickier and non-intuitive.
So, did I make the switch? Well, yes and no. I've been trying to use git for simple checkins the last few weeks, and have had mixed results. Here's my breakdown of areas in which they differ:
Ease of use
RCS wins this one. All you really need to remember to use RCS is "ci -l filename". The only other commands you might possibly need is "rlog filename" and "rcsdiff filename". On the other hand, git requires a deeper understanding of objects, trees, add vs. commit, and the use of long, hard to type hexadecimal numbers. It's also not very intuitive, and the command arguments can be complex. To be fair, for this *particular* use case git is not really that much more complex, but the advantage still goes to RCS.
Availability
RCS wins this one as well. On many systems, RCS is already installed by default. Even when it is not, a "yum install rcs" or the equivalent works just fine 100% of the time. RCS has been around a long, long time, and it's solid, tested, and very available on any system you run into. In contrast, git is fairly new, does *not* come pre-installed on most systems, and is not even available via all packaging systems. This is one factor that would definitely prevent me from using it everywhere. Maybe years from now when it is a standard tool, this will change, but for now, RCS wins this one.
Diffs
The rcsdiff command is handy, but very limited. If all you want is the simplest of bare-bones diffs, all is good. However, git allows you to view diffs in different formats, add color, generate patches, and many other features that can be nice to have.
Fancy tricks
RCS is designed to be dirt simple and good at what it does: track single files. The design of git was for a large, distributed project with complex needs. This means that git has many tricks and features that the designers of RCS did not even dream of. While most of them are not needed when you are simply doing versioning of local files, there are definitely times when the full power of git is nice to have.
Grouping
RCS has no concept of projects or trees: everything is simply a file. This means that you cannot track relationships between files. The only possible way to do so is to compare the timestamps that two files were checked in. In contrast, git does not consider files at all, but simply treats everything as objects in a tree. This allows easy grouping of files together in a single logical commit. It also allows for things such as branching and merging.
Versioning
While git uses SHA1 checksums to name each object with a unique identity, RCS simply uses a "single dot" version number, and increments it for you. Thus, the first time you check in a file, it is set as version 1.1. The second version is 1.2, and so on. This is very useful when you are simply tracking a lone file - you know that version 1.20 is the 20th recorded change, and that comparing or viewing an earlier version is as simple as using the "-r x.y" option. Calling what git does "versioning" is somewhat of a misnomer - it has a completely different philosophy about how objects are tracked, which lends itself great to distributed and collaborative projects, but not so well to single files.
Blame
Here's one area where git wins hands down. For RCS, you do a checkin, and the file is locked as the current local user. There is no indication of the actual *person* doing the checkin (as opposed to the account name), unless you add it to the checkin comment each time, and that gets laborious and annoying. With git, you can set some standard environment variables (even on a shared account), and git will record who made the change. Not only can you see who made each commit and when, but you can use the awesome "git blame" command to view who made the last change to each line in a file.
As an aside, how do we do the assignment mentioned above in a shared account? Setting the author for git commits is as simple as setting environment variables like so:
$ export GIT_AUTHOR_NAME="Greg Sabino Mullane" $ export GIT_AUTHOR_EMAIL="greg@endpoint.com"
On a shared account, just create an alias. For example:
cat > .gregs_stuff export GIT_AUTHOR_NAME="Greg Sabino Mullane" export GIT_AUTHOR_EMAIL="greg@endpoint.com"cat >> .bashrc alias greg='source ~/.gregs_stuff'
Editor support
One of the nice things about RCS is that it has been around for so long that many editors have integrated support for it. For example, calling up a file in emacs that has been checked in via RCS shows a display in the status line at the bottom of the screen showing that the file is controlled by RCS, what the current version number is, whether it is locked or not. While there is git support as well, it's only available in very new versions of emacs (and other editors). Advantage, RCS
Bloat
Because git is a real version control system, and a complicated one at that, it carries a lot of setup baggage. Just creating a repository and checking in a single file creates about 37 files underneath the .git directory. This number grows sharply with every commit you do. By contrast, RCS creates a single file (and one additional for each file you track). This means you can easily ship around the "dot vee" files to other systems.
Final analysis
When looking at all the factors, RCS still wins. It's simple, gets the job done, and most important of all, is available on all systems. I may revisit this in a few years when git is more widespread.
Recently, Ron, Ethan, and I worked on a JackThreads project. We are in the process of moving JackThreads' legacy PHP application to the CakePHP framework in addition to introducing new functionality for this project.
Several of the pages require secure requests:
- the home page (where users log in or create accounts)
- the login page
- the "invite" page (where users create an account)
- the checkout page
We referred to this article that discusses using the security component in CakePHP. Although this article covered the basics, we extended the concepts of the article by creating a CakePHP component with the custom security functionality to force a secure request and includes query string parameters. Below are the contents of the component that was created:
class StephsSecurityComponent extends Object {
var $components = array('Security');
function forceSecure($args) {
$this->Security->blackHoleCallback = 'forceSSL';
$this->Security->requireSecure($args);
}
function forceSSL($controller) {
$redirect_location = 'https://'.HTTPS_HOST.$controller->here;
$params = $controller->params['url'];
unset($params['url']);
if(count($params) > 0)
{
$param_string = '';
foreach($params as $key => $value)
$param_string .= '&'.$key.'='.$value;
$param_string = preg_replace('/^&/', '?', $param_string);
$redirect_location .= $param_string;
}
$controller->redirect($redirect_location);
}
}
This design required the following definition in the application's app_controller:
function forceSSL() {
$this->StephsSecurity->forceSSL($this);
}
And any controller that required an action to be secure would call the forceSecure function in the beforeFilter:
function beforeFilter() {
$this->StephsSecurity->forceSecure('my_action');
}
For the most part, the security redirect worked as expected. The before filter in each controller correctly registered the action that required a secure request, and logging statements in the CakePHP core security component verified that the secure component would call the blackHoleCallback if the request was not secure. But then, we came across a bug!
One of the controllers that included this new functionality was not working as expected. The controller had two actions; both actions accepted inputs from forms and did stuff with those forms, only one of the actions required the force secure, one of the actions received form inputs from the CakePHP form helper and the other action received inputs from a legacy PHP page. The action that received inputs from a legacy PHP page didn't do stuff correctly. Below is a simplified version of this controller:
class ThisController extends AppController {
...
var $uses = array('Security', 'StephsSecurity');
function beforeFilter() {
$this->StephsSecurity->forceSecure('action_one');
}
function action_one() {
//receives inputs from a cakephp form helper
//do stuff with $this->params
}
function action_two() {
//receives inputs from a legacy php page
//do stuff with $this->params -- FAIL
}
}
We added debugging and found that $this->params (or the form parameters) to action_two was empty. We added logging to the beforeFilter to determine if the parameters were deleted during the beforeFilter process. We found that the parameters were present at the conclusion of the beforeFilter. So, at some point in between the beforeFilter and before the action, our form parameters were deleted.
function beforeFilter() {
$this->log($this->params, LOG_DEBUG);
//some other unrelated before filtering
$this->log($this->params, LOG_DEBUG);
$this->StephsSecurity->forceSecure('action_one');
$this->log($this->params, LOG_DEBUG); //parameters looked ok here!
}
After more troubleshooting, we determined that if the CakePHP core Security component wasn't included in the controller, the parameters were not deleted and the action did it's stuff. A review of the CakePHP core Security component revealed that the component performs a validation on posts, which includes a check for a Token input. Because the post to this action originated from a legacy PHP page, it did not include any special hidden form variables included with the use of the CakePHP form helper (much like the Token inputs included via the Rails form helper):
<input type="hidden" value="POST" name="_method"/> <input type="hidden" id="Token123123123 value="123123123131231231223" name="data[_Token][key]"/>
As a result, the black hole security redirect was called before action_two was reached, then action_two was called with missing parameters. Ethan realized there was a simple fix to this post validation failure. The Security->validatePost variable was set to false inside the controller's beforeFilter to bypass the _validatePost check in the security component. No more post validation produced expected action_two behavior.
function beforeFilter() {
$this->Security->validatePost = false;
$this->StephsSecurity->forceSecure('index');
}
Unfortunately, there isn't a lot of documentation on the CakePHP Security component that would have helped us identify this issue quickly. Configuration of the CakePHP Security component, discussed here, fails to mention the validatePost value, but it is included in the CakePHP API documentation.
Fortunately, it wasn't too difficult to troubleshoot once we observed the undesired behavior originated from the inclusion of the Security component in the controller. We are now aware of this Security post validation as we continue to transition legacy PHP to CakePHP. I'm sure we'll come across situations where data is passed from legacy pages or 3rd party services that do not contain the required Token variables and will require bypassing the _validatePost check.
End Point blog:
Setting up a login form in a controller other then the Users controller in CakePHP, don't forget the
var $uses = array('User');
Surprisingly within our view we were able to setup forms to work with the User model. When the auth component was checking for the user data in the post it did not find any data, and stopped processing the request. This was not a graceful way for the auth component or CakePHP to handle the request, an error message would have helped track down the issue.
XZ is a new free compression file format that is starting to be more widely used. The LZMA2 compression method it uses first became popular in the 7-Zip archive program, with an analogous Unix command-line version called 7z.
We used XZ for the first time in the Interchange project in the Interchange 5.7.3 packages. Compared to gzip and bzip2, the file sizes were as follows:
interchange-5.7.3.tar.gz 2.4M interchange-5.7.3.tar.bz2 2.1M interchange-5.7.3.tar.xz 1.7M
Getting that tighter compression comes at the cost of its runtime being about 4 times slower than bzip2, but a bonus is that it decompresses about 3 times faster than bzip2. The combination of significantly smaller file sizes and faster decompression made it a clear win for distributing software packages, leading to it being the format used for packages in Fedora 12.
It's also easy to use on Ubuntu 9.10, via the standard xz-utils package. When you install that with apt-get, aptitude, etc., you'll get a scary warning about it replacing lzma, a core package, but this is safe to do because xz-utils provides compatible replacement binaries /usr/bin/lzma and friends (lzcat, lzless, etc.). There is also built-in support in GNU tar with the new --xz aka -J options.
This morning on the Interchange users list there was a post from Racke discussing a similiar issue. His customer had the Ask.com toolbar installed and Interchange's robot matching code was mistakenly matching the Ask.com toolbar as a search spider. The user agent of the browser with Ask.com installed appeared as so:
"Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; msn OptimizedIE8;ENUS; AskTB5.6)"
A quick look at the current robots.cfg that Steven Graham linked showed that 'AskTB' had been added to the NotRobotUA directive which instructs Interchange to not consider AskTB a search spider, thus allowing proper use of sessions on the site.
Updating the robots.cfg on our client's site allowed users with Ask.com to browse, login and checkout as expected. Those with Interchange sites who see reports of similiar issues should consider a false positive spider match a possibility and update their robots.cfg.
A couple of months ago, I integrated Omniture SiteCatalyst into an Interchange site for one of End Point's clients, CityPass. Shortly after, the client added a blog to their site, which is a standalone WordPress instance that runs separately from the Interchange ecommerce application. I was asked to add SiteCatalyst tracking to the blog.
I've had some experience with WordPress plugin development, and I thought this was a great opportunity to develop a plugin to abstract the SiteCatalyst code from the WordPress theme. I was surprised that there were limited Omniture WordPress plugins available, so I'd like to share my experiences through a brief tutorial for building a WordPress plugin to integrate Omniture SiteCatalyst.
First, I created the base wordpress file to append the code near the footer of the wordpress theme. This file must live in the ~/wp-content/plugins/ directory. I named the file omniture.php.
<?php /*
Plugin Name: SiteCatalyst for WordPress
Plugin URI: http:www.endpoint.com/
Version: 1.0
Author: Steph Powell
*/
function omniture_tag() {
}
add_action('wp_footer', 'omniture_tag');
?>
In the code above, the wp_footer is a specific WordPress hook that runs just before the </body> tag. Next, I added the base Omniture code inside the omniture_tag function:
...
function omniture_tag() {
?>
<script type="text/javascript">
<!-- var s_account = 'omniture_account_id'; -->
</script>
<script type="text/javascript" src="/path/to/s_code.js"></script>
<script type="text/javascript"><!--
s.pageName='' //page name
s.channel='' //channel
s.pageType='' //page type
s.prop1='' //traffic variable 1
s.prop2='' //traffic variable 2
s.prop3='' //traffic variable 3
s.prop4= '' //traffic variable 4
s.prop5= '' //traffic variable 5
s.campaign= '' //campaign variable
s.state= '' //user state
s.zip= '' //user zip
s.events= '' //user events
s.products= '' //user products
s.purchaseID= '' //purchase ID
s.eVar1= '' //conversion variable 1
s.eVar2= '' //conversion variable 2
s.eVar3= '' //conversion variable 3
s.eVar4= '' //conversion variable 4
s.eVar5= '' //conversion variable 5
/************* DO NOT ALTER ANYTHING BELOW THIS LINE ! **************/
var s_code=s.t();if(s_code)document.write(s_code)
--></script>
<?php
}
...
To test the footer hook, I activated the plugin in the WordPress admin. A blog refresh should yield the Omniture code (with no variables defined) near the </body> tag of the source code.
After verifying that the code was correctly appended near the footer in the source code, I determined how to track the WordPress traffic in SiteCatalyst. For our client, the traffic was to be divided into the home page, static page, articles, tag pages, category pages and archive pages. The Omniture variables pageName, channel, pageType, prop1, prop2, and prop3 were modified to track these pages. Existing WordPress functions is_home, is_page, is_single, is_category, is_tag, is_month, the_title, get_the_category, the_title, single_cat_title, single_tag_title, the_date were used.
...
<script type="text/javascript"><!--
<?php
if(is_home()) { //WordPress functionality to check if page is home page
$pageName = $channel = $pageType = $prop1 = 'Blog Home';
} elseif (is_page()) { //WordPress functionality to check if page is static page
$pageName = $channel = the_title('', '', false);
$pageType = $prop1 = 'Static Page';
} elseif (is_single()) { //WordPress functionality to check if page is article
$categories = get_the_category();
$pageName = $prop2 = the_title('', '', false);
$channel = $categories[0]->name;
$pageType = $prop1 = 'Article';
} elseif (is_category()) { //WordPress functionality to check if page is category page
$pageName = $channel = single_cat_title('', false);
$pageName = 'Category: ' . $pageName;
$pageType = $prop1 = 'Category';
} elseif (is_tag()) { //WordPress functionality to check if page is tag page
$pageName = $channel = single_tag_title('', false);
$pageType = $prop1 = 'Tag';
} elseif (is_month()) { //WordPress functionality to check if page is month page
list($month, $year) = split(' ', the_date('F Y', '', '', false));
$pageName = 'Month Archive: ' . $month . ' ' . $year;
$channel = $pageType = $prop1 = 'Month Archive';
$prop2 = $year;
$prop3 = $month;
}
echo "s.pageName = '$pageName' //page namen";
echo "s.channel = '$channel' //channeln";
echo "s.pageType = '$pageType' //page typen";
echo "s.prop1 = '$prop1' //traffic variable 1n";
echo "s.prop2 = '$prop2' //traffic variable 2n";
echo "s.prop3 = '$prop3' //traffic variable 3n";
?>
s.prop4 = '' //traffic variable 4
...
The plugin allows you to freely switch between WordPress themes without having to manage the SiteCatalyst code and to track the basic WordPress page hierarchy. Here are example outputs of the SiteCatalyst variables broken down by page type:
Homepage
s.pageName = 'Blog Home' //page name s.channel = 'Blog Home' //channel s.pageType = 'Blog Home' //page type s.prop1 = 'Blog Home' //traffic variable 1 s.prop2 = '' //traffic variable 2 s.prop3 = '' //traffic variable 3
Tag Page
s.pageName = 'chocolate' //page name s.channel = 'chocolate' //channel s.pageType = 'Tag' //page type s.prop1 = 'Tag' //traffic variable 1 s.prop2 = '' //traffic variable 2 s.prop3 = '' //traffic variable 3
Category Page
s.pageName = 'Category: Food' //page name s.channel = 'Food' //channel s.pageType = 'Category' //page type s.prop1 = 'Category' //traffic variable 1 s.prop2 = '' //traffic variable 2 s.prop3 = '' //traffic variable 3
Static Page
s.pageName = 'About' //page name s.channel = 'About' //channel s.pageType = 'Static Page' //page type s.prop1 = 'Static Page' //traffic variable 1 s.prop2 = '' //traffic variable 2 s.prop3 = '' //traffic variable 3
Archive
s.pageName = 'Month Archive: November 2009' //page name s.channel = 'Month Archive' //channel s.pageType = 'Month Archive' //page type s.prop1 = 'Month Archive' //traffic variable 1 s.prop2 = '2009' //traffic variable 2 s.prop3 = 'November' //traffic variable 3
Article
s.pageName = 'Hello world!' //page name s.channel = 'Test Category' //channel s.pageType = 'Article' //page type s.prop1 = 'Article' //traffic variable 1 s.prop2 = 'Hello world!' //traffic variable 2 s.prop3 = '' //traffic variable 3
A followup step to this plugin would be to use the wp_options table in WordPress to manage the Omniture account id, which would allow admin to set the Omniture account id through the WordPress admin without editing the plugin code. I've uploaded the plugin to a github repository here.
Learn more about End Point's analytics services.
CakePHP, a popular MVC framework in/for PHP, offers a pretty easy-to-use object-relational mapper, as well as fairly straightforward fixture class for test data. Consequently, it's fairly easy to get into test-driven development with CakePHP, though this can take some acclimation if you're coming from Rails or Django or some such; the need to go through a web interface to navigate to and execute your test cases feels, to me, a little unnatural. Nevertheless, you can get writing tests pretty quickly, and the openness of the testing framework means that it won't get in your way. Indeed, compared to the overwhelming plethora of testing options one gets in the Ruby space -- and the accompanying sense that the choice of testing framework is akin to one's choice of religion, political party, or top 10 desert island album list -- CakePHP's straightforward testing feels a little liberating.
Which is why it was a little surprising to me that getting a test fixture going for the join table on a has-and-belongs-to-many (HABTM) association is -- at least in my experience -- not the clearest thing in the world.
One can presumably configure the fixture to merely use the table option in the fixture's $import attribute. However, as I was following the table and model naming conventions, I felt that I must be doing something wrong in my attempts to get a fixture going for a HABTM relationship, and consequently I eschewed the (potentially) easy way out to try to find a solution that ought to work.
So, let's say my relations were:
- Product model: some stuff to sell
- Sale model: individual "sale" events when particular products are promoted
- A products_sales join table establishes a many-to-many relationship (can we all acknowledge that "many-to-many" is much more convenient for meatspace communication than the horrendously awkward "has-and-belongs-to-many"?) between these two fabulous structures
You can go with the usual Cake-ish model definitions:
# in app/models/product.php
class Product extends AppModel {
$name = 'Product';
$hasAndBelongsToMany = array(
'Sale' => array('className' => 'Sale')
);
}
# in app/models/sale.php
class Sale extends AppModel {
$name = 'Sale';
$hasAndBelongsToMany = array(
'Product' => array('className' => 'Product')
);
}
Since we're following the naming conventions here (singular model name fronts pluralized table name, the join table for the HABTM relationship uses pluralized names for each relation joined, in alphabetical order), then the above code should be all you need for the relationship to work.
Indeed, as explained in this helpful article on the HABTM-in-CakePHP subject, you should find that queries using these models will automatically include 'ProductsSale' model entries in their result sets, with that model being dynamically generated by the HABTM association.
So, that means you should be able to create a test fixture for the ProductsSale model, right?
# in app/tests/fixtures/products_sale.php
class ProductsSale extends CakeTestFixture {
$name = 'ProductsSale';
$import = 'ProductsSale';
$records = array(
a buncha awesome stuff...
);
}
Unfortunately, at least with my experience on CakePHP 1.2.5, that doesn't work. When your test case attempts to load the fixture, you'll get SQL errors indicating that the test-prefixed version of your "products_sales" table doesn't exist.
I haven't done a sufficiently exhaustive analysis of the Cake innards to sort out why this is, and may yet do so. My guess based on nothing other than observation and intuition is that the auto-generated model is related only to the models involved in the HABTM relationship, through the bindModel method, and does not get generated in any global capacity such that it exists as a model in its own right. Consequently, while the testing code can guess the correct table name for the join table based on the naming conventions used for the fixture, since it doesn't relate to an extant model, it fails to go through the model-wrapping procedures that typically take place per test-case (setting up the test-space table per model, populating it from the fixture, etc.)
Fortunately, as illustrated by the aforementioned helpful article, we can front the join table with a full-fledged model class, and use that model class within the association definitions. This solves the problem of the broken fixture, as the fixture will now refer to a standard model and successfully set up the test table, data, etc.
That means the code becomes:
# in app/models/products_sale.php
class ProductsSale extends AppModel {
/* the naming convention assumes singularized model name
based on the entire table name; it does not make inner
names singular. This feels a little unclean. If it
really bothers you, recall the language you're using
and I suspect you'll get over it. */
$name = 'ProductsSale';
/* The join table belongs to both relations */
$belongsTo = array('Product', 'Sale');
}
# in app/models/product.php
class Product extends AppModel {
$name = 'Product';
/* Use the 'with' option to join through the new model class */
$hasAndBelongsToMany = array(
'Sale' => array('with' => 'ProductsSale')
);
}
# in app/models/sale.php
class Sale extends AppModel {
$name = 'Sale';
/* And again, the 'with' option */
$hasAndBelongsToMany = array(
'Product' => array('with' => 'ProductsSale')
);
}
No changes are necessary to the fixture for ProductsSale; once that join model is in place, it'll be good.
It is not uncommon for ORMs to provide magical intelligence for establishing HABTM relationships, and as a matter of convenience it's pretty handy. It is similarly common to allow for HABTM association through an explicitly-defined model class. While this ups the ceremony for setting up your ORM, there are benefits that come with it; a reduced reliance on magic can be distinctly advantageous if you ever get into hairy situations with ORM query wrangling, and it is reasonably common for a HABTM association to have annotations on the relationship itself. In each case, you'll be happy to have your join table fronted by a model class.
Hopefully this will save somebody else some trouble.
I'm back at work after last week's PubCon Vegas. I published several articles about specific sessions, but I wanted to provide some nuggets on recurring themes of the conference.
Google Caffeine Update
This year Google rolled out some changes referred to as the Google Caffeine update. This change increases the speed and size of the index, moves Google search to real-time, and improves search results relevancy and accuracy. It was a popular topic at the conference, however, not much light was shed on how algorithm changes would affect your search results, if at all. I'll have to keep an eye on this to see if there are any significant changes in End Point's search performance.
Bing
Bing is gaining traction. They want to get [at least] 51% of the search market share.
Social media
Social media was a hot topic at the conference. An entire track was allocated to Twitter topics on the first day of the conference. However, it still pales in comparison to search. Of all referrals on the web, search still accounts for 98% and social media referrals only account for less than 1% (view referral data here). Dr. Pete from SEOMoz nicely summarized the elephant in the room at PubCon regarding social media that it's important to measure social media response to determine if it provides business value.
Ecommerce Advice
I asked Rob Snell, author of Starting a Yahoo Business for Dummies, for the most important advice for ecommerce SEO he could provide. He explained the importance of content development and link building to target keywords based on keyword conversion. Basically, SEO efforts shouldn't be wasted on keywords that don't convert well. I typically don't have access to client keyword conversion data, but this is great advice.
Internal SEO Processes
Another recurring topic I observed at PubCon was that often internal SEO processes are a much bigger obstacle than the actual SEO work. It's important to get the entire team on your side. Alex Bennert of Wall Street Journal discussed understanding your audience when presenting SEO. Here are some examples of appropriate topics for a given audience:
- IT Folks: sitemaps, duplicate content (parameter issues, pagination, sorting, crawl allocation, dev servers), canonical link elements, 301 redirects, intuitive link structure
- Biz Dev & Marketing Folks: syndication of content, evaluation of vendor products & integration, assessing SEO value and link equity of partner sites, microsites, leveraging multiple assets
- Content Developers: on page elements best practices, linking, anchor text best practices, keyword research, keyword trends, analytics
- Management: progress, timelines, roadmaps
On the topic of internal processes, I was entertained by the various comments expressing the developer-marketer relationship, for example:
- "Don't ever let a developer control your URL structure."
- "Don't ever let a developer control your site architecture."
- "This site looks like it was designed by a developer."
Apparently developers are the most obvious scapegoat. Back to the point, though: It often requires more effort to get SEO understanding and support than actually explaining what needs to be done.
Search Engine Spam
Search engine spam detection is cool. During a couple of sessions with Matt Cutts, I became interested in writing code to detect search spam. For example:
- Crawling the web to detect links where the anchor text is '.'.
- Crawling the web to identify sites where robots.txt blocks ia_archiver.
- Crawling the web to detect pages with keyword stuffing.
I've typically been involved in the technical side of SEO (duplicate content, indexation, crawlability), and haven't been involved in link building or content development, but these discussions provoked me to start looking at search spam from an engineer's perspective.
Google Parameter Masking
Apparently I missed the announcement of parameter masking in Google Webmaster Tools. I've helped battle duplicate content for several clients, and at PubCon I heard about parameter masking provided in Google Webmaster Tools. This functionality was announced in October of 2009 and allows you to provide suggestions to the crawler to ignore specific query parameters.
Parameter masking is yet another solution to managing duplicate content in addition to the rel="canonical" tag, creative uses of robots.txt, and the nofollow tag. The ideal solution for SEO would be to build a site architecture that doesn't require the use of any of these solutions. However, as developers we have all experienced how legacy code persists and sometimes a low effort-high return solution is the best short term option.
Learn more about End Point's technical SEO services.
One of the best ways to secure your box against SSH attacks is the use of port knocking. Basically, port knocking seals off your SSH port, usually with firewall rules, such that nobody can even tell if you are running SSH until the proper "knock" is given, at which time the SSH port appears again to a specific IP address. In most cases, a "knock" simply means accessing specific ports in a specific order within a given time frame.
Let's step back a moment and see why this solution is needed. Before SSH there was telnet, which was a great idea way back at the start of the Internet when hosts trusted each other. However, it was (and is) extremely insecure, as it entails sending usernames and passwords "in the clear" over the internet. SSH, or Secure Shell, is like telnet on steroids. With a mean bodyguard. There are two common ways to log in to a system using SSH. The first way is with a password. You enter the username, then the password. Nice and simple, and similar to telnet, except that the information is not sent in the clear. The second common way to connect with SSH is by using public key authentication. This is what I use 99% of the time. It's very secure, and very convenient. You put the public copy of your PGP key on the server, and then use your local private SSH key to authenticate. Since you can cache your private key, this means only having to type in your SSH password once, and then you can ssh to many different systems with no password needed.
So, back to port knocking. It turns out that any system connected to the internet is basically going to come under attack. One common target is SSH - specifically, people connecting to the SSH port, then trying combinations of usernames and passwords in the hopes that one of them is right. The best prevention against these attacks is to have a good password. Because public key authentication is so easy, and makes typing in the actual account password such a rare event, you can make the password something very secure, such as:
gtsmef#3ZdbVdAebAS@9e[AS4fed';8fS14S0A8d!!9~d1aAQ5.81sa0'ed
However, this won't stop others from trying usernames and passwords anyway, which fills up your logs with their attempts and is generally annoying. Thus, the need to "hide" the SSH port, which by default is 22. One thing some people do is move SSH to a "non-standard" port, where non-standard means anything but 22. Typically, some random number that won't conflict with anything else. This will reduce and/or stop all the break-in attempts, but at a high cost: all clients connecting have to know to use that port. With the ssh client, it's adding a -p argument, or setting a "Port" line in the relevant section of your .ssh/config file.
All of which brings us to port knocking. What if we could run SSH on port 22, but not answer to just anyone, but only to people who knew the secret code? That's what port knocking allows us to do. There are many variants on port knocking and many programs that implement it. My favorite is "knockd", mostly because it's simple to learn and use, and is available in some distros' packaging systems. My port knocking discussion and examples will focus on knockd, unless stated otherwise.
knockd is a daemon that listens for incoming requests to your box, and reacts when a certain combination is reached. Once knockd is installed and running, you modify your firewall rules (e.g. iptables) to drop all incoming traffic to port 22. To the outside world, it's exactly as if you are not running SSH at all. No break-in attempts are possible, and your security logs stay nice and boring. When you want to connect to the box via SSH, you first send a series of knocks to the box. If the proper combination is received, knockd will open a hole in the firewall for your IP on port 22. From this point forward, you can SSH in as normal. The new firewall entry can get removed right away, cleared out at some time period later, or you can define another knock sequence to remove the firewall listing and close the hole again.
What exactly is the knock? It's a series of connections to TCP or UDP ports. I prefer choosing a few random TCP ports, so that I can simply use telnet calls to connect to the ports. Keep in mind that when you do connect, it will appear as if nothing happened - you cannot tell that knockd is logging your attempt, and possibly acting on it.
Here's a sample knockd configuration file:
[options] logfile = /var/log/knockd.log [openSSH] sequence = 32144,21312,21120 seq_timeout = 15 command = /sbin/iptables -I INPUT -s %IP% -p tcp --dport 22 -j ACCEPT tcpflags = syn [closeSSH] sequence = 32144,21312,21121 seq_timeout = 15 command = /sbin/iptables -D INPUT -s %IP% -p tcp --dport 22 -j ACCEPT tcpflags = syn
In the above file, we've stated that any host that sends a TCP syn flag to ports 32144, 21312, and 21120, in that order, within 15 seconds, will cause the iptables command to be run. Note that the use of iptables is completely not hard-coded to knockd at all. Any command at all can be run when the port sequence is triggered, which allows for all sorts of fancy tricks.To close it up, we do the same sequence, except the final port is 21221.
Once knockd is installed, and the configuration file is put in place, start it up and begin testing. Leave a separate SSH connection open to the box while you are testing! If you are really paranoid, you might want to open a second SSH daemon on a second port as well. First, check that the port knocking works by triggering the port combinations. knockd comes with a command-line utility for doing so, but I usually just use telnet like so:
[greg@home ~] telnet example.com 32144 Trying 123.456.789.000... telnet: connect to address 123.456.789.000: Connection refused [greg@home ~] telnet example.com 21312 Trying 123.456.789.000... telnet: connect to address 123.456.789.000: Connection refused [greg@home ~] telnet example.com 21120 Trying 123.456.789.000... telnet: connect to address 123.456.789.000: Connection refused
Note that we reveived a bunch of "Connection refused" - the same message as if we tried any other random port. Also the same message that people trying to connect to a port knock protected SSH will see. If you look in the logs for knockd (set as /var/log/knockd.log in the example file above), you'll see some lines like this if all went well:
[2009-11-09 14:01] 100.200.300.400: openSSH: Stage 1 [2009-11-09 14:01] 100.200.300.400: openSSH: Stage 2 [2009-11-09 14:01] 100.200.300.400: openSSH: Stage 3 [2009-11-09 14:01] 100.200.300.400: openSSH: OPEN SESAME [2009-11-09 14:01] openSSH: running command: /sbin/iptables -I INPUT -s 100.200.300.400 -p tcp --dport 22 -j ACCEPT
Voila! Your iptables should now contain a new line:
$ iptables -L -n | grep 100.200 ACCEPT tcp -- 100.200.300.400 anywhere tcp dpt:ssh
The next step is to lock everyone else out from the SSH port. Add a new rule to the firewall, but make sure it goes to the bottom:
$ iptables -A INPUT -p tcp --dport ssh -j DROP $ iptables -L | grep DROP DROP tcp -- anywhere anywhere tcp dpt:ssh
You'll note that we used "A" to append the DROP to the bottom of the INPUT chain, and "I" to insert the exceptions to the top of the INPUT chain. At this point, you should try a new SSH connection and make sure you can still connect.If all is working, the final step is to make sure the knockd daemon starts up on boot, and that the DROP rule is added on boot as well. You can also add some hard-coded exceptions for boxes you know are secure, if you don't want to have to port knock from them every time.
One flaw in the above scheme the sharp reader may have spotted is that although the SSH port cannot be reached without a knock, the sequence of knocks used can easily be intercepted and played back. While this doesn't gain the potential bad guy too much, there is a way to overcome it. The knockd program allows the port knocking combinations to be stored inside of a file, and read from, one line at a time. Each successful knock will move the required knocks to the next line, so that even knowing someone else's knock sequence will not help, as it changes each time. To implement this, just replace the 'sequence' line as seen in the above configuration file with a line like this:
one_time_sequences = /etc/knockd.sequences.txt
In this case, the sequences will be read from the file named "/etc/knockd.sequences.txt". See the manpage for knockd for more details on one_time_sequences as well as other features not discussed here. For more on port knocking in general, visit portknocking.org.
While the one_time_sequences is a great idea, I'd like to see something a little different implemented someday. Specifically, having to pre-populate a fixed list of sequences is a drag. Not only do you have to make sure they are random, and that you have enough, but you have to keep the list with you locally. Lose that list, and you cannot get in! A better way would be to have your port knocking program generate the new port sequences on the fly. It would also encrypt the new port sequences to one or more public keys, and then put the file somewhere web accessible. Thus, one could simply grab the file from the server, decrypt it, and perform the port knocking based on the list of ports inside of it. Is all of that overkill for SSH? Almost certainly. :) However, there are many other uses for port knocking that simple SSH blocking and unblocking. Remember that many pieces of information can be used against your server, including what services are running on which ports, and which versions are in use.
On day 3 of PubCon Vegas, a great session I attended was Optimizing Forums For Search & Dealing with User Generated Content with Dustin Woodard, Lawrence Coburn, and Roger Dooley. User generated content is content generated by users in the form of message boards, customizable profiles, forums, reviews, wikis, blogs, article submission, question and answer, video media, or social networks.
Some good statistics were presented about why to tap into user generated content. Nielsen research recently released showed that 1 out of every 11 minutes spent online is on a social network and 2/3rds of customer "touch points" are user-generated.
Dustin provided some interesting details about long tail traffic. He looked at HitWise's data of the top 10,000 search terms for a 3 month period. The top 100 terms accounted for 5.7% of all traffic, the top 1000 terms accounted for 10.6% of all traffic, and the entire 10,000 data set accounted for just 18.5% of all traffic. With this data, representing the long tail would be analogous to a lizard with a one inch head and a tail that was 221 miles long that represents the long tail traffic.
Dustin gave the following steps for developing a user generated content community:
- Seed it with a few editors and really good initial content.
- Give them a voice.
- Make it easy to contribute.
- Make it cool or trendy.
- Provide ownership.
- Create competition with contests, ranking or by highlighting expertise.
- Build a sense of community or a sense of exclusivity.
- Give the people community a purpose.
All SEO best practices apply to a user generated content, but throughout the session, I learned several specific user generated content tips:
- Predefining keyword rich categories, topics and tags will go a long way with optimization. The better structure for topics that is created up front, the better the user generated content can content in the long run. Users are not inherently good at content organization, so content can be easily buried with poor information architecture.
- Developing automated cross-linking between user generated content helps improve authority, build clusters of content, and enrich the internal link structure. Dustin had experience with building widgets to automatically links to 5 pieces of user generated content and another widget to allow the user to select several pieces of user generated content from a set of related content.
- Examples of battling duplicate content include disallowing duplicate page titles and meta descriptions. Content that is moved, renamed or deleted should be managed well.
- Finally, building a badge or widget to display user involvement helps increase external linking to your site, but this should be carefully managed to avoid appearing spammy. Widget best practices are that the widget should have excellent accessibility, widgets should be simple with light branding and always have fresh content.
- Developing your own tiny URL helps pass and keep intact external links to your site with user generated content. Lawrence suggested to "gently tweet" user generated content that is the highest quality.
Several of End Point's clients are either in the middle of or considering building a community with user generated content. In ecommerce, blogs, forums, reviews, and Q&A are the most prevalent types of user generated content that I've encountered. Many of the things mentioned in this session were good tips to consider throughout the development of user generated content for ecommerce.
Learn more about End Point's technical SEO services.
On the second day of PubCon Vegas, I attended several SEO track sessions including "SEO for Ecommerce", "International and European Site Optimization", "Mega Site SEO", and "SEO/SEM Tools". A mini-summary of several of the sessions is presented below.
Derrick Wheeler from Microsoft.com spoke on Mega Site SEO about "taming the beast". Microsoft has 1.2 billion URLs that are comprised of thousands of web properties. For mega site SEO, Derrick highlighted:
- Content is NOT king. Structure is! Content is like the princess-in-waiting after structure has been mastered.
- Developing an overall SEO approach and organization to getting structure, content, and authority SEO completed is more valuable or relevant to the actual SEO work. This was a common theme among many of the presentations at PubCon.
- Getting metrics set up at the beginning of SEO work is a very important step to measure and justify progress.
- Don't be afraid to say no to low priority items.
Most developers deal with a large amount of legacy code. Derrick discussed primary issues when working with legacy problems:
- Duplicate and undesirable pages. For Microsoft.com, managing and dealing with 1.2 billion pages results in a lot of duplicate and undesirable pages from the past.
- Multiple redirects.
- Improper error handling (error handling on 404s or 500s).
- International URL structure can be a problem for international sites. Having an appropriate TLD (top level domain) is the best solution, but if that's not possible, a process should be implemented to regulate the international urls.
- Low Quality Page Titles and Meta Tags. For large sites with hundreds of thousands of pages, it's really important to have unique page titles and meta descriptions or to have a template that forces uniqueness.
In summary, structure and internal processes are areas to focus on for Mega Site SEO. Legacy problems are something to be aware of when you have a site so large where changes won't be implemented as quickly as small site changes.
In International and European Search Management, Michael Bonfils, Nelson James, and Andy Atkins-Krueger discussed international SEO and SEM tactics. Takeaways include:
- In terms of international search marketing, it's important to incorporate culture into search optimization and marketing. If it works in one country, it may not work in another country and so don't offend a culture by not understanding it. Some examples of content differences for targeting different cultures include emphasizing price points, focusing on product quality, and asserting authority or trust on a site.
- It's also important to understand how linguistics affects your keyword marketing. Automatic translation should not be used (all the speakers mentioned this). A good example of linguistics and search targeting is the use of the search term "soccer cleats", or "football boots". In England, the term "football boot" has a very small portion of the traffic share, but singular terms in other languages ("scarpe de calcio", "botas de futbal") have a much larger percentage of the search market share. Andy shared many other examples of how direct translation would not be the best keywords to target ("car insurance", "healthcare", "30% off", "cheap flights").
- Local hosting is important for metrics, linking, and to develop trust. Nelson James shared research that shows that 80% of the top 10 results of the top 30 keywords in china had a '.cn' top level domain, but the other top sites that were '.com' sites are all hosted in china.
- Other technical areas for international search that were mentioned are using the meta language tag, pinyin, charset, and language set. Duplicate content also will become a problem across sites of the same language.
- It's important to understand the search market share. In Russia, Google shares 35% of the search market and Yandexx has 54%. In China, Baidu has 76% and Google has 22%. There are some reasons that explain these market share differences. Yandexx was written to manage the large Russian vocabulary that Google does not handle as well. Baidu handles search for media better than Google and search traffic in China is much more entertainment driven rather than business driven in the US.
In the last session of the day, about 100 tools were discussed in SEO/SEM Tools. I'm planning on writing another blog post with a summary of these tools, but here's a short list of the tools mentioned by multiple speakers:
- SEMRush
- Google: Keyword Ad Tool, Webmaster Tools, Adplanner, SocialGraph API, Google Trends, Analytics, Google Insights
- SpyFu: Kombat, Domain Ad History, Smart Search, Keyword Ad History
- SEOBook
- SEOMoz: Linkscape, Mozbar, Top Pages, etc.
- MajesticSEO
- Raven SEO Tools: Website Analytics, Campaign Reports
Stay tuned for a day 3 and wrap up article!
Learn more about End Point's technical SEO services.

