This week at Google I/O, Google talked a lot about the evolution of the technological capabilities of the web. HTML 5 is ushering in new era of browser-based development and applications. Eric Schmidt, Google CEO, kicked things off with, “My message to you is that this is the beginning of the real win of cloud computing, of applications, of the internet, which is changing the paradigm that we’ve all grown up with so that it just works … regardless of platform or hardware you’re using.”
Which is great and all, but if “the web has won”, as Vic Gundotra, VP of Engineering at Google proclaimed, then it’s not just application development that’s moved to the web. The potential consumers of these applications have moved to the web too. And Google, more than any other company, knows that search has become the primary navigation point of the web. We’ve become a searching culture and if we don’t see something in the first 10 search results on Google, we may not realize it exists. (A 2008 PEW/Internet survey found that 49% of online Americans use search engines every day, and a 2008 iProspect/JupiterResearch study found that “68% of search engine users click a search result within the first page of results, and a full 92% of search engine users click a result within the first three pages of search results.”)
Web applications need more than technology to thrive. They also need customers. And more often than not these days, those customers are acquired through search. But how well can this new world of the web that Google is ushering in at Google I/O be crawled, indexed, and ranked by search engines such as Google?
If the search engines’ ability to handle the rich internet applications (RIAs) developers have been creating over the last few years are any indication, not very well. Google in particular has recently made strides in this area, as it benefits them in their quest to organize the world’s information and make it universally accessible and useful. But years after Flash and AJAX hit the market, their searchability still isn’t ideal.
When this year’s Google I/O conference was first announced, I talked with Tom Stocky, Director of Product Management for developer products at Google, and asked about the searchability issues with some of the Google Code APIs and products. He noted that Google I/O included a session about searchability: Search Friendly Development, presented by Maile Ohye. Having heard Maile speak about developer issues before, I have no doubt that her content was top notch, but that doesn’t answer my underlying question about the general searchability of the code Google offers to developers. After all, if Google is encouraging developers to “build a business” with their APIs, they should realize that building a business is about more than application construction — it’s about the ability to acquire customers as well.
For instance, Google’s AJAX APIs are created with, well, AJAX, which is notoriously difficult to index. A core issue with AJAX is that it dynamically changes the content of the page. The URL doesn’t change as the new content loads. Often, the URL is appended with a hash mark (#). Historically, a # in a URL has denoted a named anchor within a page, and thus search engines generally drop everything in a URL beginning with a # as not to index the same page multiple times.
You can see this implementation on coldfusionbloggers.org, for instance.
Click the Next button, and the URL is appended as follows: http://www.coldfusionbloggers.org/#2
Other AJAX implementations don’t append anything to the URL, but simply dynamically load new content on the page based on clicks. Take a look at an example from the Google Code Playground.
In this example, each of the three tabs (Local, Web, and Blog) contains unique content. The URL of the page doesn’t change as the content reloads. You might expect one of two things to happen:
- search engines associate the content from the tab that appears when the page first loads with the URL
- search engines associate the content from all of the tabs with the URL.
You can see this in action for an implementation showcased on the Google AJAX APIs blog. The sample site uses a tabbed architecture to organize results, but all that shows up in Google search results is “loading”:
A similar thing happens with this page that uses AJAX to provide navigation through photos:
All Google sees is “loading photos, please wait…”
A dynamic menu such as the one described here has related, but different issues. Each tab loads content from an external HTML file. Google doesn’t see that content as part of the page. Rather, it sees the content as part of those external files and indexes them separately.
There are ways to get the content indexed, of course.
- You can do away with AJAX in this instance and use CSS and divs instead.
You can tell that Google sees the version of the Yahoo page in which the tabs link to separate pages because this is the behavior found in the Google cache.
- You can implement a technique that Jeremy Keith describes as Hijax: return false from the onClick handler and include a crawlable URL in the href as shown below:
<a href="ajax.htm?foo=123" onClick="navigate('ajax.html#foo=123'); return false">123</a>
Of course, with an implementation like this, similar to the Yahoo home page experience, search engines may index each page individually or as variations of the same page (depending on what the AJAX code is meant to do), rather than as part of a single, comprehensive page and when visitors enter your site through search, those individual pages will be how they first experience your site.
Flash and Flex
Similar issues exist with Adobe technologies such as Flash and Flex. Search engines have historically had trouble crawling Flash. Last year, Adobe made a search crawler version of the Flash player available to Google so it could extract text and links, and more recently launched an SEO knowledge center, but problems remain.
If the Flash application is built with a single URL (rather than changing URLs for each interaction), then the site visitor coming from search will always enter the site at the home page, and have no way of knowing how to get to the content. Not to mention this style of web application is difficult to share. It’s much easier to copy and paste a link to a cute pair of shoes than it is to email a friend instructions like, “go to the home page, then click shoes, then click sandals, then brown, then scroll to the third page and look in the fifth row, second pair over from the left.” Seriously, no matter how good a friend you are to me, I am not following those instructions.
The Flex framework introduces challenges similar to AJAX. It creates new URLs by adding a hash mark (#). As with AJAX, this can make things faster because new pages aren’t loading with each click, but also as with AJAX, search engines drop everything in a URL beginning with the #. Depending on your infrastructure, you might be able to remap or rewrite the URLs, but that sort of defeats the purpose of using a coding framework to begin with.
Should Google Search and Google Code Have Coffee?
I (more than many people, having previously worked in Google search) understand that Google is a big company. And I know that both the search and code teams are working hard at their core goals. And believe me, I really love the people working on both teams. I’ve had the wonderful opportunity to work very closely with many of them, and I can vouch for the fact that any lack of integration between the two is not for lack of caring. Those teams care very much about putting out quality products that their customers are delighted by.
But as can happen in big companies where everyone is working hard, there seems to be a disconnect. Google APIs and other products for developers should be built search-engine friendly right out of the box not only because they’re being built by Googlers who should therefore know better, but because searchability is vital to for a business to be successful on the web today. Code should be secure; it should function properly; it should be crawlable.
At Google I/O’s first keynote session, Vic Gundotra touted the White House’s use of Google Moderator earlier this year as an example of how useful Google tools are to developers. Sure, the AppEngine servers running Google Moderator held up, but none of the discussion could be crawled or indexed by search engines. Maybe this didn’t matter to the White House (although I would think that some American citizens might find it helpful to be able to search through the questions people had). But a small business using Google Moderator as its forum or support framework would almost certainly want that content to be found in Google searches.
Google is working on it
The Google Webmaster Central team has been providing a wealth of education around these issues to help developers build search-friendly web sites. For instance:
At Maile’s Search-Friendly Development session at Google I/O, Google announced two advances in their ability to crawl and index RIAs. While both of these advances are great efforts, they were driven entirely by the search team. And unfortunately, they don’t solve the issues with the Google Code APIs. Wouldn’t it be great if the new Web Elements they just announced were search-engine friendly by default?
Both of these are great improvements, but the advice in my original article remains. Don’t use Flash in situations where you really don’t need it. Make sure that all text and links are built into the Flash file as text and links (and not, for instance, in images), and create a new URL for each interaction.
Googlebot is now able to construct much of the page and can access the onClick event contained in most tags. For now, if the onClick event calls a function that then constructs the URL, Googlebot can only interpret it if the function is part of the page (rather than in an external script).
Some examples of code that Googlebot can now execute include:
<tr onclick="myfunction('index.html')"><a href="#" onclick="myfunction()">new page</a>
These links pass both anchor text and PageRank.
The end of progressive enhancement?
For instance, for that Star Trek example above, the entire body code is as follows:
<body> <div id="searchcontrol">Loading</div> </body>
One issue that is worth investigating is how is Google handling <noscript> content and content when the onClick handler returns false? If a developer uses the Hijax method described above as a workaround for URLs with hash marks in them, will Google now see only the non-search-friendly version of the URL? I asked Google about these issues and they told me regarding Hijax that “we try to mimic the browser’s behavior, but it’s still possible for us to discover URLs even though the function returns false.”
- Provide alt text that describes images for visitors with screen readers or images turned off in their browsers.
What about paid links?
What about Microsoft adCenter Content Ads? The example below is from an MSN page.
When I asked Google about this, they told me:
Our onclick processing is becoming more widespread, but keep in mind it’s still an area where we’re constantly improving. We already detect many ads generated by onclick events.
To prevent PR [PageRank] flow, it remains a good practice to do things like have the onclick-generated links in an area that’s blocked from robots, or to use a url redirector that’s robots.txt disallow’d. Penalties for spam techniques have been and will continue to be enforced, but as you know, we work extremely hard to minimize false positives.
Webmaster Tools Message Center already sends emails to developers to inform them when we believe that they are inadvertently violating our guidelines. Whether it’s through our blog or our tools, we’ll continue to find ways to communicate with webmasters, especially as we further innovate in our crawling capability. Processing onclicks is one step of many! :)
I entirely understand their answer, even if I might not entirely like it. They want to crawl and index more of the web and they have to keep evolving to do that. Their aim is to only penalize those sites that intentionally violate their guidelines, but they’re not going to give away the secret sauce of how they detect that intention.
But the truth is that most people who have websites haven’t heard of the Google Webmaster Tools Message Center or Google Webmaster Central blog (or Search Engine Land!). The web contains substantially more site owners than the 52,000 who are subscribed to Google’s webmaster blog. Most site owners don’t know what SEO means. One could argue that anyone who has a web site should know about these things, but most small business owners don’t know really know how to set up accounts payable either, but they’re doing accounting themselves too because they can’t afford to get expert help. And while it’s not Google’s responsibility to ensure business owners know how to run their businesses properly, it is in Google’s best interest to index all of the web. And if they change the rules in inadvertently throw innocent businesses out of their search results, they’re not reaching that goal.
This isn’t a new issue
As technology on the web continues to advance, web developers will continue to confront these issues. New infrastructure and platforms will move faster than search engines, which after all, were originally built on the concept of HTML-powered, text-based web pages. So developers will have to create workarounds and then dismantle those workarounds as search engines catch up.
This happened, for instance, with dynamic URLs. Originally, search engines had trouble with URLs that contained characters such as question marks (?) and ampersands (&). In fact, Google advised in its guidelines to avoid using &=sid until mid 2006.
To get around this, some sites encoded their URLs to appear static. This Sitepoint article on dynamic URLs in 2002 explained:
For example, the following URL contains both “?” and “&,” making it non-indexable:http://www.planet-source-code.com/vb/scripts/ShowCode.asp? lngWId=3&txtCodeId=769
Below, it has been made search engine-friendly (all “?” and “&” and “=” characters replaced with alternate characters):http://www.planet-source-code.com/xq/ASP/txtCodeId.769/lngWId.3 /qx/vb/scripts/ShowCode.htm
Dennis Goedegebuure of eBay explains in his blog that eBay employed this technique:
In 2004 search engines were not smart enough to read dynamic URL’s. Especially those URL’s that had a lot of parameters in them to determine sort order or aspects of the product search for shopping sites were a problem to get these indexed. Replacing the dynamic parameters like & or ? with static delimiters was one technique back in the days to make a dynamic URL static for the search engines to crawl.
Now fast forward to 2009, Search Engines have become much smarter and are now able to understand dynamic URL with parameters much better. Last week they even announced their new canonical tag to help website owners to avoid duplicate content issues when it comes to sort order.
In fact, Google is now so good at interpreting dynamic URLs that use traditional patterns that Maile Ohye used eBay as an example of what not to do in a presentation at SMX West:
[At] SMX West a Google rep presented on URL structure. One part of her presentation was about MAVRICK URL’s, and in particular the long and complicated url’s you sometimes see on the Interwebs. i.e. used in her presentation:
Things on the web have evolved so much that an implementation built entirely to ensure that pages could be crawled by search engines is now being used as an example of what not to do if you want pages to be crawled by search engines!
Continuing the conversation
Clearly, Google and the other major search engines want to crawl and index all of the content on the web. That is, after all, why they continue to evolve their crawlers to adapt to changing technology. And clearly, they want to help site owners and web developers build sites that can be easily found in search. And while it may seem like I’m singling Google out, I’m focusing on them now only because of their messaging of building a business with Google APIs at Google I/O. In truth, Yahoo and Microsoft are also aggressively courting developers and their developer groups seem to be just as disconnected from their search teams. Microsoft’s IIS versions 5 and 6, for instance, are configured to implement redirects as 302s by default, when for search engine value, redirects should be 301s. (Newer versions provide more search-friendly configurations.)
Two upcoming events will provide opportunities for us to continue the discussion about search-engine friendly web development with Google, Microsoft, Yahoo, and Adobe. Next week at SMX Advanced in Seattle, we’ve got a whole set of things for developers including lunch discussion tables and an after-hours Q&A with Adobe, and Matt Cutts will be doing a Q&A session where we can ask him all about the ripple effects of these advances. The following week in San Francisco, Jane and Robot is holding a search developer summit. Again, all the reps will be on hand for some in-depth technical discussion. We’ll also have beer, which we all just might need.