Google I/O: New Advances In The Searchability of JavaScript and Flash, But Is It Enough?

This week at Google I/O, Google talked a lot about the evolution of the technological capabilities of the web. HTML 5 is ushering in new era of browser-based development and applications. Eric Schmidt, Google CEO, kicked things off with, “My message to you is that this is the beginning of the real win of cloud […]

Chat with SearchBot

Google I/OThis week at Google I/O, Google talked a lot about the evolution of the technological capabilities of the web. HTML 5 is ushering in new era of browser-based development and applications. Eric Schmidt, Google CEO, kicked things off with, “My message to you is that this is the beginning of the real win of cloud computing, of applications, of the internet, which is changing the paradigm that we’ve all grown up with so that it just works … regardless of platform or hardware you’re using.”

Which is great and all, but if “the web has won”, as Vic Gundotra, VP of Engineering at Google proclaimed, then it’s not just application development that’s moved to the web. The potential consumers of these applications have moved to the web too. And Google, more than any other company, knows that search has become the primary navigation point of the web. We’ve become a searching culture and if we don’t see something in the first 10 search results on Google, we may not realize it exists. (A 2008 PEW/Internet survey found that 49% of online Americans use search engines every day, and a 2008 iProspect/JupiterResearch study found that “68% of search engine users click a search result within the first page of results, and a full 92% of search engine users click a result within the first three pages of search results.”)

Web applications need more than technology to thrive. They also need customers. And more often than not these days, those customers are acquired through search. But how well can this new world of the web that Google is ushering in at Google I/O be crawled, indexed, and ranked by search engines such as Google?

If the search engines’ ability to handle the rich internet applications (RIAs) developers have been creating over the last few years are any indication, not very well. Google in particular has recently made strides in this area, as it benefits them in their quest to organize the world’s information and make it universally accessible and useful. But years after Flash and AJAX hit the market, their searchability still isn’t ideal.

When this year’s Google I/O conference was first announced, I talked with Tom Stocky, Director of Product Management for developer products at Google, and asked about the searchability issues with some of the Google Code APIs and products. He noted that Google I/O included a session about searchability: Search Friendly Development, presented by Maile Ohye. Having heard Maile speak about developer issues before, I have no doubt that her content was top notch, but that doesn’t answer my underlying question about the general searchability of the code Google offers to developers. After all, if Google is encouraging developers to “build a business” with their APIs, they should realize that building a business is about more than application construction — it’s about the ability to acquire customers as well.

JavaScript and AJAX

For instance, Google’s AJAX APIs are created with, well, AJAX, which is notoriously difficult to index. A core issue with AJAX is that it dynamically changes the content of the page. The URL doesn’t change as the new content loads. Often, the URL is appended with a hash mark (#). Historically, a # in a URL has denoted a named anchor within a page, and thus search engines generally drop everything in a URL beginning with a # as not to index the same page multiple times.

You can see this implementation on coldfusionbloggers.org, for instance.

Click the Next button, and the URL is appended as follows: https://www.coldfusionbloggers.org/#2

Other AJAX implementations don’t append anything to the URL, but simply dynamically load new content on the page based on clicks. Take a look at an example from the Google Code Playground.

Google Code Playground

In this example, each of the three tabs (Local, Web, and Blog) contains unique content. The URL of the page doesn’t change as the content reloads. You might expect one of two things to happen:

  • search engines associate the content from the tab that appears when the page first loads with the URL
  • search engines associate the content from all of the tabs with the URL.

Either of these scenarios could happen, depending on how the code is implemented. What actually happens in this case is less desirable than either of those options. Because the entire tab architecture is loaded in JavaScript, search engines can’t access the content at all.

You can see this in action for an implementation showcased on the Google AJAX APIs blog. The sample site uses a tabbed architecture to organize results, but all that shows up in Google search results is “loading”:

Loading Sample

A similar thing happens with this page that uses AJAX to provide navigation through photos:

All Google sees is “loading photos, please wait…”

loading

A dynamic menu such as the one described here has related, but different issues. Each tab loads content from an external HTML file. Google doesn’t see that content as part of the page. Rather, it sees the content as part of those external files and indexes them separately.

There are ways to get the content indexed, of course.

  • You can do away with AJAX in this instance and use CSS and divs instead.
  • You can use the approach taken on the Yahoo home page. The Featured, Entertainment, Sports, and Life tabs load inline when you click on them, but if JavaScript isn’t enabled, the tabs are shown as links to separate pages. (For instance, the entertainment tab links to https://entertainment.yahoo.com/.)

Yahoo Home Page

You can tell that Google sees the version of the Yahoo page in which the tabs link to separate pages because this is the behavior found in the Google cache.

  • You can implement a technique that Jeremy Keith describes as Hijax: return false from the onClick handler and include a crawlable URL in the href as shown below:
<a href="ajax.htm?foo=123" onClick="navigate('ajax.html#foo=123');
return  false">123</a>

Of course, with an implementation like this, similar to the Yahoo home page experience, search engines may index each page individually or as variations of the same page (depending on what the AJAX code is meant to do), rather than as part of a single, comprehensive page  and when visitors enter your site through search, those individual pages will be how they first experience your site.

Cloaking?

None of these JavaScript/AJAX workarounds are considered cloaking, which can get a site banned from Google, because they don’t present content based on user agent (such as Googlebot). Rather, they present “degraded” content to any user agent without JavaScript support (including screen readers, some mobile devices, and older browsers) or “enhanced” content to any user agent with JavaScript support.

Flash and Flex

Similar issues exist with Adobe technologies such as Flash and Flex. Search engines have historically had trouble crawling Flash. Last year, Adobe made a search crawler version of the Flash player available to Google so it could extract text and links, and more recently launched an SEO knowledge center, but problems remain.

If the Flash application is built with a single URL (rather than changing URLs for each interaction), then the site visitor coming from search will always enter the site at the home page, and have no way of knowing how to get to the content. Not to mention this style of web application is difficult to share. It’s much easier to copy and paste a link to a cute pair of shoes than it is to email a friend instructions like, “go to the home page, then click shoes, then click sandals, then brown, then scroll to the third page and look in the fifth row, second pair over from the left.” Seriously, no matter how good a friend you are to me, I am not following those instructions.

The Flex framework introduces challenges similar to AJAX. It creates new URLs by adding a hash mark (#). As with AJAX, this can make things faster because new pages aren’t loading with each click, but also as with AJAX, search engines drop everything in a URL beginning with the #. Depending on your infrastructure, you might be able to remap or rewrite the URLs, but that sort of defeats the purpose of using a coding framework to begin with.

Should Google Search and Google Code Have Coffee?

I (more than many people, having previously worked in Google search) understand that Google is a big company. And I know that both the search and code teams are working hard at their core goals. And believe me, I really love the people working on both teams. I’ve had the wonderful opportunity to work very closely with many of them, and I can vouch for the fact that any lack of integration between the two is not for lack of caring. Those teams care very much about putting out quality products that their customers are delighted by.

But as can happen in big companies where everyone is working hard, there seems to be a disconnect. Google APIs and other products for developers should be built search-engine friendly right out of the box not only because they’re being built by Googlers who should therefore know better, but because searchability is vital to for a business to be successful on the web today. Code should be secure; it should function properly; it should be crawlable.

At Google I/O’s first keynote session, Vic Gundotra touted the White House’s use of Google Moderator earlier this year as an example of how useful Google tools are to developers. Sure, the AppEngine servers running Google Moderator held up, but none of the discussion could be crawled or indexed by search engines. Maybe this didn’t matter to the White House (although I would think that some American citizens might find it helpful to be able to search through the questions people had). But a small business using Google Moderator as its forum or support framework would almost certainly want that content to be found in Google searches.

Google is working on it

The Google Webmaster Central team has been providing a wealth of education around these issues to help developers build search-friendly web sites. For instance:

At Maile’s Search-Friendly Development session at Google I/O, Google announced two advances in their ability to crawl and index RIAs. While both of these advances are great efforts, they were driven entirely by the search team. And unfortunately, they don’t solve the issues with the Google Code APIs. Wouldn’t it be great if the new Web Elements they just announced were search-engine friendly by default?

Flash improvements

Google made improvements in crawling content in Flash files in July of 2008, but wasn’t able to extract content from external files (such as resource files). In addition, many sites loaded Flash via a JavaScript link on the home page, and since Google didn’t follow the JavaScript link, they couldn’t get to the Flash at all. Both of those things have changed as of this week. Google can now access content in external files and can follow JavaScript links to Flash applications.

Both of these are great improvements, but the advice in my original article remains. Don’t use Flash in situations where you really don’t need it. Make sure that all text and links are built into the Flash file as text and links (and not, for instance, in images), and create a new URL for each interaction.

JavaScript improvements

Google has also been crawling some JavaScript for a while. Primarily, they’ve been extracting very simply coded links. As of today, they’re able execute JavaScript onClick events. They still recommend using progressive enhancement techniques, however, rather than to rely on Googlebot’s ability to extract from the JavaScript (not just for search engine purposes, but for accessibility reasons as well).

Googlebot is now able to construct much of the page and can access the onClick event contained in most tags. For now, if the onClick event calls a function that then constructs the URL, Googlebot can only interpret it if the function is part of the page (rather than in an external script).

Some examples of code that Googlebot can now execute include:

  • <div onclick="document.location.href='https://foo.com/'">
  • <tr onclick="myfunction('index.html')"><a href="#"
    onclick="myfunction()">new page</a>
  • <a href="javascript:void(0)" onclick="window.open
    ('welcome.html')">open new window</a>

These links pass both anchor text and PageRank.

The end of progressive enhancement?

Does this mean that web developers no longer need to progressively enhance their JavaScript? I would recommend continuing this practice. Not only does it benefit accessibility and mobile users, but search engines other than Google aren’t yet crawling JavaScript. And in any case, we will have to watch and see what happens in the Google search results before we know exactly how JavaScript content is handled.

For instance, for that Star Trek example above, the entire body code is as follows:

 <body>
    <div id="searchcontrol">Loading</div>
  </body>

The JavaScript function in the head section of the page loads the Google AJAX API code externally, and even with these latest improvements, Googlebot is still only able to interpret code on the page.

What about all those workarounds to accommodate Googlebot’s previous lack of support for JavaScript?

One issue that is worth investigating is how is Google handling <noscript> content and content when the onClick handler returns false? If a developer uses the Hijax method described above as a workaround for URLs with hash marks in them, will Google now see only the non-search-friendly version of the URL? I asked Google about these issues and they told me regarding Hijax that “we try to mimic the browser’s behavior, but it’s still possible for us to discover URLs even though the function returns false.”

As for <noscript>, Google told me that text outside of noscript is best, but they do process information inside of <noscript>. That statement is worth digging into a bit more [and I’ll update this post when I have more information]. Does Google prefer text outside of <noscript> because they can now easily crawl all the text inside of JavaScript? Or is it a technique that could be perceived as potentially manipulative? And what does “process  information” mean exactly? [Updated with additional information from Google: as noted below in their help information, <noscript> is a technique that can be useful for graceful degradation, but HTML text is the best option, if possible. And as noted below, the <noscript> text should match what displays in JavaScript-enabled browsers.]

Conventional wisdom has generally been that using <noscript> to provide graceful degradation for those without JavaScript support was fine by the search engines as long as the text in the <noscript> matched the text in the JavaScript exactly. Google itself recommends it in their guidelines. Although, to be fair, there’s always been a bit of concern about this method. In Google’s words (emphasis mine):

If your site contains elements that aren’t crawlable by search engines (such as rich media files other than Flash, JavaScript, or images), you shouldn’t provide cloaked content to search engines. Rather, you should consider visitors to your site who are unable to view these elements as well. For instance:

  • Provide alt text that describes images for visitors with screen readers or images turned off in their browsers.
  • Provide the textual contents of JavaScript in a noscript tag.

Ensure that you provide the same content in both elements (for instance, provide the same text in the JavaScript as in the noscript tag). Including substantially different content in the alternate element may cause Google to take action on the site.

Sneaky JavaScript redirects

When Googlebot indexes a page containing JavaScript, it will index that page but it cannot follow or index any links hidden in the JavaScript itself. Use of JavaScript is an entirely legitimate web practice. However, use of JavaScript with the intent to deceive search engines is not. For instance, placing different text in JavaScript than in a noscript tag violates our webmaster guidelines because it displays different content for users (who see the JavaScript-based text) than for search engines (which see the noscript-based text). Along those lines, it violates the webmaster guidelines to embed a link in JavaScript that redirects the user to a different page with the intent to show the user a different page than the search engine sees. When a redirect link is embedded in JavaScript, the search engine indexes the original page rather than following the link, whereas users are taken to the redirect target. Like cloaking, this practice is deceptive because it displays different content to users and to Googlebot, and can take a visitor somewhere other than where they intended to go.

Note that placement of links within JavaScript is alone not deceptive. When examining JavaScript on your site to ensure your site adheres to our guidelines, consider the intent.

Keep in mind that since search engines generally can’t access the contents of JavaScript, legitimate links within JavaScript will likely be inaccessible to them (as well as to visitors without Javascript-enabled browsers). You might instead keep links outside of JavaScript or replicate them in a noscript tag.

What about paid links?

On the one hand, this is great news for the web. Google can access more content, which means searchers now have easier access as well. One potential issue is that historically, JavaScript was one of the common methods used for coding a link that was paid (an advertising rather than editorial link). Google has been recommending for some time that site owners either add a nofollow attributes to those links or use a URL redirector and block the redirect via robots.txt, but they haven’t explicitly said that the JavaScript method was no longer allowed.

As late as 2007, Google’s Matt Cutts was espousing JavaScript as a valid option for indicating links that shouldn’t pass PageRank because they were paid for. It’s likely that many sites around the web use this method for “machine-readable” disclosure that the links are advertising. In fact, one of the advertising platforms that Matt showcased as coding links properly is Quigo (now owned by AOL), which appears to still enclose its ad links in JavaScript without adding the extra step of blocking the URL redirect with robots.txt. Look, for instance, at the Sports Illustrated home page:

Quigo Ads

You can see that the sponsored links are implemented in JavaScript (and when onClick returns false, the link is disabled). The links are redirected through Quigo’s ad server: redir.adsonar.com, but this subdomain isn’t blocked with robots.txt.

What about Microsoft adCenter Content Ads? The example below is from an MSN page.

Microsoft adCenter Content Ads

The ad links are run through a JavaScript redirect (a variation of r.msn.com, in this case, 0.r.msn.com). But this subdomain has no robots.txt. Is MSN in danger of violating the Google webmaster guidelines? Will be Microsoft adCenter be penalized for selling links? Surely not. Surely Google is working out a way to crawl more of the web, while not inadvertently penalizing large portions of it. (Although this may not be an issue in any case. Both of the examples above use a 302 redirect, which likely satisfies Google’s guidelines about coding paid links in such a way that they don’t pass PageRank.)

When I asked Google about this, they told me:

Our onclick processing is becoming more widespread, but keep in mind it’s still an area where we’re constantly improving. We already detect many ads generated by onclick events.

To prevent PR [PageRank] flow, it remains a good practice to do things like have the onclick-generated links in an area that’s blocked from robots, or to use a url redirector that’s robots.txt disallow’d. Penalties for spam techniques have been and will continue to be enforced, but as you know, we work extremely hard to minimize false positives.

Webmaster Tools Message Center already sends emails to developers to inform them when we believe that they are inadvertently violating our guidelines. Whether it’s through our blog or our tools, we’ll continue to find ways to communicate with webmasters, especially as we further innovate in our crawling capability. Processing onclicks is one step of many! :)

I entirely understand their answer, even if I might not entirely like it. They want to crawl and index more of the web and they have to keep evolving to do that. Their aim is to only penalize those sites that intentionally violate their guidelines, but they’re not going to give away the secret sauce of how they detect that intention.

But the truth is that most people who have websites haven’t heard of the Google Webmaster Tools Message Center or Google Webmaster Central blog (or Search Engine Land!). The web contains substantially more site owners than the 52,000 who are subscribed to Google’s webmaster blog. Most site owners don’t know what SEO means. One could argue that anyone who has a web site should know about these things, but most small business owners don’t know really know how to set up accounts payable either, but they’re doing accounting themselves too because they can’t afford to get expert help. And while it’s not Google’s responsibility to ensure business owners know how to run their businesses properly, it is in Google’s best interest to index all of the web. And if they change the rules in inadvertently throw innocent businesses out of their search results, they’re not reaching that goal.

Since Google itself previously recommended JavaScript as a way to block paid links, it seems a bit much for them to now expect the entire web to modify their sites now. Google should take it upon themselves to sort things out. And to be fair, they’re saying that they will. But I imagine some webmasters will be nervous about leaving things up to Google, when the risk of them getting it wrong is being removed from the index.

This isn’t a new issue

As technology on the web continues to advance, web developers will continue to confront these issues.  New infrastructure and platforms will move faster than search engines, which after all, were originally built on the concept of HTML-powered, text-based web pages. So developers will have to create workarounds and then dismantle those workarounds as search engines catch up.

This happened, for instance, with dynamic URLs. Originally, search engines had trouble with URLs that contained characters such as question marks (?) and ampersands (&). In fact, Google advised in its guidelines to avoid using &=sid until mid 2006.

To get around this, some sites encoded their URLs to appear static. This Sitepoint article on dynamic URLs in 2002 explained:

For example, the following URL contains both “?” and “&,” making it non-indexable:

https://www.planet-source-code.com/vb/scripts/ShowCode.asp?
lngWId=3&txtCodeId=769

Below, it has been made search engine-friendly (all “?” and “&” and “=” characters replaced with alternate characters):

https://www.planet-source-code.com/xq/ASP/txtCodeId.769/lngWId.3
/qx/vb/scripts/ShowCode.htm

Dennis Goedegebuure of eBay explains in his blog that eBay employed this technique:

In 2004 search engines were not smart enough to read dynamic URL’s. Especially those URL’s that had a lot of parameters in them to determine sort order or aspects of the product search for shopping sites were a problem to get these indexed. Replacing the dynamic parameters like & or ? with static delimiters was one technique back in the days to make a dynamic URL static for the search engines to crawl.

Now fast forward to 2009, Search Engines have become much smarter and are now able to understand dynamic URL with parameters much better. Last week they even announced their new canonical tag to help website owners to avoid duplicate content issues when it comes to sort order.

In fact, Google is now so good at interpreting dynamic URLs that use traditional patterns that Maile Ohye used eBay as an example of what not to do in a presentation at SMX West:

[At] SMX West a Google rep presented on URL structure. One part of her presentation was about MAVRICK URL’s, and in particular the long and complicated url’s you sometimes see on the Interwebs. i.e. used in her presentation:

https://shop.ebay.com/items/_W0QQ_nkwZipodQQ_armrsZ1QQ_fromZR40QQ_mdoZ

Things on the web have evolved so much that an implementation built entirely to ensure that pages could be crawled by search engines is now being used as an example of what not to do if you want pages to be crawled by search engines!

Continuing the conversation

Clearly, Google and the other major search engines want to crawl and index all of the content on the web. That is, after all, why they continue to evolve their crawlers to adapt to changing technology. And clearly, they want to help site owners and web developers build sites that can be easily found in search. And while it may seem like I’m singling Google out, I’m focusing on them now only because of their messaging of building a business with Google APIs at Google I/O. In truth, Yahoo and Microsoft are also aggressively courting developers and their developer groups seem to be just as disconnected from their search teams. Microsoft’s IIS versions 5 and 6, for instance, are configured to implement redirects as 302s by default, when for search engine value, redirects should be 301s. (Newer versions provide more search-friendly configurations.)

Two upcoming events will provide opportunities for us to continue the discussion about search-engine friendly web development with Google, Microsoft, Yahoo, and Adobe. Next week at SMX Advanced in Seattle, we’ve got a whole set of things for developers including lunch discussion tables and an after-hours Q&A with Adobe, and Matt Cutts will be doing a Q&A session where we can ask him all about the ripple effects of these advances. The following week in San Francisco, Jane and Robot is holding a search developer summit. Again, all the reps will be on hand for some in-depth technical discussion. We’ll also have beer, which we all just might need.


Opinions expressed in this article are those of the guest author and not necessarily Search Engine Land. Staff authors are listed here.


About the author

Vanessa Fox
Contributor
Vanessa Fox is a Contributing Editor at Search Engine Land. She built Google Webmaster Central and went on to found software and consulting company Nine By Blue and create Blueprint Search Analytics< which she later sold. Her book, Marketing in the Age of Google, (updated edition, May 2012) provides a foundation for incorporating search strategy into organizations of all levels. Follow her on Twitter at @vanessafox.

Get the must-read newsletter for search marketers.