Google Now Crawling And Indexing Flash Content
Historically, search engines have been unable to extract content, such as text and links, from Flash (SWF) files. Subsequently, much of the Flash-based content on the web has been unavailable in search results. This situation has been frustrating for web developers, who have tried to come up with workarounds to help get search engines to […]
Historically, search engines have been unable to extract content, such as text and links, from Flash (SWF) files. Subsequently, much of the Flash-based content on the web has been unavailable in search results. This situation has been frustrating for web developers, who have tried to come up with workarounds to help get search engines to index and rank their Flash pages.
This situation hasn’t been ideal for searchers either, as this limitation has kept them from seeing potentially great matches for their queries because they’ve been locked away in Flash files.
According to Adobe and Google, all of that is changing. Google is launching what they tell me is a “deep algorithmic change,” augmented by Flash reader technology supplied by Adobe, that enables them to “read” Flash files and extract text and links from it for better indexing and ranking. This could be great news for both site owners and searchers.
Below, more details about how it all works, as well as some caveats for those who see this development as a Flash panacea and think they no longer have to ensure their Flash applications are search engine friendly.
Google can now crawl and index Flash files
Google has been working on improving how they crawl and index rich content (such as Flash and JavaScript) for some time, and in fact have been able to extract some text and links from Flash files for a while. However, their methods weren’t perfect, and they tell me that this new technology from Adobe makes Google’s algorithms “less error prone” and enables them to access content created in any version of Flash in a variety of languages.
Adobe is providing a Flash player to (some) search engines
Adobe says they have developed an optimized Flash player for search engines and are collaborating with both Google and Yahoo!. Yahoo! has not yet implemented the technology, although they stated that “Yahoo! is committed to supporting webmaster needs with plans to support searchable SWF and is working with Adobe to determine the best possible implementation.” Adobe hasn’t made the technology available to Microsoft’s Live Search, although they say they are “exploring ways to make the technology more broadly available” to “help make all SWF content more easily searchable.” Adobe didn’t comment on whether the fact that Microsoft developing competing Silverlight was a factor in their decision not to collaborate with Microsoft Live Search for this initial announcement.
A big step forward
Previously, Google’s help documentation has warned against the use of Flash-only sites:
“In general, search engines are text based. This means that in order to be crawled and indexed, your content needs to be in text format. This doesn’t mean that you can’t include images, Flash files, videos, and other rich media content on your site; it just means that any content you embed in these files should also be available in text format or it won’t be accessible to search engines.”
They have suggested using Flash sparingly or using a method such as Scalable Inman Flash Replacement (sIFR) to provide an HTML source that can be rendered as either Flash or non-Flash.
That seems to have changed. The help documentation hasn’t been updated, but the post on the Google Webmaster blog says that Googlebot can now extract textual content and links so Google can better crawl, index, and rank the web site.
Both Google and Adobe stressed to me that this is a big win for both site owners and searchers and that it should improve relevancy in search results. They noted that Flash developers don’t have to do anything in their applications to make this new technology work for their sites.
This is certainly great news for the web, as it’s a sign that search engines, which are the primary method of navigating the web, are evolving beyond text to take into account newer web technologies.
Just how much will this change impact search relevance? It’s hard to say until we see the changes, which Google says may take time to percolate through the pipeline. In particular, they note that snippets, the descriptions that display under search results, will be improved. Before, Google often couldn’t extract any content from a Flash file, so the description for a Flash page was often empty or would consist of the only text available from the file, such as the Flash version or the word “loading.”
But although Adobe’s press release talks about “dramatic” improvements in search results and more relevant listings for “millions of RIAs” (rich internet applications), neither Adobe nor Google could give me numbers about how many more pages Google was now about to crawl and index and how much this has impacted search results.
A quick look at how SWF files are currently indexed shows that there’s a lot of room for improvement, so this may indeed be big news for search.
Flash developers should still spend time on Search Engine Optimization
However, this isn’t the perfect solution that it may seem. Adobe assures developers that “RIA developers and rich Web content producers won’t need to amend existing and future content to make it searchable — they can be confident it can now be found by users around the globe.” But that’s not entirely true, particularly for Flash pages that have little textual content.
Only text and links are affected
As Danny Sullivan noted last year when word of Google’s work in this area first came up, most Flash content isn’t made up of primarily words. It’s made up of images, video, and animation, and none of that will be surfaced in search results with this advancement. Google’s new Flash algorithms extract text and links only. Everything else is still a black box.
Associate a unique URL with each unique piece of content
In addition, the searcher experience is better served by Flash implementations that provide a unique URL for each set of content. Some Flash implementations dynamically load text as the user interacts with the application, but the URL remains the same. In this scenario, Googlebot can now follow those interactions (in a limited way) and if the URL doesn’t change, then all content that is dynamically loaded as the interactions progress is associated with a single URL.
Adobe says the Flash player it is providing to search engines “allows their search spiders to introspect and navigate through a live SWF application as if they were virtual users. The Flash Player technology, optimized for search spiders, enables the ability to traverse and parse all of the different paths in a SWF-based RIA, similar to traversing multiple pages in a standard web application.”
This means that if the content that is dynamically loaded into the Flash application from the fifth interaction matches a searcher query, that Flash application may be served in the search results. But when the searcher clicks over to that result, the content won’t be found on the page. The searcher will have to interact with the application until that content is loaded. Searchers may instead feel frustrated and abandon the page. For the best user experience and higher conversion rates from search, Flash developers should be careful to avoid this situation by creating distinct URLs for each piece of content. This implementation helps the Flash site be more viral as well, as users can email, Digg, and otherwise share the content more easily.
Google acknowledges this scenario may not be an ideal searcher experience, but points out that other non-HTML file formats such as PDFs have the same limitations. When a searcher clicks through the Google search results to a PDF file, the content that matched the query may not be on the first page of that PDF and the searcher has to scroll through the file to find the desired content. Google notes that just as they flag PDF files to alert searchers that the result is non-HTML, they do the same with Flash file, as shown below.
What this means for SEOs
Flash has often been a source of frustration for SEOs who argue that text should be in HTML, with Flash used for non-textual content, such as video illustrations. Can SEOs now remove the “review Flash implementation” line from their checklists? Probably not. However, it should be easier for SEOs to work with Flash-based sites going forward.
SEOs should keep in mind that these new algorithms don’t take into account any meta data or formatting markup in the Flash file and, for now, Google’s cache won’t show a representation of the extracted text so site owners can’t verify what is actually being crawled by viewing the cached copy. In addition, since Googlebot doesn’t execute most JavaScript, Google won’t crawl or index any Flash executed via JavaScript. Any external sources that the Flash file loads will be indexed separately, rather than as part of the Flash file. And as noted earlier, all non-textual content will remain uncrawled. This new Flash support covers all languages other than bidirectional ones (Hebrew and Arabic) and all versions of Flash.
What this means for accessibility and usability
Flash developers should continue to think about not only how well their applications can be found in search, but how usable and accessible they are.
Eric Wittman, director of platform distribution and business development at Adobe, told me that Flash web sites can be built for usability and accessibility. He noted that 98% of desktop computers have Flash support, although he acknowledged that he didn’t know how many have Flash blockers installed, and he didn’t provide numbers on the percentage of mobile devices that don’t support Flash.
He noted that screen reader support has been available since Flash Player 6 and that the newer Flex framework includes support for accessibility.
At the recent Developer Day at SMX Advanced, we talked a lot about making Flash applications accessible and search-engine friendly using graceful degradation, and both sFIR and SWFObject came up several times as good methods for ensuring this. I find that still to be good advice even in light of this announcement.
Overall, this announcement is great news for both content owners and searchers. Web developers are becoming more focused on architecting their sites to ensure they can be found in search engines, as search has become one of the primary acquisition channels online. As web technologies evolve, it’s important for search engines to evolve as well to ensure they provide the most relevant results for searchers. I’m eager to see how substantial a change this proves to be, although web developers should continue keeping search-friendly practices in mind when developing sites.
Contributing authors are invited to create content for Search Engine Land and are chosen for their expertise and contribution to the search community. Our contributors work under the oversight of the editorial staff and contributions are checked for quality and relevance to our readers. The opinions they express are their own.
Related stories
New on Search Engine Land