Building a Bigger Search Engine

Web-search upstart LookSmart has ambitious plans to build a giant distributed supercomputer for indexing and searching the Net. Critics say the idea has potential flaws, but volunteers are already lining up to participate. By Leander Kahney.
Image may contain Human Person and Sitting
Kord Campbell, a programmer from Oklahoma, is the brains behind the Grub, a distributed computing project aiming to be the biggest, baddest search engine on the Web.Leander Kahney

Web-search company LookSmart has ambitious plans to do for Web searching what SETI@Home did for the hunt for E.T.

Last week, LookSmart released a screensaver that harnesses the spare computing power of volunteers whose machines are indexing the Web.

Like SETI@Home, LookSmart's Grub screensaver runs in the background or when the computer is idle. But instead of searching for signs of intelligent aliens, Grub crawls the Net to build an index for Web searches.

In a matter of days, the number of people running Grub jumped from less than 100 to more than 1,000. As of Wednesday, the system was crawling more than 26 million Web pages, according to the site's Web page.

LookSmart is confident that the number of Grub volunteers will continue to grow, and is hopeful that in time -- perhaps several months -- the system's "distributed crawl" will be capable of indexing all of the Web's estimated 10 billion pages -- every day.

"It will be the first comprehensive index (of the Net)," said Kord Campbell, the programmer behind the Grub software. "We can conceivably crawl every Web page, every day."

By contrast, today's fastest search engines, such as Google or Inktomi, crawl about 150 million pages a day. Google indexes about a third of the Web, and refreshes its index every 30 days, according to LookSmart.

Campbell quietly worked on the software for more than three years in Oklahoma before LookSmart acquired his company and three-person staff in January. Grub was -- and continues to be -- largely an open-source project.

"It's a wild ride," said Campbell. "This project I put the last three and a half years into is just starting to take off. It's been kick started."

Each Grub screensaver crawls a portion of the Web and relays details back to LookSmart's computers in San Francisco, which parcel out the workload. To block attempts to spam or spoof the index, the system has a built-in authentication procedure, and the same work is given to several volunteers.

However, Danny Sullivan, editor of the industry newsletter Search Engine Watch, said the project may have unseen flaws. Most worrying, he said, would be the ability to hack the system in order to promote certain sites. "I have more faith in companies that control their own crawl and index, than I do in approaches that ask people to submit their own data," he said.

LookSmart claims the software has advantages in speed and efficiency over more centralized search engine "spiders." For a start, the Grub client can crawl websites internally, sending a daily update of pages that have changed.

"It minimizes bandwidth and ensures a timely update of changes," said Andre Stechert, LookSmart's director of technology.

LookSmart hopes to tap the altruistic nature of many Internet users. The company hopes volunteers will help build a distributed search engine because it is to their benefit. In that spirit, LookSmart said it would open up as much of the index as possible to the public.

"We're building a community-based infrastructure, and because it's community based we're giving back," Stechert said.

Stechert said over time, more sophisticated capabilities will be added to the screensaver, including the ability to index pages and perform "link analysis." One of the reasons Google is popular is because it analyzes links between pages, a strategy that maximizes the relevancy of the results returned to any given search.

Google is also popular because it throws enormous resources at the problem, Stechert said. In fact, Stechert said, the success of search engine companies has been closely tied to the amount of computing power they use to index the Web.

According to Stechert, Google has usurped AltaVista by moving from half a dozen big computers to clusters of thousands of PCs. By the same logic, hundreds of thousands of volunteer machines -- perhaps millions -- will have a similar effect, he said.

SETI@Home, for example, is the biggest virtual supercomputer on the planet. Its 4 million volunteers average about 1,000 years of computing time a day. The system operates at 52 teraflops, or 52 trillion floating-point operations a second. The next most-powerful supercomputer is Japan's Earth simulator, which clocks in at 10 teraflops.

If the Grub project attracts enough volunteers, it may be capable of performing the holy grail of Web searches: a real-time "semantic parse" of the Web. Stechert said.

Instead of looking for keywords, a suitably powerful search engine will analyze Web pages for their meaning. Stechert said researchers can perform this kind of analysis in the lab on relatively small numbers of documents, but it is far too computationally expensive to do for billions of Web pages.

"Going from tens of thousands of machines to hundreds of thousands of machines is fundamentally going to change the nature of search," Stechert said. "Going to millions of machines allows us to ask, ‘What can we do with all that computing power?’"

But Peter Norvig, director of search quality at Google, said while the Grub project is topical and interesting; improving Web searches isn't a problem of widening an index, but narrowing it.

"It isn't a problem of computing resources but deciding what parts of the Web should be updated more frequently than others," he said.

Norvig said that's why Google built Google News, to make sure news websites were updated several times a day.

"I don't want more computers or bandwidth," he said. "I want more clues about which page to look at rather than another page. The problem is how to rank the right pages. I don't think whether you are a distributed architecture affects that. The problem for us is how do we direct the crawl, not do we have enough resources to get the crawl."

Google is also experimenting with distributed computing. The Google Search Bar, which adds search capabilities to a Web browser's toolbar, donates spare cycles to Stanford's Folding@Home project, which simulates the ultra-complex process of protein folding.

Matthew Berk, a senior analyst at Jupiter Research, agreed that simply increasing the size of an index wouldn't necessarily lead to better searches. Berk said a lot more goes into a good search engine, such as link analysis, and it remains to be seen if LookSmart can add such capabilities.

"It's innovative, but the much larger puzzle is a lot harder to solve," he said. "But they've just come out of beta, so we'll see how it goes. The proof will be in the pudding. I don't even know what the pudding recipe looks like yet."