News came out earlier this week that Wikipedia founder
Jimmy Wales had a new
project in mind, to build a community-driven "Google-killer" search engine. I’ve
just finished talking with Jimmy about his plans. Here’s a rundown on his vision
and what may come as his Search Wikia project grows over the course of the next
year or two.
Note that in the Q&A, I’ve had to recreate my questions as best I remember
asking them. I was focused more on getting down Jimmy’s responses.
Q. Since the news emerged, there’s been some confusion about Amazon and
Wikipedia in relation to Search Wikia project. What’s the situation?
completed a funding round with Amazon [for Wikia], but other than that, they
don’t have anything to do with the search project. [The project] is a
Wikia project [the
for-profit company that Wales is chairman of], not a
Wikipedia project [the separate
community-driven encyclopedia he co-founded].
It was a combination of them both. I’ve been working on this for a long time.
We didn’t actually intend to announce per se just yet, but me and my big mouth,
the reporter asked me if I ever thought about search.
Q. It’s been said the search engine would launch in the first quarter of
2007. That’s fast. Is that really just when you expect active development work
During Q1, we’re going to set up a project to get developers involved with
building the site, writing the code and getting the search engine going. We’re
going to rely initially with Nutch
and Lucene [related open-source
search software that’s been developed over the past few years].
We’ll start from scratch on how to apply the Wikipedia principles to keep it
as simple as possible and move forward.
It’s just the development starting. We’re not producing a Google killing
search engine in three months. I only wish I were that good of a programmer.
We’ll have some servers open, some development, maybe a pre-pre-alpha demo
site up. We’d really anticipate it would be a year or two until we’re able to
launch a viable search engine.
Q. How do you see this improving on what’s out there?
There are a lot of things that we’ve learned in the wiki world on how to get
communities involved and engaged to build trusted networks in communities.
A lot of the people who have tried to do this in the past have stumbled not
on technical issues but on community issues … dmoz [The
Open Directory] was too closed … that was their response because of the
pressure of spammers … others have thought in terms of ranking algorithms.
That’s not the right approach. The right approach allows for open dialog and
debate and discussion.
Q. How do you envision the community participating? Will they be selecting
sites? Will this leverage material in Wikipedia? Will they rate sites?
This will be completely independent of Wikipedia.
Exactly how people can be involved is not yet certain. If I had to speculate
about it, I would say it’s several of those things, not just community involved
with rating URLs but also community rating for whole web sites, what to include
or not to include and also the whole algorithm … That’s a human type process
that we can empower people to guide the spider
Q. Do you see humans reviewing the most popular queries, perhaps picking
the right answers to come up?
Part of it might be a human review of queries. For the narrow subset of the
really popular queries, I think it’s important to apply humans …. if someone
types Ford Motor Company, there is a correct answer for that. There’s no reason
to beat our brains out to train our algorithm to do that.
Q. Search engines have actually gotten much better over time with these
type of navigational requests. You don’t need humans so much to make sure the
right answer shows up.
Those kinds are not too difficult. The harder one if you type ford, did you
mean President Ford or do you mean the Ford Motor Company? That’s the type of
thing where human disambiguation pages
like we have at
Wikipedia are helpful.
Q. Search engines already do a lot of this type of stuff. Ask
has its Zoom
suggestions, others have clusterings or related searches. Do you imagine people
being forced to make a query refinement choice before they actually get search
If you type ford, you should get some disambiguation terms that humans have
collected, then some search results….this is one of the places where I think
human intelligence is most important
[NOTE: For more on query refinement, see some of my past posts such as
Wants What We Had — Better Query Refinement. So Do I!,
Language Search, My Old Over-Hyped Search Friend and
Why Search Sucks & You Won’t Fix
It The Way You Think. The first link in particular discusses how Microsoft
used to have disambiguation created by editors very similar to what Wales hopes
to recreate. Sadly, it was killed in the quest to chase Google on the
Q. Are you planning to crawl the entire web, billions and billions of
pages? Or will you go after a subset of important ones?
The number of pages is yet to be determined. Obviously we won’t be doing that
initially [gathering everything], but we’ll invest in the hardware. Not to
belittle the investment required to do a full crawl of the web on a regular
basis, but I think it’s a fairly commoditized.
Q. Crawling is one thing. Serving up millions of queries per day is an
entire other issue. Wikipedia handles a lot of traffic, but not at a Google
scale. How’s it going with that?
The traffic’s not too bad. Servers are getting more and more powerful.
Bandwidth is getting cheaper. It’s all pretty much off the shelf. It’s pretty
Q. Will you be selling ads, and if so, how will that work?
There are no immediate plan to sell ads, so for now we’re not too focused on
that. If we don’t build something useful, selling ads on it is sort of a moot
Q. Why do this at all? What do you see wrong with search?
For certain types of searches, search engines are very good. But I still see
major failures, where they aren’t delivering useful results. I think at a deeper
almost political level, I think it’s important that we as a global society have
some transparency in search. What are the algorithms involved? What are the
reasons why one site comes up over another one. [Wales also raised the issue of
how ads might influence regular listings, perhaps search engines trying to keep
commercial sites out of the free listings to make money. From there, he went
on….] Those types of incentives are problematic in search. The only solution I
know to that is to be transparent
Q. How are you going to keep the community from being gamed. Wikipedia is
very good at keeping out spam, but it’s not perfect. And despite its size, it’s
dealing with far fewer topics than unique searches that will happen on any
particular day. How do you police all those searches?
You have to recognize the difference between the way community is often used
on the internet, which is short hand for millions of people clicking on some
stuff as compared to community in the wiki world, which is people who actually
know each other.
It’s one thing to say if you have millions of spammers out there trying to
game and trick an algorithm …. but it’s not the number of queries. it’s the
web sites themselves. A lot of numbers are thrown about for sites on the web,
but the number of legitimate pages that are not coming from affiliate sites and
spammers is a much more finite number. It’s much easier for a community to ban
the bad stuff.
Q. But what if someone gets into a "good" domain. We’ve had cases where
bad content gets shoved into "trusted" sites or even places like university
sites. Do you ban those entire domains? How do they get back in?
At Wikipedia, we’d have a big discussion. [Wales then explained that people
might realize a domain had done something accidentally wrong or without thinking
about spam issues and so might be allowed back in.]
Q. You probably already search a lot, probably mostly with Google. Is it
not finding what you want already most of the time, without a flood of spam or
crud in your way?
Usually I’m looking for pages on Wikipedia, so they do a good job with that.
It depends on the types of searches you are doing. If you’re doing a factual
search, then Wikipedia [in the results] would be good. In other areas, I think
there’s a strong commercial incentive. Why is it bad if I search for tampa
[NOTE: I then did this
search on Google, which we discussed. I noted I saw plenty of good hotels
listed, and that if I clicked through to the local
search results, I got an even better experience of hotels listed.
Wales replied that he’s often after reviews of hotels, not the hotels
themselves. That took me back to the original results, where I pointed out the
top listing was from TripAdvisor, exactly the type of review site he mentioned
liking — and that I often found them listed on these types of queries.
I also noted that Google even offers refinement categories at the top of the
page similar to the disambiguation he wanted, with
guides as one of the categories. Unfortunately for Google, I didn’t find
that the results from that refinement did a good job bringing back trusted hotel
Q. Back to transparency. People keep saying they want more of this. But
can you name some exact examples of what you want to see? Do you want Google to
say that using a term in bold text adds X percent of a score to the ranking
criteria? And if you do that, don’t you think spammers will just abuse the
recipe that’s been published?
If your search relies on some secret factors that you hope people won’t
discover, you haven’t really come up with a good solution the problem.
Q. Microsoft has spent millions of dollars and years now of effort to try
and be a Google killer and haven’t made it. You’re coming into this fresh with
fewer resources and no real prior experience. Can you really do it?
I have no idea. I only do whatever sounds like it is fun.
Q. What type of funding do you have behind this?
Wikia’s initial round was 4 million from a variety of angels, then there was
second round from Amazon, but the amount wasn’t announced.
When I first heard of the plans, I was pretty dubious the project would have
much success. For one thing, the idea of the "open source" search engine to take
on the world and provide more transparency is old news. Consider this from back
when Nutch first came out,
out of New Scientist in 2003:
The project "is about providing free technology that should not be controlled
by private, commercial, secretive organisations," says Doug Cuttings, veteran
web search engineer, and a Nutch founder.
Three years on, nothing really changed despite the reasoning behind such a
project being the same. And this was despite Nutch
some big names behind it.
In 2004, Nutch got another round of attention in an ACM
article looking at how it works. My comment at that time was:
Interesting read especially for
the efforts that are involved to defeat spam. The argument is that though Nutch
is open, revealing secrets won’t hurt because spammers will batter down any
defenses, no matter how tightly protected. OK, so what will stop spam? Nutch
hopes that an open, public discussion may reveal new methods. Perhaps. But the
real test will only come if Nutch is deployed by a major, highly-trafficked
site. Spammers aren’t going to bother trying the defenses of other places. It’s
not worth the time. That’s also a positive for those considering Nutch. If you
operate a small, vertical site or just want Nutch to be used on your own
content, then spam concerns are much less an issue.
The spam test simply hasn’t happened with Nutch. And every new search engine
project I’ve looked at coming in over the years completely underestimates the
spam problem they face. When I looked at the Search Wikia site, comments
like this almost seemed laughable:
search active for spammer sites
- trying to simulate user-typos (ie. "yaoho.com" rather than "yahoo.com");
see also: Microsoft’s
- blacklist domains, where spammails are linking to; create actively
honeypods to get spam; use a pattern like
to identify the spam
networks; shell the common user get the possibility to register such a mail-adress?
Seek out the spam sites? Hey, don’t worry — if you’re popular, they’ll find
you fast enough. And as you blacklist one, two more throwaway domains will show
up in their place.
I also tend to think Wales is completely underestimating how crawling a big
chunk of the web, keeping those pages fresh, ranking them quickly to provide
answers and doing so for millions each day isn’t an off-the-shelf commodity.
Still, I find myself oddly hopeful. I don’t think a Google killer will
emerge, but perhaps some new ways of a community to be involved with search will
come out of it. I wouldn’t have thought Wikipedia would work. Certainly it’s
flawed, but it’s also an incredible resource. Maybe something useful will come
from the Search Wikia project.
At the very least, I’ve long wanted humans to be back in the role of
reviewing queries and actually looking to see if they make sense, rather than so
much reliance on algorithms. Maybe the mere concept of the Search Wikia project
will encourage the major search engines to do more in this area.
Postscript: Originally I had Wales listed as cofounder of Wikipedia, but he got in touch saying he was the founder. I’d noted that Wikipedia itself lists him as founding (well, creating) it with Larry Sanger. Is Wikipedia incorrect on this, I asked? “Yes, it is wrong,” he emailed. Sanger posts his own views on the origins here.