LOK looksmart limited

looksmart acquire fantastic new technology!!!!!

  1. MJR
    1,085 Posts.


    In January 2003, we acquired substantially all of the assets of Grub, Inc., a developer of distributed computing software which allows community participants to assist in the development and updating of a web search index. We believe that by incorporating a distributed computing solution into our systems and processes for updating our search index, we may be able to achieve substantial gains in the freshness of the index and cost savings over the long term.

    Frequently Asked Questions from

    Q: What is Grub?
    A: Grub is an Open Source software company that has written a distributed web crawler. People who choose to download and run the client will assist in building the Web's largest, most accurate database of URLs. This database will be used to improve existing search engines' results by increasing the frequency at which sites are crawled and indexed.

    Conceivably, Grub's distributed network could enable state information to be gathered on every document on the Internet, each and every day. By having websites crawl their own content, and having volunteers donate their bandwidth and clock cycle resources, it decreases bandwidth consumption across the Internet dramatically, allows for pre-processing on the resulting data, and ultimately improves search results sent to end users.

    Q: Will Grub be a for-profit corporation?
    A: Yes, it will. The only way for this project to succeed is by making money doing it. Grub needs to feed its coders, and it needs to pay for its servers and bandwidth. Without revenues, none of what Grub proposes to do is possible.

    Q: Why would people want to share their resources to help Grub make money?
    A: If you don't run a website, or care to contribute to a greater cause, there might not be a good enough reason for you to run the client. We didn't expect that everyone was going to run this thing after all! However, if you run a website or host multiple websites, you would want to run the client because it will index your own content before it crawls other sites. (This functionality is a work-in-progress with the client.) Having your content auto-update into the search engines is a powerful motive to run the client. For someone like an ISP or ASP, running the client will actually help improve the quality of the services they provide, and thus make them more competitive in their space.

    We are considering some other options concerning reasons to run the client. A few ideas presented have consisted of micro payments, contests, and a lottery. Although there are merits in some of these ideas, we expect that the real reason that people will run the client will be because they are directly benefiting from the continuous indexing provided by the client and, perhaps, because they would like a better search engine.

    Q: Isn't it wrong to make money with Open Source software?
    A: Open Source doesn't equal free! Grub coders wrote most of the software used in the project, and have made that code available to the Open Source community. If anything, Grub has contributed to the community, both by making it's code open, and by opening up the database to other similar projects out on the Internet.

    Network Solutions made millions of dollars off of named, which is a GPL'd piece of software, by having their name servers listed in the named.cache file that came with the package. If it were wrong to use someone else's free software, they certainly would have been guilty of it several trillion times over.

    Q: Is it true there are over 1 billion web pages on the Internet?
    A: Yes - that's true several times over. It is estimated that there are well over 10 billion web pages worth of content on the Internet right now with another 1 million new pages or more created or revised every single day. The rate at which pages are being created is also increasing, causing the problem to intensify over time.

    Q: Don't most search engines index the entire Internet already?
    A: No. Most search engines don't even come close to indexing the entire Internet. We know this through research and investigation into the practices of the way that search engines work. Most search engines try to keep the number of pages in their index down to a minimum - partially for the reason that the number of pages that they have to revisit is less. Fewer pages in the index means that they can get back to them more often, making their index more up-to-date. In addition, a vast majority of the pages on the Internet are dynamic, with their content locked away in databases, unavailable to existing crawlers.

    Q: How does your solution allow for indexing those databases?
    A: By placing the crawler closer to the data (i.e. on the web server itself) our client will be able to analyze and index the data local to the system on which it is running.

    Q: What does being up-to-date have to do with searching?
    A: The pages on the Internet change over time. Just look at a newspaper's main page - it could change up to 5 or 6 times a day. When a search engine crawls a page on the Internet, it takes a snapshot of that page and puts it in its index. That index is then used later when someone searches for a particular item or idea. If the index is out-of-date, i.e. the pages that are returned via the search haven't been visited in a while, then the user gets bad data returned to them.

    Q: So, most search engines are more or less up to date then?
    A: No. Limiting the total number of pages indexed does help somewhat in reducing the time in revisiting a page, but the problem still exists - on every single search engine today. Almost 50% of the database a search engine uses is either out-of-date, or incomplete at any given time.

    Q: How is grub.org going to handle indexing the entire Internet in real time when others have tried, and failed?
    A: grub.org will succeed because it uses a fundamentally different way of crawling the Internet. grub.org will utilize a client crawler that is downloaded and run on volunteer's computers, which will then index a portion of the Internet per grub.org's orders. Eventually, tens of thousands of clients will be utilized to accomplish our goal.

    Q: Isn't that like Napster, or [email protected]?
    A: Yes. The concept is similar. Grub's client enables any computer on the Internet to utilize its resources (bandwidth, processor time, drive space) to crawl and index a portion of the Internet in its spare time. With enough clients, grub.org will be able to visit and index every web page on the Internet - every single day.

    Q: Doesn't a meta-index do this already?
    A: No. What a meta-index does is search across multiple databases at the same time. It's just as susceptible to bad links, incomplete crawling and being out-of-date as someone like Altavista. Worse yet, its plagued by database overlap of the different search engines that it uses.

    Q: Why would I want to run this client? At least with SETI, I'm doing something - like looking for aliens.
    A: We like aliens as much as the next guy, but we also think Grub's more terrestrial mission is pretty appealing. The reasons that you'll want to run it will vary, but we think you'll see the advantages to be gained by running our client - especially if you are a system admin, or author of a web site.

    Q: So if I were a system admin or a website author I'd want to run the client?
    A: Yes! Anyone that provides web hosting/authoring services will have a use for running our client. In addition to crawling a portion of the Internet, the client can index the admin's/author's entire site each and every night, and then submit that summary to grub's servers for incorporation into the database. Running the client will allow them to provide an added value for their clients - having their web pages updated to the biggest index, each and every day.

    Q: If I ran the client, would my computer's resources get used by people doing a search on grub.org's site - like Napster?
    A: No. Your computer will only crawl a small subset of the Internet and report the results back to the master server. grub.org's servers will handle the requests that access the database. In addition, your client will be able to be configured - allowing you to control how much your computer indexes, and what time it does it.

    Q: Why did you make the project Open Source?
    A: Open-Source is a great way to get a large, diverse group of people working on software, and at the same time make sure that it is secure and bug free. Security and quality are top priority for us - we don't want anyone's computer compromised because we missed something in the coding phase. We will make all software written during the project Open Source as long as there were external contributions to those portions of code. For the time being however, we have chosen to fork and close source the *server* portion of the software, due to security issues related to the quality of URLs submitted to the system. We may choose to reopen that source depending on the level of contribution to the project by outside developers. Please keep in mind that the server code was written ENTIRELY by us, and uses no GPL'd code in its current state.

    Q: Most Open Source software has an O'Reilly animal assigned to it. What is your animal going to be?
    A: A grub worm of course.

    Main FAQ
    Robots FAQ
    Robots Refresh
    XML Interface
    Privacy Policy
    Unix Man-page

arrow-down-2 Created with Sketch. arrow-down-2 Created with Sketch.