Have you ever thought about building a fully featured search engine working similar to Google or Bing? Google has emerged as one of the biggest companies on Internet within a very short span of time. All internet entrepreneurs might have amused by seeing the success of Google as a Company. Thinking about the Technology, how google is working so fast and powerful? How does google manage the fault tolerance? Where do google save all these data of billions of web pages? Can you create a search engine like Google? If so how?
Well, thinking about building a search engine like google, you need to know various aspects. First of all building a search engine like google cannot be done overnight. It takes months or even years to crawl and store all the data, and to rank the results, to make it crawl almost the entire web. But usually you should be able to start producing the search results within a couple of week.
Where do you store the data? Where do Google stores the data? Google has a unique NOSQL database called BigTable where they store the entire search data. BigTable works on a distributed system which works on much reliable HDFS system. This file system supports distributed computing to support thousands of notes attached in the network.
What Technology should I use?
You cannot run google on MySQL. Period. Not even in Oracle, if you are looking for a global scale service. You need to have something similar to BigTable which works on a file system like HDFS. But HDFS and BigTable are google specific technologies and are not open source and not available to the public.
Hadoop : Hadoop is a filesystem which works very similar to HDFS, and it is widely regarded as the BEST distributed filesystem available now. Hadoop is open source continuously researched and developed by Apache! Hadoop is the best file system you can use to run a highly scalable, multimachine applications like search engines, analytics etc.Hadoop help you to connects thousands of nodes together to work as a expandable file system.
HBase: Hbase is a database that works on NOSQL (Not Only SQL) system, which can work on top of Hadoop to store petabytes of data. Though it based on Java and regarded as a reliable database. Hadoop is maintained by Apache!
Hypertable: Hypertable is another NOSQL database which works on Hadoop. It works based on C++ and the Hypertable company claims that the performance is much faster the HBase. Hypertable support is also very good and it has more flexibility on queries comparing with HBase.
So for running a Google clone, you shall either use Hadoop + HBase or Hadoop + Hypertable.
What Hardware Should I use?
Of course I understand that you don’t want to start with your own datacenter initially. Google has their own, ever expanding datacenter around the world. The ideal solution to start would be you tie up with a datacenter or hosting company who can provide a series of nodes(computers) in a single network. The key reason, why need nodes in a single network is that, as we expand more nodes in future in a scalable distributed system, nodes in same physical network can significantly improve the performance of your search engine.
How Can I Code a Google Clone Application?
Here comes the most tricky and interesting part on your journey to build a Google clone search engine. No matter your decide to use the right technology or to use the right infrastructure, if the code is not powerful, and designed to manage the scalability, your spider won’t be effective enough. I am not able to cover your the components of your software logic, algorithm to build up a spider. Anyway the below diagram found on Inout Spider will give you a read good idea about the major components required to build a spider. Inout Spider is a commercial application (widely regarded as a powerful search engine data spider application, and a standard google clone script) which work on Hadoop and Hypertable technologies. So if you cannot code it yourself, I recommend you consider Inout Spider.
Building a search engine like google, is never as easy task, or else we would have seen much google clones online. But with the right technology, hardware and software(your own, or commercial applications like Inout Spider), your dream is achievable.
By Google clone, I do not mean an exact google clone, The term Google is used as a synonym for ‘search engine’. This article is indented to help you create a standard search engine like Google, Bing, Yahoo, Baidu etc.