Still in most of the university you hardly find Hadoop or Bigdata as a subject. But in this data driven world Bigdata is the new oil to move the things forward. Lets check the philosophy behind Bigdata. Generally all company starts there business by storing data on traditional RDBMS systems like Mysql, Oracle and MS SQL as popular choice, before 2008 there were only some companies which felt that their data is growing in so much rapid speed that there tradition RDBMS systems can not take care of it. Google is one of them. Early data scientists and developer of bigdata started to think about this vital issue of sustaining with so much huge data. In 2003 and in 2004, Google publish two paper about Google file system and MapReduce which describes how Google is handling the data storage and processing of data in an very effective manner. Inspired by the paper’s Yahoo started building on it, Apache Hadoop is also formed as an open source project to implement those concepts. “Hadoop” word actually has no real meaning, Doug Cutting(co-founder of apache hadoop) used this word for the project which is the name of his child's toy elephant, Apache Hadoop logo is also inspired by this. After so many years of development finally Hadoop 1.0 release on 2012.
What is Bigdata?
In simplest term we can say a data which is so Big and can not be handled by tradition RDBMS system is call Bigdata. But to qualify a data to be Bigdata, It need to satisfy FIVE characteristics mentioned below. This also known as Five V’s of Bigdata.
Obviously, velocity refers to the speed at which vast amounts of data are being generated, collected and analyzed.
The total amount of data needs to be processed at once is very huge.
Variety is defined as the different types of data we can now use. It may be files, videos, images or any data from which we can extract valuable information.
It is refers to correctness or relevance of the data.
You may have huge data but it matters how much value you can generate from it.
Why industry is moving to Hadoop?
Business is purely based on statistics, business decisions can not be take on guts feeling, every thing needs to be taken based on data. Suppose a company has five products and five variation of each product. Now from sells data and customers review, company will decide which products to improve, which to discontinue and which to remain as it is. For small companies this analytics can be done on RDBMS itself but for when data becomes Bigdata then Hadoop is the only choice and cost effective one.Overall Hadoop ecosystem is cheap as it consist of commodity hardware. In recent times all organizations trying to gather as much as possible to take good business decisions which lead them to use big data.
Bigdata is more useful for OLAP(online analytical processing) and still not mature enough for OLTP(online tranaction processing). RDBMS systems are really good handle transaction however to analyze huge data it is recommended to move the data to Bigdata and analyze it as it is very fast on analyzing the data. If you query 100 records in RDBMS and Hadoop then RDBMS will give you far better performance but for trillions of records RDBMS system may take several days and a decent Hadoop cluster may take only some hours. In future when Hadoop will start supporting transactions that time RDBMS system will be in history books.