What is Data?
In our digital world, Data is defined as facts or figures, or information that's stored in or used by a computer. In a digital sense, anything and everything that we have in our computer, over the internet, within storage devices everything is Data.
Some of the basic examples are: A text file, a csv file, an excel file, a picture file, a video file, a server or job log file and so on.
Data is in various forms. Data journey starts from raw data that is collected, processed and transformed into meaning information that makes sense for its users.
Let me share an example of data collection, processing and transformation. While performing an installation, number of events occur on a server. The software installer keeps tracks of all events relevant to the software installation. This information is collected and processed to prepare a log file which is stored at a location and shared at the end of the installation. The event ids are converted into meaningful sentences so that it is easy for the user to understand. The log file stores processed data i.e. information on your installation without any consideration that the installation failed or succeeded. I am sure all my Admin and DBA folks will grasp this example pretty well.
Another example for data and data processing is data generated from our Access Cards. Yes, the access cards that we used to swipe before March 2020 to enter our office premises. Every access card punch on the reader machine on doors that allowed us access is collected, stored and processed to generate meaningful information that is presented in reports to our management. Let’s assume if we want to identify whether Ashish worked for an average 9 hours every quarter for the firm we can definitely get this information from our collected data.
What is BigData?
I hope we are very much clear on what is Data. We can now start understanding BigData.
‘BigData’ – The word itself describes data that is BIG, Huge in size. BigData is also data but with huge size. BigData is collection of data that is huge in Volume, it keeps growing exponentially with time. The data gets so huge and complex over time that it becomes impossible to store and process data efficiently.
Big Data is not something that is completely new, it have been there for decades. We have been collecting and analyzing data since long. We used RDBMS (Databases – MSSQL, Oracle, DB2, MySQL etc.) to store data and then used Business Intelligence tools to analyze and extract information for our use. Raw data is not much of use if we cannot extract meaningful information out of it.
Over the years, the data that was collected and analyzed grew into huge size. This huge size of data made it impossible to be stored and analyzed efficiently using our traditional tools. All of our DBAs would know how cumbersome and messy it becomes to manage a database when it starts growing beyond 10TB. Oracle can handle a bit more but eventually over 100TB the data management definitely becomes painful.
BigData – A Problem?
Yes! BigData is a problem for the whole world in today’s world. BigData is not a technology or software but a problem as storing, processing, analyzing this huge data is not possible using the traditional tools.
Where does ‘BigData’ come from?
Before Data became BigData we used to store it in files, databases and process this data with data analytic tools. The data was not very huge or growing exponentially. Over the time Computer made entry into our homes, then cell phones, later laptops. Today, we all have probably more than one cell phone cards and at least one laptop in a family. We are in digital age, data had started to grow exponentially and the new age social media apps starting from nostalgic Orkut, then facebook & twitter, to present day Instagram Reels and video sharing platform, data storage range has also moved from some Terabytes to Zettabytes (Zettabyte is a trillion Gigabytes).
BigData and our lives
Every single of us generate ‘BigData’. BigData is everywhere in our lives.
We get up in morning with an alarm in our phone and most probably a fitness band on our wrist. This wakeup and sleep data is stored in app and contributes to BigData. The app also provides us analytics on our health and suggestions to improve our health.
Our morning newspaper has weather predictions and at least 4-5 prediction articles, sources of which is BigData only.
We use our mobile phones and use WhatsApp, Facebook, twitter, Instagram, SnapChat, LinkedIn etc. and share lot of pictures, videos and links. Lot of data is accessed, viewed and processed. Our social media content has also changed from Orkut scrapbook (Text) to pictures and videos on our posts and status messages.
We work on data in one or other form in our lives, be it a doctor who checks patient records on WhatsApp and gives video consultation, a regular office where we store files and records in computers or be it any other profession where they create an online website to promote business etc.
We do lot of searches on google for anything that interests us, we book online trips, movies, order food and shop anything and everything online. We take appointments for visit to passport offices, District magistrate for various purposes.
We watch online videos on YouTube, binge watch on Netflix, show premieres on Amazon Prime etc. Even for commuting anywhere we use Maps. We use banking apps, credit/debit cards, wallets like Paytm, PhonePe, GPay etc. every now and then. All data is stored for analytics and contributes to BigData.
We view, generate, process, store and analyze data every single hour of our lives. One cannot imagine our lives without Data more specifically modern world BigData in today’s world.
Examples of BigData
I have already quoted lot of examples for BigData in our lives. The following ones are from different fields and industries.
BSE – Bombay Stock Exchange – Stock markets generate TBs of data in a single day.
Medical Researches – Research institutes generated huge data from different experiments and researches.
Space researches – ISRO & NASA generate, store and process lot of data in space researches projects.
Transportation – Uber generates and uses data analytics on the generated data to provide best route and fares to its consumers.
Types of Data
Now that we understand what Data & BigData is, it is right time to understand the type of data we have. Data is categorized mainly into three types:
When we define data in a fixed format or structure our data, this data is called Structured Data. Structured data is stored, accessed and processed in a fixed format. It is easiest to work with. Example of Structured data is data stored within tables, in rows and columns formats. Each column will have number of parameters fixed like size, data types, length, precision etc.
Excel spreadsheets & CSV files are also examples of structured data.
Structured data is quantitative in forms of numbers and values.
Data without any specific format is known as unstructured data. This data not available in any particular structure and is messy. Unstructured data poses challenges not only with its size but with processing and extracting value out of it.
Examples of unstructured data are text, video files, audio files, mobile activity, social media posts, satellite imagery, surveillance camera imagery etc.
Unstructured data is qualitative data in form of text, audio, video files etc.
(image source: https://wiki.atlan.com/)
80-90% of the data generated today is unstructured.
Semi-structured data lies between structured and unstructured. It can contain both forms of data. Most of the times it is unstructured data with metadata attached to it.
Examples of semi-structured data are data stored in HTML files, XML files, JSON files etc.
"id" : 45697
"name" : "Pawan"
"DOB" : "16-12-1986"
"location" : "delhi"
"id" : 45643
"name" : "Praveen"
"DOB" : "16-12-1993"
"location" : "jaipur"
5 V’s of Big Data
Big Data was defined by the ‘3Vs’ but now there are ‘5Vs’ of Big Data which are also termed as the characteristics of Big Data.
BigData’s prime characteristic is its size. BigData is huge, enormous in volume. Size of data plays crucial role in determining value out of data. BigData is data with size in Petabytes and above.
Velocity is the speed at which data is generated. BigData is generated and grows at an exponential speed. There is massive and continuous flow of data. Data flows in from sources such as social media, mobile phones, video sharing platforms, sensors etc.
Variety refers to heterogeneous sources and the nature of data. Data is generated from heterogeneous sources such as server logs, video sharing platforms, sensors, access cards, etc. The nature of data is also different. It may be of structured, semi-structured or unstructured form.
Veracity refers to inconsistency and uncertainty of data. Data generated can be having missing details, parts of invalid and junk data. Data can be messy and its quality and accuracy is difficult to maintain.
Data which is bulk and complex is not useful for anyone until we process and extract some value out of it i.e. process and create information from raw data. Thus, Value is the most important characteristic out of all 5Vs. Value is what everyone is looking for after efforts and managing Huge and Complex BigData.