V’s of Big Data

The V’s of Big data are the words starting with V that define the characteristics of big data. When I started my career we knew and talked about big data in terms of five V’s, out of which three were the core characteristics, while the other two additional. However, with addition of new sources such as IoT, sensors and machines data, we have additional V’s that are being considered.

What is big data?

If the data is huge and in different formats all across and getting generated very quickly, we often term it as a big data. So, a school which has 10k students having same property of data, is not big data. It can be stored in a database, a traditional one. In case of big data, it often becomes difficult rather impossible to store and process this data in traditional databases. Traditional databases are the ones where you know the structure of data, you can find out the indexes, and store it and process it, without frequent changes to the way or schema of the stored data.

Understanding the V’s of Big Data

As of today, we have the following V’s in Big Data –

  1. Veracity
  2. Variety
  3. Velocity
  4. Volume
  5. Validity
  6. Variability
  7. Volatility
  8. Visualization
  9. Value

These nine V’s have been characterised into CPIVW.

Collecting Data

Veracity and variety are the characteristics that are aligned with collecting data.
Veracity – This refers to the noise and abnormality in the data. Few months back, I was working on a project and the client wanted us to bring in data from its different sources and store it at one place. When we started doing that, we found out –

  1. there were some transactions that were repeated. Although if you look at the two rows they were different, but when we started looking at it logically they were similar. This was because, there were four columns that defined a transaction, and one for a customer and one for product, there was one for the date as well. All these were same for these two different rows. Imagine, I bought a pen worth 10 bucks on 10 September. and in the database, there are two entries for the pen bought, one entry having the brand name of the pen and the other not having that brand name. On paper, these are two different rows, however, logically put this is a single transaction, which has been entered twice because of manual error.
  2. Column of a table was storing the country where the transaction took place, and the retailer where the item was bought. Some rows of these columns had USA, some had US, some had U.S. others were Unite States of America, some had uni stat of America! If you look at this, they all are pointed to US. I know that, you know that, but if I query my database to find out what are the different transactions in different countries, all these US will be considered as different countries! I don’t want that happening, no body wants right!
  3. And then a column which stored numbers! This column should have only numbers, but there were some entries with NA, some with NULL, some with na! Imagine I am trying to find the sum of this entire column, and although it is a integer column, I can never find out the summation of the values of the column, because some of it are integer while others are strings.

What I have talked in here, are some of the basic variations we find in data! There are many more! And this is what veracity is! Having variations in data! As a data engineer our job is to not just bring in data from different sources, but to make sure that the data present is cleaned and relevant to the question we are trying to solve.

Variety – This is how it seems. The fact that we have different forms of data! It could be structured, semi-structured, or unstructured! We can have data in tables, data from emails or audio files. Sometimes all these different kinds of data come from different sources, sometimes from the same source. And this variety in data, when to fetch data for the same thing we connect with different sources, to find out the answers, that is a characteristic of it being a big data. For example, in one of the project that I worked on, we were finding the data for orders placed for a whiskey! So, the thing that we are trying to find is the same, get me the order placed, but we searched for it at different places. There was one data that happened directly from a store, which made it structured. Then there were some orders being made via website making this unstructured. Then the client wanted to also analyse what people are not buying but highly interested in! This data came in from Facebook clicks! Trying to find answer to the same question from different sources!

Processing Data

When we are processing the data, that means connecting to the source, and bringing it into our system, there are two things that matter, what is the volume of the data and how frequently am I going to bring the data. This makes up for the next two V’s.

Velocity– In every data engineering project, the data has to be brought in frequently. We use scheduling tools like triggers in azure data factory, or jobs in Databricks, or earlier autosys! This basically means run the process at this time! It could be daily, hourly, weekly, monthly! But irrespective, there is new data getting generated, and we would want to fetch this data into our system. In case of big data this velocity is high! So, if we go back to our school example, admission of this new student is going to happen in either March or April. Post that the transaction is going to be mostly for fees, which is monthly! However, for a company like Nike or Adidas, their products are being sold daily and huge amount of transactional data is generated daily, rather hourly or to say in seconds!

Volume – Oracle could be faster than hadoop, I don’t know, and I am not sure! But one thing that I know for sure, is that it is damn costly! When the world moved into digital, huge amount of data got generated! Data from sensors, from IoT, from our mobiles, from social media, and with online banking, the bank transactions, everything is getting recorded. And everything being analysed! This is what defines big data, when the volume is too huge for it to be stored in traditional systems!

This is for today! Will be covering the remaining V’s in the next post. Stay Tune!

Video Version of the Post – https://youtu.be/cZQLATjYUnY

Amore,

Avantika

References - https://pdfs.semanticscholar.org/d204/960c9ee540630b444afcdfe5c0509baa9e4c.pdf


One response to “V’s of Big Data”

  1. […] have talked about the 4 V’s here – V’s of Big Data . This time we are looking at validity, variability, volatility, visualisation and volume. […]

    Like

Leave a reply to V’s of Big Data #2 – AvantikaTanubhrt Cancel reply