The central point of Business Intelligence is data. The whole point of why we need Business Intelligence is that there is lot of data and we need to analyze it to take efficient decisions. But then data in real life, does not comes from one single source. Neither it is the easy way i.e. you can’t very easily analyse it. It is of varied type with various representations and needs to be analyzed as a whole, in order to make proper decisions.
Saying there are different types of data is not enough. Let us understand the different types of data that we deal with in Business Intelligence.
-
- Unstructured Data – Unstructured data refers to the data that is not present in the traditional row-column format or does not fit into any pre-defined model. It is usually characterized by being text heavy, including some numbers and dates. The lack of structure creates irregularities and ambiguities, that make it difficult to be searched and analysed using the computer programs. Wikibon describes unstructured data as – A file with little or no metadata, and little or no classification data.Characteristics of Unstructured data are –
- It does not confirms to any data model.
- It cannot be stored in the form of rows and columns in a table.
- It does not follows any rules or semantics.
- It is not easily usable by a program.
- It is not present in a particular format or sequence.
- It has no identifiable structure.
Using unstructured data, also brings along with it lots of issues. Let us now look at some of the challenges faced when storing unstructured data-
- Storage Space
- Data is increasing rapidly and all of the data is not created equally (this means that some data generated is critical to the enterprise, however, other might be ignored). However, in many IT departments, older, untouched files and business-critical content all resides in tier-1 storage and is subjected to the same backup imperatives. Now, since this data is increasing, IT managers keep fidgeting for more storage and better backup storage. If the data is structured or semi-structured, handling becomes a little easier as you can figure out using simple techniques what is required and eliminate the unnecessary. But, with unstructured data, you are bound to store it entirely causing two issues-
- There is an increase in the actual cost of storage.
- It is quite difficult to back up all the data in the expected time frame.
Now, since data is huge, few companies are not able to consistently meet backup windows and are slow to recover from incidents that bring down systems. They are more likely to miss recovery point objectives (RPO) and recovery time objectives (RTO) due to lot of unstructured data. These files often remain in the primary storage, even after their active life has come to an end. Some companies to avoid such scenario often spend on their storage and backup. However, with the consistent increase in the data, one cannot prevent back up failures for a long time. Failure to store this data, leads to business loss.
- Scalability
- Scalability is the characteristic of a data model or a system to cope and perform under an increased or expanding workload. A system that scales well will be able to maintain or increase its level of performance when tested by larger operational needs. Now, this is an issue with unstructured data. It is not scalable. As is clear from the storage space issues, when more data is generated, it calls for increased cost and complexity for managing and maintaining multiple software. And even after spending money on it, the end result is that without proper data management policies, one cannot deal with data protection challenges of unstructured data.
Retrieve InformationData is retrieved using index in the world of computers. For unstructured data, the text needs to be first broken down into keywords and then then indexing is done. Searching large pages of data is not like searching internet. It is like you need to search for the entire source to get the required piece of information and then break down the source to store it in traditional index linking keywords. Imagine, searching images for information, when they don’t even have meta data associated with them. Lay Man Terms- You search all the data available and try to find out if it has the required information. Once you find it, you use it as index and store it for further search operation.
SecurityWith data available in bulk and at different places, it becomes quite difficult to ensure that data is secure. One cannot very easily give the various grants to data and sharing becomes an issue.
- Unstructured Data – Unstructured data refers to the data that is not present in the traditional row-column format or does not fit into any pre-defined model. It is usually characterized by being text heavy, including some numbers and dates. The lack of structure creates irregularities and ambiguities, that make it difficult to be searched and analysed using the computer programs. Wikibon describes unstructured data as – A file with little or no metadata, and little or no classification data.Characteristics of Unstructured data are –
- Semi-Structured Data- Semi-structured data is very similar to the unstructured data, with a difference that it contains tags or other markers to separate semantic elements and enforce hierarchies of records and fields within the data. It is also called as self-describing data. The different kinds of semi-structured data include, XML, JSON (JavaScript Object Notation), integration of data from heterogeneous sources.Characteristics of semi-structured data are –
- Does not confirms to any data model but contains tags and elements (meta data)/
- Cannot be stored in form of rows and columns as in database.
- The tags and elements describe the data stored.
- Similar entities are grouped into same class, but same class entities may not have similar attributes.
- Insufficient meta data
Although, semi-structured data is better than unstructured data, it also has some issues associated with it. Let us have a look at some of its issues –
- Storage Cost
- Data is increasing at a very high rate and the vendors of the traditional storage are very expensive to store such a humongous, ever increasing data. So, when the data is limited you could easily store the data in shared storage but with increasing data, you need to pay more and since this data is ever increasing, the mere cost of storing it increases.
- Irregular Structure
- As we have talked, the large collection of data in semi-structured consist of heterogeneous elements. Some elements may be incomplete, while some may contain extra information. Different types may be used for storing same information, for example- dollar can be used for storing price in some cases while rupee can be used in other. Also, same piece of information may be structured in different ways, example -address may be stored as a tuple or a string. Modeling and querying such irregular data is quite a big issue.
- Implicit Structure
- Although the data present has some precise structure, it is often quite implicit. For instance, the data present in few applications is in form of some text which contains both text and grammar. It is then parsed to find out some specific information and the relationship between them. Interpretation of the data and the relationships is way beyond the capabilities of the traditional relational model. The reason why the structure is taken to be implicit is –
- Some kind of computation needs to be done to find out the structure.
- The correspondence between the parse tree and the logical representation is not always immediate.
- Partial Structure
- Some parts of the data may lack structure e.g. bit maps while other may have only some sketchy structure (like the unstructured data). On top of it, the information retrieval tools may provide only limited form of structure e.g. by computing occurrences of particular words or group of words and by classifying documents based on their content. It might also be possible that the application leaves a large amount of data outside the database, making it unstructured and difficult to analyze.
- Evolving Schema
- In contrast to the relational database, with huge amount of data, the schema is rapidly changing. For example- the genome data is expected to change with more experiments being done. As a result, expressive formats such as ASN.1 or ACeDB are preferred. With respect to semi-structured data, we make sure that the schema is flexible and changes as rapidly as data, hence posing serious issues to the database technology.
- Distinction Between Data And Schema
- In a standard database, their is a clear distinction between the data and the schema. The difference between the data and schema too a very large extent have got blurred in semi-structured data. Schema updates are frequent, Schema laws can be violated, the schema may be very large, the same queries/updates may address both the data and schema. Furthermore, logically also it can make little sense. For example- the sex of an individual can be stored as data ( 0 for male and 1 for female) in one source and type (the object is of class Male or Female) in other.
- Structured Data – Structured data is the best form of data available as it is present in fixed field residing in a record or file. Simply put, it is the kind of data that is present in the relational database and can be retrieved using SQL queries. Characteristics of structured data are –
- It conforms to a data model.
- Data is stored in the form of rows and columns.
- Data resides in fixed fields within a record or a file.
- Definition, format and meaning of data is known before hand.
- Similar entities are grouped and they have similar attributes
Why Do We Love Structured Data?
- It is so easy to store data
Since data is present in particular format, you can easily store it in rows and columns.
- Scalability
It can be upgraded to process more transactions by adding new processors, devices and storage, and which can be upgraded easily and transparently without shutting it down.
- Security
The database administrator can very easily grant or evoke access to the user, hence making the data present safer. So, for example, when a user logs in with his/her user name or password, only the tables that are available to him/her will be visible.
- Update And Delete
Operations like these can be very easily done by firing simple queries. Since data is present in form of rows and columns, one can perform various operations using the traditional RDBMS concepts.
Reference – http://www.dtic.mil/dtic/tr/fulltext/u2/a428473.pdf;
http://www.pearsonitcertification.com/articles/article.aspx?p=2427073&seqNum=2
Happy Learning 🙂


Leave a comment