A Complete Guide on Data Integration- Part I
Data integration involves combining data residing in different sources and providing users with a unified view of them. This process becomes significant in a variety of situations, which include both commercial (such as when two similar companies need to merge their databases) and scientific (combining research results from different bioinformatics repositories, for example) domains. Data integration appears with increasing frequency as the volume (that is, big data) and the need to share existing data.
Data integration is the process of retrieving data from multiple source systems and combining it in such a way that it can yield consistent, comprehensive, current and correct information for business reporting and analysis. The source systems may be various types of devices and the data may be in a variety of formats.
When Data Integration Needed?
Data integration is needed when a business decides to implement a new application and migrate its data from the legacy systems into the new application. It becomes even critically important in cases of company mergers where two companies merge and they need to consolidate their applications.
The most well-known implementation of data integration is building an enterprise’s data warehouse. The benefit of a data warehouse enables a business to perform analyses based on the data in the data warehouse.
Data Integration Areas
Data integration is a term covering several distinct sub-areas such as:
- Data warehousing
- Data migration
- Enterprise application/information integration
- Master data management
Challenges of Data Integration
At first glance, the biggest challenge is the technical implementation of integrating data from disparate often incompatible sources. However, a much bigger challenge lies in the entirety of data integration. It has to include the following phases:
The data integration initiative within a company must be an initiative of business, not IT. There should be a champion who understands the data assets of the enterprise and will be able to lead the discussion about the long-term data integration initiative in order to make it consistent, successful and beneficial.
Analysis of the requirements
- why is the data integration being done?
- what are the objectives and deliverables?
- From what systems will the data be sourced?
- Is all the data available to fulfill the requirements?
- What are the business rules?
- What is the support model and SLA?
Analysis of the source systems
- what are the options for extracting the data from the systems (update notification, incremental extracts, full extracts)?
- what is the required/available frequency of the extracts?
- What is the quality of the data?
- Are the required data fields populated properly and consistently?
- Is the documentation available?
- What are the data volumes being processed?
- Who is the system owner?
Any other non-functional requirements
- data processing window
- system response time
- estimated number of (concurrent) users
- data security policy
- backup policy.
- Support model for the new system
- SLA requirements
- Owner of the system
- Maintenance and upgrade expenses
The results of the above steps need to be documented in form of SRS (software requirements specifications) document, confirmed and signed-off by all parties which will be participating in the data integration project.
Based on the BRS (Business requirements specifications) and SRS, a feasibility study should be performed to select the tools to implement the data integration system. Small companies and enterprises which are starting with data warehousing are faced with making a decision about the set of tools they will need to implement the solution. The larger enterprise or the enterprises which already have started other projects of data integration are in an easier position as they already have experience and can extend the existing system and exploit the existing knowledge to implement the system more efficient. There are cases, however, when using a new, better-suited platform or technology makes a system more effective compared to staying with existing company standards. For example, finding a more suitable tool which provides better scaling for future growth/expansion, a solution that lowers the implementation/support cost, lowering the license costs, migrating the system to a new/modern platform, etc.
Along with the implementation, the proper testing is a must to ensure that the unified data are correct, complete and up-to-date.
Both technical IT and business people need to participate in the testing to ensure that the results are as expected/required. Therefore, the testing should incorporate at least Performance Stress test (PST), Technical Acceptance Testing (TAT) and User Acceptance Testing (UAT ).
Data Integration Techniques
There are many sophisticated ways the unified view of data can be created today. No more ETL is the only way to achieve the goal and that is a new level of complexity in the field of Data Integration.
There are several organizational levels on which the Data Integration can be performed and let’s discuss them briefly.
- Manual Data Integration
Technically speaking, this is really not a Data Integration. In this approach, a web-based user interface or an application is created for users of the system to show them all the relevant information by accessing all the source systems directly. There is no unification of data in reality.
- Middleware Data Integration
A middleware data integration solution is essentially a layer between two disparate systems allowing them to communicate. Middleware integration can act like a glue that holds together multiple legacy applications, making seamless connectivity possible without requiring the two applications to communicate directly.
- Data Virtualization Integration Approach
Data Virtualization allows us to leave data in the source systems while allowing to create a new set of unified views. This provides a way for users to access the unified view of disparate source system’s data across the whole enterprise.
A lot of organizations today prefer this approach because of the benefits and technologies that exist today to support this approach. The main benefit of the virtual integration approach is near real-time view of data from the source systems. It eliminates a need for a separate data store for the consolidated unified data.
However, that doesn’t mean it’s the best way to do Data Integration although it certainly has a short-term benefit. The drawbacks of this approach include the limited possibility of data’s history availability or data version management and extra load on the source systems involved which may have an adverse effect on the performance of the source systems.
- Data Warehouse Approach Of Data Integration
This is the most commonly known approach to Data Integration you may already know if you have read Ralf Kimball and/or Bill Inmon.
This approach requires the creation of a new Data Warehouse (of Data Marts) which stores a unified version of data extracted from all the source systems involved and manage it independent of the original source systems.
The benefits of this approach include the ability to easily manage the history of data (or data versioning), ability to combine data from very disparate sources (mainframes, databases, flat files, etc.) and to store them in a central repository of data.