Skip to main content

What is Data Warehousing?

The term "Data Warehouse" was first coined by Bill Inmon in 1990. According to Inmon, a data warehouse is a subject oriented, integrated, time-variant, and non-volatile collection of data. This data helps analysts to take informed decisions in an organization.

An operational database undergoes frequent changes on a daily basis on account of the transactions that take place. Suppose a business executive wants to analyze previous feedback on any data such as a product, a supplier, or any consumer data, then the executive will have no data available to analyze because the previous data has been updated due to transactions.

A data warehouses provides us generalized and consolidated data in multidimensional view. Along with generalized and consolidated view of data, a data warehouses also provides us Online Analytical Processing (OLAP) tools. These tools help us in interactive and effective analysis of data in a multidimensional space. This analysis results in data generalization and data mining.

Data mining functions such as association, clustering, classification, prediction can be integrated with OLAP operations to enhance the interactive mining of knowledge at multiple level of abstraction. That's why data warehouse has now become an important platform for data analysis and online analytical processing.

Understanding a Data Warehouse

  • A data warehouse is a database, which is kept separate from the organization's operational database.

  • There is no frequent updating done in a data warehouse.

  • It possesses consolidated historical data, which helps the organization to analyze its business.

  • A data warehouse helps executives to organize, understand, and use their data to take strategic decisions.

  • Data warehouse systems help in the integration of diversity of application systems.

  • A data warehouse system helps in consolidated historical data analysis.

Why a Data Warehouse is Separated from Operational Databases

A data warehouses is kept separate from operational databases due to the following reasons −

  • An operational database is constructed for well-known tasks and workloads such as searching particular records, indexing, etc. In contract, data warehouse queries are often complex and they present a general form of data.

  • Operational databases support concurrent processing of multiple transactions. Concurrency control and recovery mechanisms are required for operational databases to ensure robustness and consistency of the database.

  • An operational database query allows to read and modify operations, while an OLAP query needs only read only access of stored data.

  • An operational database maintains current data. On the other hand, a data warehouse maintains historical data.

Data Warehouse Features

The key features of a data warehouse are discussed below −

  • Subject Oriented − A data warehouse is subject oriented because it provides information around a subject rather than the organization's ongoing operations. These subjects can be product, customers, suppliers, sales, revenue, etc. A data warehouse does not focus on the ongoing operations, rather it focuses on modelling and analysis of data for decision making.

  • Integrated − A data warehouse is constructed by integrating data from heterogeneous sources such as relational databases, flat files, etc. This integration enhances the effective analysis of data.

  • Time Variant − The data collected in a data warehouse is identified with a particular time period. The data in a data warehouse provides information from the historical point of view.

  • Non-volatile − Non-volatile means the previous data is not erased when new data is added to it. A data warehouse is kept separate from the operational database and therefore frequent changes in operational database is not reflected in the data warehouse.

Note − A data warehouse does not require transaction processing, recovery, and concurrency controls, because it is physically stored and separate from the operational database.

Data Warehouse Applications

As discussed before, a data warehouse helps business executives to organize, analyze, and use their data for decision making. A data warehouse serves as a sole part of a plan-execute-assess "closed-loop" feedback system for the enterprise management. Data warehouses are widely used in the following fields −

  • Financial services
  • Banking services
  • Consumer goods
  • Retail sectors
  • Controlled manufacturing

Types of Data Warehouse

Information processing, analytical processing, and data mining are the three types of data warehouse applications that are discussed below −

  • Information Processing − A data warehouse allows to process the data stored in it. The data can be processed by means of querying, basic statistical analysis, reporting using crosstabs, tables, charts, or graphs.

  • Analytical Processing − A data warehouse supports analytical processing of the information stored in it. The data can be analyzed by means of basic OLAP operations, including slice-and-dice, drill down, drill up, and pivoting.

  • Data Mining − Data mining supports knowledge discovery by finding hidden patterns and associations, constructing analytical models, performing classification and prediction. These mining results can be presented using the visualization tools.

Sr.No.Data Warehouse (OLAP)Operational Database(OLTP)
1It involves historical processing of information.It involves day-to-day processing.
2OLAP systems are used by knowledge workers such as executives, managers, and analysts.OLTP systems are used by clerks, DBAs, or database professionals.
3It is used to analyze the business.It is used to run the business.
4It focuses on Information out.It focuses on Data in.
5It is based on Star Schema, Snowflake Schema, and Fact Constellation Schema.It is based on Entity Relationship Model.
6It focuses on Information out.It is application oriented.
7It contains historical data.It contains current data.
8It provides summarized and consolidated data.It provides primitive and highly detailed data.
9It provides summarized and multidimensional view of data.It provides detailed and flat relational view of data.
10The number of users is in hundreds.The number of users is in thousands.
11The number of records accessed is in millions.The number of records accessed is in tens.
12The database size is from 100GB to 100 TB.The database size is from 100 MB to 100 GB.
13These are highly flexible.It provides high performance.
 


Data warehousing is the process of constructing and using a data warehouse. A data warehouse is constructed by integrating data from multiple heterogeneous sources that support analytical reporting, structured and/or ad hoc queries, and decision making. Data warehousing involves data cleaning, data integration, and data consolidations.

Using Data Warehouse Information

There are decision support technologies that help utilize the data available in a data warehouse. These technologies help executives to use the warehouse quickly and effectively. They can gather data, analyze it, and take decisions based on the information present in the warehouse. The information gathered in a warehouse can be used in any of the following domains −

  • Tuning Production Strategies − The product strategies can be well tuned by repositioning the products and managing the product portfolios by comparing the sales quarterly or yearly.

  • Customer Analysis − Customer analysis is done by analyzing the customer's buying preferences, buying time, budget cycles, etc.

  • Operations Analysis − Data warehousing also helps in customer relationship management, and making environmental corrections. The information also allows us to analyze business operations.

Integrating Heterogeneous Databases

To integrate heterogeneous databases, we have two approaches −

  • Query-driven Approach
  • Update-driven Approach

Query-Driven Approach

This is the traditional approach to integrate heterogeneous databases. This approach was used to build wrappers and integrators on top of multiple heterogeneous databases. These integrators are also known as mediators.

Process of Query-Driven Approach

  • When a query is issued to a client side, a metadata dictionary translates the query into an appropriate form for individual heterogeneous sites involved.

  • Now these queries are mapped and sent to the local query processor.

  • The results from heterogeneous sites are integrated into a global answer set.

Disadvantages

  • Query-driven approach needs complex integration and filtering processes.

  • This approach is very inefficient.

  • It is very expensive for frequent queries.

  • This approach is also very expensive for queries that require aggregations.

Update-Driven Approach

This is an alternative to the traditional approach. Today's data warehouse systems follow update-driven approach rather than the traditional approach discussed earlier. In update-driven approach, the information from multiple heterogeneous sources are integrated in advance and are stored in a warehouse. This information is available for direct querying and analysis.

Advantages

This approach has the following advantages −

  • This approach provide high performance.

  • The data is copied, processed, integrated, annotated, summarized and restructured in semantic data store in advance.

  • Query processing does not require an interface to process data at local sources.

Functions of Data Warehouse Tools and Utilities

The following are the functions of data warehouse tools and utilities −

  • Data Extraction − Involves gathering data from multiple heterogeneous sources.

  • Data Cleaning − Involves finding and correcting the errors in data.

  • Data Transformation − Involves converting the data from legacy format to warehouse format.

  • Data Loading − Involves sorting, summarizing, consolidating, checking integrity, and building indices and partitions.

  • Refreshing − Involves updating from data sources to warehouse.

Note − Data cleaning and data transformation are important steps in improving the quality of data and data mining results.

Comments

Popular posts from this blog

SQL SERVER – Fix : Error 1702 CREATE TABLE failed because column in table exceeds the maximum of columns

  Error 1702 CREATE TABLE failed because column in table exceeds the maximum of columns SQL Server 2000 supports table with maximum 1024 columns. This errors happens when we try to create table with 1024 columns or try to add columns to table which exceeds more than 1024. Fix/Solution/WorkAround: Reduce the number of columns in the table to 1,024 or less

Select Names from table which have vowels

  Problem Query the list of  CITY  names from  table  which have vowels (i.e.,  a ,  e ,  i ,  o , and  u ) as both their first  and  last characters. Your result cannot contain duplicates. Input Format The  STATION  table is described as follows: Field Type ID NUMBER CITY VARCHAR2(21) STATE VARCHAR2(2) LAT_N NUMBER LONG_W NUMBER STATION where  LAT_N  is the northern latitude and  LONG_W  is the western longitude. MYSQL select distinct city from station where (city like 'a%' or city like 'e%' or city like 'i%' or city like 'o%' or city like 'u%' ) and ( city like '%a' or city like '%e' or city like '%i' or city like '%o' or city like '%u' )

Methods of Rank Rows in SQL Server: ROW_NUMBER(), RANK(), DENSE_RANK() and NTILE()

SQL Server provides us with a number of window functions that help us to perform calculations across a set of rows, without the need to repeat the calls to the database. Unlike the standard aggregate functions, the window functions will not group the rows into a single output row, they will return a single aggregated value for each row, keeping the separate identities for those rows. The Window term here is not related to the Microsoft Windows operating system, it describes the set of rows that the function will process. One of the most useful types of window functions is Ranking Window Functions that are used to rank specific field values and categorize them according to the rank of each row, resulting in a single aggregated value for each participated row. There are four ranking window functions supported in SQL Server;  ROW_NUMBER(),   RANK() ,  DENSE_RANK()  and  NTILE() . All these functions are used to calculate ROWID for the provided rows window in ...