Team1_AIT614DL3_HW4
docx
keyboard_arrow_up
School
George Mason University *
*We aren’t endorsed by this school
Course
614
Subject
Information Systems
Date
Apr 3, 2024
Type
docx
Pages
9
Uploaded by JusticeSkunkPerson786
AIT-614-DL3
Dr. Liao
Cloud Data Warehouse and Cloud Data Lake
Team 1
0
George Mason University
10/24/2022
Fall
2022
Rami Alghussein
Nikitha Katta
Varun Medasani
Yumna Zaidi
Cloud Data Warehouse and Cloud Data Lake
Cloud Data Warehouse and Cloud Data Lake
Team 1
Table of Contents
1)
Cloud Data Warehouse
............................................................................................................
3
a)
Characteristics of Cloud Data Warehouse
......................................................................
3
2)
Cloud Data Lake
.......................................................................................................................
3
a)
Characteristics of Cloud Data Lake
.................................................................................
4
3)
Differences between Cloud Data Warehouse and Cloud Data Lake
.......................................
4
4)
Data Catalog
.............................................................................................................................
5
5)
Real World Cases
.....................................................................................................................
5
a)
Cloud Data Warehouse
....................................................................................................
5
b)
Cloud Data Lake
..............................................................................................................
5
6)
Bibliography
..................................................................................................................................
7
1
Cloud Data Warehouse and Cloud Data Lake
Team 1
1)
Cloud Data Warehouse
Data warehouses are used to store vast amounts of data which queries them in an ad-hoc way for decision making. As data warehouses need a large amount of data for decision making and data needs to be presented on the same server or on the same network. So, it is a huge problem for companies, and it is costly. We can solve this problem using cloud computing. Cloud computing provides 3 major services they are infrastructure as a service, platform as a service and software as a service. There are other services provided by cloud, one of which is Cloud Data Warehouse. Cloud Data Warehouse has all the data in a cloud as it is easy for the companies to retrieve data and store the data. It is on the top of Hadoop. Amazon Redshift is one of the examples of cloud data warehouse
a)
Characteristics of Cloud Data Warehouse
The main characteristics of cloud Data Warehouse, 1.
It is easy to use 2.
It is cost effective
3.
Fast query performance
4.
It does not require database administration.
5.
Data integration
6.
It can operate all SQL tools.
2)
Cloud Data Lake
Data lakes are seen as a lake of data used to store raw data. Day by day there is an increase in data
to store data and there is a need to change certain architecture whenever it is necessary. It is a problem for
2
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Cloud Data Warehouse and Cloud Data Lake
Team 1
the companies. So, to solve this problem data lakes are used. Data lakes can be implemented in multiple
ways based on its architecture. Multiple architectures are Data Lake on-premises, Cloud Data Lake,
Hybrid Data Lake, and Multi-Cloud Data Lake. Cloud Data Lake is data storage in cloud. It is easily
available, and it is immediately available after creating a data lake. It provides services such as data
ingestion, processing, and storage. Amazon is one of the examples of cloud data lake.
a)
Characteristics of Cloud Data Lake
The main characteristics of Cloud Data Lake:
1.
It can store enormous amounts of data and it is flexible
2.
It is cost effective
3.
It has centralized storage
4.
It is user friendly
5.
It provides scalability of data
6.
It protects its data
3)
Differences between Cloud Data Warehouse and Cloud Data Lake
The first major difference between a data lake and a data warehouse is the data structure and how the data are stored. Data warehouses store processed and structured data such as financial transactions so that it is easy for a business to immediately begin making use of the data. On the other hand, data lakes store unstructured or raw data such as large amount of media which are difficult to organize. In addition, the way data are processed differs. “In a data warehouse, data is organized, defined, and metadata is applied before the data is written and stored. This process is called ‘schema on write’.” Conversely, a data
3
Cloud Data Warehouse and Cloud Data Lake
Team 1
lake consumes all raw data and saves the data the way it is extracted from the data source in a method called ‘schema on read’. Because data warehouses are expensive only data that can be used to solve business problems are kept by data engineers while the inexpensive data lake stores all and any data for more comprehensive access. Those interacting with the data can vary as well. Typically, data analysts and business professors utilize data warehouses to solve business problems. Data scientists or data engineers are the ones that use and maintain data lakes due to complexity. Since data warehouses are more structured it is more difficult and costly to make changes. In contrast, data lake architecture has no structure and therefore it is more accessible and flexible so that changes can be made easily and quickly. The general industry consensus is that data warehouses and data lakes are not mutually exclusive. Both solutions can be complementary depending on the situation. Data warehouses will address the need for structured data to be immediately available for querying to solve business problems while data lakes can be used for big data scenarios where raw data is needed for machine learning.
4)
Data Catalog
A data catalog is an inventory of all available data using metadata and search tools. This can be achieved through a manual process of tagging data however there are ways for AI/Machine Learning to be used in automating the discovery of datasets. Having a data catalog improves data efficiency and data analysis by making the search process quick and easy. It would be extremely beneficial for data catalogs to be used in data lakes since the amount of data is so large and unstructured. In order for a data scientist 4
Cloud Data Warehouse and Cloud Data Lake
Team 1
to begin any kind of analysis it would help to know what data are available. By tagging metadata is makes
for an efficient way of big queries.
5)
Real World Cases
a)
Cloud Data Warehouse
A real-world implementation of Cloud Data Warehouse is Amazon Redshift. With the ability to integrate the content using big data analytics and Machine Learning solutions, exabytes of structured, semi-structured, and unstructured data from the data sources like data lake, operational data stores, and data warehouse may be queried using SQL in Amazon Redshift. Amazon Redshift is a petabyte-scale, parallel computing data warehouse with fully governed operations. Using the current business intelligence
(BI) technologies and traditional SQL, all the data can be quickly, easily, and economically processed. Currently, the most widely used cloud data warehouse is Amazon Redshift. An example of how Amazon Redshift is utilized is a company that heavily relies on it as a data warehouse for their analytical operations and has been embracing the opportunities and simplicity that it has offered to their organization. They mostly utilize Amazon Redshift for BI purposes to store and analyze user behavioral data. Since a few months ago, the volume of data has increased by hundreds of gigabytes per day, and throughout the workday, personnel from several departments routinely run queries on the Amazon Redshift cluster on their BI platform. The company runs four of the most crucial analytics workloads on a
single Amazon Redshift cluster since certain data is utilized by all workloads.
1.
Multiple BI queries are executed for analytical purposes through-out the day.
2.
Routine Extract-Transform-Load procedures are executed with the start of each hour for few minutes.]
3.
Few ETL jobs are scheduled to execute daily offline for generating daily reports.
5
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Cloud Data Warehouse and Cloud Data Lake
Team 1
4.
Few ETL jobs are executed weekly for generating a weekly data dump for different
departments.
5.
Cloud Warehouse implementation in Redshift enables the execution of above all processes in parallel.
b)
Cloud Data Lake
Amazon S3 Data Lake is a very established implementation of Cloud Data Lake. A secure, long-lasting, and expandable architecture is used by the Amazon S3 Data Lake to store and register datasets of any size
in their original format. AWS Glue and Amazon Athena may be linked by users to convert and analyze datasets with available information. Since Amazon S3 analyzes the data sources intelligently, identifies the data types, and then provides schemas and updates, users don't need to spend time manually developing data flows. Additionally, user-defined labels are maintained in Amazon DynamoDB to provide each dataset the business relevance it needs. The solution helps companies create simple guidelines that demand specific tags be used when datasets are submitted to the data lake. By viewing the accessible datasets or running searches on dataset features and tags, users may quickly locate and acquire information relevant to their business objectives.
6
Cloud Data Warehouse and Cloud Data Lake
Team 1
6) Bibliography
1
Modeling a secure cloud data warehouse with SoaML. (2015, December 1). IEEE Conference Publication | IEEE Xplore. https://ieeexplore.ieee.org/abstract/document/7492745
2
Prospects for Using Cloud Data Warehouses in Information Systems. (2018, September 1). IEEE Conference Publication | IEEE Xplore. https://ieeexplore.ieee.org/document/8526745
3
A Data Warehouse Approach for Business Intelligence. (2019, June 1). IEEE Conference Publication | IEEE Xplore. https://ieeexplore.ieee.org/document/8795395
4
Cloud DATA LAKE: The new trend of data storage. (2021, June 11). IEEE Conference Publication | IEEE Xplore. https://ieeexplore.ieee.org/document/9461293
5
Data Lake vs Data Warehouse: 6 key differences. Qlik. (n.d.). Retrieved October 24, 2022, from https://www.qlik.com/us/data-lake/data-lake-vs-data-warehouse 6
Data Lake VS Data Warehouse: Key differences. Talend. (n.d.). Retrieved October 24, 2022, from https://www.talend.com/resources/data-lake-vs-data-warehouse/#:~:text=Data%20lakes%20and
%20data%20warehouses,processed%20for%20a%20specific%20purpose. 7
Data Lake vs. Data Warehouse - working together in the cloud. Panoply. (n.d.). Retrieved October 24, 2022, from https://panoply.io/data-warehouse-guide/data-warehouse-vs-data-lake/ 8
Hazel, T. (2022, September 6). Data Lake vs Data Warehouse: Which is right for you? ChaosSearch. Retrieved October 24, 2022, from https://www.chaossearch.io/blog/data-lake-vs-data-
warehouse 7
Cloud Data Warehouse and Cloud Data Lake
Team 1
9
Wells, D. (2022, September 19). What is a data catalog? - importance, Benefits & Features. Alation. Retrieved October 24, 2022, from https://www.alation.com/blog/what-is-a-data-catalog/
10
From centralized architecture to decentralized architecture: How data sharing fine-tunes Amazon Redshift workloads. (2022, August 16). Amazon Web Services. https://aws.amazon.com/blogs/big-data/from-centralized-architecture-to-decentralized-architecture-how-
data-sharing-fine-tunes-amazon-redshift-workloads/
11
Architecture Overview - Data Lake on AWS. (n.d.). Retrieved October 24, 2022, from https://docs.aws.amazon.com/solutions/latest/data-lake-solution/architecture.html
8
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help