Lecture 4: (de)Normalization & Data Warehousing#

Gittu George, January 12 2023

Announcements#

  • Office Hours

    • Today, 12:30 - 1:30 PM Zac (DL 005)

    • Today, 5 - 6 PM Ngoc

    • Tommorow 2 - 3 PM Daniel

  • Don’t forget upcoming deadlines

    • Late submissions won’t be accepted by any chance

    • If you need concession, please reach out to the course admin

  • Late by more than 5 min - I won’t accept your attendance.

  • Lids open doesn’t mean you do something else there.

Todays Agenda#

  • WHAT is a Data Warehouse?

  • WHY Data Warehouse?

  • HOW do we build a data warehouse?

  • Differences

  • Wrap up

Learning objectives#

  • Understanding of data warehouse.

  • Building a data warehouse.

  • Connecting our learning from previous classes.

Prerequisites#

  • Before we talk about data warehousing we need to clearly understand about views;

    • What are views ?

    • Why do we want to use views?

    • Different types of views (and when to use what ?)

    • How to create views ?

Refresher…

Please check here for more details on views.

  • If you see in the heading I mentioned about de-normalization, it is the opposite of normalization. But what is normalization? Please check this refresher to refresh your understanding of normalization or your BAIT 507 notes.

***NOTE:***This will be our last class talking about relational databases.

WHAT is a Data Warehouse?#

  • A Data Warehouse is a Data Management System

  • Data Analysts and BI Experts query a front end

  • Data is ingested in the back-end from multiple sources

  • A Data Warehouse simplifies data access and analysis.

  • Its indeed data stored in a database but in a denormalized fashion

It’s generally a process,

../_images/warehouse.png
  • Data moves from data sources to a data warehouse and specialized data marts.

  • Users access information from the data warehouse/data mart.

  • Users access data through applications.

When building a data warehouse, we’re taking data from multiple sources and joining it in a comprehensive way to support multiple purposes, from reports and app-based analytics to data mining and machine learning applications.

How do you think a data warehouse may look like?

General structure of Data Warehouse#

Reciept story#

../_images/reciept.png

Image source: wikiMedia

Fact Table: A table that contains the measurements of a business process of interest (in red - above diagram). Dimension Table(s): Tables that provide context for the fact table (in purple - above diagram).

Let’s look at a real world example,

../_images/warehouseeg.png

Source: ibm,oracle

A Data Warehouse results from a process that takes data and transforms it into a “fact table” that a range of users can access. These facts are represented in a manner that is suited to analysis & use by a broad set of users. In addition, it powers up BI Applications like PowerBI or Tableau.

WHY#

  • Data Can Be Complex

  • Multiple Sources

  • Multiple Uses

  • Multiple Users

  • Varied Skill Levels

  • Changing Data & Data Needs

Goals of Data Warehousing & BI#

The Data Warehouse system must:

  • Make information easily accessible

  • Present information consistently

  • Adapt to change

  • Present information in a timely way

  • Be secure

  • Be authoritative & trustworthy

  • Be accepted by end-users

So How Do We Meet Expectations?#

We have this dimensional model influenced by the business requirements & data realities.

The Data Warehouse Design must be collaborative for the expectations to be met. Some parties include;

  • IT Managers

  • Executives

  • BI Analysts

  • Department Members

  • App Designers

../_images/reality.png

How Do We Build a Data Warehouse?#

You are already in the process of building an enterprise data warehouse. Here are the steps that you already did;

  • You identified the data sources

  • You performed data cleaning and normalization to load the data properly into the database

  • You are already doing some analysis on the data

Even though you are already doing some joins to get to the analysis, we haven’t properly identified the facts and dimensions for denormalizing the data to one flat table. So let’s go through our Twitter data to see how this process looks like.

in our Twitter data, eg….#

  • What is the question?

  • Data we got in hand?

  • What Facts Do We Need?

  • What are Facts? What are the Dimensions of those facts?

A Data Warehouse is a type of database, but it conforms to a different data model. It’s focused on a Fact Table with multiple dimensions.

A Data Warehouse is part of a workflow that gets us from raw data (directly from a business process like sales, purchases, quality assessment metrics, real estate holdings, or some other measure) to a dataset easily queried and modeled as part of an analytics team.

Thinking of our Twitter data, we had a central question that asked if daily cashtag volume was positively correlated to stock price changes. So, we took raw tweet data, stock symbol, and price data from multiple sources (Twitter, NYSE, Yahoo! Finance), came up with a data model and undertook data cleaning.

If we’re interested in understanding relationships between the tweets and changes in price, we might want to look at some different analyses. For example:

  • Sentiment analysis on tweets

  • Presence of key terms (“sell” or “buy”)

  • Changes in price

  • Analysis by individual ticker

  • Aggregation at different scales (week, month)

If we do that, we would want to work on a single table to undertake our analysis, so we’re assured that there is consistency in our analysis.

Here is the query we might use:

SELECT sy.nasdaqsymbol,
	   sy.securityname,
       cl.tweet,
	   cl.userid,
	   CASE WHEN rp.replyid IS NULL THEN 'FALSE' 
	                                ELSE 'TRUE' END AS reply,
	   CASE WHEN rt.retweetid IS NULL THEN 'FALSE' 
	                                ELSE 'TRUE' END AS retweet,
	   ROUND(sv.open::numeric,2) AS open,
	   ROUND(sv.close::numeric, 2) AS close,
	   ROUND(sv.close::numeric, 2) - ROUND(sv.open::numeric,2) AS change,
	   date_trunc('day', cl.createdate)::date AS day
FROM          tweets.cashtags AS ct
INNER JOIN tweets.cleantweets AS cl ON cl.tweetid = ct.tweetid
INNER JOIN     tweets.symbols AS sy ON sy.symbolid = ct.symbolid
INNER JOIN tweets.stockvalues AS sv ON sv.symbolid = sy.symbolid AND sv.date = date_trunc('day', cl.createdate)::date
LEFT JOIN      tweets.replies AS rp ON rp.tweetid = cl.tweetid
LEFT JOIN     tweets.retweets AS rt ON rt.tweetid = cl.tweetid;

We’re putting back the columns together and making some decisions about how we treat the data; for example, we’re just adding a boolean (TRUE/FALSE) as to whether something is tweeted or retweeted (using the CASE WHEN statements).

To create the wide table, given that this is essentially a derived table, we’re going to use a MATERIALIZED VIEW. This means that the output is committed to the disk, but it results from a query, so it’s not something we’re going to update directly.

So our call would look something like this:

import pandas as pd
import matplotlib.pyplot as plt
import psycopg2
import json
import urllib.parse

%load_ext sql
%config SqlMagic.displaylimit = 20

with open('credentials.json') as f:
    login = json.load(f)
    
username = login['user']
password = urllib.parse.quote(login['password'])
host = login['host']
port = login['port']
%sql postgresql://{username}:{password}@{host}:{port}/postgres
'Connected: postgres@postgres'
%%time
%%sql
CREATE MATERIALIZED VIEW if not exists tweets.dtwide AS
SELECT sy.nasdaqsymbol,
	   sy.securityname,
       cl.tweet,
	   cl.userid,
	   CASE WHEN rp.replyid IS NULL THEN 'FALSE' 
	                                ELSE 'TRUE' END AS reply,
	   CASE WHEN rt.retweetid IS NULL THEN 'FALSE' 
	                                ELSE 'TRUE' END AS retweet,
	   ROUND(sv.open::numeric,2) AS open,
	   ROUND(sv.close::numeric, 2) AS close,
	   ROUND(sv.close::numeric, 2) - ROUND(sv.open::numeric,2) AS change,
	   date_trunc('day', cl.createdate)::date AS day
FROM          tweets.cashtags AS ct
INNER JOIN tweets.cleantweets AS cl ON  cl.tweetid = ct.tweetid
INNER JOIN     tweets.symbols AS sy ON sy.symbolid = ct.symbolid
INNER JOIN tweets.stockvalues AS sv ON sv.symbolid = sy.symbolid AND sv.date = date_trunc('day', cl.createdate)::date
LEFT JOIN      tweets.replies AS rp ON  rp.tweetid = cl.tweetid
LEFT JOIN     tweets.retweets AS rt ON  rt.tweetid = cl.tweetid;
 * postgresql://postgres:***@mbandtweet.cumg1xvg68gc.us-west-2.rds.amazonaws.com:5432/postgres
Done.
CPU times: user 7.76 ms, sys: 775 µs, total: 8.53 ms
Wall time: 163 ms
[]

Look at the “Materialized View” tab in pgAdmin, you’ll see the view there.
I am putting a snapshot of this table on how it looks from pgadmin.

../_images/mat1.png ../_images/mat2.png

Here are the rows if you check from here using %%sql

%%sql
select * from tweets.dtwide limit 5;
 * postgresql://postgres:***@mbandtweet.cumg1xvg68gc.us-west-2.rds.amazonaws.com:5432/postgres
5 rows affected.
nasdaqsymbol securityname security_ts tweet tweet_ts userid reply retweet open close change day
A Agilent Technologies, Inc. Common Stock 'agilent':1 'common':4 'inc':3 'stock':5 'technologies':2 Agilent Technologies Inc $A Shares Bought by Toronto Dominion Bank https://t.co/YUjEefSgrB #stocks '/yujeefsgrb':13 'a':4 'agilent':1 'bank':10 'bought':6 'by':7 'dominion':9 'inc':3 'shares':5 'stocks':14 't.co':12 't.co/yujeefsgrb':11 'technologies':2 'toronto':8 2348716116 FALSE FALSE 53.97 53.92 -0.05 2017-05-18
A Agilent Technologies, Inc. Common Stock 'agilent':1 'common':4 'inc':3 'stock':5 'technologies':2 Parametric Portfolio Associates LLC Reduces Stake in Agilent Technologies Inc $A https://t.co/pT9mV2xkRB '/pt9mv2xkrb':14 'a':11 'agilent':8 'associates':3 'in':7 'inc':10 'llc':4 'parametric':1 'portfolio':2 'reduces':5 'stake':6 't.co':13 't.co/pt9mv2xkrb':12 'technologies':9 849417718642159616 FALSE FALSE 53.97 53.92 -0.05 2017-05-18
A Agilent Technologies, Inc. Common Stock 'agilent':1 'common':4 'inc':3 'stock':5 'technologies':2 @Reforma @lopezobrador_ J$U $A $N Z $E $P $E $D $A ES CUATE DEL P $R $Í $. LA LANA MANDA ?????? 'a':5,12 'cuate':14 'd':11 'del':15 'e':8,10 'es':13 'j':3 'la':19 'lana':20 'lopezobrador':2 'manda':21 'n':6 'p':9,16 'r':17 'reforma':1 'u':4 'z':7 'í':18 842521275411173376 TRUE FALSE 53.97 53.92 -0.05 2017-05-18
A Agilent Technologies, Inc. Common Stock 'agilent':1 'common':4 'inc':3 'stock':5 'technologies':2 @Reforma @lopezobrador_ J$U $A $N Z $E $P $E $D $A ES CUATE DEL P $R $Í $. LA LANA MANDA ?????? 'a':5,12 'cuate':14 'd':11 'del':15 'e':8,10 'es':13 'j':3 'la':19 'lana':20 'lopezobrador':2 'manda':21 'n':6 'p':9,16 'r':17 'reforma':1 'u':4 'z':7 'í':18 842521275411173376 TRUE FALSE 53.97 53.92 -0.05 2017-05-18
A Agilent Technologies, Inc. Common Stock 'agilent':1 'common':4 'inc':3 'stock':5 'technologies':2 Agilent Technologies Inc $A Shares Bought by Toronto Dominion Bank https://t.co/tuuWP0ZJf0 '/tuuwp0zjf0':13 'a':4 'agilent':1 'bank':10 'bought':6 'by':7 'dominion':9 'inc':3 'shares':5 't.co':12 't.co/tuuwp0zjf0':11 'technologies':2 'toronto':8 4816109896 FALSE FALSE 53.97 53.92 -0.05 2017-05-18

It’s a table with three million rows, so we may want to add some indices to the table to speed up our query speed - This is based on the types of questions (queries) that users using this data warehouse will mostly be asking.

For example the only question that I am interested in asking is “The relationship between the number of tweets that are not retweets in a day and the proportional change in stock value”

First, let me drop any indexes that I made previously. (This cell is here just so that the index gets deleted when you run this notebook a second time)

%%sql 
drop index if exists tweets.symbolhash_idx;
drop index if exists tweets.nasdaqsymbolhash_idx;
 * postgresql://postgres:***@mbandtweet.cumg1xvg68gc.us-west-2.rds.amazonaws.com:5432/postgres
Done.
Done.
[]

We can directly query our own mini-data warehouse using SQL queries. For example,

The following would give us the relationship between the number of tweets (that are not retweets) in a day and the proportional change in stock value. All of this work then allows us to simplify our workflow and avoid all joins that take lots of time:

%%time
%%sql
explain analyze SELECT nasdaqsymbol, ROUND(change / open, 2), COUNT(*) AS tweets
               FROM tweets.dtwide 
        WHERE retweet = 'FALSE' AND nasdaqsymbol = 'BABA'
        GROUP BY day, nasdaqsymbol, change, open;
 * postgresql://postgres:***@mbandtweet.cumg1xvg68gc.us-west-2.rds.amazonaws.com:5432/postgres
17 rows affected.
CPU times: user 8.98 ms, sys: 4.76 ms, total: 13.7 ms
Wall time: 14.3 s
QUERY PLAN
Finalize GroupAggregate (cost=274740.72..278260.52 rows=25870 width=58) (actual time=14128.303..14135.874 rows=81 loops=1)
  Group Key: day, nasdaqsymbol, change, open
  -> Gather Merge (cost=274740.72..277593.70 rows=22302 width=26) (actual time=14128.179..14135.715 rows=154 loops=1)
        Workers Planned: 2
        Workers Launched: 2
        -> Partial GroupAggregate (cost=273740.69..274019.47 rows=11151 width=26) (actual time=14122.235..14128.498 rows=51 loops=3)
              Group Key: day, nasdaqsymbol, change, open
              -> Sort (cost=273740.69..273768.57 rows=11151 width=18) (actual time=14122.207..14122.912 rows=8276 loops=3)
                    Sort Key: day, change, open
                    Sort Method: quicksort Memory: 825kB
                    Worker 0: Sort Method: quicksort Memory: 1064kB
                    Worker 1: Sort Method: quicksort Memory: 819kB
                    -> Parallel Seq Scan on dtwide (cost=0.00..272991.08 rows=11151 width=18) (actual time=154.517..14115.237 rows=8276 loops=3)
                          Filter: ((retweet = 'FALSE'::text) AND ((nasdaqsymbol)::text = 'BABA'::text))
                          Rows Removed by Filter: 1171408
Planning Time: 0.257 ms
Execution Time: 14136.084 ms

Wohooo, that was fast, my mini warehouse is helping me, but can we go faster? I see that I am using a condition on retweet and nasdaqsymbol, and I want to execute this query multiple times as it is very important to my business. So I need to think about indexing these 2 columns. Once I have that decision on what kind of indexing makes more sense for those columns.

The nasdaqsymbol and retweet columns could either use a B-TREE index or a HASH index. A HASH index is likely more appropriate if the ratio of unique symbols to total rows is very high (there are few unique symbols and many rows). However, remember that a hash index is only useful for equality. Are we likely to want to analyze stock symbols based on their letter configuration? (e.g., all stocks that start with ‘A’), or is it more likely that we’re just looking at matches?

In our situation, we are interested in exact matches for the stocks we are interested in. In this case, we are interested in nasdaqsymbol to be BABA (I might also be interested in seeing stocks for other companies) and retweets to be FALSE.

%sql CREATE INDEX if not exists nasdaqsymbolhash_idx ON tweets.dtwide USING hash(nasdaqsymbol);
#CREATE INDEX symbolhash_idx ON tweets.dtwide USING hash(retweet);
 * postgresql://postgres:***@mbandtweet.cumg1xvg68gc.us-west-2.rds.amazonaws.com:5432/postgres
Done.
[]

After indexing, let’s look at the time it took to execute our query.

%%time
%%sql
explain analyze SELECT nasdaqsymbol, ROUND(change / open, 2), COUNT(*) AS tweets
               FROM tweets.dtwide 
        WHERE retweet = 'FALSE' AND nasdaqsymbol = 'BABA'
        GROUP BY day, nasdaqsymbol, change, open;
 * postgresql://postgres:***@mbandtweet.cumg1xvg68gc.us-west-2.rds.amazonaws.com:5432/postgres
12 rows affected.
CPU times: user 4.8 ms, sys: 847 µs, total: 5.65 ms
Wall time: 86.4 ms
QUERY PLAN
HashAggregate (cost=79517.48..79905.53 rows=25870 width=58) (actual time=21.353..21.520 rows=81 loops=1)
  Group Key: day, nasdaqsymbol, change, open
  Batches: 1 Memory Usage: 801kB
  -> Bitmap Heap Scan on dtwide (cost=821.30..79182.95 rows=26763 width=18) (actual time=1.362..11.750 rows=24828 loops=1)
        Recheck Cond: ((nasdaqsymbol)::text = 'BABA'::text)
        Filter: (retweet = 'FALSE'::text)
        Rows Removed by Filter: 358
        Heap Blocks: exact=2376
        -> Bitmap Index Scan on nasdaqsymbolhash_idx (cost=0.00..814.61 rows=27015 width=0) (actual time=1.050..1.051 rows=25186 loops=1)
              Index Cond: ((nasdaqsymbol)::text = 'BABA'::text)
Planning Time: 0.157 ms
Execution Time: 21.992 ms

Excellent! We improved the speed of the query execution.

I plan to use Plotly for visualization, so I am installing it. You might have learned visualization in python using some other packages - please feel free to install those and use those for your assignments (like altair)

! conda install -y plotly
Retrieving notices: ...working... done
Collecting package metadata (current_repodata.json): \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ done
Solving environment: / - \ | / - \ | / - \ | / - \ | done

## Package Plan ##

  environment location: /Users/ggeorg02/opt/miniconda3/envs/bait580env

  added / updated specs:
    - plotly


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    openssl-3.0.7              |       hfd90126_2         2.2 MB  conda-forge
    ------------------------------------------------------------
                                           Total:         2.2 MB

The following packages will be UPDATED:

  openssl                                  3.0.7-hfd90126_1 --> 3.0.7-hfd90126_2 



Downloading and Extracting Packages

Preparing transaction: - done
Verifying transaction: | done
Executing transaction: - done
%%sql query_output <<

SELECT nasdaqsymbol, ROUND(change / open, 2), COUNT(*) AS tweets
               FROM tweets.dtwide 
        WHERE retweet = 'FALSE' AND nasdaqsymbol = 'BABA'
        GROUP BY day, nasdaqsymbol, change, open;
 * postgresql://postgres:***@mbandtweet.cumg1xvg68gc.us-west-2.rds.amazonaws.com:5432/postgres
81 rows affected.
Returning data to local variable query_output
import plotly.express as px
import pandas as pd
fig = px.scatter(query_output, 
              x='tweets',
              y='round',
              title='Stock value change by tweet count.', 
              hover_name='nasdaqsymbol')
fig.show()

Let me experiment the thread across all the stock tickers;

%%sql query_output <<

SELECT nasdaqsymbol, ROUND(change / open, 2), COUNT(*) AS tweets
               FROM tweets.dtwide 
        WHERE retweet = 'FALSE'
        GROUP BY day, nasdaqsymbol, change, open;
 * postgresql://postgres:***@mbandtweet.cumg1xvg68gc.us-west-2.rds.amazonaws.com:5432/postgres
174546 rows affected.
Returning data to local variable query_output
import plotly.express as px
import pandas as pd
fig = px.scatter(query_output, 
              x='tweets',
              y='round',
              title='Stock value change by tweet count.', 
              hover_name='nasdaqsymbol')
fig.show()

Differences#

DW vs RMDBS#

Database

Data Warehouse

Concerned with data security and durability

Concerned with broad access for analysis

Highly normalized to ensure referential integrity & reduce update/insertion anomalies.

Denormalized to ensure ease of access

Designed to support data ingestion.

they are only updated by the system

Data can go in and out

User data goes out, not in

ETL vs ELT#

ETL

ELT

Extract, Transform and Load

Extract, Load and Transform

Data is loaded after performing transformation

Data first loaded and then transformed

Mostly used for structured data

Mostly used for unstructured data

SQL based databases

NoSQL based systems

For building Data Warehouses

For building data lakes

OLAP vs OLTP#

OLAP

OLTP

Online analytical processing

Online transactional processing

Focus on Insert, Update, Delete for databases

Focuses on extract data for analyzing

uses the data warehouse

uses the database

BI vs DW#

Data warehouse powers up BI. (Business intelligence)

Can you#

  • Distinguish two different types of views and when to use what?

  • Define what is data warehousing and why we want it?

  • How are data warehouses built?

  • Different terms used in the industry that are related to data warehousing?

Class activity#

Note

Check worksheet 4 in canvas for detailed instructions on activity and submission instructions.

  • Identify a problem and build a mini data warehouse to answer that question.