Lecture 2: Setting up RDS and connection to cloud#

Gittu George, January 5 2023

Todays Agenda#

Connecting to RDS (Postgres)

Learning objectives#

Setup your first cloud database (RDS instance)
You will know how to connect to a remote database from a Jupyter notebook.
Different ways in interacting with the database from jupyter notebook.
Showing different ways of loading the data.
You will use queries to bring data into a Jupyter notebook and provide a simple data analysis.

Setting up RDS#

Demo time

I recommend you do this with me or during the class activity time. Here are the instructions on setting up your first RDS instance.

Consider your database what you built as an independent entity. Then, follow below instructions to connect to your database in the cloud.

Connecting To Cloud (RDS)#

If you haven’t yet set it up, please follow instructions in Setting up RDS

OKAY.. so now you have your database ready! In our case, we set up our Postgres database in RDS. Since you are not using an already existing database, you created it, so it’s empty now. So we need to load data into this database. After loading data to the database, we are all set with querying from the database.

Whether it’s for loading or querying the database, we need to connect to the database.

Well, we have a variety of options to run our SQL statements in PostgreSQL (You already aware of couple of options from your BAIT 507) :

pgAdmin is the official web-based GUI for interacting with PostgreSQL databases
psql is PostgreSQL’s interactive command-line interface
%sql and %%sql magic commands in Jupyter notebooks, which are provided by the ipython-sql package
psycopg2 is the official Python adapter for PostgreSQL databases
Using .read_sql_query() method in Pandas

First thing first#

What information do you want to know if you’re going to connect to the above? We need to know the information of the database that you wanted to connect to

Hostname
Port number
User name (need to make sure that this user have access to the database)
Password (Ofcourse !! you need to know the password for this user)

Now let’s think about other sources you want to connect to the database. Here are a few that I thought of

From a programming language (java, R, python, etc..)
From your jupyter notebook
From tableau
From excel
From R Studio
OR from any other application that you are using, e.g., your banking website

Besides knowing the hostname, port name, username, and password, we also need to install an Application programming interface (API) that helps the applications on the client-side communicate with the server-side (in our case database). Java Database Connectivity (JDBC) and Open Database Connectivity (ODBC) are 2 of such APIs that are provided by the database vendor (Postgres, Oracle, MySQL, etc..) to access their database.

JDBC is language-specific, and its the standard interface between any java application and database. JDBC converts the java programmer’s requests to the database to something that the database understands (you can think of it as a translator). Meanwhile, ODBC is not language-specific, so it can convert requests from any application/software/programming language to something the database understands. Here is a diagram that will help you understand how ODBC works.

In the rest of the section, let’s focus on connecting to the database from python and the jupyter notebook. We will be using specific packages to make this possible, but as we have discussed, it uses ODBC for the connection. Many packages make this possible, but we will discuss a package named psycopg2.

pgAdmin#

Checkout

Please check here for more details on setting up pgadmin and connecting to your database.

I like to use a Shift + Enter keyboard shortcut to run my queries. You can configure this too by going Preferences -> Query Tool -> Keyboard shortcuts -> Change “Execute query” to Shift + Enter.

psql#

Checkout

Please check here for more details on setting up psql and connecting to your database.

Note

You can run SQL commands in pgadmin and/or jupyter notebook. In our situation, we need to install psql mainly to load the dumps.

`ipython-sql` (`%sql` and `%%sql`)#

ipython-sql is a package that enables us to run SQL statements right from a Jupyter notebook. You should install this package in your conda environment ( you probabily have it if you used the yml file to setup your environment). In order to use it, we should load it first:

%load_ext sql

Now we need the host address of where the database is stored, along with a username and a password.

Note

There are a couple of ways to deal with packages. In the following section, I am showing you one way of dealing with passwords, and I recommend following approach.

(OPTIONAL) There is also a way to use dotenv package, if you want, you can check out those here

It is always a bad idea to store login information directly in a notebook or code file because of security reasons. Say if you want to share this notebook with your colleague, or if somebody came by and looked at your notebook, they can see your database connection details, including your password. It’s probably okay now as it’s a toy database, but when you start working in the industry, you will be dealing with a database ( or, in other words, business), and it can cause lots of problems if the password is compromised. So, let’s practice not doing the connection call by including the credentials directly into your notebook.

In order to avoid that, we store that kind of information in a separate file, like credentials.json here, and read the username and password into our IPython session:

Note

Make sure that your credentials.json file looks like this:

{
  "host": "<the_end_point>",
  "port": 5432,
  "user": "postgres",
  "password": "<your_password>"
}

This way, you don’t have to worry about exposing your password to anyone.

Caution

Nowhere in assignments/projects should you include the hostname or password. If anything needs to be checked, then TA’s will contact you.

import json
import urllib.parse

with open('credentials.json') as f:
    login = json.load(f)
    
username = login['user']
password = urllib.parse.quote(login['password'])
host = login['host']
port = login['port']

Now we can establish the connection to the postgres database using the following code:

%sql postgresql://{username}:{password}@{host}:{port}/postgres

'Connected: postgres@postgres'

Note that we have used the %sql line magic to interpret the line in front of it as a magic command.

We can also use %%sql cell magic to apply the magic to an entire notebook cell.

List the relations (i.e. tables) in the database:

%%sql 
SELECT *
FROM pg_catalog.pg_tables
WHERE schemaname != 'pg_catalog' AND 
    schemaname != 'information_schema';

 * postgresql://postgres:***@mbandtweet.cumg1xvg68gc.us-west-2.rds.amazonaws.com:5432/postgres
9 rows affected.

schemaname	tablename	tableowner	tablespace	hasindexes	hasrules	hastriggers	rowsecurity
fakedata	testindex	postgres	None	True	False	False	False
tweets	retweets	postgres	None	False	False	True	False
tweets	stockvalues	postgres	None	False	False	True	False
tweets	tweets	postgres	None	True	False	False	False
tweets	symbols	postgres	None	True	False	True	False
tweets	cashtags	postgres	None	False	False	True	False
tweets	cleantweets	postgres	None	True	False	True	False
tweets	users	postgres	None	True	False	True	False
tweets	replies	postgres	None	False	False	True	False

Let’s run some SQL statements now. Let’s retrieve the name and population columns from the country table:

%sql SELECT * FROM tweets.symbols limit 10;

 * postgresql://postgres:***@mbandtweet.cumg1xvg68gc.us-west-2.rds.amazonaws.com:5432/postgres
10 rows affected.

symbolid	nasdaqsymbol	securityname
1	A	Agilent Technologies, Inc. Common Stock
2	AA	Alcoa Corporation Common Stock
3	AAA	Listed Funds Trust AAF First Priority CLO Bond ETF
4	AAAU	Goldman Sachs Physical Gold ETF Shares
5	AAC	Ares Acquisition Corporation Class A Ordinary Shares
6	AAC=	Ares Acquisition Corporation Units, each consisting of one Class A ordinary share, and one-fifth of one redeemable warrant
7	AAC+	Ares Acquisition Corporation Redeemable Warrants, each whole warrant exercisable for one Class A ordinary share at an exercise price of $11.50
8	AAIC	Arlington Asset Investment Corp Class A (new)
9	AAIC-B	Arlington Asset Investment Corp 7.00%
10	AAIC-C	Arlington Asset Investment Corp 8.250% Seies C Fixed-to-Floating Rate Cumulative Redeemable Preferred Stock

Limiting returned and displayed rows#

As you can see, all rows are returned and displayed by default. This behaviour can be problematic if our table is very large for two reasons:

Retrieving large tables can be slow, and maybe not necessary
Displaying a lot of rows clutters our Jupyter notebook

We can modifying ipython-sql configuration to limit the number of returned and displayed rows. For example, here we change the display limit:

%config SqlMagic.displaylimit = 10

%sql SELECT * FROM tweets.symbols;

 * postgresql://postgres:***@mbandtweet.cumg1xvg68gc.us-west-2.rds.amazonaws.com:5432/postgres
6121 rows affected.

symbolid	nasdaqsymbol	securityname
1	A	Agilent Technologies, Inc. Common Stock
2	AA	Alcoa Corporation Common Stock
3	AAA	Listed Funds Trust AAF First Priority CLO Bond ETF
4	AAAU	Goldman Sachs Physical Gold ETF Shares
5	AAC	Ares Acquisition Corporation Class A Ordinary Shares
6	AAC=	Ares Acquisition Corporation Units, each consisting of one Class A ordinary share, and one-fifth of one redeemable warrant
7	AAC+	Ares Acquisition Corporation Redeemable Warrants, each whole warrant exercisable for one Class A ordinary share at an exercise price of $11.50
8	AAIC	Arlington Asset Investment Corp Class A (new)
9	AAIC-B	Arlington Asset Investment Corp 7.00%
10	AAIC-C	Arlington Asset Investment Corp 8.250% Seies C Fixed-to-Floating Rate Cumulative Redeemable Preferred Stock

6121 rows, truncated to displaylimit of 10

Looks good. Let’s apply the magic to an entire cell so that we can break the lines:

%%sql 
SELECT * 
FROM tweets.symbols;

 * postgresql://postgres:***@mbandtweet.cumg1xvg68gc.us-west-2.rds.amazonaws.com:5432/postgres
6121 rows affected.

symbolid	nasdaqsymbol	securityname
1	A	Agilent Technologies, Inc. Common Stock
2	AA	Alcoa Corporation Common Stock
3	AAA	Listed Funds Trust AAF First Priority CLO Bond ETF
4	AAAU	Goldman Sachs Physical Gold ETF Shares
5	AAC	Ares Acquisition Corporation Class A Ordinary Shares
6	AAC=	Ares Acquisition Corporation Units, each consisting of one Class A ordinary share, and one-fifth of one redeemable warrant
7	AAC+	Ares Acquisition Corporation Redeemable Warrants, each whole warrant exercisable for one Class A ordinary share at an exercise price of $11.50
8	AAIC	Arlington Asset Investment Corp Class A (new)
9	AAIC-B	Arlington Asset Investment Corp 7.00%
10	AAIC-C	Arlington Asset Investment Corp 8.250% Seies C Fixed-to-Floating Rate Cumulative Redeemable Preferred Stock

6121 rows, truncated to displaylimit of 10

Assigning returned rows to Python variables#

Single line queries:

query_output = %sql SELECT * FROM tweets.symbols;

 * postgresql://postgres:***@mbandtweet.cumg1xvg68gc.us-west-2.rds.amazonaws.com:5432/postgres
6121 rows affected.

Multi-line queries:

%%sql query_output <<

SELECT * 
FROM tweets.symbols;

 * postgresql://postgres:***@mbandtweet.cumg1xvg68gc.us-west-2.rds.amazonaws.com:5432/postgres
6121 rows affected.
Returning data to local variable query_output

query_output

symbolid	nasdaqsymbol	securityname
1	A	Agilent Technologies, Inc. Common Stock
2	AA	Alcoa Corporation Common Stock
3	AAA	Listed Funds Trust AAF First Priority CLO Bond ETF
4	AAAU	Goldman Sachs Physical Gold ETF Shares
5	AAC	Ares Acquisition Corporation Class A Ordinary Shares
6	AAC=	Ares Acquisition Corporation Units, each consisting of one Class A ordinary share, and one-fifth of one redeemable warrant
7	AAC+	Ares Acquisition Corporation Redeemable Warrants, each whole warrant exercisable for one Class A ordinary share at an exercise price of $11.50
8	AAIC	Arlington Asset Investment Corp Class A (new)
9	AAIC-B	Arlington Asset Investment Corp 7.00%
10	AAIC-C	Arlington Asset Investment Corp 8.250% Seies C Fixed-to-Floating Rate Cumulative Redeemable Preferred Stock

6121 rows, truncated to displaylimit of 10

Now you can use the query_output dataframe to do any other analysis or visualization that you learned in your other courses.

Using psycopg2 package (OPTIONAL)#

Refer here if you want to know on psycopg2 package and how to use it.

Warning

If you are not using your RDS instance, make sure you shut it down from your AWS console. You can always restart it when working on your assignments/projects. However, your credits are limited, and if those get exhausted, we won’t provide you with more.

Note

Closing your lab doesn’t shut down the RDS instance; you need to shut it down from your AWS console explicitly.

Can you ?#

Setup your database in cloud ?
Recollect various options that are available while setting up ?
connect to your databases from various options (jupyter, tableau, pgadmin etc)that we discussed ?
use good practices in storing passwords and using them from your notebook ?
Load data to your database using dumps ?

Class activity#

Note

Check worksheet 2 in canvas for detailed instructions on activity and submission instructions.

Setup your RDS instance
Load data to your RDS instance using dump
Ensure you installed the necessary packages for this course (bait580env.yml). Refer to Conda installation instructions for more details.

BAIT 580A - Topics in Information Technology Management

Lecture 2: Setting up RDS and connection to cloud

Contents

Lecture 2: Setting up RDS and connection to cloud#

Todays Agenda#

Learning objectives#

Setting up RDS#

Connecting To Cloud (RDS)#

First thing first#

pgAdmin#

psql#

`ipython-sql` (`%sql` and `%%sql`)#

Limiting returned and displayed rows#

Assigning returned rows to Python variables#

Using psycopg2 package (OPTIONAL)#

Can you ?#

Class activity#

BAIT 580A - Topics in Information Technology Management

Lecture 2: Setting up RDS and connection to cloud

Contents

Lecture 2: Setting up RDS and connection to cloud#

Todays Agenda#

Learning objectives#

Setting up RDS#

Connecting To Cloud (RDS)#

First thing first#

pgAdmin#

psql#

ipython-sql (%sql and %%sql)#

Limiting returned and displayed rows#

Assigning returned rows to Python variables#

Using psycopg2 package (OPTIONAL)#

Can you ?#

Class activity#

`ipython-sql` (`%sql` and `%%sql`)#