Lecture 2: Setting up RDS and connection to cloud#

Gittu George, January 5 2023

Todays Agenda#

  • Connecting to RDS (Postgres)

Learning objectives#

  • Setup your first cloud database (RDS instance)

  • You will know how to connect to a remote database from a Jupyter notebook.

  • Different ways in interacting with the database from jupyter notebook.

  • Showing different ways of loading the data.

  • You will use queries to bring data into a Jupyter notebook and provide a simple data analysis.

Setting up RDS#

Demo time

../_images/explore.png

I recommend you do this with me or during the class activity time. Here are the instructions on setting up your first RDS instance.

Consider your database what you built as an independent entity. Then, follow below instructions to connect to your database in the cloud.

Connecting To Cloud (RDS)#

If you haven’t yet set it up, please follow instructions in Setting up RDS

OKAY.. so now you have your database ready! In our case, we set up our Postgres database in RDS. Since you are not using an already existing database, you created it, so it’s empty now. So we need to load data into this database. After loading data to the database, we are all set with querying from the database.

Whether it’s for loading or querying the database, we need to connect to the database.

Well, we have a variety of options to run our SQL statements in PostgreSQL (You already aware of couple of options from your BAIT 507) :

  • pgAdmin is the official web-based GUI for interacting with PostgreSQL databases

  • psql is PostgreSQL’s interactive command-line interface

  • %sql and %%sql magic commands in Jupyter notebooks, which are provided by the ipython-sql package

  • psycopg2 is the official Python adapter for PostgreSQL databases

  • Using .read_sql_query() method in Pandas

First thing first#

What information do you want to know if you’re going to connect to the above? We need to know the information of the database that you wanted to connect to

  • Hostname

  • Port number

  • User name (need to make sure that this user have access to the database)

  • Password (Ofcourse !! you need to know the password for this user)

Now let’s think about other sources you want to connect to the database. Here are a few that I thought of

  • From a programming language (java, R, python, etc..)

  • From your jupyter notebook

  • From tableau

  • From excel

  • From R Studio

  • OR from any other application that you are using, e.g., your banking website

Besides knowing the hostname, port name, username, and password, we also need to install an Application programming interface (API) that helps the applications on the client-side communicate with the server-side (in our case database). Java Database Connectivity (JDBC) and Open Database Connectivity (ODBC) are 2 of such APIs that are provided by the database vendor (Postgres, Oracle, MySQL, etc..) to access their database.

JDBC is language-specific, and its the standard interface between any java application and database. JDBC converts the java programmer’s requests to the database to something that the database understands (you can think of it as a translator). Meanwhile, ODBC is not language-specific, so it can convert requests from any application/software/programming language to something the database understands. Here is a diagram that will help you understand how ODBC works.

../_images/connection.png

In the rest of the section, let’s focus on connecting to the database from python and the jupyter notebook. We will be using specific packages to make this possible, but as we have discussed, it uses ODBC for the connection. Many packages make this possible, but we will discuss a package named psycopg2.

pgAdmin#

Checkout

Please check here for more details on setting up pgadmin and connecting to your database.

I like to use a Shift + Enter keyboard shortcut to run my queries. You can configure this too by going Preferences -> Query Tool -> Keyboard shortcuts -> Change “Execute query” to Shift + Enter.

psql#

Checkout

Please check here for more details on setting up psql and connecting to your database.

Note

You can run SQL commands in pgadmin and/or jupyter notebook. In our situation, we need to install psql mainly to load the dumps.

ipython-sql (%sql and %%sql)#

ipython-sql is a package that enables us to run SQL statements right from a Jupyter notebook. You should install this package in your conda environment ( you probabily have it if you used the yml file to setup your environment). In order to use it, we should load it first:

%load_ext sql

Now we need the host address of where the database is stored, along with a username and a password.

Note

There are a couple of ways to deal with packages. In the following section, I am showing you one way of dealing with passwords, and I recommend following approach.

(OPTIONAL) There is also a way to use dotenv package, if you want, you can check out those here

It is always a bad idea to store login information directly in a notebook or code file because of security reasons. Say if you want to share this notebook with your colleague, or if somebody came by and looked at your notebook, they can see your database connection details, including your password. It’s probably okay now as it’s a toy database, but when you start working in the industry, you will be dealing with a database ( or, in other words, business), and it can cause lots of problems if the password is compromised. So, let’s practice not doing the connection call by including the credentials directly into your notebook.

In order to avoid that, we store that kind of information in a separate file, like credentials.json here, and read the username and password into our IPython session:

Note

Make sure that your credentials.json file looks like this:

{
  "host": "<the_end_point>",
  "port": 5432,
  "user": "postgres",
  "password": "<your_password>"
}

This way, you don’t have to worry about exposing your password to anyone.

Caution

Nowhere in assignments/projects should you include the hostname or password. If anything needs to be checked, then TA’s will contact you.

import json
import urllib.parse

with open('credentials.json') as f:
    login = json.load(f)
    
username = login['user']
password = urllib.parse.quote(login['password'])
host = login['host']
port = login['port']

Now we can establish the connection to the postgres database using the following code:

%sql postgresql://{username}:{password}@{host}:{port}/postgres
'Connected: postgres@postgres'

Note that we have used the %sql line magic to interpret the line in front of it as a magic command.

We can also use %%sql cell magic to apply the magic to an entire notebook cell.

List the relations (i.e. tables) in the database:

%%sql 
SELECT *
FROM pg_catalog.pg_tables
WHERE schemaname != 'pg_catalog' AND 
    schemaname != 'information_schema';
 * postgresql://postgres:***@mbandtweet.cumg1xvg68gc.us-west-2.rds.amazonaws.com:5432/postgres
9 rows affected.
schemaname tablename tableowner tablespace hasindexes hasrules hastriggers rowsecurity
fakedata testindex postgres None True False False False
tweets retweets postgres None False False True False
tweets stockvalues postgres None False False True False
tweets tweets postgres None True False False False
tweets symbols postgres None True False True False
tweets cashtags postgres None False False True False
tweets cleantweets postgres None True False True False
tweets users postgres None True False True False
tweets replies postgres None False False True False

Let’s run some SQL statements now. Let’s retrieve the name and population columns from the country table:

%sql SELECT * FROM tweets.symbols limit 10;
 * postgresql://postgres:***@mbandtweet.cumg1xvg68gc.us-west-2.rds.amazonaws.com:5432/postgres
10 rows affected.
symbolid nasdaqsymbol securityname
1 A Agilent Technologies, Inc. Common Stock
2 AA Alcoa Corporation Common Stock
3 AAA Listed Funds Trust AAF First Priority CLO Bond ETF
4 AAAU Goldman Sachs Physical Gold ETF Shares
5 AAC Ares Acquisition Corporation Class A Ordinary Shares
6 AAC= Ares Acquisition Corporation Units, each consisting of one Class A ordinary share, and one-fifth of one redeemable warrant
7 AAC+ Ares Acquisition Corporation Redeemable Warrants, each whole warrant exercisable for one Class A ordinary share at an exercise price of $11.50
8 AAIC Arlington Asset Investment Corp Class A (new)
9 AAIC-B Arlington Asset Investment Corp 7.00%
10 AAIC-C Arlington Asset Investment Corp 8.250% Seies C Fixed-to-Floating Rate Cumulative Redeemable Preferred Stock

Limiting returned and displayed rows#

As you can see, all rows are returned and displayed by default. This behaviour can be problematic if our table is very large for two reasons:

  1. Retrieving large tables can be slow, and maybe not necessary

  2. Displaying a lot of rows clutters our Jupyter notebook

We can modifying ipython-sql configuration to limit the number of returned and displayed rows. For example, here we change the display limit:

%config SqlMagic.displaylimit = 10
%sql SELECT * FROM tweets.symbols;
 * postgresql://postgres:***@mbandtweet.cumg1xvg68gc.us-west-2.rds.amazonaws.com:5432/postgres
6121 rows affected.
symbolid nasdaqsymbol securityname
1 A Agilent Technologies, Inc. Common Stock
2 AA Alcoa Corporation Common Stock
3 AAA Listed Funds Trust AAF First Priority CLO Bond ETF
4 AAAU Goldman Sachs Physical Gold ETF Shares
5 AAC Ares Acquisition Corporation Class A Ordinary Shares
6 AAC= Ares Acquisition Corporation Units, each consisting of one Class A ordinary share, and one-fifth of one redeemable warrant
7 AAC+ Ares Acquisition Corporation Redeemable Warrants, each whole warrant exercisable for one Class A ordinary share at an exercise price of $11.50
8 AAIC Arlington Asset Investment Corp Class A (new)
9 AAIC-B Arlington Asset Investment Corp 7.00%
10 AAIC-C Arlington Asset Investment Corp 8.250% Seies C Fixed-to-Floating Rate Cumulative Redeemable Preferred Stock
6121 rows, truncated to displaylimit of 10

Looks good. Let’s apply the magic to an entire cell so that we can break the lines:

%%sql 
SELECT * 
FROM tweets.symbols;
 * postgresql://postgres:***@mbandtweet.cumg1xvg68gc.us-west-2.rds.amazonaws.com:5432/postgres
6121 rows affected.
symbolid nasdaqsymbol securityname
1 A Agilent Technologies, Inc. Common Stock
2 AA Alcoa Corporation Common Stock
3 AAA Listed Funds Trust AAF First Priority CLO Bond ETF
4 AAAU Goldman Sachs Physical Gold ETF Shares
5 AAC Ares Acquisition Corporation Class A Ordinary Shares
6 AAC= Ares Acquisition Corporation Units, each consisting of one Class A ordinary share, and one-fifth of one redeemable warrant
7 AAC+ Ares Acquisition Corporation Redeemable Warrants, each whole warrant exercisable for one Class A ordinary share at an exercise price of $11.50
8 AAIC Arlington Asset Investment Corp Class A (new)
9 AAIC-B Arlington Asset Investment Corp 7.00%
10 AAIC-C Arlington Asset Investment Corp 8.250% Seies C Fixed-to-Floating Rate Cumulative Redeemable Preferred Stock
6121 rows, truncated to displaylimit of 10

Assigning returned rows to Python variables#

Single line queries:

query_output = %sql SELECT * FROM tweets.symbols;
 * postgresql://postgres:***@mbandtweet.cumg1xvg68gc.us-west-2.rds.amazonaws.com:5432/postgres
6121 rows affected.

Multi-line queries:

%%sql query_output <<

SELECT * 
FROM tweets.symbols;
 * postgresql://postgres:***@mbandtweet.cumg1xvg68gc.us-west-2.rds.amazonaws.com:5432/postgres
6121 rows affected.
Returning data to local variable query_output
query_output
symbolid nasdaqsymbol securityname
1 A Agilent Technologies, Inc. Common Stock
2 AA Alcoa Corporation Common Stock
3 AAA Listed Funds Trust AAF First Priority CLO Bond ETF
4 AAAU Goldman Sachs Physical Gold ETF Shares
5 AAC Ares Acquisition Corporation Class A Ordinary Shares
6 AAC= Ares Acquisition Corporation Units, each consisting of one Class A ordinary share, and one-fifth of one redeemable warrant
7 AAC+ Ares Acquisition Corporation Redeemable Warrants, each whole warrant exercisable for one Class A ordinary share at an exercise price of $11.50
8 AAIC Arlington Asset Investment Corp Class A (new)
9 AAIC-B Arlington Asset Investment Corp 7.00%
10 AAIC-C Arlington Asset Investment Corp 8.250% Seies C Fixed-to-Floating Rate Cumulative Redeemable Preferred Stock
6121 rows, truncated to displaylimit of 10

Now you can use the query_output dataframe to do any other analysis or visualization that you learned in your other courses.

Using psycopg2 package (OPTIONAL)#

Refer here if you want to know on psycopg2 package and how to use it.

Warning

If you are not using your RDS instance, make sure you shut it down from your AWS console. You can always restart it when working on your assignments/projects. However, your credits are limited, and if those get exhausted, we won’t provide you with more.

Note

Closing your lab doesn’t shut down the RDS instance; you need to shut it down from your AWS console explicitly.

Can you ?#

  • Setup your database in cloud ?

  • Recollect various options that are available while setting up ?

  • connect to your databases from various options (jupyter, tableau, pgadmin etc)that we discussed ?

  • use good practices in storing passwords and using them from your notebook ?

  • Load data to your database using dumps ?

Class activity#

Note

Check worksheet 2 in canvas for detailed instructions on activity and submission instructions.

  • Setup your RDS instance

  • Load data to your RDS instance using dump

  • Ensure you installed the necessary packages for this course (bait580env.yml). Refer to Conda installation instructions for more details.