Lecture 2: Setting up RDS and connection to cloud#
Gittu George, January 11 2024
Todays Agenda#
Connecting to RDS (Postgres)
Learning objectives#
Setup your first cloud database (RDS instance)
You will know how to connect to a remote database from a Jupyter notebook.
Different ways in interacting with the database from jupyter notebook.
Showing different ways of loading the data.
You will use queries to bring data into a Jupyter notebook and provide a simple data analysis.
Setting up RDS#
Demo time
I recommend you do this with me or during the class activity time. Or you can watch the youtube video that I made on setting up your first RDS instance.
Here is the youtube video https://www.youtube.com/watch?v=Wm5zLfpLD8I
Consider the database that you built as an independent entity. Then, follow the below instructions to connect to your database in the cloud.
Connecting To Cloud (RDS)#
If you haven’t yet set it up, please follow instructions in Setting up RDS
OKAY.. so now you have your database ready! In our case, we set up our Postgres database in RDS. Since you are not using an already existing database, you created it, so it’s empty now. So we need to load data into this database. After loading data to the database, we are all set with querying from the database.
Whether it’s for loading or querying the database, we need to connect to the database.
Well, we have a variety of options to run our SQL statements in PostgreSQL (You already aware of couple of options from your BAIT 507) :
We will be using the following 2 approaches in this course
pgAdmin is the official web-based GUI for interacting with PostgreSQL databases
%sql
and%%sql
magic commands in Jupyter notebooks, which are provided by theipython-sql
package
Some other ways you can connect (we won’t be discussing these)
psql
is PostgreSQL’s interactive command-line interfacepsycopg2
is the official Python adapter for PostgreSQL databasesUsing
.read_sql_query()
method in Pandas
First thing first#
What information do you want to know if you’re going to connect to the above? We need to know the information of the database that you wanted to connect to
Hostname
Port number
User name (need to make sure that this user have access to the database)
Password (Ofcourse !! you need to know the password for this user)
Now let’s think about other sources you want to connect to the database. Here are a few that I thought of
From a programming language (java, R, python, etc..)
From your jupyter notebook
From tableau
From excel
From R Studio
OR from any other application that you are using, e.g., your banking website
Besides knowing the hostname, port name, username, and password, we also need to install an Application programming interface (API) that helps the applications on the client-side communicate with the server-side (in our case database). Java Database Connectivity (JDBC) and Open Database Connectivity (ODBC) are 2 of such APIs that are provided by the database vendor (Postgres, Oracle, MySQL, etc..) to access their database.
JDBC is language-specific, and its the standard interface between any java application and database. JDBC converts the java programmer’s requests to the database to something that the database understands (you can think of it as a translator). Meanwhile, ODBC is not language-specific, so it can convert requests from any application/software/programming language to something the database understands. Here is a diagram that will help you understand how ODBC works.
In the rest of the section, let’s focus on connecting to the database from python and the jupyter notebook. We will be using specific packages to make this possible, but as we have discussed, it uses ODBC for the connection.
pgAdmin#
Important
You use pgadmin to load dumps for worksheet 2 and assignment 1.
Checkout
Please check here for more details on setting up pgadmin and connecting to your database.
I like to use a
Shift + Enter
keyboard shortcut to run my queries. You can configure this too by going Preferences -> Query Tool -> Keyboard shortcuts -> Change “Execute query” toShift + Enter
.
psql#
Note
You can run SQL commands in pgadmin and/or jupyter notebook. We don’t need this for this course.
ipython-sql
(%sql
and %%sql
)#
ipython-sql
is a package that enables us to run SQL statements right from a Jupyter notebook. You should install this package in your conda environment ( you probabily have it if you used the yml file to setup your environment). In order to use it, we should load it first:
%load_ext sql
Now we need the host address of where the database is stored, along with a username and a password.
There are a couple of ways to deal with packages. In the following section, I am showing you one way of dealing with passwords, and I recommend the following approach.
It is always a bad idea to store login information directly in a notebook or code file because of security reasons. Say if you want to share this notebook with your colleague, or if somebody came by and looked at your notebook, they can see your database connection details, including your password. It’s probably okay now as it’s a toy database, but when you start working in the industry, you will be dealing with a database ( or, in other words, business), and it can cause lots of problems if the password is compromised. So, let’s practice not doing the connection call by including the credentials directly into your notebook.
In order to avoid that, we store that kind of information in a separate file, like credentials.json
here, and read the username and password into our IPython session:
Note
Make sure that your credentials.json
file looks like this:
{
"host": "<the_end_point>",
"port": 5432,
"user": "postgres",
"password": "<your_password>"
}
This way, you don’t have to worry about exposing your password to anyone.
Caution
Nowhere in assignments/projects should you include the hostname or password. If anything needs to be checked, then TA’s will contact you.
import json
import urllib.parse
with open('credentials.json') as f:
login = json.load(f)
username = login['user']
password = urllib.parse.quote(login['password'])
host = login['host']
port = login['port']
Now we can establish the connection to the postgres
database using the following code:
%sql postgresql://{username}:{password}@{host}:{port}/postgres
'Connected: postgres@postgres'
Note that we have used the %sql
line magic to interpret the line in front of it as a magic command.
We can also use %%sql
cell magic to apply the magic to an entire notebook cell.
List the relations (i.e. tables) in the database:
%%sql
SELECT *
FROM pg_catalog.pg_tables
WHERE schemaname != 'pg_catalog' AND
schemaname != 'information_schema';
* postgresql://postgres:***@database-2.ch0m6kus6nrp.us-east-1.rds.amazonaws.com:5432/postgres
9 rows affected.
schemaname | tablename | tableowner | tablespace | hasindexes | hasrules | hastriggers | rowsecurity |
---|---|---|---|---|---|---|---|
tweets | stockvalues | postgres | None | False | False | True | False |
fakedata | testindex | postgres | None | True | False | False | False |
tweets | tweets | postgres | None | True | False | False | False |
tweets | symbols | postgres | None | True | False | True | False |
tweets | cashtags | postgres | None | False | False | True | False |
tweets | cleantweets | postgres | None | True | False | True | False |
tweets | users | postgres | None | True | False | True | False |
tweets | replies | postgres | None | False | False | True | False |
tweets | retweets | postgres | None | False | False | True | False |
Let’s run some SQL statements now. Let’s retrieve the name
and population
columns from the country
table:
%sql SELECT * FROM tweets.symbols limit 10;
* postgresql://postgres:***@database-2.ch0m6kus6nrp.us-east-1.rds.amazonaws.com:5432/postgres
10 rows affected.
symbolid | nasdaqsymbol | securityname |
---|---|---|
1 | A | Agilent Technologies, Inc. Common Stock |
2 | AA | Alcoa Corporation Common Stock |
3 | AAA | Listed Funds Trust AAF First Priority CLO Bond ETF |
4 | AAAU | Goldman Sachs Physical Gold ETF Shares |
5 | AAC | Ares Acquisition Corporation Class A Ordinary Shares |
6 | AAC= | Ares Acquisition Corporation Units, each consisting of one Class A ordinary share, and one-fifth of one redeemable warrant |
7 | AAC+ | Ares Acquisition Corporation Redeemable Warrants, each whole warrant exercisable for one Class A ordinary share at an exercise price of $11.50 |
8 | AAIC | Arlington Asset Investment Corp Class A (new) |
9 | AAIC-B | Arlington Asset Investment Corp 7.00% |
10 | AAIC-C | Arlington Asset Investment Corp 8.250% Seies C Fixed-to-Floating Rate Cumulative Redeemable Preferred Stock |
Limiting returned and displayed rows#
As you can see, all rows are returned and displayed by default. This behaviour can be problematic if our table is very large for two reasons:
Retrieving large tables can be slow, and maybe not necessary
Displaying a lot of rows clutters our Jupyter notebook
We can modifying ipython-sql
configuration to limit the number of returned and displayed rows. For example, here we change the display limit:
%config SqlMagic.displaylimit = 10
%sql SELECT * FROM tweets.symbols;
* postgresql://postgres:***@database-2.ch0m6kus6nrp.us-east-1.rds.amazonaws.com:5432/postgres
6121 rows affected.
symbolid | nasdaqsymbol | securityname |
---|---|---|
1 | A | Agilent Technologies, Inc. Common Stock |
2 | AA | Alcoa Corporation Common Stock |
3 | AAA | Listed Funds Trust AAF First Priority CLO Bond ETF |
4 | AAAU | Goldman Sachs Physical Gold ETF Shares |
5 | AAC | Ares Acquisition Corporation Class A Ordinary Shares |
6 | AAC= | Ares Acquisition Corporation Units, each consisting of one Class A ordinary share, and one-fifth of one redeemable warrant |
7 | AAC+ | Ares Acquisition Corporation Redeemable Warrants, each whole warrant exercisable for one Class A ordinary share at an exercise price of $11.50 |
8 | AAIC | Arlington Asset Investment Corp Class A (new) |
9 | AAIC-B | Arlington Asset Investment Corp 7.00% |
10 | AAIC-C | Arlington Asset Investment Corp 8.250% Seies C Fixed-to-Floating Rate Cumulative Redeemable Preferred Stock |
Looks good. Let’s apply the magic to an entire cell so that we can break the lines:
%%sql
SELECT *
FROM tweets.symbols;
* postgresql://postgres:***@database-2.ch0m6kus6nrp.us-east-1.rds.amazonaws.com:5432/postgres
6121 rows affected.
symbolid | nasdaqsymbol | securityname |
---|---|---|
1 | A | Agilent Technologies, Inc. Common Stock |
2 | AA | Alcoa Corporation Common Stock |
3 | AAA | Listed Funds Trust AAF First Priority CLO Bond ETF |
4 | AAAU | Goldman Sachs Physical Gold ETF Shares |
5 | AAC | Ares Acquisition Corporation Class A Ordinary Shares |
6 | AAC= | Ares Acquisition Corporation Units, each consisting of one Class A ordinary share, and one-fifth of one redeemable warrant |
7 | AAC+ | Ares Acquisition Corporation Redeemable Warrants, each whole warrant exercisable for one Class A ordinary share at an exercise price of $11.50 |
8 | AAIC | Arlington Asset Investment Corp Class A (new) |
9 | AAIC-B | Arlington Asset Investment Corp 7.00% |
10 | AAIC-C | Arlington Asset Investment Corp 8.250% Seies C Fixed-to-Floating Rate Cumulative Redeemable Preferred Stock |
Assigning returned rows to Python variables#
Single line queries:
query_output = %sql SELECT * FROM tweets.symbols;
* postgresql://postgres:***@database-2.ch0m6kus6nrp.us-east-1.rds.amazonaws.com:5432/postgres
6121 rows affected.
Multi-line queries:
%%sql query_output <<
SELECT *
FROM tweets.symbols;
* postgresql://postgres:***@database-2.ch0m6kus6nrp.us-east-1.rds.amazonaws.com:5432/postgres
6121 rows affected.
Returning data to local variable query_output
query_output
symbolid | nasdaqsymbol | securityname |
---|---|---|
1 | A | Agilent Technologies, Inc. Common Stock |
2 | AA | Alcoa Corporation Common Stock |
3 | AAA | Listed Funds Trust AAF First Priority CLO Bond ETF |
4 | AAAU | Goldman Sachs Physical Gold ETF Shares |
5 | AAC | Ares Acquisition Corporation Class A Ordinary Shares |
6 | AAC= | Ares Acquisition Corporation Units, each consisting of one Class A ordinary share, and one-fifth of one redeemable warrant |
7 | AAC+ | Ares Acquisition Corporation Redeemable Warrants, each whole warrant exercisable for one Class A ordinary share at an exercise price of $11.50 |
8 | AAIC | Arlington Asset Investment Corp Class A (new) |
9 | AAIC-B | Arlington Asset Investment Corp 7.00% |
10 | AAIC-C | Arlington Asset Investment Corp 8.250% Seies C Fixed-to-Floating Rate Cumulative Redeemable Preferred Stock |
Now you can use the query_output
dataframe to do any other analysis or visualization that you learned in your other courses.
Using psycopg2 package (OPTIONAL)#
Refer here if you want to know on psycopg2 package and how to use it.
Warning
If you are not using your RDS instance, make sure you shut it down from your AWS console. You can always restart it when working on your assignments/projects. However, your credits are limited, and if those get exhausted, we won’t provide you with more.
Note
Closing your lab doesn’t shut down the RDS instance; you need to shut it down from your AWS console explicitly.
Can you ?#
Setup your database in cloud ?
Recollect various options that are available while setting up ?
connect to your databases from various options (jupyter, tableau, pgadmin etc)that we discussed ?
use good practices in storing passwords and using them from your notebook ?
Load data to your database using dumps ?
Class activity#
Note
Check worksheet 2 in canvas for detailed instructions on activity and submission instructions.
Setup your RDS instance
Load data to your RDS instance using dump
Ensure you installed the necessary packages for this course (bait580env.yml). Refer to Conda installation instructions for more details.