Lecture 2: Setting up RDS and connection to cloud#
Gittu George, January 9 2025
Announcements#
Remember to watch this youtube video. https://www.youtube.com/watch?v=3OyQ3EeVPNs
We finally managed to get the ROOM booking, and here are the TA OH details.
Tuesday 12:00–1:00 PM – HA 358 - i.e., right after class, so that you can continue asking your questions with the TA.
Tuesday 4:00–5:00 PM – HA 454 - i.e., right after class, so that you can continue asking your questions with the TA.
Thursday 5:00–6:00 PM – HA 454
Friday 3-4 PM - VIA ZOOM. Here is the link https://ubc.zoom.us/j/65657043766?pwd=0btOmgt8qHO5X1IayvghiX8lmM0fd1.1
We will be having an OPTIONAL tutorial section on Friday 4-5pm, on conda and it’s importance. More details in piazza. https://piazza.com/class/m5l9q2ervdk6pj/post/7
Worksheets for the following week will be released on Friday before 5:00 PM, so you will have time to review if you want.
Start setting up and working on assignments early so that you have enough time to plan and ask questions. If you wait until the deadline to ask questions, you might not get a response in time.
If your attendance falls below the required threshold for the MBAN program, your name will be reported to the Program Manager. Please ensure you get any concessions approved beforehand.
Todays Agenda#
Connecting to RDS (Postgres)
Learning objectives#
Setup your first cloud database (RDS instance)
You will know how to connect to a remote database from a Jupyter notebook.
Different ways in interacting with the database from jupyter notebook.
Showing different ways of loading the data.
You will use queries to bring data into a Jupyter notebook and provide a simple data analysis.
Setting up RDS#
Demo time
I recommend you do this with me or during the class activity time. Or you can watch the youtube video that I made on setting up your first RDS instance.
Here is the youtube video https://www.youtube.com/watch?v=3OyQ3EeVPNs
Consider the database that you built as an independent entity. Then, follow the below instructions to connect to your database in the cloud.
Connecting To Cloud (RDS)#
If you haven’t yet set it up, please follow instructions in Setting up RDS
OKAY.. so now you have your database ready! In our case, we set up our Postgres database in RDS. Since you are not using an already existing database, you created it, so it’s empty now. So we need to load data into this database. After loading data to the database, we are all set with querying from the database.
Whether it’s for loading or querying the database, we need to connect to the database.
Well, we have a variety of options to run our SQL statements in PostgreSQL (You already aware of couple of options from your BAIT 507) :
We will be using the following 2 approaches in this course
pgAdmin is the official web-based GUI for interacting with PostgreSQL databases
%sql
and%%sql
magic commands in Jupyter notebooks, which are provided by theipython-sql
package
Some other ways you can connect (we won’t be discussing these)
psql
is PostgreSQL’s interactive command-line interfacepsycopg2
is the official Python adapter for PostgreSQL databasesUsing
.read_sql_query()
method in Pandas
First thing first#
What information do you want to know if you’re going to connect to the above? We need to know the information of the database that you wanted to connect to
Hostname
Port number
User name (need to make sure that this user have access to the database)
Password (Ofcourse !! you need to know the password for this user)
Now let’s think about other sources you want to connect to the database. Here are a few that I thought of
From a programming language (java, R, python, etc..)
From your jupyter notebook
From tableau
From excel
From R Studio
OR from any other application that you are using, e.g., your banking website
Besides knowing the hostname, port name, username, and password, we also need to install an Application programming interface (API) that helps the applications on the client-side communicate with the server-side (in our case database). Java Database Connectivity (JDBC) and Open Database Connectivity (ODBC) are 2 of such APIs that are provided by the database vendor (Postgres, Oracle, MySQL, etc..) to access their database.
JDBC is language-specific, and its the standard interface between any java application and database. JDBC converts the java programmer’s requests to the database to something that the database understands (you can think of it as a translator). Meanwhile, ODBC is not language-specific, so it can convert requests from any application/software/programming language to something the database understands. Here is a diagram that will help you understand how ODBC works.
In the rest of the section, let’s focus on connecting to the database from python and the jupyter notebook. We will be using specific packages to make this possible, but as we have discussed, it uses ODBC for the connection.
pgAdmin#
Important
You use pgadmin to load dumps for worksheet 2 and assignment 1.
Checkout
Please check here for more details on setting up pgadmin and connecting to your database.
I like to use a
Shift + Enter
keyboard shortcut to run my queries. You can configure this too by going Preferences -> Query Tool -> Keyboard shortcuts -> Change “Execute query” toShift + Enter
.
psql#
Note
You can run SQL commands in pgadmin and/or jupyter notebook. We don’t need this for this course.
ipython-sql
(%sql
and %%sql
)#
ipython-sql
is a package that enables us to run SQL statements right from a Jupyter notebook. You should install this package in your conda environment ( you probabily have it if you used the yml file to setup your environment). In order to use it, we should load it first:
%load_ext sql
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[1], line 1
----> 1 get_ipython().run_line_magic('load_ext', 'sql')
File ~/opt/miniconda3/envs/jb2023/lib/python3.11/site-packages/IPython/core/interactiveshell.py:2369, in InteractiveShell.run_line_magic(self, magic_name, line, _stack_depth)
2367 kwargs['local_ns'] = self.get_local_scope(stack_depth)
2368 with self.builtin_trap:
-> 2369 result = fn(*args, **kwargs)
2371 # The code below prevents the output from being displayed
2372 # when using magics with decodator @output_can_be_silenced
2373 # when the last Python token in the expression is a ';'.
2374 if getattr(fn, magic.MAGIC_OUTPUT_CAN_BE_SILENCED, False):
File ~/opt/miniconda3/envs/jb2023/lib/python3.11/site-packages/IPython/core/magics/extension.py:33, in ExtensionMagics.load_ext(self, module_str)
31 if not module_str:
32 raise UsageError('Missing module name.')
---> 33 res = self.shell.extension_manager.load_extension(module_str)
35 if res == 'already loaded':
36 print("The %s extension is already loaded. To reload it, use:" % module_str)
File ~/opt/miniconda3/envs/jb2023/lib/python3.11/site-packages/IPython/core/extensions.py:76, in ExtensionManager.load_extension(self, module_str)
69 """Load an IPython extension by its module name.
70
71 Returns the string "already loaded" if the extension is already loaded,
72 "no load function" if the module doesn't have a load_ipython_extension
73 function, or None if it succeeded.
74 """
75 try:
---> 76 return self._load_extension(module_str)
77 except ModuleNotFoundError:
78 if module_str in BUILTINS_EXTS:
File ~/opt/miniconda3/envs/jb2023/lib/python3.11/site-packages/IPython/core/extensions.py:91, in ExtensionManager._load_extension(self, module_str)
89 with self.shell.builtin_trap:
90 if module_str not in sys.modules:
---> 91 mod = import_module(module_str)
92 mod = sys.modules[module_str]
93 if self._call_load_ipython_extension(mod):
File ~/opt/miniconda3/envs/jb2023/lib/python3.11/importlib/__init__.py:126, in import_module(name, package)
124 break
125 level += 1
--> 126 return _bootstrap._gcd_import(name[level:], package, level)
File <frozen importlib._bootstrap>:1206, in _gcd_import(name, package, level)
File <frozen importlib._bootstrap>:1178, in _find_and_load(name, import_)
File <frozen importlib._bootstrap>:1142, in _find_and_load_unlocked(name, import_)
ModuleNotFoundError: No module named 'sql'
Now we need the host address of where the database is stored, along with a username and a password.
There are a couple of ways to deal with packages. In the following section, I am showing you one way of dealing with passwords, and I recommend the following approach.
It is always a bad idea to store login information directly in a notebook or code file because of security reasons. Say if you want to share this notebook with your colleague, or if somebody came by and looked at your notebook, they can see your database connection details, including your password. It’s probably okay now as it’s a toy database, but when you start working in the industry, you will be dealing with a database ( or, in other words, business), and it can cause lots of problems if the password is compromised. So, let’s practice not doing the connection call by including the credentials directly into your notebook.
In order to avoid that, we store that kind of information in a separate file, like credentials.json
here, and read the username and password into our IPython session:
Note
Make sure that your credentials.json
file looks like this:
{
"host": "<the_end_point>",
"port": 5432,
"user": "postgres",
"password": "<your_password>"
}
This way, you don’t have to worry about exposing your password to anyone.
Caution
Nowhere in assignments/projects should you include the hostname or password. If anything needs to be checked, then TA’s will contact you.
import json
import urllib.parse
with open('credentials.json') as f:
login = json.load(f)
username = login['user']
password = urllib.parse.quote(login['password'])
host = login['host']
port = login['port']
Now we can establish the connection to the postgres
database using the following code:
%sql postgresql://{username}:{password}@{host}:{port}/postgres
Note that we have used the %sql
line magic to interpret the line in front of it as a magic command.
We can also use %%sql
cell magic to apply the magic to an entire notebook cell.
List the relations (i.e. tables) in the database:
%%sql
SELECT *
FROM pg_catalog.pg_tables
WHERE schemaname != 'pg_catalog' AND
schemaname != 'information_schema'
* postgresql://postgres:***@database-1.ccf3srhijxm7.us-east-1.rds.amazonaws.com:5432/postgres
9 rows affected.
schemaname | tablename | tableowner | tablespace | hasindexes | hasrules | hastriggers | rowsecurity |
---|---|---|---|---|---|---|---|
tweets | replies | postgres | None | False | False | True | False |
tweets | cleantweets | postgres | None | True | False | True | False |
tweets | symbols | postgres | None | True | False | True | False |
tweets | retweets | postgres | None | False | False | True | False |
tweets | stockvalues | postgres | None | False | False | True | False |
tweets | tweets | postgres | None | True | False | False | False |
tweets | cashtags | postgres | None | False | False | True | False |
tweets | users | postgres | None | True | False | True | False |
fakedata | testindex | postgres | None | True | False | False | False |
Let’s run some SQL statements now. Let’s retrieve the name
and population
columns from the country
table:
%sql SELECT * FROM tweets.symbols limit 10;
* postgresql://postgres:***@database-1.ccf3srhijxm7.us-east-1.rds.amazonaws.com:5432/postgres
10 rows affected.
symbolid | nasdaqsymbol | securityname |
---|---|---|
1 | A | Agilent Technologies, Inc. Common Stock |
2 | AA | Alcoa Corporation Common Stock |
3 | AAA | Listed Funds Trust AAF First Priority CLO Bond ETF |
4 | AAAU | Goldman Sachs Physical Gold ETF Shares |
5 | AAC | Ares Acquisition Corporation Class A Ordinary Shares |
6 | AAC= | Ares Acquisition Corporation Units, each consisting of one Class A ordinary share, and one-fifth of one redeemable warrant |
7 | AAC+ | Ares Acquisition Corporation Redeemable Warrants, each whole warrant exercisable for one Class A ordinary share at an exercise price of $11.50 |
8 | AAIC | Arlington Asset Investment Corp Class A (new) |
9 | AAIC-B | Arlington Asset Investment Corp 7.00% |
10 | AAIC-C | Arlington Asset Investment Corp 8.250% Seies C Fixed-to-Floating Rate Cumulative Redeemable Preferred Stock |
Limiting returned and displayed rows#
As you can see, all rows are returned and displayed by default. This behaviour can be problematic if our table is very large for two reasons:
Retrieving large tables can be slow, and maybe not necessary
Displaying a lot of rows clutters our Jupyter notebook
We can modifying ipython-sql
configuration to limit the number of returned and displayed rows. For example, here we change the display limit:
%config SqlMagic.displaylimit = 10
%sql SELECT * FROM tweets.symbols;
* postgresql://postgres:***@database-1.ccf3srhijxm7.us-east-1.rds.amazonaws.com:5432/postgres
6121 rows affected.
symbolid | nasdaqsymbol | securityname |
---|---|---|
1 | A | Agilent Technologies, Inc. Common Stock |
2 | AA | Alcoa Corporation Common Stock |
3 | AAA | Listed Funds Trust AAF First Priority CLO Bond ETF |
4 | AAAU | Goldman Sachs Physical Gold ETF Shares |
5 | AAC | Ares Acquisition Corporation Class A Ordinary Shares |
6 | AAC= | Ares Acquisition Corporation Units, each consisting of one Class A ordinary share, and one-fifth of one redeemable warrant |
7 | AAC+ | Ares Acquisition Corporation Redeemable Warrants, each whole warrant exercisable for one Class A ordinary share at an exercise price of $11.50 |
8 | AAIC | Arlington Asset Investment Corp Class A (new) |
9 | AAIC-B | Arlington Asset Investment Corp 7.00% |
10 | AAIC-C | Arlington Asset Investment Corp 8.250% Seies C Fixed-to-Floating Rate Cumulative Redeemable Preferred Stock |
Looks good. Let’s apply the magic to an entire cell so that we can break the lines:
%%sql
SELECT *
FROM tweets.symbols;
* postgresql://postgres:***@database-1.ccf3srhijxm7.us-east-1.rds.amazonaws.com:5432/postgres
6121 rows affected.
symbolid | nasdaqsymbol | securityname |
---|---|---|
1 | A | Agilent Technologies, Inc. Common Stock |
2 | AA | Alcoa Corporation Common Stock |
3 | AAA | Listed Funds Trust AAF First Priority CLO Bond ETF |
4 | AAAU | Goldman Sachs Physical Gold ETF Shares |
5 | AAC | Ares Acquisition Corporation Class A Ordinary Shares |
6 | AAC= | Ares Acquisition Corporation Units, each consisting of one Class A ordinary share, and one-fifth of one redeemable warrant |
7 | AAC+ | Ares Acquisition Corporation Redeemable Warrants, each whole warrant exercisable for one Class A ordinary share at an exercise price of $11.50 |
8 | AAIC | Arlington Asset Investment Corp Class A (new) |
9 | AAIC-B | Arlington Asset Investment Corp 7.00% |
10 | AAIC-C | Arlington Asset Investment Corp 8.250% Seies C Fixed-to-Floating Rate Cumulative Redeemable Preferred Stock |
Assigning returned rows to Python variables#
Single line queries:
query_output = %sql SELECT * FROM tweets.symbols;
* postgresql://postgres:***@database-1.ccf3srhijxm7.us-east-1.rds.amazonaws.com:5432/postgres
6121 rows affected.
Multi-line queries:
%%sql query_output <<
SELECT *
FROM tweets.symbols;
* postgresql://postgres:***@database-1.ccf3srhijxm7.us-east-1.rds.amazonaws.com:5432/postgres
6121 rows affected.
Returning data to local variable query_output
query_output
symbolid | nasdaqsymbol | securityname |
---|---|---|
1 | A | Agilent Technologies, Inc. Common Stock |
2 | AA | Alcoa Corporation Common Stock |
3 | AAA | Listed Funds Trust AAF First Priority CLO Bond ETF |
4 | AAAU | Goldman Sachs Physical Gold ETF Shares |
5 | AAC | Ares Acquisition Corporation Class A Ordinary Shares |
6 | AAC= | Ares Acquisition Corporation Units, each consisting of one Class A ordinary share, and one-fifth of one redeemable warrant |
7 | AAC+ | Ares Acquisition Corporation Redeemable Warrants, each whole warrant exercisable for one Class A ordinary share at an exercise price of $11.50 |
8 | AAIC | Arlington Asset Investment Corp Class A (new) |
9 | AAIC-B | Arlington Asset Investment Corp 7.00% |
10 | AAIC-C | Arlington Asset Investment Corp 8.250% Seies C Fixed-to-Floating Rate Cumulative Redeemable Preferred Stock |
import pandas as pd
pd.DataFrame(query_output)
symbolid | nasdaqsymbol | securityname | |
---|---|---|---|
0 | 1 | A | Agilent Technologies, Inc. Common Stock |
1 | 2 | AA | Alcoa Corporation Common Stock |
2 | 3 | AAA | Listed Funds Trust AAF First Priority CLO Bond... |
3 | 4 | AAAU | Goldman Sachs Physical Gold ETF Shares |
4 | 5 | AAC | Ares Acquisition Corporation Class A Ordinary ... |
... | ... | ... | ... |
6116 | 6117 | ZUO | Zuora, Inc. Class A Common Stock |
6117 | 6118 | ZVIA | Zevia PBC Class A Common Stock |
6118 | 6119 | ZVV | NYSE ARCA test stock |
6119 | 6120 | ZXIET | IEX Test Company Test Symbol Three for IEX |
6120 | 6121 | ZYME | Zymeworks Inc. Common Shares |
6121 rows × 3 columns
Important
Must use pd.DataFrame(query_output)
, so that you get it as a pandas dataframe which you can use for further analysis, and visualization.
Now you can use the query_output
dataframe to do any other analysis or visualization that you learned in your other courses.
Using psycopg2 package (OPTIONAL)#
Refer here if you want to know on psycopg2 package and how to use it.
Warning
If you are not using your RDS instance, make sure you shut it down from your AWS console. You can always restart it when working on your assignments/projects. However, your credits are limited, and if those get exhausted, we won’t provide you with more.
Note
Closing your lab doesn’t shut down the RDS instance; you need to shut it down from your AWS console explicitly.
Can you ?#
Setup your database in cloud ?
Recollect various options that are available while setting up ?
connect to your databases from various options (jupyter, tableau, pgadmin etc)that we discussed ?
use good practices in storing passwords and using them from your notebook ?
Load data to your database using dumps ?
Class activity#
Note
Check worksheet 2 in canvas for detailed instructions on activity and submission instructions.
Setup your RDS instance
Load data to your RDS instance using dump
Ensure you installed the necessary packages for this course (
bait580_2024.yml
). Refer to Conda installation instructions for more details.