Module 8: Loading databases, 1:1 and 1:many
Sqlite3 in Python
Databases?
Python is commonly used as a connector service to databases. This means that Python provides tools to connect and interact with a database stored...somewhere. This may be a local database file, a server somewhere in your company, etc.
There are several database tools within python. Generally, they all have a pretty standard process of connect, execute, commit, etc. Competitors come out regularly as well, so jeust plan to learn whatever system your company will be using.
We will be exploring sqlite3 here, which is built into Python.
Essential objects
The conn
This represents your connection to the database. Sort of like doing an HTTP request, you must establish a connection to the data source and then interact from there. This object will give you special methods that are about connecting to the source file or server.
The cursor
The cursor object provides you access (via methods) to issue commands to the database. This object gives you special methods for interacting with the data and other content within the database.
Preparation
When we import the package without an alias, this means that we need to use dot notation and have sqlite3. before any function from that package that we want to use.
import sqlite3
In this case, we want to call sqlite3.connect()
to open up a connection to our database file. Again, the terminology here is really geared toward non-local files.
Loading data
We pass the name of our file into the function, and save it as a variable called conn. This variable name is a pretty standard one that you'll see across a variety of SQL packages in Python.
This variable represents Python's knowledge of the database file and what can be done with it. You'll use this variable as an object to interact with the database's content.
But we aren't ready to execute commands just yet. We need to create a cursor connection to the database so we can pass it some text to parse as SQL commands. This cursor object will have a special execute() function that will let us pass it a SQL command (that is stored as text in a string).
Querying data
So we use c
to execute SQL commands on our database, but then we use the results
object to inspect and work with the results of those commands.
We only have one table, so we're going to do the simplest SQL command there is: select everything from the table and show it all to us. Let's take a look at the query and results, then we'll talk through the syntax.
This will be a two step process, breaking it down into both: 1) execute the query to create a results variable, and 2) use another command to view the contents of those results.
There are three functions that you can run on your results variable:
results.fetchone()
will return a single tuple that represents the first row of your results.results.fetchmany(a_number)
will return a list of tuples that represents the first number of results that you requested.results.fetchall()
will return a list of tuples that represents all the rows in your returned table.
For memory and space reasons, you may only want to look at the 'top' of the results to check what you were given back. You may use this process as a check that what you got back was correct before going on to request or write out all of the data.
So to load all the rows of data into a 2D list you can do a list comprehension to recast each tuple as a list. However, you may find it valuable to leave the items as a tuple so you can't accidentally change the data.
If you have a ton of columns, you may want to list them out:
You can use this to create headers for an output file if needed.
Writing data out from our database
We've got headers now, and now also 2D data. We can use our same pattern for writing out CSVs.
Creating a database from data
Let's play with the data that you'll be using for your homework. These are found in our results folder, and are a bunch of xml and json files. First off, we need to loop through the data files and access the data.
Let's create a database table that has a three columns: the doi, the title, and the resource type.
You should fight with getting the data access and the database creation as two separate problems. Solve the data access first, and then you can stitch the database work around that.
First, you want to start with some 2D data structures. Let's read in the csv we made earlier.
Let's look through this and add a transformation, making the box/folder number into a single value.
Now we can accumulate all these rows up into a new list.
Now we can create our database and load the data. We can provide any file name that we like, and the file will be created if it doesn't already exist.
We can create our tables.
Now we can load the data. Due to security needs (you can read more about this in the documentation), we should not manually create the insert statement. Instead, we rely on the function to pass the data in directly. Our responsibility is to provide the correct placeholders and then the data.
The order that you put the data into the () tuple must match the order of the tables that you created in your create table command.
We'll need to play with a few things in here:
making the tuple out of our data values
placing that formatted tuple into our larger select statement.
Thankfully there's a nice function for handling this for us within the module.
This presumes that we have a list of rows to add, that each item in the row is a separate element in those lists, and that the order in the rows is consistent. It will also nicely sanitize values, etc. for us.
Unfortunately, the syntax is somewhat cryptic.
The c.executemany() (note the c which means this is a method that you are calling on your cursor) command takes two arguments:
the first being a string with your insert pattern and ? characters where you would want it to place your data values. You'll need to have as many ? items in there as you have columns. So instead of trying to do a .join() or something, you can use (?,?,?,?) if you have four columns.
The second argument will be the list of rows that you have.
So putting this together, we get this command to execute:
Now we need to commit the changes so we can see them.
We also could have added the data in from within the for loop, row by row, using the c.execute(...)
command, using the same syntax. Instead of providing the 2D data structure, you provide it just the single 1D row.
Last updated