Module 8: Loading databases, 1:1 and 1:many

Sqlite3 in Python

Databases?

Python is commonly used as a connector service to databases. This means that Python provides tools to connect and interact with a database stored...somewhere. This may be a local database file, a server somewhere in your company, etc.

There are several database tools within python. Generally, they all have a pretty standard process of connect, execute, commit, etc. Competitors come out regularly as well, so jeust plan to learn whatever system your company will be using.

We will be exploring sqlite3 here, which is built into Python.

Essential objects

The conn

This represents your connection to the database. Sort of like doing an HTTP request, you must establish a connection to the data source and then interact from there. This object will give you special methods that are about connecting to the source file or server.

The cursor

The cursor object provides you access (via methods) to issue commands to the database. This object gives you special methods for interacting with the data and other content within the database.

Preparation

When we import the package without an alias, this means that we need to use dot notation and have sqlite3. before any function from that package that we want to use.

import sqlite3

In this case, we want to call sqlite3.connect() to open up a connection to our database file. Again, the terminology here is really geared toward non-local files.

Loading data

We pass the name of our file into the function, and save it as a variable called conn. This variable name is a pretty standard one that you'll see across a variety of SQL packages in Python.

This variable represents Python's knowledge of the database file and what can be done with it. You'll use this variable as an object to interact with the database's content.

But we aren't ready to execute commands just yet. We need to create a cursor connection to the database so we can pass it some text to parse as SQL commands. This cursor object will have a special execute() function that will let us pass it a SQL command (that is stored as text in a string).

import sqlite3

# the two lines below are boilerplate to load things
conn = sqlite3.connect('pettigrew.db')
c = conn.cursor()

# a quick statement to see all the names of the tables, 
# in this there should only be one.

results = c.execute('SELECT name from sqlite_master where type= "table"')

print(list(results))
[('letters',)]

Querying data

So we use c to execute SQL commands on our database, but then we use the results object to inspect and work with the results of those commands.

We only have one table, so we're going to do the simplest SQL command there is: select everything from the table and show it all to us. Let's take a look at the query and results, then we'll talk through the syntax.

This will be a two step process, breaking it down into both: 1) execute the query to create a results variable, and 2) use another command to view the contents of those results.

There are three functions that you can run on your results variable:

  • results.fetchone() will return a single tuple that represents the first row of your results.

  • results.fetchmany(a_number) will return a list of tuples that represents the first number of results that you requested.

  • results.fetchall() will return a list of tuples that represents all the rows in your returned table.

For memory and space reasons, you may only want to look at the 'top' of the results to check what you were given back. You may use this process as a check that what you got back was correct before going on to request or write out all of the data.

results = c.execute("SELECT * FROM letters;") # executes your query and saves your results

print(results.fetchone()) # shows you the contents of your results
(1, 1, '[Provenance documents and biographical sources]', 'n.d.')
# print(results.fetchmany(10))
# print(results.fetchall())

So to load all the rows of data into a 2D list you can do a list comprehension to recast each tuple as a list. However, you may find it valuable to leave the items as a tuple so you can't accidentally change the data.

data = [list(row) for row in results.fetchall()]

If you have a ton of columns, you may want to list them out:

print(results.description)
(('BoxNumber', None, None, None, None, None, None), ('FolderNumber', None, None, None, None, None, None), ('Contents', None, None, None, None, None, None), ('Date', None, None, None, None, None, None))

You can use this to create headers for an output file if needed.

headers = [r[0] for r in results.description]
headers
['BoxNumber', 'FolderNumber', 'Contents', 'Date

Writing data out from our database

We've got headers now, and now also 2D data. We can use our same pattern for writing out CSVs.

import csv

results = c.execute('SELECT * from letters;')

headers = [r[0] for r in results.description]
data = results.fetchall()

with open('allletters.csv', 'w') as fout:
    csvout = csv.writer(fout)
    csvout.writerow(headers)
    csvout.writerows(data)

Creating a database from data

Let's play with the data that you'll be using for your homework. These are found in our results folder, and are a bunch of xml and json files. First off, we need to loop through the data files and access the data.

Let's create a database table that has a three columns: the doi, the title, and the resource type.

You should fight with getting the data access and the database creation as two separate problems. Solve the data access first, and then you can stitch the database work around that.

First, you want to start with some 2D data structures. Let's read in the csv we made earlier.

import csv

with open('allletters.csv', 'r') as fin:
    csvin = csv.reader(fin)
    headers = next(csvin)
    data = [r for r in csvin]
print(headers)
data[:5]
['BoxNumber', 'FolderNumber', 'Contents', 'Date']





[['1', '1', '[Provenance documents and biographical sources]', 'n.d.'],
 ['1', '2', '[Provenance documents and biographical sources]', 'n.d.'],
 ['1', '3', '[Provenance documents and biographical sources]', 'n.d.'],
 ['1',
  '4',
  'Aberdeen, George Hamilton-Gordon, 4th Earl of, 1784-1860. 1 ALS, 1 LS, 1 AL to T. J. Pettigrew',
  '1846 Aug'],
 ['1',
  '5',
  'Abingdon, Montagu Bertie, 6th Earl of, 1808-1884. 2 ALS to T. J. Pettigrew',
  '1859, 1860']]

Let's look through this and add a transformation, making the box/folder number into a single value.

for row in data:
    newrow = [row[0] + "-" + row[1]] + row[2:]
    print(newrow)
['1-1', '[Provenance documents and biographical sources]', 'n.d.']
['1-2', '[Provenance documents and biographical sources]', 'n.d.']
['1-3', '[Provenance documents and biographical sources]', 'n.d.']
['1-4', 'Aberdeen, George Hamilton-Gordon, 4th Earl of, 1784-1860. 1 ALS, 1 LS, 1 AL to T. J. Pettigrew', '1846 Aug']
......snip.....

Now we can accumulate all these rows up into a new list.

retooleddata = []

for row in data:
    newrow = [row[0] + "-" + row[1]] + row[2:]
    retooleddata.append(newrow)

Now we can create our database and load the data. We can provide any file name that we like, and the file will be created if it doesn't already exist.

import sqlite3

conn = sqlite3.connect('retooled_letters.db')

c = conn.cursor()

We can create our tables.

c.execute('CREATE TABLE letters (id text, contents text, date text);')

Now we can load the data. Due to security needs (you can read more about this in the documentation), we should not manually create the insert statement. Instead, we rely on the function to pass the data in directly. Our responsibility is to provide the correct placeholders and then the data.

The order that you put the data into the () tuple must match the order of the tables that you created in your create table command.

We'll need to play with a few things in here:

  • making the tuple out of our data values

  • placing that formatted tuple into our larger select statement.

  • Thankfully there's a nice function for handling this for us within the module.

This presumes that we have a list of rows to add, that each item in the row is a separate element in those lists, and that the order in the rows is consistent. It will also nicely sanitize values, etc. for us.

Unfortunately, the syntax is somewhat cryptic.

The c.executemany() (note the c which means this is a method that you are calling on your cursor) command takes two arguments:

  1. the first being a string with your insert pattern and ? characters where you would want it to place your data values. You'll need to have as many ? items in there as you have columns. So instead of trying to do a .join() or something, you can use (?,?,?,?) if you have four columns.

  2. The second argument will be the list of rows that you have.

    So putting this together, we get this command to execute:

c.executemany('INSERT INTO letters VALUES (?, ?, ?)', retooleddata)

Now we need to commit the changes so we can see them.

conn.commit()
results = c.execute("SELECT * FROM letters")
new_data = results.fetchall()
print(len(retooleddata)) # these should match
print(len(new_data))
601
601

We also could have added the data in from within the for loop, row by row, using the c.execute(...) command, using the same syntax. Instead of providing the 2D data structure, you provide it just the single 1D row.

Last updated