Week 4
Outline
Working with databases:
deleting files
sqite3 reading in databases: creating connections, creating cursors, executing queries. Receiving values back.
sqlite3 writing out databases: creating tables, inserting rows of values, committing changes.
writing database results out as csvs
Working with API endpoints:
requests (not standard library)
building up API url queries
downloading files to disk
General reminders
you can use pathlib to delete files using the .unlink() method on a Path object. Nice thing is, it will execute without error if the file doesn't exist yet.
import pathlib
out = pathlib.Path('deleteme.txt')
out.write_text('hello, this will go away')
print(out.exists())
out.unlink()
print(out.exists())
True
False
Use shutil to unzip things.
# just proving this isnt' already here
target = pathlib.Path('records')
print(target.exists())
False
import shutil
shutil.unpack_archive('records.zip') # this will work again even if the folder already exists
print(target.exists())
True
Use shutil to delete a folder.
You can use pathlib Path object methods to delete a file (p.unlink()) or delete an empty folder (p.rmdir()), but not a folder with contents. You'll need to use shutil.rmtree(p) to delete it. If you are working with a Path object for the target to delete, you'll need to cast your Path object as as string.
shutil.rmtree(str(target)) # will only work if the folder exists
print(target.exists())
False
Use the glob method to search for file paths, and then you can loop over those to read them.
import shutil
target = pathlib.Path('records')
shutil.unpack_archive('records.zip')
print(target.exists())
True
jsonfiles = list(target.glob('*.json')) # use this for your homework, y'all
print(len(jsonfiles))
207
Databases
The standard library has sqlite3 built in. There are many sql languages out there, this is just one. It's quick and pretty painless.
Let's read in an existing db file and explore the tables. There's a variety of boilerplate here to work with.
The conn object represents your connection to the database. The c (cursor) object represents your ability to issue commands to the database.
So like a connection vs an interpreter.
When we import the package without an alias, this means that we need to use dot notation and have sqlite3. before any function from that package that we want to use.
In this case, we want to call sqlite3.connect() to open up a connection to our database file. Again, the terminology here is really geared toward non-local files.
We pass the name of our file into the function, and save it as a variable called conn. This variable name is a pretty standard one that you'll see across a variety of SQL packages in Python.
This variable represents Python's knowledge of the database file and what can be done with it. You'll use this variable as an object to interact with the database's content.
But we aren't ready to execute commands just yet. We need to create a cursor connection to the database so we can pass it some text to parse as SQL commands. This cursor object will have a special execute() function that will let us pass it a SQL command (that is stored as text in a string).
import sqlite3
# the two lines below are boilerplate to load things
conn = sqlite3.connect('pettigrew.db')
c = conn.cursor()
# a quick statement to see all the names of the tables,
# in this there should only be one.
results = c.execute('SELECT name from sqlite_master where type= "table"')
print(list(results))
[('letters',)]
So we use c
to execute SQL commands on our database, but then we use the results
object to inspect and work with the results of those commands.
We only have one table, so we're going to do the simplest SQL command there is: select everything from the table and show it all to us. Let's take a look at the query and results, then we'll talk through the syntax.
This will be a two step process, breaking it down into both: 1) execute the query to create a results variable, and 2) use another command to view the contents of those results.
There are three functions that you can run on your results variable:
results.fetchone()
will return a single tuple that represents the first row of your results.results.fetchmany(a_number)
will return a list of tuples that represents the first number of results that you requested.results.fetchall()
will return a list of tuples that represents all the rows in your returned table.
For memory and space reasons, you may only want to look at the 'top' of the results to check what you were given back. You may use this process as a check that what you got back was correct before going on to request or write out all of the data.
results = c.execute("SELECT * FROM letters;") # executes your query and saves your results
print(results.fetchone()) # shows you the contents of your results
(1, 1, '[Provenance documents and biographical sources]', 'n.d.')
# print(results.fetchmany(10))
# print(results.fetchall())
So to load all the rows of data into a 2D list you can do a list comprehension to recast each tuple as a list. However, you may find it valuable to leave the items as a tuple so you can't accidentally change the data.
data = [list(row) for row in results.fetchall()]
If you have a ton of columns, you may want to list them out:
print(results.description)
(('BoxNumber', None, None, None, None, None, None), ('FolderNumber', None, None, None, None, None, None), ('Contents', None, None, None, None, None, None), ('Date', None, None, None, None, None, None))
You can use this to create headers for an output file if needed.
headers = [r[0] for r in results.description]
headers
['BoxNumber', 'FolderNumber', 'Contents', 'Date']
Writing out database results to a csv
We've got headers now, and now also 2D data. We can use our same pattern for writing out CSVs.
import csv
results = c.execute('SELECT * from letters;')
headers = [r[0] for r in results.description]
data = results.fetchall()
with open('allletters.csv', 'w') as fout:
csvout = csv.writer(fout)
csvout.writerow(headers)
csvout.writerows(data)
Creating a database from data
Let's play with the data that you'll be using for your homework. These are found in our results folder, and are a bunch of xml and json files. First off, we need to loop through the data files and access the data.
Let's create a database table that has a three columns: the doi, the title, and the resource type.
You should fight with getting the data access and the database creation as two separate problems. Solve the data access first, and then you can stitch the database work around that.
First, you want to start with some 2D data structures. Let's read in the csv we made earlier.
import csv
with open('allletters.csv', 'r') as fin:
csvin = csv.reader(fin)
headers = next(csvin)
data = [r for r in csvin]
print(headers)
data[:5]
['BoxNumber', 'FolderNumber', 'Contents', 'Date']
[['1', '1', '[Provenance documents and biographical sources]', 'n.d.'],
['1', '2', '[Provenance documents and biographical sources]', 'n.d.'],
['1', '3', '[Provenance documents and biographical sources]', 'n.d.'],
['1',
'4',
'Aberdeen, George Hamilton-Gordon, 4th Earl of, 1784-1860. 1 ALS, 1 LS, 1 AL to T. J. Pettigrew',
'1846 Aug'],
['1',
'5',
'Abingdon, Montagu Bertie, 6th Earl of, 1808-1884. 2 ALS to T. J. Pettigrew',
'1859, 1860']]
Let's look through this and add a transformation, making the box/folder number into a single value.
for row in data:
newrow = [row[0] + "-" + row[1]] + row[2:]
print(newrow)
['1-1', '[Provenance documents and biographical sources]', 'n.d.']
['1-2', '[Provenance documents and biographical sources]', 'n.d.']
['1-3', '[Provenance documents and biographical sources]', 'n.d.']
['1-4', 'Aberdeen, George Hamilton-Gordon, 4th Earl of, 1784-1860. 1 ALS, 1 LS, 1 AL to T. J. Pettigrew', '1846 Aug']
......snip.....
Now we can accumulate all these rows up into a new list.
retooleddata = []
for row in data:
newrow = [row[0] + "-" + row[1]] + row[2:]
retooleddata.append(newrow)
Now we can create our database and load the data. We can provide any file name that we like, and the file will be created if it doesn't already exist.
import sqlite3
conn = sqlite3.connect('retooled_letters.db')
c = conn.cursor()
We can create our tables.
c.execute('CREATE TABLE letters (id text, contents text, date text);')
Now we can load the data. Due to security needs (you can read more about this in the documentation), we should not manually create the insert statement. Instead, we rely on the function to pass the data in directly. Our responsibility is to provide the correct placeholders and then the data.
The order that you put the data into the () tuple must match the order of the tables that you created in your create table command.
We'll need to play with a few things in here:
making the tuple out of our data values
placing that formatted tuple into our larger select statement.
Thankfully there's a nice function for handling this for us within the module.
This presumes that we have a list of rows to add, that each item in the row is a separate element in those lists, and that the order in the rows is consistent. It will also nicely sanitize values, etc. for us.
Unfortunately, the syntax is somewhat cryptic.
The c.executemany() (note the c which means this is a method that you are calling on your cursor) command takes two arguments:
the first being a string with your insert pattern and ? characters where you would want it to place your data values. You'll need to have as many ? items in there as you have columns. So instead of trying to do a .join() or something, you can use (?,?,?,?) if you have four columns.
The second argument will be the list of rows that you have.
So putting this together, we get this command to execute:
c.executemany('INSERT INTO letters VALUES (?, ?, ?)', retooleddata)
Now we need to commit the changes so we can see them.
conn.commit()
results = c.execute("SELECT * FROM letters")
new_data = results.fetchall()
print(len(retooleddata)) # these should match
print(len(new_data))
601
601
We also could have added the data in from within the for loop, row by row, using the c.execute(...)
command, using the same syntax. Instead of providing the 2D data structure, you provide it just the single 1D row.
Working with web based URL endpoints
Your homework will be to mess with the data being created here, so we are going to work through the creation of it during class.
There are a few tools that we will use here.
First off, we are going to use requests even though it is not standard library. However, it is so commonly used for this that it is effectively treated as standard library.
requests has a variety of tools for working with web connections, but importantly it also has the capacity to download and save content.
import requests
url = "https://en.wikipedia.org/wiki/UIUC_School_of_Information_Sciences"
result = requests.get(url)
data = result.text
result.close()
data[:1000]
'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>UIUC School of Information Sciences - Wikipedia</title>\n<script>document.documentElement.className="client-js";RLCONF={"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"UIUC_School_of_Information_Sciences","wgTitle":"UIUC School of Information Sciences","wgCurRevisionId":921410882,"wgRevisionId":921410882,"wgArticleId":5761393,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Pages using infobox university with the image name parameter","Coordinates on Wikidata","Information schools","University of Illinois at Urbana–Champaign"],"wgBreakFrames":!1,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August",'
In just a few lines of code we have received the text from this website. We can also write it to disk now.
import pathlib
pathlib.Path('ischoolwiki.txt').write_text(data)
92875
pathlib.Path('ischoolwiki.txt').exists()
True
This also means that we can put this text through to the json parser if we are getting json results back.
When dealing with a web endpoint api, your goal will be to use python to create the URLs you want to hit, and save the content to disk.
As much as possible, I suggest that you save the content to disk. This means that you can take your time and repeatedly open the files for reading as you are developing your processing strategy, but you've only hit the website once. This reduces the number of calls to make.
In the example above we have only hit a single page, so no delay was needed. However, if we had multiple, we would need to put in a time delay. The amount of time you would delay depends on the service. Usually 2-3 seconds is ok, but some will require 30 seconds or more.
So let's take a look at the datacite API.
Here's an example search: https://search.datacite.org/works?query=tuatara
This gives you 4 pages of results, and when we click on the the next page, the URL reveals to us the page number structure.
So here's page 2's url: https://search.datacite.org/works?query=tuatara&page=2
So we can try changing that 2 to 1: https://search.datacite.org/works?query=tuatara&page=1 and see if that indeed gives us the first page of results.
So how do we generate these 4 URLs? Well, let's think: how can we generate the numbers 1-4?
range(1, 5) will do that.
You can also get the json results back if you change it to api.datacite.org.
for i in range(1, 5):
print("https://api.datacite.org/works?query=snake&page=" + str(i))
https://api.datacite.org/works?query=snake&page=1
https://api.datacite.org/works?query=snake&page=2
https://api.datacite.org/works?query=snake&page=3
https://api.datacite.org/works?query=snake&page=4
import time
import pathlib
import requests
folder = pathlib.Path('dsresults')
# folder.mkdir()
for i in range(1, 5):
filename = folder / pathlib.Path("result_page_" + str(i) + ".json")
result = requests.get("https://api.datacite.org/works?query=snake&page[number]=" + str(i))
text = result.text
print(len(text))
filename.write_text(result.text)
result.close()
time.sleep(2)
print(i)
27824810
1
18163739
2
28328422
3
54
4
We can also do this a different way. At the bottom of the page you can see that they list the total number of pages. If you parse the results along the way, you can use a counter to check if you are on that page or not.
import json
import pathlib
import requests
folder = pathlib.Path("dsresults2")
if not folder.exists():
folder.mkdir()
count = 1
while True:
filename = folder / pathlib.Path("tuatara-result-" + str(count).zfill(3) + ".json")
url = "https://api.datacite.org/works?query=tuatara&page[number]=" + str(count)
result = requests.get(url)
data = json.loads(result.text)
result.close()
current = data["meta"]["page"]
end = data["meta"]["total-pages"]
filename.write_text(json.dumps(data, indent = 4), encoding='utf-8')
result.close()
if count == end:
break
count += 1
time.sleep(2)
Last updated
Was this helpful?