Python Data Loading Tour

This is a short lesson highlighting several different ways to load data into Python

What this is for

This lesson will mostly focus on processing different types of text files, particularly those for data. The general approaches for reading in data are covered, and then boilerplate code for specific file formats.

What this lesson presumes

That you are already working with Python code in some capacity
You understand how to import and work with modules (this will focus on standard library modules but may reference some very common external modules)
Some technical jargon will be used, most explained.

Plain text vs binary

All files contain content of some sort, or they are of size 0 and contain nothing. But they could if they wanted to.

They each will usually contain the same kind of codes in them, but plain text files are special. These codes represent characters, and these generally follow a pretty common standard. You may try to open a plain text file and still see some errors in it due to other encoding differences, so even this is not perfect. These character codes are generally (see: encoding drama) things that most modern computer systems can all open without a special program to interpret the data. This is sort of where the name comes from. Plain, because there's no extra formatting or anything about the text, and text because, well, that's what it contains.

Various workarounds have been created to transmit plain text with some formatting. HTML as one or you may have heard of markdown. These use specific text characters to annotate the text for certain formatting conditions. When viewed as plain text, the markup and original characters are visible, and when viewed with a specific parser, the text formatting is rendered. Markdown follows a nice balance of adding this formatting but leaving it still (mostly) human readable.

Binary files still contain bytes, but these aren't all intended to be just characters or text. Specific programs are required to interpret those bytes, allowing for non-text data to be represented. SQLite DB files are an example of this. They have the text data of the database, but the structural information and metadata information is also transmitted. These will usually have a file extension of db or sqlite. With db in particular, this doesn't make it super clear how they are supposed to be interpreted. Which can be a larger problem with binary files. The file names don't always make it clear, and you can't just open the file to figure it out.

General file reading techniques

There are a few ways that files can be read in with Python modules. Each module usually supports a few of these, and the tasks and data sources often dictate your approach. There usually isn't a single "right" way to do these things, so you should try and use techniques that you are comfortable with. Some modules and reading techniques will have a single primary pattern you are expected to use nearly all the time. There isn't a magic way to determine this. You have to look at the documentation and examples.

Here's the general idea behind how all these things work:

You make a string of the file path
Use whatever the boilerplate is to make the thing go, and at some point you'll pass the file name string (or variable containing it) into the relevant function
The boilerplate code will return back the processed data, loaded in some form of a Python data structure, and you need to either: a) save it to a variable name or b) figure out which variable the boilerplate code made with the data.

Style 1: The module takes a filename and does all the things for you

The general steps:

Get the file path loaded in some way, as a string, variable, etc.
Pass it into the relevant function meant to process that data.
The function does all the opening, loading, closing etc. It returns the loaded data. You just need to use a variable assignment to save it.

The pandas library is a good example of this, where you give it a file name and use the function that matches the file type. The function returns a loaded data object, ready to work with.

The sqlite3 module also does this. You use con = sqlite3.connect('example.db') and the con object is the loaded data ready to go.

All file manipulation is handled by the function itself.

Style 2: The module takes an open file object and does stuff to it

This is a bit of a hybrid of Style 1. You are responsible for creating the fileIO object, but you can pass that into the function and it takes care of loading all the data. Finally returning the data for you to save in a variable. You will generally use a with block in these situations.

The general steps:

Get the file opened and have a fileIO object ready.
Pass that fileIO object to the relevant function (usually along with other parameters).
Save the returned value in a variable.
You may still need to close the fileIO object, just depending on the function you are using and how you opened the file.

The json module works a lot like this. The json.load(...) function takes an argument that's just a fileIO object of the json file to be read. It parses the file and returns the relevant data structure.

Style 3: The module takes an open file object and transforms it into a new kind of file object

The csv module is like this. This is very much like style 2, except the fileIO object is taken by the module via a function, and a new IO object is returned. This new object will contain the specific methods for reading in that file.

This line, within the larger boilerplate, is key: csvin = csv.reader(infile) . Once you create the fileIO object, you pass it to the function and it returns back a special csv reader object. This gives you an object that you will work with in a way that's very similar to a regular fileIO object. You will used methods etc. to extract and save the data from it, but it is specific to CSVs.

What are the core ways to read things in?

The official documentation is a pretty easy and interesting read: https://docs.python.org/3/tutorial/inputoutput.html#reading-and-writing-files

Style 1: load it in default library Python

You'll have a three step process here. The key is the open() function. This takes the file path and the reading method. The returned result will be a fileIO object, full of fun methods for reading things.

Example: infile = open('filepath.txt', 'rt') you can choose whichever variable name you'd like, but keep them reasonable. This object represents Python's knowledge of the file, and not the file contents. You'll only use this object to load the data from it, so you should make the variable name clear and different from what you would call the actual data content.

Now that you have the infile object, you can use a variety of methods for reading the content. .read() is a pretty common and safe one. It reads the entire file and returns the content back as a string (when opened in rt mode).

There are many ways of reading through the content.

Finally, you want to close the file using infile.close(). While you can get away with not closing a file opened for reading, it isn't great practice. However, forgetting to close the file during writing can cause the data to not appear in the file. There is no variable assignment with the close line.

In sum:

infile = open('filepath.txt', 'rt') 
data = infile.read()
infile.close()

# data now holds all the conten of the file
# you will interact with data after this, and not infile

Style 2: the with block

The with keyword has a larger meta purpose, but need not be fully understood to use in this content. It is commonly used to read data from a file, and the with block will automatically close the file for you. You'll notice that there are several similar elements to the previous style, but no close statement. This is the more common style because the file reading process is often "one and done".

with open('filepath.txt', 'rt') as infile:
    data = infile.read()

The with...as... structure is similar to infile = open(...) but within a block. Everything you want to do to infile must then happen within the with block itself. Once the with block closes, infile is closed and inaccessible.

Style 3: pathlib

Once you have the file in question made into a Path object, you can use the .read_text() method to read the text. For example:

import pathlib

# you would usually have more interesting stuff 
# to do along the way, but this is a very short example
textfile = pathlib.Path('filepath.txt')

text = textfile.read_text()

While setting up pathlib stuff can add extra text at the beginning, it has many really variable tools for file processing once they are loaded.

But `[other package]`lets me do it in one line

Okay?

Nothing wrong with that, but remember that hyper-optimized utilities won't suit all situations. Not all data are clean and ready for things like pandas. So sure, use those! But remember that knowing the basics is also an important part of the game.

A few skills to remember

Investigate what the module wants from you

There isn't an exact science behind figuring out how the module wants to load data. Every module and function you work with to load data will have some expectations for what you give it. You need to review what those are and the documentation. There are a few common patterns, outlined in the prior section. When starting up with a new module, look for those expectations in the examples and documentation.

Remember that file paths are really just strings

Even if you are using something fancy and new, like pathlib, you are basically constructing fancy strings and getting strings back. Remembering that file paths are strings should remind you that you can use cool string methods to construct them. You should use a module like pathlib for handling the folder delimiters, but you can use string methods to construct the content of the file names needed, detect things about the files, etc.

Sometimes the file names are all that you'll see, so make the informative (and unique)

While you could give all your file names just numbers, like 00001.txt etc., if this file is producing an error later on, you may not be able to easily connect this file name to the source data. For example, if you are doing web scraping, you may need to go back to the original to check on what happened. You can always open the local file, find the id, and then go, but putting that ID information within the file name saves you as step. Oftentimes, you can also extract this ID out of the file name and use that for later transformations or file naming. So this can even save you the step of an extra file read, etc.

Remember that the file names also need to be unique, or else the existing file will just be overwritten. You can combine some "should be unique?" piece of data about the item along with natural numbers to protect against these cases.

Making folders can also be helpful to keep data separated, but think about putting the folder name info in the file name as well. Just in case things get disconnected or you save stuff to the wrong folder.

Your backup and indexing processes may hate you

Making many thousands of files in bulk or storing millions of files in a single folder can make your system indexing mad for a bit. Every computer is a bit different, but the following things can make your system run better:

leave it on over night so the indexing can work
ensure that your code is closing the file and connections
there are usually commands you can use to shut down your system indexing (remember to turn it back on at some point)
Quit GitHub Desktop (if using it) during the harvest, because it is auto updating with the long list.
Quit your cloud backups, if running. They are attempting to index things as well. Be sure to restart them when done.

Reading in text files

Code for loading in data:

infile = open('dracula.txt', 'r')

text = infile.read()

infile.close()

Once you have the text variable, you are done with infile. You will use the text variable to process the contents. Here are a few examples for common text processing tasks and how they can build up on eachother:

# lowercase the text and split it into words
cleantext = text.lower() # this is where you would do other cleaning

words = cleantext.split()

# get a list of the unique words

unique_words = list(set(words))

Reading CSV files

There's a lot of boilerplate here.

What you start with: a CSV file, or other tabular file with text delimiters.

What you end up with: there are many options, but one of the standard ones is a list of lists. Each list contains a row of data, where each cell becomes a string. The header row (if you have one, you can omit this if you don't) will also be a list of strings.

What it's great at: this parses CSV files really well so long as they were created correctly. Issues like commas or other delimiters being within the content of the data is handled nicely, and I've never had any problems with it. Data files with lots of text and mixed data types will be loaded nicely.

What it isn't great at: there's nothing that this is terrible at, but there are things you will need to handle. One challenge is that the cell content will be separated from the header labels. There are other ways to read it in that will result in a dictionary or other structure. However, you can usually use the headers list to determine the position of where values are.

Example code:

with open('filepath.csv', 'r') as infile:
    csv_in = csv.reader(infile)
    headers = next(csv_in) # reads just the first line
    # you will sometimes see the lines below condensed into
    # data = [row for row in csv_in]
    data = []
    for line in csv_in:
        data.append(line)

You need to provide the file path, but then headers and data will exist after this runs. Each of which is a list.

Example accessing the individual cells:

for row in data:
    row_id = row[0]
    for cell in row:
        print(row_id, cell)

Example for how to handle the positioning issue:

row_id_position = headers.find('ID')

for row in data:
    row_id = row[row_id_position]
    for cell in row:
        print(row_id, cell)

Reading JSON files

JSON files are commonly used within APIs. This structure is a "tree" structure, meaning that nodes can have several branches. Think about authors on a book. There can be many authors, but with CSVs we are often trying to get that field of many values stored within a single cell. In JSON (and XML) you can have "authors" as a node and then the list of authors as a single value, but the individual author information separated within that list.

For example:

"human, a.; human, b" might be what you would see for a CSV.
"authors": ["human, a.", "human, b"] might be how you would see it within an CSV

This structure allows you to hold data with more detail and granularity (for example, you don't have to try and split a string on a delimiter. You are able to know the different values with certainty (or at least the certainty that it wasn't you that messed it up).

Reading more about the specification for the JSON standard is a must if you end up working with JSON a lot. Very generally, when a JSON file is parsed and loaded, it will end up as a combination of dictionaries and lists. These items will look a lot like dictionaries and lists in python, and generally act like them 90% of the time, but there can be subtleties.

import json

with open('filepath.json', 'r') as infile:
    data = json.load(infile)

This tool takes an open fileIO object and parses it, returning back the JSON data but stored within a Python data structure. You will get either a dictionary back or a list (of dictionaries) back, depending on how the JSON file is structured.

You will then operate on that structure like a native python data structure.

PreviousInitial page

Last updated 4 years ago

Was this helpful?