Module 4: Working with files and file paths in Python

The first step to dealing with the content of a file is to open it.

Related readings: https://python-textbok.readthedocs.io/en/1.0/Python_Basics.html#files

Python has many ways of opening files and constructing file paths. The ones you use come down to personal choice and task at hand.

Here are some important facts that are shared pretty universally:

  • file names exist and are constructed from strings

    • these often live within variables

    • you can construct a new filename by constructing a string of what it should be

  • there are two main ways of opening and reading a text file:

    • some formation that uses open()

    • using Path object methods

  • opening a file with a certain mode (e.g. read or write) does not read the content of the file

  • open() returns a new object that has special methods related to the mode that it was opened in

  • The open() function is one of the more flexible constructs, but usually requires more hand holding

  • there are many "modes" to open a file, dictated by the letter in the second argument. Generally, these will be read/write/append, but there are others with more specific behavior.

  • They key power to using the open() function to read/write files is having access to the

The with notation

with open('myfile.txt', 'r') as infile:
    text = infile.read()

There's some interesting nuance to the with structure, but to summarize: this syntax will open the file for you and close it for you once the code within the block completes.

Using with is really nice because it can save a step and there are other/larger benefits from the syntax, but these are usually things that come up only in more advanced situations.

Pathlib

The pathlib module takes working with files in a different direction. This is an explicitly object oriented way of handling files. Instead of manually constructing the strings (you still need to construct the actual file name and extension) you use objects, methods, and operators to do the rest of the work.

For example, where you might have said 'data/0001.csv' in the past, you would have the containing folder be one object and the file be another object. And then you can say something like target / outfile to form the full string.

It can be more work to get set up, but can make a lot of work later in the program so much easier.

Path objects also have methods for very common file activities. For example, you can directly check if a file exists, delete files, and ask for information about the path and file name. Not only does this actually simplify things in later portions of your code, but it also makes your code automatically adjust across platforms. Windows and unix systems use different path construction symbols. Other modules attempted to provide ways to work through this, but they weren't as generally useful as this module.

A semi secret bonus of pathlib is that the official documentation page is extremely good! Lots of great examples out there.

importing pathlib

There are several ways of importing modules in Python.

  • import pathlib will give you access to all the functions within the module, bt you have to use them as such: pathlib.functionname().

  • from pathlib import Function is a way to import a single function directly into the namespace. This allows you to use it as such: Function(). Note that you cannot directly observe where the function came from.

  • import pathlib as pl will import the module with the name pl. This means you will need to call functions as such: pl.function() and that pathlib.function() will not work.

  • from pathlib import * is a way to import all the functctions from a module into the namespace. This can lead to odd behavior etc and should be used sparingly.

Most of the time you should use import pathlib.

Making Path objects

You will use pathlib.Path(file or folder) to make a path object. The syntax does not change if it is a folder or a file.

Looping over stuff

You can use the glob method on a path to search within it. Then loop over those results.

print(p.glob('*'))
results = list(p.glob('*'))
for subp in results:
    print(subp, subp.is_dir())

Adding conditionals in

There are several relevant conditionals you can use to work with these pathlib objects. These come in the form of boolean methods for pathlib objects. These kinds of methods will return True or False.

These methods work on existing Pathlib objects. Recall that pathlib objects can be made for files that may or may not exist. For example, you might make one for a file that you want to propose making, but only want to create it if it doesn't already exist. Same thing for a folder.

As you read these descriptions remember that "the path" is the location plus file name you created the Path object with. "Points to" means the location that path resolves to. The official docs will use the phrasing "path points to" to describe if the location of the defined path resolves to.

Presuming that p is an existing Path object, we could do:

p.exists() returns True if the path points to an item that exists (otherwise False)

p.is_dir() returns True if the path points to a directory (otherwise False)

These two are some of the most important methods we could use.

Let's work through an example that's pretty common if you are working with an API. Many times these processes need to run for multiple days if not weeks, so starting and stopping the script is pretty normal. Given that you don't want to run through the same files or content you've already harvested, you can use your existing files as the log of what has been completed.

This can be as simple as leaving your code like normal, but adding in a few check points that only allow the request and data pull to actually happen if the file does not exist.

These tasks often need folders created as well, and we can check both.

import pathlib

p = pathlib.Path('data')

print(p.exists())
print(p.is_dir())
False
False

At this point we haven't created this file yet, so exists() and is_dir will give us False. We can add a little check in here to create the folder if it doesn't already exist. Here's our updated code:

import pathlib

p = pathlib.Path('data')

if not p.exists():
    p.mkdir()

print(p.exists())
print(p.is_dir())
True
True

Experiment: take the boolean statement out of there and rerun the code. When you try to make a directory that already exists, you get this error: FileExistsError: [Errno 17] File exists: 'data'. Alternatively, you can use mkdir with a special flag to not raise this error if the folder exists.

import pathlib

p = pathlib.Path('data')

p.mkdir(exist_ok=True) # flag added

print(p.exists())
print(p.is_dir())
True
True

There are good reasons for doing this both ways. You may need to do more than just make the folder if it doesn't exist, in which case having the conditional section will give you that space. However, the flag to ignore the error is good enough of you only need to make it if it doesn't already exist.

List accumulator patterns

Meanwhile, checking if something is a directory might be an odder task depending on the situation. You may be working with a folder that has multiple kinds of files. There may be multiple reasons you want to work through a full folder full of files and collect some up. Files in a folder may be mixed up either because users mixed things up, but could also have been made via a programming error that missed adding the file extension. This means that just looking at the file paths you can't really tell if it's a file or not. Remember that file extensions are optional. This means that you may want to separate out paths that are files and folders but you could use a similar pattern to separate all the other file types.

Let's look inside this data folder and check the items found inside.

import pathlib

datafolder = pathlib.Path('data')
files = datafolder.glob('*')

for f in files:
    print(f)
data/stuff.txt
data/data3.csv
data/data2.csv
data/data1.csv
data/docs
data/codebook
data/data4

The output here shows that docs , codebook, and data4 are all missing extensions. Yes, you could go look inside of the folder here to visually inspect which is a folder or not. However, this may not always be an option nor easy given the scale.

We can use a combination of loops, conditionals, and list accumulator patterns to collect these up. We want to collect two lists: one with all the paths that are folders and the other containing everything else (being the files).

import pathlib

datafolder = pathlib.Path('data')
files = datafolder.glob('*')

all_files = []
all_folders = []

for f in files:
    print(f.is_dir())

We've got a good baseline form here. The empty lists have been added that will hold the relevant paths and I'm printing out the results of the boolean method. Next I'll add in code for the conditional statements I need to check if they are a folder or a file.

import pathlib

datafolder = pathlib.Path('data')
files = datafolder.glob('*')

all_files = []
all_folders = []

for f in files:
    if f.is_dir():
        print("a folder:", f)
    else:
        print("a file:", f)
a file: data/stuff.txt
a file: data/data3.csv
a file: data/data2.csv
a file: data/data1.csv
a folder: data/docs
a file: data/codebook
a file: data/data4

Now that I'm correctly filtering and labeling these items, I can add the code into the conditional areas that will place each path object into the appropriate lists.

import pathlib

datafolder = pathlib.Path('data')
files = datafolder.glob('*')

all_files = []
all_folders = []

for f in files:
    if f.is_dir():
        # print("a folder:", f)
        all_folders.append(f)
    else:
        # print("a file:", f)
        all_files.append(f)

print("the files:", all_files)
print("the folders", all_folders)
the files: [PosixPath('data/stuff.txt'), 
                PosixPath('data/data3.csv'), 
                PosixPath('data/data2.csv'), 
                PosixPath('data/data1.csv'), 
                PosixPath('data/codebook'), 
                PosixPath('data/data4')]
the folders: [PosixPath('data/docs')]

The list accumulator pattern allows us to collect items we want in a list. Usually there's some amount of checking etc. performed so the items collected are a subset of the larger original, but the accumulator may also be used to collect new versions of the content that have been changed in some substantial way.

Making multiple files

Now that we can loop over file paths and have explored some methods, we can also explore how to programmatically create things from within these loops.

All these examples presume that p is a Path object.

File creation tools

This won't work for all things you may need to do with files, but there are two nice file creation methods that can come in really handy.

Presuming you have p which is a Path object pointing to a text file you would like to create, and you have text which contains a string of content to appear within the file. Use p.write_text(text) to write that file out. All the opening and closing is done for you.

text = "this is a sentence"
p = pathlib.Path('data/document.txt')
p.write_text(text)

Folder creation tools

As seen before, using p.mkdir() will create a directory at the specified path.

Creating the names

Recall that names are just strings, so you can connect them inside your loops and feed them into the Path objects as you create them.

Putting it together

Last updated