Module 7: Working with many data files

The two tasks

Usually solved independently, these two tasks need to be correct first before you can put them together.

  • Be able to loop over the files that you need to work with

  • Be able to do the thing you need on one file

What's the order here? Depends on your starting position. However, the following is one of the safest ways to approach this.

Getting one thing right

Focus first on getting the individual process correct.

Work and develop the processing strategy on a single entity or file. This tends to be an optimal chance create a custom function to process a single unit of whatever you are dealing with.

Then move on to the files, sort of

The first step here is not to just unleash things on all of your data.

Take a moment to focus on just getting the paths correct for your output files. You can do this by not creating the files in your code yet, but just printing the file paths. Seriously. This will save you a ton of grief. Why?

You don't want to:

  • make thousands of files in the wrong directory

  • accidentally make thousands of files when you meant to make a few

  • accidentally make millions of files when you meant to make one

  • do all of the above at the same time

Yes, it happens. Print the file paths before creating them.

Then make the files

This is where having a function in the process helps. You can use it in your loop over whatever it is going on. Then write the content out.

Fun facts to remember

  • file paths are composed from strings, which are just text!

    • so use all your fun string tools to your advantage to create some cool stuff

  • you can make pathlib objects for files that don't exist

  • pathlib has many read/write tools

  • pathlib has a ton of boolean methods to check things about the path

  • pathlib also has a ton of ways to extract information from file paths

  • pathlib works really well with folders and has many tools for grabbing information about the files within that directory

  • also recursive file searching within a parent directory!

Last updated