Module 7: Working with many data files
The two tasks
Usually solved independently, these two tasks need to be correct first before you can put them together.
Be able to loop over the files that you need to work with
Be able to do the thing you need on one file
What's the order here? Depends on your starting position. However, the following is one of the safest ways to approach this.
Getting one thing right
Focus first on getting the individual process correct.
Work and develop the processing strategy on a single entity or file. This tends to be an optimal chance create a custom function to process a single unit of whatever you are dealing with.
Then move on to the files, sort of
The first step here is not to just unleash things on all of your data.
Take a moment to focus on just getting the paths correct for your output files. You can do this by not creating the files in your code yet, but just printing the file paths. Seriously. This will save you a ton of grief. Why?
You don't want to:
make thousands of files in the wrong directory
accidentally make thousands of files when you meant to make a few
accidentally make millions of files when you meant to make one
do all of the above at the same time
Yes, it happens. Print the file paths before creating them.
Then make the files
This is where having a function in the process helps. You can use it in your loop over whatever it is going on. Then write the content out.
Fun facts to remember
file paths are composed from strings, which are just text!
so use all your fun string tools to your advantage to create some cool stuff
you can make pathlib objects for files that don't exist
pathlib has many read/write tools
pathlib has a ton of boolean methods to check things about the path
pathlib also has a ton of ways to extract information from file paths
pathlib works really well with folders and has many tools for grabbing information about the files within that directory
also recursive file searching within a parent directory!
Last updated