Module 1: data in the wild

All data is created somewhere.

Sometimes you're the one making the data, the one advising on how it should be made, or the one stuck with whatever happened to it before you got there.

Data and spreadsheets are where many datasets are born and many datasets end up.

The problem is that each cell needs to have a single question, a single object, and a single answer.

We will be taking a look at several spreadsheets as a starting point.

  • The Library Carpentry spreadsheet has workshop data: https://librarycarpentry.org/lc-spreadsheets/setup.html

Review:

Review:

  • tabular data

  • rows and columns

  • "cells"

  • why pandas is valuable in this case when it comes to how data loads into arrays

  • but why not-pandas is also valuable

  • not all data models work for this kind of structure

  • tabular data structures

  • rows/columns

  • why pandas is valuable

  • what can be in a cell

  • the decisions required for such a structure

  • how humans interact with data

Evaluate:

  • How readable is the spreadsheet (for a human)?

  • How usable is the spreadsheet for data entry? E.g. can you imagine going through a large stack of papers and being able to quickly/easily enter in the data?

Break into small groups and discuss the following:

  • Explore the data a little bit and discuss it with your group. If you were to combine everything into a single table of data, what would the columns be? (don't do it yet, just discuss how to capture all the data that you see into a single set of columns.) Attempt to follow the suggestions in here (https://librarycarpentry.org/lc-spreadsheets/01-format-data/index.html) about how to organize the data.

Make a new copy of the spreadsheet. Attempt to cut/paste things around into this new format. Check and update your column names as you go.

  • How parsable is the data by code? Imagine you could write any rule you could come up with for telling a program to load the data.

    • Is there a succinct set of rules that applies to the entire sheet or sheets? Or do you need custom rules for pretty much anything?

    • Can the data values be connected without worry? Or does it need to rely on context clues for joining data up?

  • Create a flow chart chart (hand drawn, powerpoint, lucidchart, whatever) to outline what such a program might do. Reference the wikipedia article on flowcharts if you need help with formatting this. You may use any formatting you like so long as the actions and decisions are clear.

Last updated