Content clustering: going from 1D to 2D structures
Taking a flat list of data and creating dimensions out of it is a common task. This may be because you were given a flat list of things and need to clump/cluster/chunk those things into more detailed meaning.
Believe it or not, these things happen way more than you may imagine. Think about this question: I ask the class to name their favorite three fruits and I type them into one big list, each answer on one line.
We've lost the individual granularity here of the individual answers, but if we were for sure that each group of three lines is one answer...then we can reconstruct it.
Warning: you must be very very very certain that using a static number like this is correct for your data. Sometimes it can seem like the pattern holds true, but even if there's the slightest chance not... this isn't a good idea.
Now, there are many ways you may go about accomplishing this. We are going to use one involving: list accumulators, counters, and resetting values. This gives us a good chance to practice these things and solve this problem.
Let's first read this list in as some data:
Now we have a list called data
that we can work with.
Meeting the "collect and clear" temporary variable pattern
Think about packing eggs into cartons. Each carton can fit 12 eggs. Once you fill a carton up, it goes into a big box and you get a new empty carton. So we have this series of steps:
add eggs into carton until it's full
put carton into shipping box
get empty egg carton out
We're going to do a similar thing! Add items into our temporary list until it is "full" (three items, given our situation), then add it to our big box, and clear out our container. The tricky piece about this is placing the logic to check if this full condition has been met. This is also where there are many patterns to do this.
Changing things up, this means we have these general actions to take:
add item to temp
check if temp is full
add temp to collection area if full
clear out temp if we are starting fresh
The order you take these actions within your code will strongly determine how your results will look and the order of the other actions. Honestly, you can mix these up into any particular order, so long as you are careful about things. But that's the key: the weirder the order the weirder your code will be.
We are going to follow the idea of: always be able to put something into temp, and check after every placement to see if it's full. A full temp needs to added to the collection area and then cleaned out.
Create the two storage variables
all_chunks
will hold all the groups of three, while temp_chunk
will hold the three items as we collect them.
Loop over the main list of data
We need to loop through the 1D list to access all the data points.
Always add an item to temp
Inside our for loop, let's add the item to our temp variable.
Just leaving it like this and printing off temp_chunk
after it would result in just a full copy of data
into temp-chunk
. We are missing the logic to check if our temp is full or not.
Check if our temp variable is full
After we add the item we check the length of temp_chunk
to see how far we've gotten. Note the 3
in there, but this could be changed to a separate variable to hold this value.
Collect up the temp and clear it out
Now we get to the heart of this pattern: we'll collect up our temp variable and then clear it out for the next round.
We are calling append on our big container list and adding the current version of our temp variable in there. Once that's done, we can clear the temp out. This is where order can matter, as we don't want to clear anything out before we collect it up.
Likewise, keep your variable names clear. Getting mixed up between your source data list, collection list, and temporary list will create either errors or wild behavior within your code.
The final value
Printing out our collection list we can see this, which accomplishes our task:
A big limitation here is that we aren't doing anything about any extras that may be laying around. This is up to you. The final value of temp will have everything, so maybe you want to add those into your collection list once done, or handle them in another way.
Sentinel variables
This pattern provides a lot of flexibility. You have full access to the data items and could inspect them as well in the process.
Taking this pattern a bit further, instead of a static value of three, maybe we have something more variable. For example, there could be a valie within the data that indicates when something is done or not.
Sure, you could write something using the split
method on the string of data and then break things apart like that. However, that's presuming that the incoming data will always be a string you can manipulate. You may have data coming in as rows and or something else, where you need to extract something about it to check stuff.
The thing that you are checking for is called the "sentinel" item. Sometimes these are variables sometimes it's just a piece of content.
Let's change the data up a little to add something like this in:
Printing off data we get a similar looking list, just with a few extra done strings in there.
Adapting the sentinel
Going back to our original for loop, we can change our boolean conditional here to see if we are working with the word "done".
Ad
And printing off the results we get:
Success! Sort of. Changing this mechanism now allows the length of the temp variable to be different and unknown each time. However, each of our groups ends with "done". Maybe we want this! Nothing wrong with it, but generally collecting up that value isn't desired.
There are a few ways to solve this. We could use all_chucks.append(temp_chunk[:-1])
to have it skip the last item. We could also run all_chucks.pop(-1)
before appending to have it remove the last item. Both would be fine, but neither are actually handling the problem: that we are always adding the item no matter what.
Then the question is, okay where do we add it in? Well, we need to think about there being two conditions: either done or not done. Done we want to collect and clear out, not done we want to add an item in.
Moving the item addition to temp into an else clause allows us to keep the majority of the code the same form, but protects the item collection from the done condition.
Ad
Running this and looking at our result:
End checking
So you'll notice that we are missing an item at the end.
We could handle it via:
ad
Just adding it manually to the end. But this isn't a great solution if you aren't sure there will be content or not.
A better way to do this is to check if you are looking at the last item.
Enumerate
One problem here is that we need access to the index position within our loop in addition to the content. True, we could add a counter here, but Python has a better way. We can easily thread in the enumerate
function into our loop structure without having to mess too much around with our content.
Calling enumerate
with an iterable item will return back a list of tuple pairs. The 0th
item is the index position (or counter) and the 1th
item is the content. Using multiple assignment syntax, we can combine this and retain our iterable variable name of item
and just add the i
value into our code.
First, we add the function call in our for loop block opener:
Now we can used index
to check if we are at the last item. The next question is how? We can use something like index + 1 == len(data)
to check if the index position matches the length, so that's taken care of. However, we need to address that we are in a bit of a different situation.
We aren't like in the "done" situation where we want to ignore that last piece of content. Otherwise we wouldn't need to handle this. Instead, we are in a new condition. The wrapup condition. Not only do we need to collect and clear up but also add that last item in before we do it.
Splitting this into an elif
clause lets us treat this condition separately from the binary of being done or not being done.
You may see that I didn't clear out the temp variable. That's your choice. In this example I don't need it after, so I can leave it as is. However, you may want to clear it out in more complex code in case you use it later. This would reset it back to the inital value, and if you end up reusing the variable name or something you won't run into problems.
Now our final output is:
Last updated