Content clustering: going from 1D to 2D structures

Taking a flat list of data and creating dimensions out of it is a common task. This may be because you were given a flat list of things and need to clump/cluster/chunk those things into more detailed meaning.

Believe it or not, these things happen way more than you may imagine. Think about this question: I ask the class to name their favorite three fruits and I type them into one big list, each answer on one line.

pear
apple
banana
durian
banana
blueberry
apple
banana
strawberry
blueberry
dragonfruit
pear
apple
peach
banana

We've lost the individual granularity here of the individual answers, but if we were for sure that each group of three lines is one answer...then we can reconstruct it.

Warning: you must be very very very certain that using a static number like this is correct for your data. Sometimes it can seem like the pattern holds true, but even if there's the slightest chance not... this isn't a good idea.

Now, there are many ways you may go about accomplishing this. We are going to use one involving: list accumulators, counters, and resetting values. This gives us a good chance to practice these things and solve this problem.

Let's first read this list in as some data:

text = """pear
apple
banana
durian
banana
blueberry
apple
banana
strawberry
blueberry
dragonfruit
pear
apple
peach
banana"""

data = text.split("\n")

print(data)
['pear', 'apple', 'banana', 'durian', 'banana', 'blueberry',
 'apple', 'banana', 'strawberry', 'blueberry', 'dragonfruit', 
 'pear', 'apple', 'peach', 'banana']

Now we have a list called data that we can work with.

Meeting the "collect and clear" temporary variable pattern

Think about packing eggs into cartons. Each carton can fit 12 eggs. Once you fill a carton up, it goes into a big box and you get a new empty carton. So we have this series of steps:

  • add eggs into carton until it's full

  • put carton into shipping box

  • get empty egg carton out

We're going to do a similar thing! Add items into our temporary list until it is "full" (three items, given our situation), then add it to our big box, and clear out our container. The tricky piece about this is placing the logic to check if this full condition has been met. This is also where there are many patterns to do this.

Changing things up, this means we have these general actions to take:

  • add item to temp

  • check if temp is full

  • add temp to collection area if full

  • clear out temp if we are starting fresh

The order you take these actions within your code will strongly determine how your results will look and the order of the other actions. Honestly, you can mix these up into any particular order, so long as you are careful about things. But that's the key: the weirder the order the weirder your code will be.

We are going to follow the idea of: always be able to put something into temp, and check after every placement to see if it's full. A full temp needs to added to the collection area and then cleaned out.

Create the two storage variables

all_chucks = []
temp_chunk = []

all_chunks will hold all the groups of three, while temp_chunk will hold the three items as we collect them.

Loop over the main list of data

for item in data:
    .....

We need to loop through the 1D list to access all the data points.

Always add an item to temp

Inside our for loop, let's add the item to our temp variable.

for item in data:
    temp_chunk.append(item)

Just leaving it like this and printing off temp_chunk after it would result in just a full copy of data into temp-chunk. We are missing the logic to check if our temp is full or not.

Check if our temp variable is full

for item in data:
    temp_chunk.append(item)
    if len(temp_chunk) == 3:
        ....

After we add the item we check the length of temp_chunk to see how far we've gotten. Note the 3 in there, but this could be changed to a separate variable to hold this value.

Collect up the temp and clear it out

Now we get to the heart of this pattern: we'll collect up our temp variable and then clear it out for the next round.

for item in data:
    temp_chunk.append(item)
    if len(temp_chunk) == 3:
        all_chucks.append(temp_chunk) # collect it
        temp_chunk = [] #clear it out

We are calling append on our big container list and adding the current version of our temp variable in there. Once that's done, we can clear the temp out. This is where order can matter, as we don't want to clear anything out before we collect it up.

Likewise, keep your variable names clear. Getting mixed up between your source data list, collection list, and temporary list will create either errors or wild behavior within your code.

The final value

Printing out our collection list we can see this, which accomplishes our task:

[['pear', 'apple', 'banana'], 
 ['durian', 'banana', 'blueberry'], 
 ['apple', 'banana', 'strawberry'], 
 ['blueberry', 'dragonfruit', 'pear'], 
 ['apple', 'peach', 'banana']]

A big limitation here is that we aren't doing anything about any extras that may be laying around. This is up to you. The final value of temp will have everything, so maybe you want to add those into your collection list once done, or handle them in another way.

Sentinel variables

This pattern provides a lot of flexibility. You have full access to the data items and could inspect them as well in the process.

Taking this pattern a bit further, instead of a static value of three, maybe we have something more variable. For example, there could be a valie within the data that indicates when something is done or not.

Sure, you could write something using the split method on the string of data and then break things apart like that. However, that's presuming that the incoming data will always be a string you can manipulate. You may have data coming in as rows and or something else, where you need to extract something about it to check stuff.

The thing that you are checking for is called the "sentinel" item. Sometimes these are variables sometimes it's just a piece of content.

Let's change the data up a little to add something like this in:

text = """pear
apple
banana
durian
done
banana
blueberry
apple
banana
done
strawberry
done
blueberry
dragonfruit
pear
apple
done
peach
banana"""

data = text.split("\n")

Printing off data we get a similar looking list, just with a few extra done strings in there.

Adapting the sentinel

Going back to our original for loop, we can change our boolean conditional here to see if we are working with the word "done".

for item in data:
    temp_chunk.append(item)
    if item == "done": # updated check here
        all_chucks.append(temp_chunk)
        temp_chunk = []

Ad

And printing off the results we get:

[['pear', 'apple', 'banana', 'durian', 'done'], 
 ['banana', 'blueberry', 'apple', 'banana', 'done'], 
 ['strawberry', 'done'], 
 ['blueberry', 'dragonfruit', 'pear', 'apple', 'done']]

Success! Sort of. Changing this mechanism now allows the length of the temp variable to be different and unknown each time. However, each of our groups ends with "done". Maybe we want this! Nothing wrong with it, but generally collecting up that value isn't desired.

There are a few ways to solve this. We could use all_chucks.append(temp_chunk[:-1]) to have it skip the last item. We could also run all_chucks.pop(-1) before appending to have it remove the last item. Both would be fine, but neither are actually handling the problem: that we are always adding the item no matter what.

Then the question is, okay where do we add it in? Well, we need to think about there being two conditions: either done or not done. Done we want to collect and clear out, not done we want to add an item in.

Moving the item addition to temp into an else clause allows us to keep the majority of the code the same form, but protects the item collection from the done condition.

for item in data:
    if item == "done":
        all_chucks.append(temp_chunk)
        temp_chunk = []
    else:
        temp_chunk.append(item)

Ad

Running this and looking at our result:

[['pear', 'apple', 'banana', 'durian'], 
 ['banana', 'blueberry', 'apple', 'banana'], 
 ['strawberry'], 
 ['blueberry', 'dragonfruit', 'pear', 'apple']]

End checking

So you'll notice that we are missing an item at the end.

We could handle it via:

for item in data:
    if item == "done":
        all_chucks.append(temp_chunk)
        temp_chunk = []
    else:
        temp_chunk.append(item)
    
all_chucks.append(temp_chunk)

ad

Just adding it manually to the end. But this isn't a great solution if you aren't sure there will be content or not.

A better way to do this is to check if you are looking at the last item.

Enumerate

One problem here is that we need access to the index position within our loop in addition to the content. True, we could add a counter here, but Python has a better way. We can easily thread in the enumerate function into our loop structure without having to mess too much around with our content.

Calling enumerate with an iterable item will return back a list of tuple pairs. The 0th item is the index position (or counter) and the 1th item is the content. Using multiple assignment syntax, we can combine this and retain our iterable variable name of item and just add the i value into our code.

First, we add the function call in our for loop block opener:

for index, item in enumerate(data):
    .....

Now we can used index to check if we are at the last item. The next question is how? We can use something like index + 1 == len(data) to check if the index position matches the length, so that's taken care of. However, we need to address that we are in a bit of a different situation.

We aren't like in the "done" situation where we want to ignore that last piece of content. Otherwise we wouldn't need to handle this. Instead, we are in a new condition. The wrapup condition. Not only do we need to collect and clear up but also add that last item in before we do it.

for index, item in enumerate(data):
    if item == "done":
        all_chucks.append(temp_chunk)
        temp_chunk = []
    elif (index + 1) == len(data): # added new condition
        temp_chunk.append(item) # add item
        all_chucks.append(temp_chunk) # collect chunk
        # temp_chunk = [] # optional
    else:
        temp_chunk.append(item)

Splitting this into an elif clause lets us treat this condition separately from the binary of being done or not being done.

You may see that I didn't clear out the temp variable. That's your choice. In this example I don't need it after, so I can leave it as is. However, you may want to clear it out in more complex code in case you use it later. This would reset it back to the inital value, and if you end up reusing the variable name or something you won't run into problems.

Now our final output is:

[['pear', 'apple', 'banana', 'durian'], 
 ['banana', 'blueberry', 'apple', 'banana'], 
 ['strawberry'], 
 ['blueberry', 'dragonfruit', 'pear', 'apple'], 
 ['peach', 'banana']]

Last updated