Module 11: Advanced function options and building simulations

What do we mean by simulations?

Many possible meanings. (these are sufficiently accurate, don't @ me for not being perfectly correct)

  • randomly creating fake data to test an application with

    • example: populating a database with randomly selected but based on real data to run queries and develop documentation before deployment.

    • Sometimes also done to populate a data file so you can work on developing analytics code etc while the data are being collected, are too sensitive to share/get help on, etc.

  • simulating moves/plays in a game

  • simulating possible actions of consumers etc in large systems

  • interpolating data points based of observations to extend models and measurements

    • example: cool space stuff

  • checking your probability math, because probability math is the worst

  • exploring how altering single parameters can effect a system

    • physics, architecture, "what if we had 20% less water and 50% more boats" engineering etc.

Lots of reasons!

How is this done?

Mostly just a lot of random numbers. We can carefully define and control how those numbers are generated to change the direction of the model.

Randomness is complex, take it up with the stats department

We will be using the random module, using psuedo random numbers. They aren't truly random, but they are random enough general interesting stuff. They are not random enough for certain applications. Read more about it here if you are deeply curious: https://en.wikipedia.org/wiki/Pseudorandomness

Randomness in Python

We will be using the random module to generate random number. It's standard library and pretty easy to use.

First up, for reproducibility we want to set a random seed. This allows us to all generate the same random numbers. (see? pseudorandom) Basically, this allows us to all have the same "starting point" for randomness. When you don't specify one, the system time or some other value is used.

import random

random.seed(68)

(you'll see a lot of people use 42 as a seed because it's a Hitchhiker's Guide to the Galaxy reference as the "meaning of the universe" or something. there's nothing particularly special about 42 from a programming perspective.)

You don't always have to worry about this, but if you are doing this For Science or other such things, it is important for reproducibility.

Random module

I particularly like the set of examples put in the Python docs for random. https://docs.python.org/3/library/random.html#examples

Getting random samples

There are actually multiple ways of doing this, so it can be a fun thing to explore the basics with.

Why might we want random samples?

Loads of reasons.

  • you may need a 20% random sample of a specific group of data for training data in machine learning.

  • You may want just a small subset of a larger dataset to explore for qualitative analysis or to test our interrater reliability.

  • You may be developing a complex data processing pipeline and want to test it on limited but representative cases.

The general approach to this is: figure out how to randomly order things and then grab however many you need. To reach a random sample of size n:

  • Approach 1: randomly select items from the content until you have a collection of n size

    • con: allows for repetition

  • Approach 2: randomly generate valid index positions until you have a collection of size n

    • same as 1, allows for repetition

  • Approach 3: randomly reorder (using random.shuffle()) the items and then slice out the first n items

    • con: mutates the original list if you don't make a copy

  • Approach 4: gather the index positions, shuffle them, and select the first n positions. Then gather those.

    • generally most common one?

There are also special functions for this.

Probability in randomness

There are a few ways we can achieve probability in randomness.

Say we have an event that's got a 5% chance of occurring in any given trial. Basically, if I have 100 trials, I should see it happen around 5 times. You can also reduce this down to 1/20 (fraction math, yo). While we may start with a 5% chance, we can see it as 5/100 or 1/20 chance. Most probability math stuff wants it as a fraction.Probabilities are also spoken of in decimal values. After all, those fractions end up as decimal values. So we can say that 1/20 is 5% but also 0.05.

Given that python can't really handle % values, we need to speak of these things in decimal values.

Once we have our decimal value, we need to think of how we can mock up something to only print out 5% of the time. This is a very small sort of simulation. The way that we approach this is the following:

  • set a variable as the decimal value of the probability. eg event = .05

  • randomly generate a floating point value between 0 and 1. Each floating point number has an equal chance of occurring.

  • Check if the number we generated is less than or equal to our decimal probability.

  • Numbers less than or equal to it are marked as a success, all other numbers are a failure.

Why do we do it this way? Because computers want numbers. Given that the numbers (there are some limits to the depth of precision that they can do into) each have a fair chance, by filtering out for a subset of 5% of them, we can mimic something occurring 5% of the time.

Which I know sounds a bit bonkers, because you might think, but you're only looking at the low numbers? Why not like .95 and above? Ya. I know. I hate it, too. But remember that each has a fair chance, so the probability of selecting .1 and .1 and .1 is the same as selecting .9 and .9 and .9 exactly.

We are choosing thresholds where the cumulative probability of the numbers we've selected equals the probability that we are after. And it's just easiest to provide an upper boundary without having to deal with a lower one?

Last updated