Bio/AI Software EngineerBio/AI Software Engineer

Intro to statistics for Bio AI Software Engineers

Cover Image for Intro to statistics for Bio AI Software Engineers
Jan Piotrzkowski
Jan Piotrzkowski

What is statistics?

When we talk about statistics we can talk about statistics as a field - a practice and stufy of collecting and analysing data - or as a summary of some data.

More important question is what can statistics do?

It can help in answering business questions like:

  1. Are people more likely to make a purchase if we change X on your website?
  2. How many people will buy our product?
  3. How many sizes of shoes should we make so 90% of our customer base would fit them?
  4. Which ads will bring more meaningfull traffic and make people buy our product?

Another question worth answering is:

What statistics can't do?

It won't be able to answer questions with wrong assumptions, that we aren't sure are the reason for sth happening.

For example:

Why is Game of Thrones so popular?

We can ask ourself a question:

Are series with more violence viewed by more people?

But even if its true we can't be sure if more violent scenes are the reason for Game of Thrones being so popular, or if other factors are driving its popularity and it just happens to be violent.

Types of Statistics

  • Descriptive

It describes and summarizes data.

Example: 10% of people do X, 90% of people do Y

  • Inferential

Uses a sample of data to make inferences about a larger population.

Example: What procent of people do Y?

Types of data

There are two types of data that we should be aware about.

  • Numeric (Quantitive)

    -> Continous - those we can measure, like speed of the car, or time spent on sth

    -> Discrete - is counted data, like number of star wars lego figures or number of houses in the neoghberhood

  • Categorical (Qualitive)

    -> Nominal - not orderd information like country of residence, marrige status

    -> Ordinal - depends on order, like a degree to how much you completed specific ticket , or how much you agree with some statement when provided a list of possible answers.

Categorical data may be represented as numbers.

Why data types matter? Because they decide on how we present it. If the data shown should be put on plots or histograms

Spread

Spread describes how close datapoints are.

Mean

The average of a set of numbers, calculated by summing all the values and dividing by the total number of values

Variance

Average distance from each data point from the mean

Standard deviation

Measure that quantifies the amount of variation or dispersion of a set of data values relative to its mean

Mean absolute deviation

Measure that quantifies the average distance between each data point in a dataset and the mean of that dataset

Standard deviation vs Mean absolute deviation

Standard deviation squares distances penalizing longer distances more than shorter ones. Mean absolute deviation penalizes each distance equally.

One is not better than the other, it's just standard deviation is more commonly used.

Quantiles

Also called procentiles, split data into equal parts (making the center of the both sides a median).

Quartiles

Split the data into four equal parts

Boxes in boxplot use quartiles.

Interquartile range (IQR)

Distance between 25% and 75% which is also a hight of the box in the boxplot

Outliers

Data points that is substaintially different from others.

A data point is outlier if:

  • data < Q1 - 1.5 * IQR
  • data > Q3 - 1.5 * IQR

Probability

Probability = number of desired outcomes ÷ total possible outcomes

P = (ways an event can happen) / (all possible outcomes)

Sampling without replacement

A method where, after an element is selected from a population, it is removed and not returned, meaning it cannot be chosen again

Sampling with replacement

A method where, after each selection of an item from a population, that item is returned to the population before the next selection

Independent events

Two events are independent if the probability of the second one isn't affected by the outcome of the first event

Example: Sampling without replacement

Dependednt events

Two events are dependent if the probability of the second event is affected by the outcome of the first event

Example: Sampling with replacement

Continous distributions

A continuous distribution is used when the values a variable can take are infinite and can include fractions or decimals - like height, weight, or temperature.

Instead of just specific outcomes (like rolling a die), you’re looking at a range of possible values.

You don’t ask

what’s the chance it’s exactly 170 cm tall?

You do ask

what’s the chance someone is between 170 cm and 175 cm?

Bidominal distribution

The Binomial Distribution models situations where:

  1. You repeat an experiment a fixed number of times

  2. Each time has only two possible outcomes (like success/failure)

  3. The chance of success is always the same

  4. Each trial is independent

If I try this thing n times, how likely is it that I’ll get k successes?

Normal distribution

The Normal Distribution is the classic "bell curve" - it’s symmetric, with most values clustering around the average (mean), and fewer the farther out you go.

Used when data is naturally spread around a central value, like height, test scores, or enzyme activity levels.

You don’t ask

"will this value be exactly 100?"

You ask

"what’s the chance it falls between 95 and 105?"

Think of it like:

"Most people are average. Some are outliers."

The central limit theorem

The Central Limit Theorem says:

If you take enough samples from any distribution and average them, the result will look like a normal distribution.

Even if the original data is weird or skewed, the means of many samples will start forming a bell curve.

The Poisson distrinution

The Poisson Distribution models how many times an event happens in a fixed amount of time or space, when those events are rare and independent.

It answers:

"If this thing happens on average 3 times per hour, what’s the chance it happens exactly 5 times this hour?"

Examples:

  • Mutations in a DNA sequence
  • Calls to a lab helpline
  • Cell division events per time window

Correlation

Correlation tells you how two variables move together.

If one goes up and the other also goes up → positive correlation

If one goes up and the other goes down → negative correlation

If they don’t move in any consistent way → no correlation

Example:

The more coffee you drink, the more productive you feel? That might be a positive correlation.

It’s measured between -1 and 1:

  • 1 = perfect positive correlation
  • -1 = perfect negative correlation
  • 0 = no correlation at all

Correlation caveats

Just because two things are correlated doesn’t mean one causes the other.

Example:

Ice cream sales and shark attacks both go up in summer. But one doesn’t cause the other - it’s the season that influences both.

Watch out for:

  • Spurious correlations (weird coincidences)
  • Confounding variables (a hidden third factor)
  • Overfitting in small samples

Correlation ≠ Causation

(But it’s a good place to start asking questions)