Intro to statistics for Bio AI Software Engineers



What is statistics?
When we talk about statistics we can talk about statistics as a field - a practice and stufy of collecting and analysing data - or as a summary of some data.
More important question is what can statistics do?
It can help in answering business questions like:
- Are people more likely to make a purchase if we change X on your website?
- How many people will buy our product?
- How many sizes of shoes should we make so 90% of our customer base would fit them?
- Which ads will bring more meaningfull traffic and make people buy our product?
Another question worth answering is:
What statistics can't do?
It won't be able to answer questions with wrong assumptions, that we aren't sure are the reason for sth happening.
For example:
Why is Game of Thrones so popular?
We can ask ourself a question:
Are series with more violence viewed by more people?
But even if its true we can't be sure if more violent scenes are the reason for Game of Thrones being so popular, or if other factors are driving its popularity and it just happens to be violent.
Types of Statistics
- Descriptive
It describes and summarizes data.
Example: 10% of people do X, 90% of people do Y
- Inferential
Uses a sample of data to make inferences about a larger population.
Example: What procent of people do Y?
Types of data
There are two types of data that we should be aware about.
-
Numeric (Quantitive)
-> Continous - those we can measure, like speed of the car, or time spent on sth
-> Discrete - is counted data, like number of star wars lego figures or number of houses in the neoghberhood
-
Categorical (Qualitive)
-> Nominal - not orderd information like country of residence, marrige status
-> Ordinal - depends on order, like a degree to how much you completed specific ticket , or how much you agree with some statement when provided a list of possible answers.
Categorical data may be represented as numbers.
Why data types matter? Because they decide on how we present it. If the data shown should be put on plots or histograms
Spread
Spread describes how close datapoints are.
Mean
The average of a set of numbers, calculated by summing all the values and dividing by the total number of values
Variance
Average distance from each data point from the mean
Standard deviation
Measure that quantifies the amount of variation or dispersion of a set of data values relative to its mean
Mean absolute deviation
Measure that quantifies the average distance between each data point in a dataset and the mean of that dataset
Standard deviation vs Mean absolute deviation
Standard deviation squares distances penalizing longer distances more than shorter ones. Mean absolute deviation penalizes each distance equally.
One is not better than the other, it's just standard deviation is more commonly used.
Quantiles
Also called procentiles, split data into equal parts (making the center of the both sides a median).
Quartiles
Split the data into four equal parts
Boxes in boxplot use quartiles.
Interquartile range (IQR)
Distance between 25% and 75% which is also a hight of the box in the boxplot
Outliers
Data points that is substaintially different from others.
A data point is outlier if:
- data < Q1 - 1.5 * IQR
- data > Q3 - 1.5 * IQR
Probability
Probability = number of desired outcomes ÷ total possible outcomes
P = (ways an event can happen) / (all possible outcomes)
Sampling without replacement
A method where, after an element is selected from a population, it is removed and not returned, meaning it cannot be chosen again
Sampling with replacement
A method where, after each selection of an item from a population, that item is returned to the population before the next selection
Independent events
Two events are independent if the probability of the second one isn't affected by the outcome of the first event
Example: Sampling without replacement
Dependednt events
Two events are dependent if the probability of the second event is affected by the outcome of the first event
Example: Sampling with replacement
Continous distributions
A continuous distribution is used when the values a variable can take are infinite and can include fractions or decimals - like height, weight, or temperature.
Instead of just specific outcomes (like rolling a die), you’re looking at a range of possible values.
You don’t ask
what’s the chance it’s exactly 170 cm tall?
You do ask
what’s the chance someone is between 170 cm and 175 cm?
Bidominal distribution
The Binomial Distribution models situations where:
-
You repeat an experiment a fixed number of times
-
Each time has only two possible outcomes (like success/failure)
-
The chance of success is always the same
-
Each trial is independent
If I try this thing n times, how likely is it that I’ll get k successes?
Normal distribution
The Normal Distribution is the classic "bell curve" - it’s symmetric, with most values clustering around the average (mean), and fewer the farther out you go.
Used when data is naturally spread around a central value, like height, test scores, or enzyme activity levels.
You don’t ask
"will this value be exactly 100?"
You ask
"what’s the chance it falls between 95 and 105?"
Think of it like:
"Most people are average. Some are outliers."
The central limit theorem
The Central Limit Theorem says:
If you take enough samples from any distribution and average them, the result will look like a normal distribution.
Even if the original data is weird or skewed, the means of many samples will start forming a bell curve.
The Poisson distrinution
The Poisson Distribution models how many times an event happens in a fixed amount of time or space, when those events are rare and independent.
It answers:
"If this thing happens on average 3 times per hour, what’s the chance it happens exactly 5 times this hour?"
Examples:
- Mutations in a DNA sequence
- Calls to a lab helpline
- Cell division events per time window
Correlation
Correlation tells you how two variables move together.
If one goes up and the other also goes up → positive correlation
If one goes up and the other goes down → negative correlation
If they don’t move in any consistent way → no correlation
Example:
The more coffee you drink, the more productive you feel? That might be a positive correlation.
It’s measured between -1 and 1:
- 1 = perfect positive correlation
- -1 = perfect negative correlation
- 0 = no correlation at all
Correlation caveats
Just because two things are correlated doesn’t mean one causes the other.
Example:
Ice cream sales and shark attacks both go up in summer. But one doesn’t cause the other - it’s the season that influences both.
Watch out for:
- Spurious correlations (weird coincidences)
- Confounding variables (a hidden third factor)
- Overfitting in small samples
Correlation ≠ Causation
(But it’s a good place to start asking questions)