Bio/AI Software EngineerGlossary
3
3D Tensor -
A three-dimensional array of numbers, often used in deep learning to represent structured data such as spatial grids, sequences, or stacked embeddings.Example:
AlphaFold uses 3D tensors to represent spatial relationships between amino acids.
A
Amino Acid -
The basic building block of proteins. Each protein is a sequence of amino acids folded into a specific shape.Example:
The sequence of amino acids determines the structure of a protein.AlphaFold -
A deep learning model developed by DeepMind that predicts 3D structures of proteins from amino acid sequences.Example:
AlphaFold was used to generate a predicted 3D structure for an unknown protein in the project.
B
Bio AI Software Engineer -
An engineer who builds intelligent software, tools, and infrastructure that apply machine learning to biological data, accelerating breakthroughs in protein design, drug discovery, and molecular simulation.Bioinformatician -
A scientist who uses computational tools to analyze and interpret biological data, often focusing on genomic sequences, protein structures, and biological networks.Binding Affinity -
A measure of the strength of the interaction between a protein and a ligand.Example:
We evaluated binding affinity scores to predict which molecules bind most tightly to the target.
C
Computational Biologist -
A Computational Biologist uses biological data to develop models to better understand biological systems. Conducts analysis using computational and mathematical methods and large data sets.Codon -
A set of three DNA or RNA nucleotides that code for a specific amino acid during translation.Example:
The codon AUG encodes methionine and serves as the start signal.Cosine Similarity -
A metric that measures how similar two vectors are regardless of their magnitude, commonly used to compare embeddings in high-dimensional space.Example:
We used cosine similarity to find proteins with similar embedding representations.CLS Vector -
A special embedding derived from the [CLS] token in transformer models, often used as a summary representation of the entire input sequence.Example:
We extracted the CLS vector to classify protein function based on its sequence embedding.
D
DNA -
A molecule that encodes genetic instructions used to make proteins and other cellular functions.Example:
The DNA sequence determines the order of amino acids in a protein.Dataset -
A structured collection of data used for training or evaluating models.Example:
The project used a dataset of proteins with known ligands and binding affinities.
E
Embedding -
A numerical vector representation of data, such as a protein or a molecule, that captures semantic or structural features in a lower-dimensional space.Example:
Each protein sequence was converted into a fixed-length embedding for similarity analysis.
F
FASTA Format -
A plain text format for storing nucleotide or amino acid sequences with a header line.Example:
The protein sequence was uploaded in FASTA format for AlphaFold input.
H
Homolog -
A gene or protein that shares a common evolutionary origin with another, often retaining similar structure or function.Example:
We searched for homologs to identify similar proteins with known structures.
I
Inference -
The process of using a trained machine learning model to make predictions on new, unseen data.Example:
The model performed inference on new protein-ligand pairs to estimate binding scores.
L
Ligand -
A small molecule that binds to a protein, often at a specific binding site, to alter its function or as part of a biochemical interaction.Example:
We simulated a ligand binding to the protein's active site to evaluate its effect.
M
Machine Learning Engineer -
A Machine Learning Engineer focuses on researching, building and designing self-running artificial intelligence systems to automate predictive models. ML engineers design and create AI algorithms capable of learning and making predictions that define machine learning.Example:
Building a model to predict protein-ligand binding affinities using historical data.Molecular Docking -
A computational technique that predicts how a ligand fits into a protein's binding site and estimates the strength of the interaction.Example:
We performed docking simulations using the predicted protein structure.Machine Learning -
An approach where algorithms learn from data to make predictions or classify unseen samples.Example:
A machine learning model was trained to predict protein-ligand binding strength.Model Training -
The process of feeding data into a machine learning model so it can learn patterns and adjust internal weights.Example:
The binding affinity model was trained on a dataset of known protein-ligand pairs.Model Inference -
The process of using a trained machine learning model to generate predictions or outputs based on new input data.Example:
During inference, the model predicted binding scores for unseen protein-ligand pairs.
N
Neural Network -
A type of machine learning model composed of layers of connected units (neurons) capable of learning complex patterns.Example:
A neural network was used to predict structural properties of the protein.
P
Protein -
A large molecule made up of amino acids, responsible for most biological functions in cells, including catalysis, signaling, and structural support.Example:
The protein kinase plays a role in signaling pathways.Protein Sequence -
A linear string of amino acids that defines the primary structure of a protein, often written using single-letter amino acid codes.Example:
The protein sequence 'MVLSPADKTNVKAA' was used as input for structure prediction.
S
SMILES -
A string representation of a molecule’s structure used for computational input and modeling.Example:
The ligand molecules were represented in SMILES format for model input.
T
Transformer Model -
A neural network architecture especially effective for sequence-based data, such as protein sequences.Example:
AlphaFold and ESM models use transformer architectures for structure prediction.Tokenization -
The process of splitting raw input data (such as a protein sequence) into smaller parts (tokens) that a model can process.Example:
The protein sequence was tokenized into individual amino acids before feeding into the transformer.