Bio/AI Software EngineerBio/AI Software Engineer

Glossary

ABCDEFGHIJKLMNOPQRSTUVWXYZ

3

  • 3D Tensor -

    A three-dimensional array of numbers, often used in deep learning to represent structured data such as spatial grids, sequences, or stacked embeddings.

    Example: AlphaFold uses 3D tensors to represent spatial relationships between amino acids.

A

  • Amino Acid -

    The basic building block of proteins. Each protein is a sequence of amino acids folded into a specific shape.

    Example: The sequence of amino acids determines the structure of a protein.

  • AlphaFold -

    A deep learning model developed by DeepMind that predicts 3D structures of proteins from amino acid sequences.

    Example: AlphaFold was used to generate a predicted 3D structure for an unknown protein in the project.

B

  • Bio AI Software Engineer -

    An engineer who builds intelligent software, tools, and infrastructure that apply machine learning to biological data, accelerating breakthroughs in protein design, drug discovery, and molecular simulation.
  • Bioinformatician -

    A scientist who uses computational tools to analyze and interpret biological data, often focusing on genomic sequences, protein structures, and biological networks.
  • Binding Affinity -

    A measure of the strength of the interaction between a protein and a ligand.

    Example: We evaluated binding affinity scores to predict which molecules bind most tightly to the target.

C

  • Computational Biologist -

    A Computational Biologist uses biological data to develop models to better understand biological systems. Conducts analysis using computational and mathematical methods and large data sets.
  • Codon -

    A set of three DNA or RNA nucleotides that code for a specific amino acid during translation.

    Example: The codon AUG encodes methionine and serves as the start signal.

  • Cosine Similarity -

    A metric that measures how similar two vectors are regardless of their magnitude, commonly used to compare embeddings in high-dimensional space.

    Example: We used cosine similarity to find proteins with similar embedding representations.

  • CLS Vector -

    A special embedding derived from the [CLS] token in transformer models, often used as a summary representation of the entire input sequence.

    Example: We extracted the CLS vector to classify protein function based on its sequence embedding.

D

  • DNA -

    A molecule that encodes genetic instructions used to make proteins and other cellular functions.

    Example: The DNA sequence determines the order of amino acids in a protein.

  • Dataset -

    A structured collection of data used for training or evaluating models.

    Example: The project used a dataset of proteins with known ligands and binding affinities.

E

  • Embedding -

    A numerical vector representation of data, such as a protein or a molecule, that captures semantic or structural features in a lower-dimensional space.

    Example: Each protein sequence was converted into a fixed-length embedding for similarity analysis.

F

  • FASTA Format -

    A plain text format for storing nucleotide or amino acid sequences with a header line.

    Example: The protein sequence was uploaded in FASTA format for AlphaFold input.

H

  • Homolog -

    A gene or protein that shares a common evolutionary origin with another, often retaining similar structure or function.

    Example: We searched for homologs to identify similar proteins with known structures.

I

  • Inference -

    The process of using a trained machine learning model to make predictions on new, unseen data.

    Example: The model performed inference on new protein-ligand pairs to estimate binding scores.

L

  • Ligand -

    A small molecule that binds to a protein, often at a specific binding site, to alter its function or as part of a biochemical interaction.

    Example: We simulated a ligand binding to the protein's active site to evaluate its effect.

M

  • Machine Learning Engineer -

    A Machine Learning Engineer focuses on researching, building and designing self-running artificial intelligence systems to automate predictive models. ML engineers design and create AI algorithms capable of learning and making predictions that define machine learning.

    Example: Building a model to predict protein-ligand binding affinities using historical data.

  • Molecular Docking -

    A computational technique that predicts how a ligand fits into a protein's binding site and estimates the strength of the interaction.

    Example: We performed docking simulations using the predicted protein structure.

  • Machine Learning -

    An approach where algorithms learn from data to make predictions or classify unseen samples.

    Example: A machine learning model was trained to predict protein-ligand binding strength.

  • Model Training -

    The process of feeding data into a machine learning model so it can learn patterns and adjust internal weights.

    Example: The binding affinity model was trained on a dataset of known protein-ligand pairs.

  • Model Inference -

    The process of using a trained machine learning model to generate predictions or outputs based on new input data.

    Example: During inference, the model predicted binding scores for unseen protein-ligand pairs.

N

  • Neural Network -

    A type of machine learning model composed of layers of connected units (neurons) capable of learning complex patterns.

    Example: A neural network was used to predict structural properties of the protein.

P

  • Protein -

    A large molecule made up of amino acids, responsible for most biological functions in cells, including catalysis, signaling, and structural support.

    Example: The protein kinase plays a role in signaling pathways.

  • Protein Sequence -

    A linear string of amino acids that defines the primary structure of a protein, often written using single-letter amino acid codes.

    Example: The protein sequence 'MVLSPADKTNVKAA' was used as input for structure prediction.

S

  • SMILES -

    A string representation of a molecule’s structure used for computational input and modeling.

    Example: The ligand molecules were represented in SMILES format for model input.

T

  • Transformer Model -

    A neural network architecture especially effective for sequence-based data, such as protein sequences.

    Example: AlphaFold and ESM models use transformer architectures for structure prediction.

  • Tokenization -

    The process of splitting raw input data (such as a protein sequence) into smaller parts (tokens) that a model can process.

    Example: The protein sequence was tokenized into individual amino acids before feeding into the transformer.