Thank you for Subscribing to Life Science Review Weekly Brief
Properly trained deep learning networks take input and use it to accurately predict an output that is represented as matrices or vectors of numbers. These mathematical representations are important for converting biological problems into amenable ones to model training.
FREMONT, CA: Synthetic biology synergises with deep learning and generates large datasets to train models. For example, using DNA synthesis and deep learning models to inform design, such as generating novel parts or suggesting optimal experiments to conduct. Recent research at the engineering biology interface and deep learning have highlighted this potential through successes, including novel biological parts design and biomolecular implementations of artificial neural networks.
A properly trained deep learning network takes an input and uses it to accurately predict an output. Input data is generally represented as matrices or vectors of numbers. These mathematical representations are essential for converting biological problems into ones that are compliant with model training. As the representation codifies which information the model needs and restricts the applicable learning algorithms set, identifying the optimal data representations for specific problems is crucial to developing high-performing and generalisable models.
Practitioners should make careful choices about data representation to ensure that the independent variables pertinent to the issue are represented while constraining the irrelevant or confounding variables that a model must learn to overlook. In addition, the smart selection of data representations allows the practitioner to optimise these representation structures to reduce the problem space and increase data efficiency. Thus, it is necessary to understand the common types of synthetic-relevant data and how they can be numerically represented.
Sequence Data
Today, sequencing capabilities rapidly expand, an area with a vast quantity of data in sequence space. This includes DNA, RNA, or amino acid sequences.
These data are represented as matrices by embeddings or functions that map sequence elements to vectors. The most basic embedding is a one-hot encoding in which only a single element is hot in each embedding vector, taking on a value of one, and the rest are zero.
One-hot encoding is straightforward, and, in certain cases, it limits the representational power of the model by disregarding the idea that certain amino acids might behave similarly at a given point in the sequence. Amino acid embeddings learned from large unlabeled protein data sets outperform one-hot encodings in certain protein engineering tasks.
Molecular Structure Data
The molecules’ structure at small and macromolecular scales is geometrically described in various ways, either in a string-based representation or a learned embedding. A molecule can also be represented through its structural formula, and this formula is encoded as a graph, upon which graph-based learning methods are directly applied.
Molecules are generally treated as objects in three-dimensional space, giving their constituent atoms explicit coordination with their existing node features. Machine learning workflows use these correlated and node features. Furthermore, these concepts abstract to a higher-level molecular structure view by defining nodes as nucleotides in DNA and RNA structures and amino acids in proteins.
Image Data
Synthetic biology experiments also generate image data like microscopy files. The pixels are represented in the rows and columns in the matrix. The dimensions expand to include values if the image contains multiple colour channels. For example, a colour image that is 400 × 600 pixels is represented as a 400 × 600 × 3 object having data associated with the red, green, and blue colour channels.