<strong> Cell-penetrating peptides (CPPs) </strong> are short sequences of amino acids that have the remarkable ability to cross cellular membranes, facilitating the intracellular delivery of various therapeutic agents, including drugs, nucleic acids, and proteins. These peptides exploit mechanisms such as direct penetration or endocytosis to traverse cell membranes, making them powerful tools in drug delivery systems.
In real-world medical applications, CPPs are being leveraged to enhance the efficacy of treatments for a range of conditions. For instance, they are used in targeted cancer therapies to deliver chemotherapeutic agents directly to tumor cells, minimizing damage to healthy tissues. Additionally, CPPs are employed in gene therapy to transport genetic material into cells, offering potential treatments for genetic disorders like cystic fibrosis and muscular dystrophy. Their versatility and efficiency in overcoming cellular barriers position CPPs as a promising frontier in the development of advanced therapeutic strategies.
Our ultimate goal is to develop precise machine learning (ML) model allowing to <strong> design CPPs with superior activity </strong>. Here are the main steps which will allow you to build a precise model for CPP design:
<strong> 2. Data unification. </strong> The data presented in Datasets are heterogeneous and should be unified in terms of variables, measurement units etc.
<strong> 3. System parametriation. </strong> You need to choose the set of parameters to describe CPPs as well as experimental setup. Most of the models use symbolic representations lacking physico-chemical properties crucial for CPP activity prediction.
<strong> 4. Model selection. </strong> Best-performing models should be choosen for screening depending on the task complexity (sequence classification or sequence generation).
<strong> 5. Feature selection. </strong> After model selection, features used in the model should be choosen showing optimal prediction performance, robustness, and interpretability.
<strong> 6. Evaluation. </strong> Every model should be evaluated beyond performance on train/test datasets. It can be structural analysis of CPP candidates, modelling of interaction with cellular membranes etc.
The main challenge here is to develop <strong> unbiased model </strong> not limited to existing CPP structures and cell penetration mechanisms. Another challenge is to develop CPPs <strong> for particular drug delivery system and setup </strong>, which includes multi-property optimization (amphiphilicity, molecular weight, toxicity etc.). Finally, models should be <strong> interpretable </strong>, which means user should know why particular CPP demonstrates its activity, and what are the possible ways to improve it further.
a) <em> Sequence classification </em> is the easiest task, where you need to develop the model differentiating between CPP and non-CPP sequences based on the sequence. The main problem is such models do not inherently allow to find the best CPP sequences. Additional algorithm for sequence generation should be developed to discover novel CPP sequences.
b) <em> Uptake quantitative prediction </em> is more challenging due to small data existing in the field, where you need not just predict either sequence is CPP or not, but to predict its cellular uptake depending on conditions and cell type. The problem is, despite such model allows to compare CPP activities with each other and find the best ones, additional algorithm for sequence generation is still needed.
c) <em> Sequence generation </em> is the most challenging task since it requires deep learning models and a lot of data. Generation can be simple or conditional. In the first case, the model should basically learn the patterns in CPP sequences and be able to generate new sequences based on these intrinsic rules (cell type, uptake value, and experimental setup are not considered). In the second case, the model should generate potential CPP sequences based on information about desired uptake, cell type, and experimental setup.
d) <em> Hybrid algorithm </em> is the most optimal choice, since simple classification/regression models can be "inversed" using evolutionary algorithms. Moreover, results obtained by simpler models can be reused by more complex to compensate for insufficient data.
<h4><ins> 6. Choose the best-performing model. </ins></h4> Prioritize the list of models by accuracy, interpretability, extrapolative power, and robustness.
<h4><ins> 7. Validate your model. </ins></h4> Use computational, predictive, or hybrid approaches to check model consistency with first principles and previous studies.
<h4><ins> 8. Formalize the results. </ins></h4> Create a repo, systematize analysis results, submit the code, ensure the code is reproducible, usable, and readable.
DataCon 3.0 includes not only practices but authoritative lectures and other activities, therefore check for any schedule updates [HERE](https://scamt.ifmo.ru/datacon/).
For CPPs from 7 to 24 amino acids you can use [PMIpred neural network model](https://pmipred.fkt.physik.tu-dortmund.de/curvature-sensing-peptide/) trained on Molecular Dynamics (MD) data to predict its interaction with the cellular membrane. Please use modelling on neutral membrane for better differentiation between CPPs and non-CPPs.
For so-called stapled peptides consisting of both natural and modified amino acids you can predict membrane permeability using [STAPEP package](https://github.com/dahuilangda/stapep_package) offering the full pipeline from data preprocessing to ML model development and use on novel samples.