The main challenge here is to develop **unbiased model** not limited to existing CPP structures and cell penetration mechanisms. Another challenge is to develop CPPs **for particular drug delivery system and setup**, which includes multi-property optimization (amphiphilicity, molecular weight, toxicity etc.). Finally, models should be **interpretable**, which means user should know why particular CPP demonstrates its activity, and what are the possible ways to improve it further.
The main challenge here is to develop **unbiased model** not limited to existing CPP structures and cell penetration mechanisms. Another challenge is to develop CPPs **for particular drug delivery system and setup**, which includes multi-property optimization (amphiphilicity, molecular weight, toxicity etc.). Finally, models should be **interpretable**, which means user should know why particular CPP demonstrates its activity, and what are the possible ways to improve it further.
### What do you need to do? :computer:
**1. Choose the tasks.** Sequence classification, uptake quantitative prediction, or sequence generation.
a) ***Sequence classification*** is the easiest task, where you need to develop the model differentiating between CPP and non-CPP sequences based on the sequence. The main problem is such models do not inherently allow to find the best CPP sequences. Additional algorithm for sequence generation should be developed to discover novel CPP sequences.
b) ***Uptake quantitative prediction*** is more challenging due to small data existing in the field, where you need not just predict either sequence is CPP or not, but to predict its cellular uptake depending on conditions and cell type. The problem is, despite such model allows to compare CPP activities with each other and find the best ones, additional algorithm for sequence generation is still needed.
c) ***Sequence generation*** is the most challenging task since it requires deep learning models and a lot of data. Generation can be simple or conditional. In the first case, the model should basically learn the patterns in CPP sequences and be able to generate new sequences based on these intrinsic rules (cell type, uptake value, and experimental setup are not considered). In the second case, the model should generate potential CPP sequences based on information about desired uptake, cell type, and experimental setup.
d) ***Hybrid algorithm*** is the most optimal choice, since simple classification/regression models can be "inversed" using evolutionary algorithms. Moreover, results obtained by simpler models can be reused by more complex to compensate for insufficient data.
**2. Create a database.** Process datasets, look for more data, merge it, clean, and unify, create a database with DBMS.
- study the organization of data in the datasets
- search for additional data (high throughput screening studies, review papers, databases, datasets etc.)
- search for additional parameters describing CPPs or experimental setup, which can be computed, predicted, or parsed
- unify the data (measurement units, amino acid notations etc.)
- remove duplicated samples
- choose the type of DBMS
- develop a database structure for DBMS
- move the data to DBMS
- set up access, data retrieval etc.
**3. Analyze the data.** Perform sequence alignment, look for conservative patterns, study correlations.
- perform local or global sequence alignment on CPP and non-CPP sequences (either all or particular groups/clusters)
- make amino acid frequency maps to search for conservative patterns and dependencies
- compare these findings with literature information
- make correlation plots for categorical and numeric parameters
- try to answer the question what parameters and sequence patterns lead to best-performing CPPs
**4. Choose the models.** Find the best-performing classification/regression/generation models to develop and compare.
- do not screen models which were shown to underperform most of the modern ML/DL models
- use the models with documented performance
- you can use models pre-trained on more abundant data (transfer learning)
- prioritize interpretable models over black-box
**5. Build and optimize the models.** Check model performance on default parameters, optimize hyperparameters and architechture.
- choose the logic of train/test split (random, stratified, rule-based etc.)
- build basic models in simplest form
- optimize the code until it runs correctly
- make a list of architectures you want to test (for neural networks)
- choose a method for hyperparameter tuning (Optuna, Grid search etc.)
**6. Choose the best-performing model.** Prioritize the list of models by accuracy, interpretability, extrapolative power, and robustness.
- analyze model performance (use appropriate classification/regression metrics or loss functions)
- check model extrapolative power (ability to work on samples, which differ a lot from train samples)
- analyze model feature importance and its consistency with literature observations and basic logic
- analyze model speed
- choose the best model according to these parameters
**7. Validate your model.** Use computational, predictive, or hybrid approaches to check model consistency with first principles and previous studies.
- choose the methods for additional model validation (computational models, benchmarked ML/DL models, hybrid approaches)
- check these methods on correlation with labeled data (for instance, how good these methods differentiate between CPPs and non-CPPs)
- analyze how good these methods explain obtained results
- generate novel CPPs for validation
**8. Formalize the results.** Create a repo, systematize analysis results, submit the code, ensure the code is reproducible, usable, and readable.
- create a GitHub repository structure
- sort and publish all the results obtained during data analysis
- structure, debug, and prettify the code
- ensure all dependencies needed for correct code work are listed in requirements.txt
### Schedule :calendar:
### Schedule :calendar:
DataCon 3.0 includes not only practices but authoritative lectures and other activities, therefore check for any schedule updates [HERE](https://scamt.ifmo.ru/datacon/).
DataCon 3.0 includes not only practices but authoritative lectures and other activities, therefore check for any schedule updates [HERE](https://scamt.ifmo.ru/datacon/).