Update README.md

main
ACID Design Lab 7 months ago committed by GitHub
parent 6844a4c06c
commit b70344ae69
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194

@ -48,15 +48,15 @@
<h4> <ins> 1. Choose the tasks. </ins> </h4> Sequence classification, uptake quantitative prediction, or sequence generation.
a) <em> Sequence classification* is the easiest task, where you need to develop the model differentiating between CPP and non-CPP sequences based on the sequence. The main problem is such models do not inherently allow to find the best CPP sequences. Additional algorithm for sequence generation should be developed to discover novel CPP sequences.
a) <em> Sequence classification </em> is the easiest task, where you need to develop the model differentiating between CPP and non-CPP sequences based on the sequence. The main problem is such models do not inherently allow to find the best CPP sequences. Additional algorithm for sequence generation should be developed to discover novel CPP sequences.
b) *Uptake quantitative prediction* is more challenging due to small data existing in the field, where you need not just predict either sequence is CPP or not, but to predict its cellular uptake depending on conditions and cell type. The problem is, despite such model allows to compare CPP activities with each other and find the best ones, additional algorithm for sequence generation is still needed.
b) <em> Uptake quantitative prediction </em> is more challenging due to small data existing in the field, where you need not just predict either sequence is CPP or not, but to predict its cellular uptake depending on conditions and cell type. The problem is, despite such model allows to compare CPP activities with each other and find the best ones, additional algorithm for sequence generation is still needed.
c) *Sequence generation* is the most challenging task since it requires deep learning models and a lot of data. Generation can be simple or conditional. In the first case, the model should basically learn the patterns in CPP sequences and be able to generate new sequences based on these intrinsic rules (cell type, uptake value, and experimental setup are not considered). In the second case, the model should generate potential CPP sequences based on information about desired uptake, cell type, and experimental setup.
c) <em> Sequence generation </em> is the most challenging task since it requires deep learning models and a lot of data. Generation can be simple or conditional. In the first case, the model should basically learn the patterns in CPP sequences and be able to generate new sequences based on these intrinsic rules (cell type, uptake value, and experimental setup are not considered). In the second case, the model should generate potential CPP sequences based on information about desired uptake, cell type, and experimental setup.
d) *Hybrid algorithm* is the most optimal choice, since simple classification/regression models can be "inversed" using evolutionary algorithms. Moreover, results obtained by simpler models can be reused by more complex to compensate for insufficient data.
d) <em> Hybrid algorithm </em> is the most optimal choice, since simple classification/regression models can be "inversed" using evolutionary algorithms. Moreover, results obtained by simpler models can be reused by more complex to compensate for insufficient data.
<ins>**2. Create a database.**</ins> Process datasets, look for more data, merge it, clean, and unify, create a database with DBMS.
<h4> <ins> 2. Create a database. </ins> </h4> Process datasets, look for more data, merge it, clean, and unify, create a database with DBMS.
- study the organization of data in the datasets
- search for additional data (high throughput screening studies, review papers, databases, datasets etc.)
@ -68,7 +68,7 @@
- move the data to DBMS
- set up access, data retrieval etc.
<ins>**3. Analyze the data.**</ins> Perform sequence alignment, look for conservative patterns, study correlations.
<h4> <ins> 3. Analyze the data. </ins> </h4> Perform sequence alignment, look for conservative patterns, study correlations.
- perform local or global sequence alignment on CPP and non-CPP sequences (either all or particular groups/clusters)
- make amino acid frequency maps to search for conservative patterns and dependencies
@ -76,14 +76,14 @@
- make correlation plots for categorical and numeric parameters
- try to answer the question what parameters and sequence patterns lead to best-performing CPPs
<ins>**4. Choose the models.**</ins> Find the best-performing classification/regression/generation models to develop and compare.
<h4> <ins> 4. Choose the models. </ins> </h4> Find the best-performing classification/regression/generation models to develop and compare.
- do not screen models which were shown to underperform most of the modern ML/DL models
- use the models with documented performance
- you can use models pre-trained on more abundant data (transfer learning)
- prioritize interpretable models over black-box
<ins>**5. Build and optimize the models.**</ins> Check model performance on default parameters, optimize hyperparameters and architechture.
<h4> <ins> 5. Build and optimize the models. </ins> </h4> Check model performance on default parameters, optimize hyperparameters and architechture.
- choose the logic of train/test split (random, stratified, rule-based etc.)
- build basic models in simplest form
@ -91,7 +91,7 @@
- make a list of architectures you want to test (for neural networks)
- choose a method for hyperparameter tuning (Optuna, Grid search etc.)
<ins>**6. Choose the best-performing model.**</ins> Prioritize the list of models by accuracy, interpretability, extrapolative power, and robustness.
<h4> <ins> 6. Choose the best-performing model. </ins> </h4> Prioritize the list of models by accuracy, interpretability, extrapolative power, and robustness.
- analyze model performance (use appropriate classification/regression metrics or loss functions)
- check model extrapolative power (ability to work on samples, which differ a lot from train samples)
@ -99,14 +99,14 @@
- analyze model speed
- choose the best model according to these parameters
<ins>**7. Validate your model.**</ins> Use computational, predictive, or hybrid approaches to check model consistency with first principles and previous studies.
<h4> <ins> 7. Validate your model. </ins> </h4> Use computational, predictive, or hybrid approaches to check model consistency with first principles and previous studies.
- choose the methods for additional model validation (computational models, benchmarked ML/DL models, hybrid approaches)
- check these methods on correlation with labeled data (for instance, how good these methods differentiate between CPPs and non-CPPs)
- analyze how good these methods explain obtained results
- generate novel CPPs for validation
<ins>**8. Formalize the results.**</ins> Create a repo, systematize analysis results, submit the code, ensure the code is reproducible, usable, and readable.
<h4> <ins> 8. Formalize the results. </ins> </h4> Create a repo, systematize analysis results, submit the code, ensure the code is reproducible, usable, and readable.
- create a GitHub repository structure
- sort and publish all the results obtained during data analysis
@ -115,28 +115,28 @@
---
## Schedule :calendar:
<h2> Schedule :calendar: </h2>
DataCon 3.0 includes not only practices but authoritative lectures and other activities, therefore check for any schedule updates [HERE](https://scamt.ifmo.ru/datacon/).
---
## Contents :open_book:
<h2> Contents :open_book: </h2>
This repository contains the following data:
1. **Articles about CPPs** to read (see the relevant folder)
2. **Available datasets** for model development (see **Data Description** section)
3. **Useful tools** for property and structure prediction (see **Useful tools** section and relevant folder)
1. <strong> Articles about CPPs </strong> to read (see the relevant folder)
2. <strong> Available datasets </strong> for model development (see <strong> Data Description </strong> section)
3. <strong> Useful tools </strong> for property and structure prediction (see <strong> Useful tools </strong> section and relevant folder)
---
## Data description :floppy_disk:
<h2> Data description :floppy_disk: </h2>
### 1. Mixed CPPs
<h3> 1. Mixed CPPs </h3>
Contains CPPs with natural or modified amino acids.
1.1. POSEIDON
<h4> 1.1. POSEIDON </h4>
Contains heterogeneous experimental data regarding CPP (natural and non-natural amino acids) activity measurements (.csv format), which are:
- peptide name,
@ -151,58 +151,58 @@
- uptake type,
- sequence.
### 2. Natural CPPs
<h3> 2. Natural CPPs </h3>
Contains only sequences with natural amino acids.
2.1. CPPBase
<h4> 2.1. CPPBase </h4>
Contains sequences of CPPs with experimentally proved activity in .fasta format.
2.2. Experimental and Experimental2
<h4> 2.2. Experimental and Experimental2 </h4>
Contain more sequences of CPPs with experimentally proved activity in .txt format.
2.3. Experimental_high_uptake
<h4> 2.3. Experimental_high_uptake </h4>
Contains CPP sequences with high (but not stated) uptake in .txt format.
2.4. Balanced_dataset
<h4> 2.4. Balanced_dataset </h4>
Represents a balanced dataset of CPPs and non-CPPs; often used for model benchmarking.
### 3. Non-CPPs
<h3> 3. Non-CPPs </h3>
Contains negative CPP samples in .txt format.
3.1. Generated
<h4> 3.1. Generated </h4>
Contains randomly generated sequences treated as negative.
3.2. Experimental
<h4> 3.2. Experimental </h4>
Contains non-CPP sequences shown not to demonstrate activity experimentally.
### 4. Non-Natural CPPs
<h3> 4. Non-Natural CPPs </h3>
Contains CPPs consisting of non-natural amino acids.
4.1. CPPBase_modified
<h4> 4.1. CPPBase_modified </h4>
Contains a list of modified CPPs with experimentally proved activity in .fasta format.
4.2. CPPBase_modified_symbols
<h4> 4.2. CPPBase_modified_symbols </h4>
Contains a list of abbreviations for modified amino acids in .txt format (ABBREVIATION: NAME; ...: ...).
## Useful tools :bookmark_tabs:
<h2> Useful tools :bookmark_tabs: </h2>
### Structure prediction
<h3> Structure prediction </h3>
In the relevant folder you can find a Jupiter notebook with AlphaFold 2.
@ -214,7 +214,7 @@
<img src="https://github.com/acid-design-lab/DataCon24/assets/82499756/640ee468-cac2-4e7d-8042-8baf68bbe865" alt="drawing" width="500"/>
### Modelling of interaction with membrane
<h3> Modelling of interaction with membrane </h3>
For CPPs from 7 to 24 amino acids you can use [PMIpred neural network model](https://pmipred.fkt.physik.tu-dortmund.de/curvature-sensing-peptide/) trained on Molecular Dynamics (MD) data to predict its interaction with the cellular membrane. Please use modelling on neutral membrane for better differentiation between CPPs and non-CPPs.
@ -228,7 +228,7 @@
<img src="https://github.com/acid-design-lab/DataCon24/assets/82499756/22cd60b9-0d0f-4021-a61e-8b0865c8b583" alt="drawing" width="500"/>
### Membrane permeability prediction
<h3> Membrane permeability prediction </h3>
For so-called stapled peptides consisting of both natural and modified amino acids you can predict membrane permeability using [STAPEP package](https://github.com/dahuilangda/stapep_package) offering the full pipeline from data preprocessing to ML model development and use on novel samples.

Loading…
Cancel
Save