From 8216cceea0258ccfe293fd90555957c97e68728e Mon Sep 17 00:00:00 2001 From: ACID Design Lab <82499756+acid-design-lab@users.noreply.github.com> Date: Sat, 8 Jun 2024 02:02:59 +0300 Subject: [PATCH] Update README.md --- README.md | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/README.md b/README.md index eaba364..b62a0b4 100644 --- a/README.md +++ b/README.md @@ -44,7 +44,7 @@

What do you need to do? :computer:

-

1. Choose the tasks.

Sequence classification, uptake quantitative prediction, or sequence generation. +

1. Choose the tasks. Sequence classification, uptake quantitative prediction, or sequence generation.

a) Sequence classification is the easiest task, where you need to develop the model differentiating between CPP and non-CPP sequences based on the sequence. The main problem is such models do not inherently allow to find the best CPP sequences. Additional algorithm for sequence generation should be developed to discover novel CPP sequences. @@ -54,7 +54,7 @@ d) Hybrid algorithm is the most optimal choice, since simple classification/regression models can be "inversed" using evolutionary algorithms. Moreover, results obtained by simpler models can be reused by more complex to compensate for insufficient data. -

2. Create a database.

Process datasets, look for more data, merge it, clean, and unify, create a database with DBMS. +

2. Create a database. Process datasets, look for more data, merge it, clean, and unify, create a database with DBMS.

- study the organization of data in the datasets - search for additional data (high throughput screening studies, review papers, databases, datasets etc.) @@ -66,7 +66,7 @@ - move the data to DBMS - set up access, data retrieval etc. -

3. Analyze the data.

Perform sequence alignment, look for conservative patterns, study correlations. +

3. Analyze the data. Perform sequence alignment, look for conservative patterns, study correlations.

- perform local or global sequence alignment on CPP and non-CPP sequences (either all or particular groups/clusters) - make amino acid frequency maps to search for conservative patterns and dependencies @@ -74,14 +74,14 @@ - make correlation plots for categorical and numeric parameters - try to answer the question what parameters and sequence patterns lead to best-performing CPPs -

4. Choose the models.

Find the best-performing classification/regression/generation models to develop and compare. +

4. Choose the models. Find the best-performing classification/regression/generation models to develop and compare.

- do not screen models which were shown to underperform most of the modern ML/DL models - use the models with documented performance - you can use models pre-trained on more abundant data (transfer learning) - prioritize interpretable models over black-box -

5. Build and optimize the models.

Check model performance on default parameters, optimize hyperparameters and architechture. +

5. Build and optimize the models. Check model performance on default parameters, optimize hyperparameters and architechture.

- choose the logic of train/test split (random, stratified, rule-based etc.) - build basic models in simplest form @@ -89,7 +89,7 @@ - make a list of architectures you want to test (for neural networks) - choose a method for hyperparameter tuning (Optuna, Grid search etc.) -

6. Choose the best-performing model.

Prioritize the list of models by accuracy, interpretability, extrapolative power, and robustness. +

6. Choose the best-performing model. Prioritize the list of models by accuracy, interpretability, extrapolative power, and robustness.

- analyze model performance (use appropriate classification/regression metrics or loss functions) - check model extrapolative power (ability to work on samples, which differ a lot from train samples) @@ -97,14 +97,14 @@ - analyze model speed - choose the best model according to these parameters -

7. Validate your model.

Use computational, predictive, or hybrid approaches to check model consistency with first principles and previous studies. +

7. Validate your model. Use computational, predictive, or hybrid approaches to check model consistency with first principles and previous studies.

- choose the methods for additional model validation (computational models, benchmarked ML/DL models, hybrid approaches) - check these methods on correlation with labeled data (for instance, how good these methods differentiate between CPPs and non-CPPs) - analyze how good these methods explain obtained results - generate novel CPPs for validation -

8. Formalize the results.

Create a repo, systematize analysis results, submit the code, ensure the code is reproducible, usable, and readable. +

8. Formalize the results. Create a repo, systematize analysis results, submit the code, ensure the code is reproducible, usable, and readable.

- create a GitHub repository structure - sort and publish all the results obtained during data analysis