From 0bac55f43791bd8099ab394fce18acf4bab95a8d Mon Sep 17 00:00:00 2001
From: ACID Design Lab <82499756+acid-design-lab@users.noreply.github.com>
Date: Sat, 8 Jun 2024 02:12:39 +0300
Subject: [PATCH] Update README.md
---
README.md | 24 +++++++++++++++++++-----
1 file changed, 19 insertions(+), 5 deletions(-)
diff --git a/README.md b/README.md
index db0ae97..ff12f9d 100644
--- a/README.md
+++ b/README.md
@@ -54,6 +54,8 @@
d) Hybrid algorithm is the most optimal choice, since simple classification/regression models can be "inversed" using evolutionary algorithms. Moreover, results obtained by simpler models can be reused by more complex to compensate for insufficient data.
+
+
2. Create a database. Process datasets, look for more data, merge it, clean, and unify, create a database with DBMS.
- study the organization of data in the datasets
@@ -66,6 +68,8 @@
- move the data to DBMS
- set up access, data retrieval etc.
+
+
3. Analyze the data. Perform sequence alignment, look for conservative patterns, study correlations.
- perform local or global sequence alignment on CPP and non-CPP sequences (either all or particular groups/clusters)
@@ -74,6 +78,8 @@
- make correlation plots for categorical and numeric parameters
- try to answer the question what parameters and sequence patterns lead to best-performing CPPs
+
+
4. Choose the models. Find the best-performing classification/regression/generation models to develop and compare.
- do not screen models which were shown to underperform most of the modern ML/DL models
@@ -81,6 +87,8 @@
- you can use models pre-trained on more abundant data (transfer learning)
- prioritize interpretable models over black-box
+
+
5. Build and optimize the models. Check model performance on default parameters, optimize hyperparameters and architechture.
- choose the logic of train/test split (random, stratified, rule-based etc.)
@@ -89,6 +97,8 @@
- make a list of architectures you want to test (for neural networks)
- choose a method for hyperparameter tuning (Optuna, Grid search etc.)
+
+
6. Choose the best-performing model. Prioritize the list of models by accuracy, interpretability, extrapolative power, and robustness.
- analyze model performance (use appropriate classification/regression metrics or loss functions)
@@ -97,6 +107,8 @@
- analyze model speed
- choose the best model according to these parameters
+
+
7. Validate your model. Use computational, predictive, or hybrid approaches to check model consistency with first principles and previous studies.
- choose the methods for additional model validation (computational models, benchmarked ML/DL models, hybrid approaches)
@@ -104,6 +116,8 @@
- analyze how good these methods explain obtained results
- generate novel CPPs for validation
+
+
8. Formalize the results. Create a repo, systematize analysis results, submit the code, ensure the code is reproducible, usable, and readable.
- create a GitHub repository structure
@@ -149,7 +163,7 @@
- uptake type,
- sequence.
-
+
2. Natural CPPs
@@ -173,7 +187,7 @@
Represents a balanced dataset of CPPs and non-CPPs; often used for model benchmarking.
-
+
3. Non-CPPs
@@ -188,7 +202,7 @@
Contains non-CPP sequences shown not to demonstrate activity experimentally.
-
+
4. Non-Natural CPPs
@@ -219,7 +233,7 @@
-
+
Modelling of interaction with membrane
@@ -235,7 +249,7 @@
-
+
Membrane permeability prediction