Disease candidate genes prediction using positive labeled and unlabeled instances

Molaei, Sepideh; Jalili, Saeed

doi:10.1186/s12920-025-02109-4

Research
Open access
Published: 16 April 2025

Disease candidate genes prediction using positive labeled and unlabeled instances

Sepideh Molaei¹ &
Saeed Jalili¹

BMC Medical Genomics volume 18, Article number: 73 (2025) Cite this article

246 Accesses
Metrics details

Abstract

Identifying disease genes and understanding their performance is critical in producing drugs for genetic diseases. Nowadays, laboratory approaches are not only used for disease gene identification but also using computational approaches like machine learning are becoming considerable for this purpose. In machine learning methods, researchers can only use two data types (disease genes and unknown genes) to predict disease candidate genes. Notably, there is no source for the negative data set. The proposed method is a two-step process: The first step is the extraction of reliable negative genes from a set of unlabeled genes by one-class learning and a filter based on distance indicators from known disease genes; this step is performed separately for each disease. The second step is the learning of a binary model using causing genes of each disease as a positive learning set and the reliable negative genes extracted from that disease. Each gene in the unlabeled gene's production and ranking step is assigned a normalized score using two filters and a learned model. Consequently, disease genes are predicted and ranked. The proposed method evaluation of various six diseases and Cancer class indicates better results than other studies.

Peer Review reports

Introduction

Genes are the factors of inherited and genetic disorders that can path through into future generations. Also, they can be hidden and may be revealed in the future. Hence, genetic disease treatment or prevention has been challenging for physicians and health researchers from the past to now. Thereby, predicting disease genes and understanding their mechanisms is the first critical step in pharmacology and medicine for treatment and prevention. Today, new studies have significantly enhanced for finding the disease's molecular basis to prevent, diagnose, and treat genetic diseases.

The utilization of machine learning methods to solve various problems has shown promising performance compared to traditional and experimental methods. In particular, machine learning techniques in medicine have attracted significant attention. Experimental and laboratory-based methods for solving medical problems are often cost-intensive and time-consuming, which has led to a growing interest in computational methods, including machine learning. Furthermore, while some genes are classified as non-disease genes, they may be identified as disease-related in different contexts. This complexity has made it difficult to definitively classify non-disease genes, as knowledge in this area remains limited. However, recent studies have shown that some human genes play a role in diseases and can be valuable for predicting disease-related genes using machine learning methods.

In predicting and ranking disease genes using machine learning, the disease-known genes are considered a positive data set, and unknown genes are considered unlabeled genes. Prediction and labeling the genes causing a disease (based on ranking) among the unknown genes using that disease's known genes is the purpose of this issue. Due to the data nature, one of the most proper solving methods of this issue (which consists of the data's nature) is the Positive Unlabeled Learning (PU-Learning) approach [1]. The PU-Learning method is semi-supervised; this method is used for binary classes with positive labeled and unlabeled samples. This type of learning has no negative labeled samples, and it distinguishes it from other learning types. The available data in this type of learning is as two following types: (i) data set including positive labeled samples; (ii) data set without label that potentially can be the cause of the disease (positive) or non-disease (negative). The studies regarding solving this issue with the PU-Learning approach are classified into two general approaches: 1) Identifying negative samples approach; 2) not identifying negative samples approach.

The negative genes (non-disease) are initially selected among the unlabeled genes in the identifying negative samples approach. Next, binary models are learned separately for each disease using data set containing genes causing that disease (with positive label) and non-disease genes (with negative genes). Selecting reliable negative genes is the main challenge in this strategy. The more reliable they are the learning will be accurate in the next step. In the not identifying negative samples approach, learning of one class is only carried out using positive samples. This method will be useful if the number of positive samples is ample and sufficient.

Moreover, the efficiency of this method is very low if the number of positive samples is insufficient [2] or the entire unlabeled genes consider negative samples. Consequently, the problem will be changed to an unbalanced binary classification, and then binary models will be learned. Since the dataset of unlabeled genes is included potential negative and positive samples, utilization of this method have high error. Recently, the use of this method has been reduced [3].

The extraction of reliable negative genes in the proposed method is as follows: negative genes extraction is carried out separately for each disease in the one-class learning step. Then, the most distant negative genes from known disease genes are selected. Indeed, designing reliable negative gene extraction in such a way will enhance the trust in extracted negative genes. Disease genes will be selected separately for each disease in the binary model learning step based on the proposed method's designed scoring system. The score-relevance indicator is used for this purpose. The score of each disease gene is normalized using a scoring system. Then, it is decided whether or not to select any disease gene as positive educational data based on the score of each gene. Eventually, a binary disease model is learned using the Support Vector Machine (SVM) algorithm. The other two filters are used in the unlabeled genes' prediction and ranking step after determining the sample's label using the learned binary model. These two filters are based on: 1) each gene's distance from the support vector; 2) the closeness of the gene to disease genes. A normalized score is laid out for each gene using the designed scoring system in the proposed method and the distance of every unlabeled gene from the disease binary model's support vectors. Next, another score is laid out for each gene using a designed scoring system and score relevance related to every unlabeled gene. Eventually, a single score for each gene is obtained by formulating scores. Then, the decision is made for the unlabeled gene (in other words, whether the gene is a candidate for the disease or not). Besides, the rank of the gene is determined if it is a candidate for the disease. The outcomes of evaluating the proposed method compared to the best previous available proposed method are as follows:

The recall measure of Adrenal, Colon, Lung, Prostate, and Heart Failure diseases and Cancer disease class are increased by 0.53%, 5.32%, 1.29%, 3.33%, 4.04%, and 3.11%, respectively. Moreover, the increase of precision measure is 2.64%, 2.14%, 1.75%, 3.14%, 3.13%, and 2.38%, respectively. The increase of AUC measure for Neurological disease is 8.82% compared to other studies.

Basic concepts

Gene expression profile (GEP)

Gene expression data provides valuable information regarding cellular situations, biological networks, and understanding of genes' performance. Indeed, the genetic codes have been stored in DNA strands. Furthermore, they will interpret by gene expression. Determining how genes are expressed in non-disease and diseased cells is one of the purposes of gene expression interpretation. Scientists utilize DNA microarray (biochips) to measure gene expression amount. A set of gene expression samples is the result of determining the gene expression amount's experiment. Every row in the gene expression matrix indicates the related gene expression profile. Time series of gene expression profiles (which state the gene expression level in determining periods) are used in this study.

Similarity-based communication principle

Similarity-based communication principle is used in most disease candidate gene prediction problem-solving methods. The mentioned principle declares that the greater the physical and performance similarity of genes, the greater the probability of their role in developing the same diseases. The closeness amount to the disease genes can be used as a rank.

Score relevance

The scores for each gene based on Score-Relevance can be considered a score for the effectiveness of that gene in the specific disease formation. Indeed, the mentioned scores are based on the simultaneous presence of two elements in the Medline^{Footnote 1} document. This score is based on a formula (the base of this formula is the Boleyn model) and is calculated for finding coincident documents and their conformity amount. Overall, the mentioned formula has used the concepts of Term Frequency-Inverse Document Frequency (TF-IDF), Vector Space, Coordination factor, and field length normalization [4].

Comparing the number of documents in which two elements are present next to each other and the number of documents in which elements independently appear with the expected amount is carried out based on the hypergeometric distribution. The greater the simultaneous presence of elements (compared to the expected amount) will reduce the random occurrence of this happen. Consequently, the scores will enhance [5]. Unfortunately, these scores are not significant absolutely and only are sequentially significant in the related genes list of each disease and have particular importance. Moreover, the absolute amounts of scores may vary from one version to another version.

Research history

The previous studies regarding disease candidate gene prediction are introduced in two groups.

Identifying negative samples approach

Yousef and Moghadam [6] used proteins' amino acid sequences for predicting and ranking the diseases' genes. They construct four various characteristic vectors using amino acid sequences. Moreover, they use cosine distance for extracting reliable negative genes. Eventually, the characteristics of a model are learned separately for each vector. The results of every category are integrated, and the final result will be announced.

VasighiZaker and Jalili [7] presented the C-PUGP method. In this method, the clustering of positive samples is considered initially. Next, a one-class model with an OCSVM learning algorithm is carried out for every cluster. Labeling of unlabeled samples is performed using learned models. Then, the unlabeled gene, which gives a negative label based on the entire one-class models, is considered a reliable negative sample. Finally, the SVM binary model is learned using the obtained negative samples and initial positive samples. Many initial studies considered the entire unlabeled genes as negative samples and learned a binary model. Since the dataset of unlabeled genes is included negative and positive samples, utilization of this method have high error. Smalter et al. [8] predicted disease candidate genes using the protein–protein interaction dataset and SVM binary model. Radivojac et al. [9] used three various datasets and learned an SVM binary model for every dataset. They identify disease candidate genes using these three disease binary models' results. The used datasets were protein sequences, protein performance information, and the PPI network.

Not identifying negative samples approach

Learning is carried out only with positive samples in this method. The efficiency of this method is very low if the number of positive samples is insufficient [2]. Yousef and Moghadam [10] identified disease genes using the SVDD one-class model (only by using the sequences of disease genes). This method generates the characteristic vector by converting protein consequences to numerical vectors using their physicochemical properties translation. Then, they reduced the characteristics sizes to find the critical characteristics using Principal Component Analysis (PCA). The disease genes (positive samples) are learned using SVDD one-class model in the next step. The unlabeled samples will predict using the learned model. The entire disease genes are initially considered a positive set in the method of VasighiZaker and Jalili [11]. This set will normalize by the Min–Max method. Then, the number of the characteristic will reduce using the PCA method. Next, the learning is performed by OCSVM one-class model. The unlabeled genes are labeled after finding the optimal parameters. Nikdel and Jalili [12] studied the clustering of disease genes based on a constructed matrix by measuring semantic similarity among the disease types; this is carried out based on the gene ontology. Next, the Hidden Markov Model (HMM) is learned for each cluster; a threshold is calculated for each cluster separately. The unlabeled genes are given to the entire learned hidden Markov models of that disease. The label of that gene will identify given the probability obtained from each hidden Markov model and calculated threshold for each cluster. In other words, if at least one of the hidden Markov models (among the entire learned hidden Markov models of that disease) considers an unlabeled gene as a disease candidate, the positive label is attributed to that gene. After normalizing gene expression data, Vasighizaker et al. [13] used a one-class support vector machine model with a linear kernel for predicting disease genes in Acute Myeloid Leukemia (AML) cancer.

The proposed method

The scoring-based method using the SVM binary model is introduced to solve the prediction and ranking problem of disease candidate genes; this method scores effective factors in predicting and ranking disease candidate genes. The main aim of this method is disease candidate genes prediction and ranking from an unlabeled gene set. The higher priority belongs to the gene more likely to belong to the disease candidate genes group. Unlabeled genes are human genome that does not belong to disease genes. Notably, determining gene expression is performed in various laboratories. Therefore, a gene may have more than one gene profile. Consequently, the entire calculation is carried out separately for a gene's profile.

The S-PUL^{Footnote 2} proposed method has four following steps: 1) data normalization; 2) reliable negative genes extraction; 3) disease binary model learning; 4) disease candidate genes prediction and ranking (see Fig. 1). The gene expression data is normalized in the first step. In the second step, reliable negative genes are extracted from unlabeled samples separately for every disease. The binary model is learned separately for every disease with positive samples (disease genes) in the third step. In the fourth step, reliable negative genes are eliminated from unlabeled genes (U). Then, the remaining unlabeled gene set (Rui) is given to the disease binary model for label prediction.

The term "S-PUL" stands for Scored-Positive Unlabeled Learning. It is a combination of two used methods: Positive Unlabeled Learning (PUL) and a Scoring system. The scoring aspect refers to the integration of a scoring system within the Support Vector Machine (SVM) algorithm. This hybrid approach leverages the strengths of both techniques to enhance the learning process.

Data normalization step

Each gene's time expression range is different, and their difference is high. The entire data is normalized separately for two datasets (disease and unlabeled genes). The normalization is carried out based on Eq. 1. The highest and lowest amounts of every gene's time expression are indicated by X_max and X_min in Eq. 1, respectively.

$${\text{X}}_{\text{normalized}}=\frac{({\text{X}}_{\text{max}}-\text{ X})}{({\text{X}}_{\text{max}}- {\text{X}}_{\text{min}})}$$

(1)

Reliable negative genes selection step

Learning disease binary model, in addition to disease genes set (as positive samples), requires reliable negative genes set (as negative samples). It is evident that the accuracy of predicting unlabeled genes by the disease binary model (as disease genes) increases with enhancing the trust degree in the identified negative genes (among unlabeled genes). Figure 2 illustrates the reliable negative gene extraction process related to each disease class.

In the first action (i.e., Action 1 Algorithm), the Robust Gaussian, KNN, Parzen window, and SVDD one-class classification algorithms are used for learning the positive samples model separately for each disease class. Moreover, other disease classes' genes (after eliminating common genes) are used as test data. After learning a disease model, other diseases genes are expected to appear in the negative data role. Hence, the evaluation indicator to select the best learning algorithm is the percentage of considered accurate negative samples. Eventually, the learned one-class algorithm that has the highest percentage of accurate negative samples is selected as the best one-class model of i-th disease. In the Action2, unlabeled genes are given to the best one-class model as input, and unlabeled genes are labeled. The outcome of this step is a set of negative genes. Finally, Reliable negative genes are selected from the set of negative genes in the third step (i.e., Filter1 algorihtm.). The shortest Euclidean distance of every negative gene is calculated from its correspondent disease genes. If a disease gene expression profile (Ne) from the ND_i set is shown by $Ne=\{{d}_{1}{,d}_{2},{d}_{3},\dots ,{d}_{m}\}$ and the negative gene expression profile (Ng_i) is shown by ${Ng}_{i}=\{{n}_{1},{n}_{2},{n}_{3},\dots ,{n}_{m}\}$, the Euclidean distance is calculated using Eq. 2. Thus, the minimum distance of every negative gene from its correspondent disease genes is calculated based on Eq. 3. Eventually, the farthest genes from disease genes are selected as reliable negative genes for every disease i (RND_i).

$$Di{s}_{Eu\left(Ne,{ Ng}_{i}\right)}=\sqrt{\sum_{k=1}^{m}{\left({d}_{k}-{n}_{k}\right)}^{2}}$$

(2)

$${Ne}_{k}= \underset{ \forall \mathit{ Ne }\in \mathit{NDi}}{\text{min}}\left\{Dis\_Eu\left(Ne , {Ng}_{i}\right)\right\}$$

(3)

Learning step of the disease binary model

The prediction and ranking problem of disease candidate genes are solved based on binary model learning. Figure 3 indicates the learning process of the disease binary model. Selecting the positive training data from the disease genes set of every disease is another challenge of this study.

It is worth noting that the role of genes in the arising of disease has different degrees. The reliability of learning results will enhance using genes (as training data) that have higher correspondent S-R^{Footnote 3} values in the learning process. The selection of disease genes is performed using S-R for a positive training set in this study. The value of S-R related to every disease gene (separately for each disease) is available in [4].

Positive genes selection (Filter 2)

Positive genes of each disease are selected in four steps. This process is described step by step in the following and presented formally by "Filter 2 algorithm".

In the first step, disease class genes are categorized based on their S-R values (separately for each disease). The gene will belong to a higher category by enhancing its S-R value, thus obtaining a higher score. The categories with equal intervals of ten units will create for categorizing genes based on their S-R value. Therefore, the first category is related to the genes whose range is [0,10). In other words, the first category has the lowest value. Accordingly, each gene will belong to a category (the length of these categories is 10). One of the challenges of this study is determining the value of these categories' range. The distribution of disease genes number based on the S-R values is not uniform. Each category's range should be determined so that it does not lead to the over-elimination of genes. The length of 10 for categories is a logical number for the entire disease. The mentioned length has been obtained by trial and error in this study. Moreover, this number can be calculated more accurately in future studies.

In the second step, every category gets a portion of 100 points according to its obtained score. In other words, the highest percentage will obtain by the highest category. The category score related to the i-th category is shown by ${NGr}_{i}$; this reaches the base of 100. The ${NGr}_{i}$ can be calculated by Eq. 4.

$${NGr}_{i}= ( \left\lfloor\left(\frac{{SR}_{i}}{10}\right)\right\rfloor+1 )\times \frac{200}{\text{Max }\{\left|Gr\right|\}(\text{Max }\{\left|Gr\right|\}+1)}$$

(4)

The category score of the entire genes belonging to the disease is saved in Gr set. In Eq. 4, Max(|Gr|) is the highest category score of a gene belong to a disease class; S-R_i indicates the S-R value related to the i-th gene of Gr set.

In the third step, the final score of the i-th gene F_Score_i is calculated by Eq. 5.

$${F\_Score}_{i}={NGr}_{i}\times {S-R}_{i}$$

(5)

The mean scores range $(\stackrel{-}{IL)}$ is calculated based on the Eq. 6 (separately to every disease). Moreover, some genes are selected as positive training data (their final score is over the mean of the scores range). In Eq. 6, the final score of the entire genes is in the {F_Score} set.

$$\overline{IL }=\frac{Max \left\{F\_Score\right\}+Min \left\{F\_Score\right\}}{2}$$

(6)

Positive genes selection (Filter3)

Filter 3 is an optimization step in the proposed method designed to eliminate low-significance genes and reduce noise in the data. Specifically, this filter removes genes that received a negative label from the SVM binary model during the learning phase, and have S-R values in the lowest scoring range ([0, 10)).

The primary goal of this filter is to focus the learning process on genes that are more likely associated with the disease, while excluding genes that have the least impact on disease formation. By doing so, the learning process is refined, and it is expected that the prediction accuracy for disease candidate genes will improve.

Binary model learning

In Action3 of Fig. 3, the binary learning using binary learning algorithms is performed using selected positive training genes (PD_i) from i-th disease genes and reliable negative genes (RND_i) from unlabeled genes. Eventually, the algorithm that obtains the highest recall evaluation value for all diseases is selected.

Disease candidate genes prediction and ranking step

The remaining unlabeled gene sets (i.e., the unlabeled genes set that the extracted negative genes are eliminated in that set in Reliable negative genes selection step) are given to the disease binary model as test data after learning and selecting the best binary learning algorithm (SVM) with having the best learning parameters. A scoring algorithm is also used in the disease candidate prediction and ranking step, as illustrated in Fig. 4. There are two critical factors in the scoring algorithm: 1) The distance of every unlabeled gene from the disease gene; 2) The distance of every unlabeled gene from the support-vector of the i disease model. genes give a score based on each mentioned factor. The final score of the gene will obtain by multiplying these two scores. Eventually, the prediction and ranking are carried out according to the final score.

Action4- Identifying the valuable genes

The unlabeled genes given to the i disease (i.e.; the extracted reliable negative genes of i-th disease eliminated from the unlabeled genes set; RUi indicates this set) are labeled and stored in the DS1 set using the i-th disease learning model. Suppose that the expression profile of disease gen (Ne) from the ND_i set is $Ne=\{{d}_{1}{,d}_{2},\dots ,{d}_{m}\}$, and the expression profile of an unlabeled gene (Ru) from the RU_i set is $Ru=\{{u}_{1},{u}_{2},\dots ,{u}_{m}\}$. The closet i disease gene (Ne) to each Ru studied expression profile from the RU_i dataset is identified using Eqs. 2 and 3 (in terms of Euclidean distance). Moreover, it is stored in the DS2 set. Negatively labeled genes are eliminated from the DS1 dataset to preserve valuable genes (separately to each profile). Their correspondent S-R values are settled in the DS2 dataset of the first category (the least valuable category). The remaining genes are stored in the VRU_i dataset. These remaining genes are negatively labeled genes with high S-R and positive labeled genes that are valuable genes.

Action5- The prediction and ranking of disease candidate genes

The $F\_Score$ value of the nearest disease gene is attributed to Ru studied gene profile from the VRU_i dataset. It is worth noting that the nearest disease gene to each studied gene profile is identified in the "Reliable negative genes selection step" section and maintained in the DS2 dataset. In this method, the given label to each Ru studied gene is maintained from the VRU_i dataset. Each gene has many gene profiles. Thus, the final score of a gene is the algebraic summation of scores of that gene's profiles. The output of this step is the DS3 dataset, which contains the entire valuable genes input from the VRU_i to this step, along with the second score of each gene (${DP\_Score}_{i}$). It is worth noting that the reliability of the sample belonging to the i disease class increases with enhancing the distance of the tested sample from the support vectors of the i disease model. In contrast, the reliability of the sample belonging to the i disease class reduces by reducing the distance of the tested sample from the support vectors of the i disease model. Consequently, the gene score will increase by distancing the studied gene (VRU_i) from the support vectors of i disease in the calculation of the second score of each gene (${DS\_Score}_{i}$). The calculation of the third score is carried out in three steps.

In the first step, the value of ${Gr}_{sv}$ parameter for i-th gene from positive and negative labeled genes are considered $\lfloor{DS}_{i}\rfloor+1$ value and $\lfloor{DS}_{i}\rfloor$ value, respectively. ${Gr}_{sv}$ is the category's score, including the ith gene, and DS_i is the distance of the i-th gene from the Support Vector.

In the second step, the Eq. 7 is used to calculate the value of ${NGr}_{svi}$; it is the category's score of the i-th gene (${Gr}_{svi}$). The mentioned score has reached the base of 100. The category's score of all genes belonging to the disease is in the $\{{Gr}_{sv}\}$ set.

$${NGr}_{svi}={Gr}_{svi}\times \frac{200}{\text{Max }\{\left|{Gr}_{sv}\right|\} (\text{Max }\{\left|{Gr}_{sv}\right|\} +1)}$$

(7)

In the third step, the final correspondent score with the i-th gene ${DS\_Score}_{i}$ is calculated by Eq. 8.

$${DS\_Score}_{i}={NGr}_{svi}\times |{DS}_{i}|$$

(8)

The second and third scores are simultaneously used for predicting and ranking disease candidate genes. Each gene may have several profiles in the unlabeled genes dataset. Thus, each gene obtains a score based on its profile number. The final score for that gene is obtained from the algebraic summation of gene profiles.

The final score of the studied gene (${Final\_Score}_{i}$) is calculated based on the Eq. 9 with the algebraic summation of gene profiles' scores (${DP\_Score}_{i}$ and ${DS\_Score}_{i}$ for each profile of that gene). The prediction of disease candidate genes is carried out based on the score of each gene.

$${Final\_Score}_{i}=\sum_{i=1}^{m}({\mathit{DS}\_\mathit{Score}}_{i}\times |{DP\_Score}_{i}|)$$

(9)

The number of gene profiles is indicated by m in Eq. 9.

Finally, genes whose $Final\_Score$ values are negative will eliminate; other genes are predicted as disease candidate genes. The obtained final score of each disease candidate gene is used for ranking.

Results

The efficiency of the S-PUL method is evaluated in six versions, namely S-PUL_Vn in this section. The number of S-PUL versions and used filters in that version are reported in Table 1. It is worth noting that the version of S-PUL_V5 is the proposed S-PUL method, which uses all filters.

Table 1 The used filters in the versions of the S-PUL proposed method

Disease candidate genes prediction using positive labeled and unlabeled instances

Abstract

Introduction

Basic concepts

Gene expression profile (GEP)

Similarity-based communication principle

Score relevance

Research history

Identifying negative samples approach

Not identifying negative samples approach

The proposed method

Data normalization step

Reliable negative genes selection step

Learning step of the disease binary model

Positive genes selection (Filter 2)

Positive genes selection (Filter3)

Binary model learning

Disease candidate genes prediction and ranking step

Action4- Identifying the valuable genes

Action5- The prediction and ranking of disease candidate genes

Results

Dataset

Evaluation measures

The evaluation of extracted reliable negative genes

Selecting the one-class learning algorithm

The first evaluation method

The second evaluation method

Measuring the trust degree in the extracted negative genes

Evaluation of the binary classification algorithms performance and selection of the disease genes

The evaluation of disease candidate genes prediction and ranking

The evaluation of selecting valuable genes efficiency (filter 3)

The efficiency evaluation of utilizing the second genes of the VRU set

The efficiency evaluation of using the third score of VRU set genes

The evaluation of the S-PUL method efficiency

Comparing the efficiency of the S-PUL proposed method with other methods

Comparing the efficiency of the S-PUL proposed method with biologists' efficiency

Conclusion

Data availability

Notes

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Medical Genomics

Contact us