In this thesis, we try to extract a disease screening marker based on genetic mutations related to diseases from the whole genome or exome sequencing data. Although there are many studies for finding disease-related genetic characteristics from the genomic data whose size are tens to hundreds of gigabytes, the actual bio markers used in the clinical medicine occupy only a small part of the total information. This is because only partial genetic information is considered such as some genes in the cases of existing methods. Additionally, a mutual relationship of mutations have been rarely studied. Therefore, in this thesis, we propose a selective searching algorithm which examines the relationship between genetic characteristics and a disease from the whole genome or exome data by considering the combination of genetic mutations.
First, we propose a searching algorithm for a combination of disease-related mutations based on the whole exome sequencing data. Here, we consider point mutations such as SNVs and InDels. In the extraction algorithm, we filter candidate mutations by applying the learning concept. The entire samples are divided into training and test samples, and marker extraction and validation samples are randomly selected from the training samples. From marker extraction samples, we extract disease-related mutations that have many changes in disease samples and few changes in normal samples. Then, we apply extracted disease-related mutations to validation samples, and select only mutations whose accuracy is maintained in validation samples. The random selection of the marker extraction samples and the validation samples is repeated until the number of selected mutations is converged. Then, we propose an objective function-based searching algorithm to find a combination of disease-related mutations. The combination of disease-related mutations is obtained by applying the objective function-based searching algorithm to extracted candidate mutations related to a disease. Finally, we apply the proposed searching algorithms for the combination of disease-related mutations to whole exome sequencing data of acute myeloid leukemia (AML). Then, we analyze the validity of the proposed marker and extracted genes. To check the validity of the proposed marker, the proposed threshold-based classification, support vector machine (SVM) and convolutional neural network (CNN) are used.
Second, we propose a searching algorithms for a combination of the disease-related mutations based on the whole genome sequencing data which includes exome, intron and inter-genic regions. The extraction process of candidate mutations is the same as the whole exome data-based method. We newly propose the objective function of the searching algorithm for the whole genome sequencing data. In the case of the whole genome sequencing data, the number of candidate mutations is quite large value compared to the whole exome sequencing data. Thus, the objective function is redefined in the consideration of the classification accuracy, difference, variance for disease and normal groups in training samples. In addition, we extract the disease screening marker from major genes and their inter-genic regions. To confirm the performance of the disease screening marker based on the whole genome sequencing data, we observe classification results for test samples by applying the proposed threshold, SVM and CNN methods. Finally, we compare the whole exome data-based marker with the whole genome data-based marker.