In this thesis, we aim to analyze various types of genetic information and find important genetic information related to metastasis of melanoma.
Understanding the causes and principles of metastasis is an important issue in cancer research because cancer metastasis makes treatment difficult and is the leading cause of cancer-related death.
In particular, melanoma, a type of skin cancer, is frequent in Caucasian and is relatively rare in Asian and African-American.
Melanoma should be understood comprehensively because of the high risk of death if it is diagnosed as metastatic tumor or when it generates metastasis.
With the development of genetic information acquisition technologies such as DNA sequencing and microarray, a large amount of genetic information can be quickly obtained with low cost.
Therefore, cancer genetics, which is a cancer research based on analysis of genetic information, is actively being conducted.
Genetic information that is found to be specific to an individual determines an individual's characteristics and also is known to have a high correlation with cancer.
Furthermore, it is an important criterion for the selection of effective treatment.
However, it is very difficult to find a small number of cancer-related information from a large amount of genetic information.
Therefore, in this thesis, we find important genetic information related to the metastasis of melanoma by analyzing various types of genetic information using machine learning, neural network, optimization, and search algorithm.
In this dissertation, we analyze different types of genetic information and attempt to develop the selection method of important features considering the characteristics of genetic information.
In this process, the relationship between the genetic information is considered to derive the minimum information related to the melanoma metastasis instead of analyzing the genetic information independently.
In addition, by applying the feature selection technique that selects important information without modifying data, the characteristics of the cancer are identified through the selected feature, and it can be actively utilized in the diagnosis or treatment of actual cancer.
First, the CNV set that can distinguish primary tumors from metastatic tumors is derived from copy number variation (CNV), which is a type of structural variation of the genome.
CNV is the variation in the number of repetitions of a particular section of a genome sequence compared to the reference genome and is classified as deletion or duplication.
To derive the CNV set for the identification of primary and metastatic tumors, a forward selection-based search algorithm is utilized.
At this time, the deletion and duplication are separated, and CNV commonly found in primary tumors and metastatic tumors are selected, separately.
In addition, the optimization is conducted to minimize the number of selected CNVs while maintaining identification performance.
Second, short somatic variants such as single nucleotide variation (SNV) and insertion and deletion (Indel) are analyzed and melanoma metastasis related variants are derived.
SNV and InDel are shorter variants and more frequently found compared to CNV.
SNV means the change of one nucleotide compared to the reference genome, and Indel refers the addition or deletion of one or more consecutive nucleotides compared to the reference genome.
By applying the correlation-based feature selection, somatic variants that are highly correlated with primary tumors or metastatic tumors are selected.
Then, the selected variants have a low correlation each other.
In order to exclude somatic variants related to both tumors, two correlation filters are simultaneously applied using multiobjective optimization.
Then, it is possible to remove additional variants that are related to both contradictory characteristics while selecting the variants related to one of the activation and deactivation of melanoma metastasis.
In addition, in order to alleviate the computational complexity problem caused by the size of the data, candidate variants are pre-selected taking into account the average correlation value of each variant before selecting the final important variants.
Third, gene expression profile is analyzed to derive gene signatures for melanoma metastasis.
Unlike CNV, SNV, and Indel, which are generally obtained using DNA sequencing, expression profile can also be obtained using microarray.
In the case of microarray-based expression profile, simultaneous analysis of a large number of genes and samples is possible
Also, it has been studied in various fields because it can present simple and reliable analysis results compared to DNA sequencing.
To select gene signatures related to melanoma metastasis among gene expression profile, an embedded feature selection method is proposed.
In the proposed method, feature selection based on linear regression is performed to select gene signatures without distortion of raw data, and classification is performed by applying the neural network.
In this case, multiple linear regression models can be integrated with boosting, and a powerful feature selection model can be generated.
Also, we use repeatedly the proposed boosted feature selection to minimize the number of selected features.
However, the training and integrating multiple regression models can cause the high computational complexity problem.
To alleviate computational complexity, we only reuse models that were useful in previous feature selection instead of using all the regression models repeatedly.