Research Area:  Data Mining
Data mining is an analytic process for discovering systematic relationships between variables and for finding patterns in data. Using those findings, data mining can create predictive models (e.g., target variable forecasting, label classification) or identify different groups within data (e.g., clustering). The principal objective of this dissertation is to develop data mining algorithms that outperform conventional data mining techniques on social and healthcare sciences.
Toward this objective, this dissertation develops two data mining techniques, each of which addresses the limitations of a conventional data mining technique when applied in these contexts. The first part (Part I) of this dissertation addresses the problem of identifying important factors that promote or hinder population growth. When addressing this problem, previous studies included variables (input factors) without considering the statistical dependence among the included input factors; therefore, most previous studies exhibit multicollinearity between the input variables.
We propose a novel methodology that, even in the presence of multi collinearity among input factors, is able to (1) identify significant factors affecting population growth and (2) rank these factors according to their level of influence on population growth. In order to measure the level of influence of each input factor on population growth, the proposed method combines decision tree clustering and Cohen-s d index. We applied the proposed method to a real county-level United States dataset and determined the level of influence of an extensive list of input factors on population growth.
Name of the Researcher:  Kisuk Sung
Name of the Supervisor(s):  Erick Moreno-Centeno
Year of Completion:  2016
University:  Texas A&M University
Thesis Link:   Home Page Url