In SIAM International Conference on Data Mining, pages: 295-304, (Editors: Park, H. , S. Parthasarathy, H. Liu), Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, SDM, May 2009 (inproceedings)
Many real-world applications with graph data require
the efficient solution of a given regression task as well as the
identification of the subgraphs which are relevant for the task. In these cases graphs
are commonly represented as binary vectors of indicators of subgraphs, giving rise to an intractable input dimensionality.
An efficient solution to this problem was recently proposed by a Lasso-type
method where the objective function optimization over an intractable
number of variables is reformulated as a dual mathematical programming problem
over a small number of variables but a large number of constraints. The
dual problem is then solved by column generation where the subgraphs corresponding
to the most violated constraints are found by weighted subgraph mining.
This paper proposes an extension of this method to a fully Bayesian approach which
defines a prior distribution on the parameters and integrate them out from the model, thus providing a posterior distribution on the target variable as
opposed to a single estimate. The advantage of this approach is that
the extra information given by the target posterior distribution can be used for improving
the model in several ways. In this paper, we use the target posterior variance as a measure of uncertainty in the
prediction and show that, by rejecting unconfident predictions, we can improve state-of-the-art
performance on several molecular graph datasets.
In KDD2008, pages: 578-586, (Editors: Li, Y. , B. Liu, S. Sarawagi), ACM Press, New York, NY, USA, 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 2008 (inproceedings)
Attributed graphs are increasingly more common in many application
domains such as chemistry, biology and text processing.
A central issue in graph mining is how to collect informative subgraph
patterns for a given learning task.
We propose an iterative mining method based on
partial least squares regression (PLS).
To apply PLS to graph data, a sparse version of PLS is developed first
and then it is combined with a weighted pattern mining algorithm.
The mining algorithm is iteratively called with different weight
vectors, creating one latent component per one mining call.
Our method, graph PLS, is efficient and easy to implement, because the
weight vector is updated with elementary matrix calculations.
In experiments, our graph PLS algorithm showed
competitive prediction accuracies in many chemical datasets and its
efficiency was significantly superior to graph boosting (gboost) and the
naive method based on frequent graph mining.
Machine Learning, 75(1):69-89, November 2008 (article)
Graph mining methods enumerate frequently appearing subgraph patterns, which can be used as features for subsequent classification or regression. However, frequent patterns are not necessarily informative for the given learning problem. We propose a mathematical programming boosting method (gBoost) that progressively collects informative patterns. Compared to AdaBoost, gBoost can build the prediction rule with fewer iterations. To apply the boosting method to graph data, a branch-and-bound pattern search algorithm is developed based on the DFS code tree. The constructed search space is reused in later iterations to minimize the computation time. Our method can learn more efficiently than the simpler method based on frequent substructure mining, because the output labels are used as an extra information source for pruning the search space. Furthermore, by engineering the mathematical program, a wide range of machine learning problems can be solved without modifying the pattern search algorithm.
In ICDM 2008, pages: 1007-1012, (Editors: Giannotti, F. , D. Gunopulos, F. Turini, C. Zaniolo, N. Ramakrishnan, X. Wu), IEEE Computer Society, Los Alamitos, CA, USA, IEEE International Conference on Data Mining, December 2008 (inproceedings)
Graph mining methods enumerate frequent subgraphs
efficiently, but they are not necessarily good features for
machine learning due to high correlation among features.
Thus it makes sense to perform principal component analysis
to reduce the dimensionality and create decorrelated
features. We present a novel iterative mining algorithm
that captures informative patterns corresponding to major
entries of top principal components. It repeatedly calls
weighted substructure mining where example weights are
updated in each iteration. The Lanczos algorithm, a standard
algorithm of eigendecomposition, is employed to update
the weights. In experiments, our patterns are shown to
approximate the principal components obtained by frequent
NIPS Workshop on Machine Learning in Computational Biology, December 2007 (talk)
Secondary metabolic pathway in plant is important for finding druggable candidate enzymes. However, there are many enzymes whose functions are still undiscovered especially in organism-specific metabolic pathways. We propose reaction graph kernels for automatically assigning the EC numbers to unknown enzymatic reactions in a metabolic network. Experiments are carried out on KEGG/REACTION database and our method successfully predicted the first three digits of the EC number with 83% accuracy.We also exhaustively predicted missing enzymatic functions in the plant secondary metabolism pathways, and evaluated our results in biochemical validity.
Bioinformatics, 23(18):2455-2462, September 2007 (article)
Human immunodeficiency virus type 1 (HIV-1) evolves in human body,
and its exposure to a drug often causes mutations that enhance
the resistance against the drug.
To design an effective pharmacotherapy for an individual patient,
it is important to accurately predict the drug resistance
based on genotype data.
Notably, the resistance is not just
the simple sum of the effects of all mutations.
Structural biological studies suggest that
the association of mutations is crucial:
Even if mutations A or B alone do not affect the resistance,
a significant change might happen
when the two mutations occur together.
Linear regression methods cannot take the associations into account,
while decision tree methods can reveal only limited associations.
Kernel methods and neural networks implicitly use all possible
associations for prediction, but cannot select salient associations
Our method, itemset boosting, performs linear regression
in the complete space of power sets of mutations.
It implements a forward feature selection procedure where,
in each iteration, one mutation combination is
found by an efficient branch-and-bound search.
This method uses all possible combinations,
and salient associations are explicitly shown.
In experiments, our method worked particularly well for predicting the
resistance of nucleotide reverse transcriptase inhibitors
(NRTIs). Furthermore, it successfully recovered many mutation
associations known in biological literature.
Kashima, H., Yamazaki, K., Saigo, H., Inokuchi, A.
Regression with Intervals
International Workshop on Data-Mining and Statistical Science (DMSS2007), October 2007, JSAI Incentive Award. Talk was given by Hisashi Kashima. (talk)
In MLG 2006, pages: 85-96, (Editors: Gärtner, T. , G. C. Garriga, T. Meinl), International Workshop on Mining and Learning with Graphs, September 2006, Best Paper Award (inproceedings)
Small molecules in chemistry can be represented as graphs.
In a quantitative structure-activity relationship (QSAR) analysis, the
central task is to find a regression function that predicts
the activity of the molecule in high accuracy.
Setting a QSAR as a primal target, we propose a new linear
programming approach to the graph-based regression problem.
Our method extends the graph classification algorithm by Kudo et al.
(NIPS 2004), which is a combination of boosting and graph mining.
Instead of sequential multiplicative updates, we employ the linear
programming boosting (LP) for regression. The LP approach allows to
include inequality constraints for the parameter vector, which turns out to
be particularly useful in QSAR tasks where activity values are
Furthermore, the efficiency is improved significantly by employing
NIPS Workshop on New Problems and Methods in Computational Biology, December 2006 (talk)
We propose a new boosting method that systematically combines graph mining and mathematical programming-based machine learning. Informative and interpretable subgraph features are greedily found by a series of graph mining calls. Due to our mathematical programming formulation, subgraph features and pre-calculated real-valued features are seemlessly integrated. We tested our algorithm on a quantitative structure-activity relationship (QSAR) problem, which is basically a regression problem when given a set of chemical compounds. In benchmark experiments, the prediction accuracy of our method favorably compared with the best results reported on each dataset.
Danziger, S., Swamidass, S., Zeng, J., Dearth, L., Lu, Q., Cheng, J., Cheng, J., Hoang, V., Saigo, H., Luo, R., Baldi, P., Brachmann, R., Lathrop, R.
IEEE Transactions on Computational Biology and Bioinformatics, 3(2):114-125, April 2006 (article)
Many biomedical problems relate to mutant functional properties across a sequence space of interest, e.g., flu, cancer, and HIV. Detailed knowledge of mutant properties and function improves medical treatment and prevention. A functional census of p53 cancer rescue mutants would aid the search for cancer treatments from p53 mutant rescue. We devised a general methodology for conducting a functional census of a mutation sequence space by choosing informative mutants early. The methodology was tested in a double-blind predictive test on the functional rescue property of 71 novel putative p53 cancer rescue mutants iteratively predicted in sets of three (24 iterations). The first double-blind 15-point moving accuracy was 47 percent and the last was 86 percent; r = 0.01 before an epiphanic 16th iteration and r = 0.92 afterward. Useful mutants were chosen early (overall r = 0.80). Code and data are freely available (http://www.igb.uci.edu/research/research.html, corresponding authors: R.H.L. for computation and R.K.B. for biology).
BMC Bioinformatics, 7(246):1-12, May 2006 (article)
Detecting remote homologies by direct comparison of protein sequences remains a challenging task. We had previously developed a similarity score between sequences, called a local alignment kernel, that exhibits good performance for this task in combination with a support vector machine. The local alignment kernel depends on an amino acid substitution matrix. Since commonly used BLOSUM or PAM matrices for scoring amino acid matches have been optimized to be used in combination with the Smith-Waterman algorithm, the matrices optimal for the local alignment kernel can be different.
Contrary to the local alignment score computed by the Smith-Waterman algorithm, the local alignment kernel is differentiable with respect to the amino acid substitution and its derivative can be computed efficiently by dynamic programming. We optimized the substitution matrix by classical gradient descent by setting an objective function that measures how well the local alignment kernel discriminates homologs from non-homologs in the COG database. The local alignment kernel exhibits better performance when it uses the matrices and gap parameters optimized by this procedure than when it uses the matrices optimized for the Smith-Waterman algorithm. Furthermore, the matrices and gap parameters optimized for the local alignment kernel can also be used successfully by the Smith-Waterman algorithm.
This optimization procedure leads to useful substitution matrices, both for the local alignment kernel and the Smith-Waterman algorithm. The best performance for homology detection is obtained by the local alignment kernel.
Ralaivola, L., Swamidass, J., Saigo, H., Baldi, P.
Neural Networks, 18(8):1093-1110, 2005 (article)
Increased availability of large repositories of chemical compounds is creating new
challenges and opportunities for the application of machine learning methods to
problems in computational chemistry and chemical informatics. Because chemical
compounds are often represented by the graph of their covalent bonds, machine
learning methods in this domain must be capable of processing graphical structures
with variable size. Here we first briefly review the literature on graph kernels and
then introduce three new kernels (Tanimoto, MinMax, Hybrid) based on the idea
of molecular fingerprints and counting labeled paths of depth up to d using depthfirst
search from each possible vertex. The kernels are applied to three classification
problems to predict mutagenicity, toxicity, and anti-cancer activity on three publicly
available data sets. The kernels achieve performances at least comparable, and most
often superior, to those previously reported in the literature reaching accuracies of
91.5% on the Mutag dataset, 65-67% on the PTC (Predictive Toxicology Challenge)
dataset, and 72% on the NCI (National Cancer Institute) dataset. Properties and
tradeoffs of these kernels, as well as other proposed kernels that leverage 1D or 3D
representations of molecules, are briefly discussed.
Protein Science, 14, pages: 2804-2813, 2005 (article)
As the number of complete genomes rapidly increases, accurate methods to automatically predict the subcellular location of proteins are increasingly useful to help their functional annotation. In order to improve the predictive accuracy of the many prediction methods developed to date, a novel representation of protein sequences is proposed. This representation involves local compositions of amino acids and twin amino acids, and local frequencies of distance between successive (basic, hydrophobic, and other) amino acids. For calculating the local features, each sequence is split into three parts: N-terminal, middle, and C-terminal. The N-terminal part is further divided into four regions to consider ambiguity in the length and position of signal sequences. We tested this representation with support vector machines on two data sets extracted from the SWISS-PROT database. Through fivefold cross-validation tests, overall accuracies of more than 87% and 91% were obtained for eukaryotic and prokaryotic proteins, respectively. It is concluded that considering the respective features in the N-terminal, middle, and C-terminal parts is helpful to predict the subcellular location.
Keywords: subcellular location; signal sequence; amino acid composition; distance frequency; support vector machine; predictive accuracy
Remote homology detection between protein sequences is a central problem in computational biology. Discriminative methods involving support vector machines (SVM) are currently the most effective methods for the problem of superfamily recognition in the SCOP database. The performance of SVMs depend critically on the kernel function used to quantify the similarity between sequences. We propose new kernels for strings adapted to biological sequences, which we call local alignment kernels. These kernels measure the similarity between two sequences by summing up scores obtained from local alignments with gaps of the sequences. When tested in combination with SVM on their ability to recognize SCOP superfamilies on a benchmark dataset, the new kernels outperform state-of-the art methods for remote homology detection.
Our goal is to understand the principles of Perception, Action and Learning in autonomous systems that successfully interact with complex environments and to use this understanding to design future systems