All Fingers Are Not the Same: Handling Variable-Length Sequence Comparison in a Discriminative Setting





1. Background

Existing string kernels for comparison of genomic sequences are generally tied to using (absolute) positional information of the features in the individual sequences. This poses limitations when comparing variable-length sequences using such string kernels. For example, profiling long-range chromatin interactions by Hi-C experiments results in variable-length restriction fragments. Here, exact position-wise occurrence of signals in sequences may not be so important as compared to the scenario of analysis of the promoter sequences, that have a transcription start site as reference, for gene expression prediction. The existing string kernels have been shown to be useful for the latter scenario.

2. Method

In this work, we propose a novel approach for sequence comparison that allows larger positional freedom as compared to the existing approaches and identifies a possibly dispersed set of features in comparing variable-length sequences. In our approach, termed CoMIK for 'conformal multiple instance kernels', we represent each genomic sequence (whole) by its segments (parts). More specifically, we use a complementary set of segments-- non-shifted and shifted. Then, in order to compare any two sequences, we compare all segments of one sequence (non-shifted and shifted) with all segments of the other by employing a multi-instance kernel. We further use conformal transformations to the multi-instance kernels, an idea developed previously by Blaschko and Hofmann, that enable CoMIK to obtain a segment weighting per sequence denoting the contributions of the individual segments of the sequence towards classification of that sequence. This can be done for any test sequence that is as yet unseen to CoMIK.

CoMIK, thus, identifies not just the features useful towards classification but also their locations in the variable-length sequences. Additionally, by leveraging intuitive visualizations developed recently by us, CoMIK is highly interpretable and capable of providing novel insights.

For further details, kindly refer to our paper (see link below).

3. Results

We demonstrate CoMIK's efficacy with results on two biological problems: one covering the typical scenario of classifying equal-length core promoter sequences in yeast that show either high or low promoter activity (Lubliner et al., Genome Research 25(7)), and, the other involving classification of variable-length restriction fragments reported from a 5C experiment in humans in three cell lines (Sanyal et al., Nature, 489(7414)) as interacting or not-interacting w.r.t. a locus of interest.

For the yeast data, CoMIK not only achieved a high quantitative performance, but it also demonstrated its ability to accurately identify the important predictive features and locate them in segments per sequence as was noted earlier by Lubliner et al. Moreover, we note that CoMIK was able to identify the important segments without any prior knowledge of the segment positions within the sequences.

Even for the 5C data, CoMIK attains quantitative performances comparable to an earlier computational approach (Nikumbh and Pfeifer, BMC Bioinformatics 18(1)) on all three cell lines. The additional qualitative gain of interpreting the features and their locations within the variable-length restriction fragments, which was not possible with the earlier approach, makes CoMIK's contribution stand out.

See below for a link to our paper for more details.

4. Conclusions

CoMIK bridges the gap left by existing string kernels to compare variable-length (genomic) sequences in a discriminative setting, in addition to catering to the typical scenario of comparison of equal-length sequences. With this, CoMIK renders the user free of the constraint to make all the candidate sequences in a study to be of the same length.

More specifically, with regards to long-range chromatin interactions studied by chromosome conformation capture (3C)-based experiments, we envisage that CoMIK’s ability to locate signal within a variable-length sequence could be useful in studying the so-called structural interactions between the intervening chromatin of the interacting loci.

5. Future ideas/collaborators needed to advance research?

Our immediate next goal is to apply CoMIK on recent, high-resolution genome-wide long-range interactions' data.

We look forward to collaborating with researchers on novel use-cases where they deem CoMIK could be applicable/useful. We also look forward to any suggestions and requests for new features/requirements in CoMIK.



No discussion yet, be the first one to comment


I am a PhD researcher at the Max Planck Institute for Informatics (MPI-INF), Saarbrücken, Germany working on machine learning/statistical learning and computational biology. Specifically, the focus...

Round: Peer Prize Round
Category: Default






Recent Voters