[Leetcode]187. Repeated DNA Sequences

Question

All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: “ACGAATTCCG”. When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.

Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.

For example,

1
2
3
4
Given s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT",

Return:
["AAAAACCCCC", "CCCCCAAAAA"].

Solution

We can solve this problem in 2 ways.

  1. Use sliding window to get every 10-letter-long subString of the input string. Store them in a hash set. Every time insert new substring to hash set, check if it’s exist in hash set. If exist, put it in a result hash set. We choose hash set to store the result, because the result need to be unique.

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    public class Solution {
    public List<String> findRepeatedDnaSequences(String s) {

    HashSet<String> subStrings = new HashSet<String>();
    HashSet<String> repeatedSubstrings = new HashSet<String>();

    for(int i = 0 ; i <= s.length()-10; i++){
    String subS = s.substring(i,i+10);
    if( !subStrings.add(subS) ){
    repeatedSubstrings.add(subS);
    }
    }

    return new ArrayList(repeatedSubstrings);
    }
    }
  2. The solution before is pretty straight forward but need a lot memory to store the sub-strings. We can optimize the algorithm by using bit manipulate. There are only 4 letters, we can use 2 bit to present a letter. 00, 01, 10, 11 stand for A, C, T, G. An integer is 32 bits. so we can use an integer rather than a string to present a substring.

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    public class Solution {
    public List<String> findRepeatedDnaSequences(String s) {
    List<String> result = new LinkedList<String>();

    if(s == null || s.length() < 10 ){
    return result;
    }

    HashMap<Character,Integer> map = new HashMap<Character, Integer>();
    map.put('A',0); // 00
    map.put('C',1); // 01
    map.put('T',2); // 10
    map.put('G',3); // 11

    Set<Integer> subStrings = new HashSet<Integer>();
    Set<Integer> repeated = new HashSet<Integer>();
    int temp = 0;

    for(int i = 0 ; i < s.length(); i++ ){
    temp = (temp << 2) + map.get( s.charAt(i) );
    if( i >= 9 ){
    temp = temp & ( (1 << 20) - 1 ); // temp & 111111 , filt the last 20 digits.
    if( subStrings.contains(temp) && !repeated.contains(temp) ){
    result.add( s.substring(i-9,i+1) );
    repeated.add(temp);
    }else{
    subStrings.add(temp);
    }
    }
    }

    return result;
    }
    }
写得好!朕重重有赏!