Question
All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: “ACGAATTCCG”. When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.
Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.
For example,
1 | Given s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT", |
Solution
We can solve this problem in 2 ways.
Use sliding window to get every 10-letter-long subString of the input string. Store them in a hash set. Every time insert new substring to hash set, check if it’s exist in hash set. If exist, put it in a result hash set. We choose hash set to store the result, because the result need to be unique.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16public class Solution {
public List<String> findRepeatedDnaSequences(String s) {
HashSet<String> subStrings = new HashSet<String>();
HashSet<String> repeatedSubstrings = new HashSet<String>();
for(int i = 0 ; i <= s.length()-10; i++){
String subS = s.substring(i,i+10);
if( !subStrings.add(subS) ){
repeatedSubstrings.add(subS);
}
}
return new ArrayList(repeatedSubstrings);
}
}The solution before is pretty straight forward but need a lot memory to store the sub-strings. We can optimize the algorithm by using bit manipulate. There are only 4 letters, we can use 2 bit to present a letter. 00, 01, 10, 11 stand for A, C, T, G. An integer is 32 bits. so we can use an integer rather than a string to present a substring.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34public class Solution {
public List<String> findRepeatedDnaSequences(String s) {
List<String> result = new LinkedList<String>();
if(s == null || s.length() < 10 ){
return result;
}
HashMap<Character,Integer> map = new HashMap<Character, Integer>();
map.put('A',0); // 00
map.put('C',1); // 01
map.put('T',2); // 10
map.put('G',3); // 11
Set<Integer> subStrings = new HashSet<Integer>();
Set<Integer> repeated = new HashSet<Integer>();
int temp = 0;
for(int i = 0 ; i < s.length(); i++ ){
temp = (temp << 2) + map.get( s.charAt(i) );
if( i >= 9 ){
temp = temp & ( (1 << 20) - 1 ); // temp & 111111 , filt the last 20 digits.
if( subStrings.contains(temp) && !repeated.contains(temp) ){
result.add( s.substring(i-9,i+1) );
repeated.add(temp);
}else{
subStrings.add(temp);
}
}
}
return result;
}
}