But what is SAM?

Here is the video

But what is SAM

Welcome to AlgoRhythm, I’m Mike.

Introduction and Structure

There is a string. For every substring, we define that “End Position”, or “EndPos”, is the set of positions where a substring occurs. An “Equal EndPos Class” is the set of substrings with the same EndPos. We can call it “EC”. It can be proved that if we sort every substrings in an EC by length, each substring erase the first character is the former string.

SAM, or Suffix Automaton describes the relations of ECs. SAM contains all the information of a string, which means that you can get the original string uniquely from a SAM.

One node of an SAM representes an EC. There are two kinds of edges, Links and Transitions.

Nodes with Links constructed a tree, Link-Tree. Every Node’s EndPos is its father’s EndPos’s subset. The root’s EC only contains empty string, which occurs everywhere. The definition of Link means that when the EC A’s string occurs, A’s father’s string occurs, too. So all of A’s father’s strings are the suffix of all A’s strings.

Nodes with Transitions constructed a DAG. Each Transition has a character as the value. No more than one Transition with the same value could begin from the same node. If there is a Transition from A to B, with the value c, every string from A add a character c in the end can be found in B.

Construction

SAM can be constructed in linear complexity, both time and space.

We choose the method of addtion.

Everytime we add a character at the end of the string, the new string will have a set of new substrings. Clearly, these substrings are suffixes, and can be arranged as a triangle.

We divide the triangle, and except the tip, there are lots of trapezoidals. Let strings with the same EndPos in a same trapezoidal, representing the EndPos’s node.

Note that current whole string’s node is X. When adding a character C, a new node should be added. We define the new node is N. From X, jump along Links, if there is no Transition with the value C, create a new Transition from current node to N, if there is already a Transition with value C, note the node as Y, the node that Y’s Transition C pointing at is T.

Because every node have series of strings, we define the length of a node is the length of the longest string in the node. So the length of N is the length of X plus 1. Define that Y’s length is yy, T’s length is tt. If t=y+1t = y + 1, we can let N’s Link point at T.

If not, split T as T1 and T2. So the Transitions used to point at T should point at T1 or T2. Let strings longer than y+1y + 1 belongs to T2, so the former Transitions’ beginings whose length is longer than yy should point at T2. The rest of those strings belongs to T1, and the rest of those former Transitions should point at T1. So the length of T1 is y+1y + 1, and the length of the shortest string in T2 is y+2y + 2. Former Transitions that begings from T, should be copied, both T1 and T2 should have these Transitions. T1’s Link is former T’s Link.

We know that on the Link Tree, From X to Y’s son, their Transitions of C all point at N. So N’s shortest string’s length is (y+1)+1=y+2(y + 1) + 1 = y + 2, and T2’s shortest string’s length is y+2y + 2, too. Then, as T1’s length is y+1y + 1, so N and T2’s Link is T1.

Explain

Back to the triangle constructed with new suffixes while adding a new character.

We divide the triangle, and the top of the triangle can be devided as existing SAM nodes.

However, sometimes only a part of the existing node can match with the devided triangle, which means that the Endpos of the strings of the node are no longer the same. The matched part has a new occuring position, and the unmatched part’s Endpos doesn’t change. So as the defination of a Node: “Every string in one Node has the same Endpos”, the Node should split.

Practice

For example, here is the whole construction of the SAM of the string “BANANA”.

翻译: Zeyi Lee 校正: Michael

但是, 什么是后缀自动机?

欢迎来到 AlgoRhythm, 我是 Mike。

简介与结构

有一个字符串。对于每个子串,我们定义 “End Position”, 即 “EndPos” 表示子串出现的位置集合。“Equal EndPos Class” 是具有相同 EndPos 的子串集合。我们可以称之为“EC”。可以证明,如果我们按长度对每个EC中的子串进行排序,每个子串删除第一个字符后就是前一个子串。

后缀自动机(SAM)描述了 EC 之间的关系。SAM 包含了字符串的所有信息,这意味着你可以根据 SAM 唯一地确定原始字符串。

SAM 的一个节点表示一个 EC。有两种类型的边,链接和转移。

节点和链接构成了一棵树,称为 “Link-Tree”。每个节点的 EndPos 是其父节点的 EndPos 的子集。根节点的 EC 只包含空字符串,它在任何位置都出现。链接的定义意味着当 EC A 中的字符串出现时,A 的父节点的字符串也会出现。因此,所有A的父节点的字符串都是所有A的字符串的后缀。

节点和转移构成了一个有向无环图(DAG)。每个转移边都有一个字符作为权值。以同一节点为起点,最多存在一个同字符的转移边。如果从 A 到 B 有一个值为 c 的转移,则 A 中的每个字符串在末尾添加字符 c 后得到的新字符串都可以在 B 中找到。

构造

SAM 可以在线性时空复杂度内构造。

我们选用增量构造法。

每次在字符串末尾添加一个字符,新字符串将具有一系列的新子串。显然,这些子串都是新字符串的后缀,并且可以排列成一个三角形。

我们将三角形分割,除了顶部,剩下的部分都是梯形。将具有相同 EndPos 的字符串划分到同一个梯形中,这个梯形代表的就是这个 EndPos 对应的节点。

记当前整个字符串所在的节点为 X。当在其末尾添加字符 C 时,应该新建一个节点。我们将新节点记为 N。从 X 出发,沿着 Link 向上跳,如果当前节点没有字符 c 的转移,从当前节点新建一条指向 N 的转移边;如果遇到了已经存在一个字符 c 的转移的节点,将当前节点记为 Y,Y 的字符 c 的转移指向的节点记为 T。

因为每个节点都有一个或多个字符串,我们将节点的长度定义为节点中最长字符串的长度。因此,N 的长度是 X 的长度加 1。定义 Y 的长度为 y,T 的长度为 t。如果 t = y + 1,我们就可以断定 N 的 Link 指向 T。

如果 t 不等于 y + 1,将 T 分裂为 T1 和 T2。因此,原来指向 T 的转移现在应该指向 T1 或 T2。令长度超过 y + 1 的字符串属于 T2,所以前面提到的节点中,长度超过 y 的节点的字符 c 的转移边应该指向T2。令 T 中其余的字符串属于 T1,因此前面提到的节点中,其余部分的节点的字符 c 的转移应该指向 T1。因此,T1 的长度为 y + 1,T2 中最短字符串的长度为 t + 2。原来以 T 为起点的所有转移应该被复制,T1 和 T2 都应该存在这些转移。T1 的 Link 指向之前 T 的 Link 指向的节点。

我们知道,在 Link-Tree 上从 X 到 Y 的子节点,它们的字符 c 的转移都指向 N。因此,N 的最短字符串的长度为 (y + 1) + 1 = y + 2,T2 的最短字符串的长度也为 y + 2。又因为T1的长度为 y + 1,所以 N 和 T2 的 Link 指向 T1。

解释

重新审视在添加新字符时产生的三角形。

我们将三角形分割,三角形的上半部分可以被划分为现有的SAM节点。

然而,有时候只有现有节点的一部分与三角形分裂出来的部分匹配,这意味着节点中字符串的 EndPos 不再相同。匹配的部分的字符串有一个新的出现位置,未匹配的部分的字符串的 EndPos 保持不变。因此,根据节点的定义:“一个节点中的每个字符串具有相同的 EndPos”,这个节点应该分裂。

实例

例如,接下来是字符串 “BANANA” 的后缀自动机的完整构造过程。