Frequent Subgraph Mining: An introduction

Frequent Subgraph Mining (FSM for short) is one of the central areas of graph mining. It has equivalences to Frequent Pattern Mining, which is a part of Data Mining. Just like the Frequent Pattern Mining, the goal of FSM is to find frequently occurring patterns in the underlying data set and therefore to generate new knowledge. Therefor, FSM can be divided into two different classes depending on the type of data you want to analyze:

Transactional data: The dataset contains a set of smaller graphs. The aim of FSM here is to find frequent subgraphs in this set of graphs.
Large graph data: The dataset includes a single large contiguous graphs. The aim of FSM here ist to find which subgraphs (ore patterns) are often found in this graph.

In the following, however, we will concentrate only on large graph data, as a consideration of both scenarios would go beyond the scope of this post. However, the basic structure of FSM algorithms is usually based on the same three steps, which are an integral part of FSM:

Candidate generation: The algorithm has to generate candidates (subgraphs of the input graph), which serves as a basis for the second step.
Isomorphism: At this step the algorithm has to check wether a candidate is a valid subgraph of the given input graph or not. This problem is really hard to compute, as we will see later.
Significance computing: The last step then consists of the computation of the number of occurrences in the input graph, to evaluate if a candidate is frequent or not.

Because these steps are connected with several problem, the following sections will consider information about them. But before we go any deeper into the individual steps of the FSM, we must first define what is meant by graphs:

Definition 1 (Directed labeled graph)
Let $ G=(V, E, l) $ be a directed labeled graph with a set of nodes $ V $, a set of edges $ E\subseteq V \times V $ and a function $ l: V \cup E \to L $. An edge $ e = (u,v) $ consists of two nodes $ u,v \in V $, with $ u $ as the source and $ v $ as the target of the respective edge. The labeling function $ l $ assigns the nodes and edges to the attributes of the set $ L $.

This means in fact, that the edges and nodes of a graph can have labels, which serve as attributes for the repsective node or edge.

1. Candidate generation

Candidate generation is a fundamental part of any FSM algorithm. It’s not explicitly specified which patterns or which subgraphs are to be searched for. Instead, the algorithm should generate subgraphs independently and then checks whether these subgraphs occur frequently in the given data set. For this purpose, we first define what is meant by a subgraph:

Definition 2 (Subgraph)
Given a directed labeled graph $G = (V, E, l)$. A graph $G_s = (V_s, E_s, l_s)$ is called a subgraph of $G$ if and only if (i)$V_s \subseteq V \land \forall v \in V_s: l_s(v) = l(v)$; (ii)$E_s \subseteq E \land \forall (u,v) \in E_s: l_s(u,v) = l(u,v)$.

Figure 1: Two different approaches to generate candidates.

According to this definition a subgraph constists of a subset of nodes and edges of another graph.
There are then two different approaches to generate subgraphs. In figure 1 you can see the two main approaches to generate candidates: The join based approach and the pattern growth based one. Both approaches use the so-called anti-monotony property (also called a-priori property). Thus the search space can be clearly limited, since not all possibilities have to be searched anymore. This means that we only have to build a small subset of candidates that could be build out of the nodes and edges of the input graph. The following definition contains the exact description of anit-monotony property, that we use to achieve this.

Definition 3 (Anti-monotony)
A subgraph $G_s$ can be only frequent, if all of its subgraphs are also frequent.

According to this property we only have to build our candidates out of subgraphs for which we already know, that they occur frequently in our input graph. Both approaches use this circumstance to shrink down the search space for new candidates (as already mentioned). Th next two paragraphs will eplain the two approaches in more detail.

1.1. Join-based approach

This strategy is very much based on the Apriori algorithm, which is often used for association analysis in the field of databases. Therefore, this approach is in the literature often referred to as Apriori-based approach. Here, new candidates are generated from the candidates for which it has already been verified that they occur (frequently) in a set of graphs or in a single graph. The approach is bottom-up. In detail, the candidate generation works as follows:

First, all possible subgraphs with size $k = 1$ ($k =$ number of edges) are generated.
For the newly created subgraphs, it is checked whether they occur in a set of graphs or in a single graph. If this is the case, they are included in the result set $A$.
New $k+1$ subgraphs are generated by a join of two similar, but not perfectly identical, $k$ subgraphs contained in the set $A$.
The algorithm terminates when no new subgraphs are added to $A$ in an iteration.

The most important step here is the join, i.e. the step in which a new $k$-subgraph is created from two similar $k – 1$ subgraphs. It was determined that a new $k$-subgraph can only be created from two $k – 1$ subgraphs if the two $k-1$ subgraphs are similar. In figure 2 you can see an example for the whole procedure.

Figure 2: The join of two similar subgraphs.

One problem with this approach of candidate generation is the fact, that we have to compute the similarity of two subgraphs. This problem can be reduced to that of isomorphism (which will be explained later on). This problem ist really hard to compute (NP-complete). Also the join operation can be extremly costly in terms of computation power, because we have to check which part of one graphs refers to the equivalent part of the other graph. This is the reason, why the pattern growth-based approach was invented.

1.2. Pattern growth based approach

In this approach, new candidates are obtained, again from the subgraphs already contained in the result set A. The difference to the join-based approach is, however, that no join is required in this procedure. So instead of joining two similar frequent subgraphs, we just add another edge if we want to connect two already existing nodes in a pattern. If we want add another node we just add a new edge plus a new node.
By not requiring a join in this approach, the overhead in the computation is eliminated and the algorithm is thus faster. However, one problem that arises with this approach is that a subgraph can be generated several times. According to Khan and Ranu [1], one can counteract this problem by using the so-called right-most-extension.

2. Isomorphism

After the problem of candidate generation was explained in more detail in the previous section, we will now discuss another challenge that occurs in the second step (checking whether a candidate appears in the input graph). As the name of the current section suggests, this is the problem of isomorphism or subgraph isomorphism. The challenge here ist to determine if two graphs are equal or not.

Definition 4 (Isomorphism)
Two graphs $G_1 = (V_1, E_1)$ and $G_2 = (V_2, E_2)$ are isomorph to each other if they have the same topology. This means that there is an assignment of $V_1$ to $V_2$, so that every edge in $E_1$ can be assigned to an edge in $E_2$. In the case of labeled graphs, the assignment only has to map edges and nodes to identically labeled ones.

Thus, both graphs have to have the same nodes and edges with the exact same labels. However, in the context of FSM we need to check whether a graph is a part of another graph. This is than called subgraph isomorphism. From definition 4 now also the definition of subgraph isomorphism can be derived easily (cf. [2]):

Definition 5 (Subgraph isomorphism)
For two directed labeled graphs $G = (V, E, l)$ and $G_s = (V_s, E_s, l_s)$ is the subgraph isomorphism an injective function $f: V(G_s) \to V(G)$ under the condition of (s.t.):
1. $\forall v \in V(G_s): l_s(v) = l(f(v))$
2. $\forall (u,v) \in E(G_s): (f(u),f(v)) \in E(G) \land l_s(u, v) = l(f(u), f(v))$
where $l$ and $l_s$ is the labeling function of $G$ and $G_s$. $f$ is then called an embedding of $G_s$ in $G$.

This means that the node set $V_s$, edge set $E_s$ and the labels $l_s$ of $G_s$ must be a subset of the nodes, edges and labels of $G$. $G_s$ is therefore a subgraph of $G$, since there is a subgraph isomorphism between the two graphs ($G_s \subseteq G$). $G$ can then also be called supergraph of $G_s$.
Of course, it can be computationally difficult to check for each candidate whether there is a isomorphism/subgraph isomorphism to another graph. Because the isomorphism/subgraph isomorphism belongs to the NP-complete problems [3]. One way to check if two graphs are isomorphic to each other is canonical labeling. In order to compute such a canonical label for each graph you need a so called adjacency matrices:

Definition 6 (Adjacency matrix)
The adjacency matrix of a graph $G$ with $n$ nodes is defined as $n \times n$ matrix, where the entries in the diagonal correspond to the nodes and all remaining entries correspond to the edges of the graph $G$. If there is no edge between two nodes, the respective field in the adjacency matrix is filled with $0$.

Figure 3: A graph with its adjacency matrix.

In figure 3 we can see an example for an adjacency matrix of a graph. As already mentioned in definition 6 the single entries of the matrix correspond to the labels of the respective edges and nodes.
We can also see in this figure that each graph can be assigned to a canonical code. For this purpose we just traverse the adjacency matrix from the top left to the bottom right to generate the canonical code for it.

3. Significance computation

The different instances of a candidate in an input graph can overlap and thus violate the anti-monotony property, which, however, is an important part of most FSM algorithms since it can be used to narrow down the search space. Because of this, different definitions for the significance of a candidate have emerged in recent years. In the following, some of these definitions are discussed in detail.

3.1. Maximum indepenten set (MIS)

Kuramochi et al. [4] developed an algorithm to calculate the significance of a candidate based on the maximum independent set. To be able to compute this quantity, a overlap graph is constructed in MIS. First, the definitions for the edge-disjoint and the simple overlap between two graphs and two instances, respectively, will follow so that it can then be explained how the overlap graph is constructed.

Definition 7 (Edge disjoint of instances)
Two graphs $G_1 = (V_1, E_1)$ and $G_2 = (V_2, E_2)$ are edge disjoint if $E_1 \cap E_2 = \emptyset$.

Definition 8 (Simple overlap)
Given a candidate $G_s = (V(G_s), E(G_s))$. There exists a simple overlap of two instances $\varphi_1$ and $\varphi_2$ of a candidate $G_s$$ if $\varphi_1(E(G_s)) \cap \varphi_2(E(G_s)) \neq \emptyset$.

The overlap graph is then constructed out of the single instances of a candidate which are located in the input graph:

Definition 9 (Overlap graph)
Given the set of all instances of a candidate $G_s$ in a graph $G$. The overlap graph of $G_s$ in $G$ is then the graph whose nodes are non-identical instances and whose edges connect the nodes whose instances are not edge-disjoint, i.e., correspond to a simple overlap.

So the overlap graph constists of nodes for every instance. The single nodes are connected if they aren’t edge disjoint and thus share one or more edges in the input graph. This is then called a simple overlap. The frequency of a candidate is then given by the number of indipendent sets (sets of nodes which have no connections to the nodes of other sets) in our overlap graph. In figure 4 we can see an example. In this case the subgraph has a frequency of 2, because there exists two different indepent sets in our overlap graph.

Figure 4: The instances $\varphi_1, \varphi_2, \varphi_3$ of the candidate $G_s$ in a graph $G$ and the resulting overlap graph.

3.2. Harmful Overlap (HO)

Another definition for the significance of a candidate is that of Fiedler and Borgelt [5]. Here, just as in MIS, an overlap graph is constructed and the significance is then the maximum size of the independent set of the overlap graph.
However, Fiedler and Borgelt have established a different way of defining the overlap of two instances, called harmful overlap. In contrast to the simple overlap, a harmful overlap arises only when two instances share one or more nodes. So a harmful overlap can be defined as follows:

Definition 9 (Harmful overlap)
Given a candidate $G_s = (V(G_s), E(G_s))$. There exists a harmful overlap of two instances $\varphi_1$ and $\varphi_2$ of a candidate $G_s$$ if $\varphi_1(V(G_s)) \cap \varphi_2(V(G_s)) \neq \emptyset$.

If we would apply this metric to the example in figure 4, we would end up with an overlap graph which has two additional edges between $\varphi_1$ and $\varphi_3$ and $\varphi_2$ and $\varphi_3$. Thus, according to definition 9 the frequency of candidate would be 1.

3.3. Minimum image based support (MNI)

The computation for the significance functions MIS and HO introduced above are each NP-complete, since the construction of the overlap graph requires computing the subgraph isomorphism of all instances of a pattern. To address this challenge, Bringmann et al. [6] devised a simpler significance function based on the number of unique nodes in a graph $G = (V, E)$ that can be assigned to a node of a candidate $G_s = (V_s, E_s)$.

Figure 5: A graph with four different instances of a candidate (Source: [6]).

Figure 5 shows an example of the MNI significance. The subgraph occurs a total of 4 times in the input graph. However, its significance is only 3 because each node of the subgraph can only be assigned three different nodes in the input graph.
Formally we can define the MNI frequency of a candidate in an input graph as follows:

Definition 10 (Minimum image based support)
Given a graph $G = (V, E)$ and a candidate $G_s = (V_s, E_s)$, the minimum image-based support is then defined as: $$\sigma_{\land}(G_s, G) = \underset{v \in V_s}{min} |\varphi_i(v)|$$ where $\varphi_i$ is an occurrence of $G_s$ in $G$.

4. Conclusion

In this blog post we have discussed the single steps of FSM. For every step I’ve presented the corresponding problems and showed different metrics or algorithms to solve them. The single metrics or algorithms very often serve as the basis for concrete FSM technqiues. In a further post, I will present one of these specific FSM algorithms, which can also be executed in parallel on a computer cluster (instead of one single computer).
For the interested reader I’ve listed here the used literature, to dive deeper into the field of FSM:

Khan, Arijit and Ranu, Sayan: Big-Graphs: Querying, Mining, and Beyond. In: Handbook of Big Data Technologies, pages 531-582. Springer International Publishing, 2017
Aggarwal, Charu C. and Wang, Haixun (publisher): Managing and Mining Graph Data. Springer US, 2012
Brandstädt, A., V. B. Le, and J. P. Spinrad: Graph Classes: A Survey. p. 306, Society for Industrial and Applied Mathematics, 1999
Kuramochi, Michihiro and George Karypis: Finding Frequent Patterns in a Large Sparse Graph*. In: Data Mining and Knowledge Discovery, 11(3):243-271, 2005
Fiedler, Mathias and Christian Borgelt: Support Computation for Mining Frequent Subgraphs in a Single Graph. In: Workshop on Mining and Learning with Graphs, 2007.
Bringmann, Björn and Siegfried Nijssen: What is frequent in a single graph? In: Advances in Knowledge Discovery and Data Mining, pages 858-863, 2008.

Citation

If you want to cite this article, please do it in the following way:

Ole Fenske, "Frequent Subgraph Mining: An introduction". https://ole-fenske.de/?p=815, 2018.

Bibtex format:

@misc{fenske2018fsm,
author = {Fenske, Ole},
title = {{Frequent Subgraph Mining: An introduction}},
year = {2018},
howpublished = {\url{https://ole-fenske.de/?p=815}},
}