{"id":470,"date":"2007-07-15T18:52:17","date_gmt":"2007-07-15T18:52:17","guid":{"rendered":"http:\/\/scientopia.org\/blogs\/goodmath\/2007\/07\/15\/cliques-subgraphs-and-a-bit-of-biology\/"},"modified":"2007-07-15T18:52:17","modified_gmt":"2007-07-15T18:52:17","slug":"cliques-subgraphs-and-a-bit-of-biology","status":"publish","type":"post","link":"http:\/\/www.goodmath.org\/blog\/2007\/07\/15\/cliques-subgraphs-and-a-bit-of-biology\/","title":{"rendered":"Cliques, Subgraphs, and a bit of biology"},"content":{"rendered":"<p><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" alt=\"pclique.jpg\" src=\"https:\/\/i0.wp.com\/scientopia.org\/img-archive\/goodmath\/img_202.jpg?resize=145%2C118\" width=\"145\" height=\"118\" class=\"inset right\" \/><br \/>\nToday, I&#8217;m going to talk a bit about two closely related problems in graph theory: the maximal clique detection problem, and the maximal common subgraph problem. The two problems are interesting both on their own as easy-to-understand but hard-to-compute problems; and also because they have so many applications. In particular, the two are used extensively<br \/>\nin bioinformatics and pharmacology for the analysis of complex molecular structure.<\/p>\n<p><!--more--><br \/>\nAs I mentioned back in the definitions post, a clique is a complete graph: a graph where all vertices are adjacent, where any pair of vertices has an edge connecting them.<br \/>\nThe clique problem is an optimization problem which asks: given a graph **G**, what&#8217;s the<br \/>\nlargest subgraph of **G** which is a clique? If we convert that to a decision problem, we get:<br \/>\n&#8220;Is there a clique of N vertices contained in **G**?&#8221;<br \/>\nThe maximal common subgraph problem is another optimization problem which asks: Given two graphs **G** and **H**, what is the largest subgraph of G which is isomorphic to a subgraph of H? In its representation as a decision problem, it becomes: is there a subgraph in **G** with N nodes which is isomorphic to a subgraph of H?<br \/>\nFollowing the usual pattern of these things, for both problems  decision problem is NP complete, and the corresponding optimization problem is NP-hard. In both problems, in the worst case, you need to exhaustively consider all possible subgraphs in order to find the solution. In both cases, there are useful heuristics for pruning the search space, but like other NP complete\/NP hard problems, either the heuristics sometimes fail to produce the optimal solution, or they sometimes wind up taking exponential time.<br \/>\nBut still, heuristics are the key. In general, for applications of the graph clique problem (or the maximal subgraph), you want the optimum, but a solution which is not the ideal optimum, but is very close is nearly as good. Again, this is the common pattern for solutions of NP complete problems. Perfect solutions are hard to come by, but good ones<br \/>\nare often not unknown.<br \/>\n(A great example of this comes from a talk I saw a few years ago by a software engineering researcher named Daniel Jackson. Daniel had built a really cool tool called Alloy, which did some neat analysis. In the course of his talk, he was describing how the analysis actually worked. The way he described it was: &#8220;The bad news is, it&#8217;s reducible to 3-SAT, which is NP complete. The good news is, it&#8217;s reducible to 3-SAT, which means we can compute the answer really fast.&#8221;)<br \/>\nAnyway &#8211; the basic heuristics for these two problems are very similar; I&#8217;ll describe the method for the clique problem; and starting from the largest common clique is a good heuristic for a starting point for getting the largest common subgraph.<br \/>\nFor the maximum clique, there&#8217;s a clever heuristic,  which will, on average, find the maximal sub-clique. The idea is, find the set of all clique of size 2. Then, given that set, try to find cliques within it that can be merged. When you find two cliques that can be merged into a clique, remove those two, and replace them with their merge. Keep doing that until you can no longer merge any members of the set.<br \/>\nThis can be implemented quite efficiently. (In fact, it&#8217;s just a special case of the good<br \/>\nold classic union-find algorithm, which I&#8217;ll probably write about one of these days.) It&#8217;s<br \/>\n*not* a true solution to the problem, because it *doesn&#8217;t* always produce the maximal<br \/>\nclique. It produces *a* maximal clique over its members, but not necessarily *the* maximal<br \/>\nclique over G.<br \/>\nLet&#8217;s pause for just a moment, so that I can explain that last line, which sounds quite confusing.<br \/>\nTake a graph G=(V,E), and a set of vertices C={v<sub>1<\/sub>,&#8230;,v<sub>n<\/sub>}&sub;V,<br \/>\nsuch that the induced subgraph over C is a clique. C is *a* maximal clique over the vertices in C if there is no vertex is G that can be added to C producing a larger clique.<br \/>\nC is *the* maximal clique in G if and only if there is *no larger clique* contained in G.<br \/>\nThe key difference is that in the first (a maximal clique over a set of vertices), we&#8217;re<br \/>\nasking for the largest clique containing *that specific set* of vertices; in the second, we&#8217;re asking for the largest clique formed from *any set* of vertices in G.<br \/>\nFor example, in the diagram below, the green clique is the maximal clique over the set of vertices, {F,D,L}. There is no larger possible clique in the graph containing those three<br \/>\nvertices than the green 4-clique. But the maximal clique in G is actually the 6-clique containing the vertices {A, G, H, J, K, M}.<br \/>\n<img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" alt=\"maximal-cliques.jpg\" src=\"https:\/\/i0.wp.com\/scientopia.org\/img-archive\/goodmath\/img_203.jpg?resize=284%2C273\" width=\"284\" height=\"273\" class=\"inset\" \/><br \/>\nSo, how does this became useful for biology?<br \/>\n* In evolutionary development, one of the ways of recognizing many processes is<br \/>\nby looking at gene expression in terms of gene expression networks, which<br \/>\nare graphs. Regulatory systems are often <a href=\"http:\/\/www.pubmedcentral.nih.gov\/articlerender.fcgi?artid=133448\">revealed as cliques in the expression<br \/>\nnetwork. <\/a><br \/>\n* When studying protein structures, one of the major techniques is to<br \/>\nlook at similar proteins, and find common cliques in the graph-based<br \/>\nstructural networks of the proteins.<br \/>\n* In pharmacology, the relationship between different chemicals is studied<br \/>\nby finding the largest common subgraphs in the molecular graphs of the different<br \/>\nsubstances.<br \/>\n* Again in pharmacology, there is a graph based method for <a href=\"http:\/\/www.ncbi.nlm.nih.gov\/sites\/entrez?cmd=Retrieve&amp;db=PubMed&amp;list_uids=8097240&amp;dopt=Abstract\">searching for a compound that will interact<br \/>\nwith a specific biochemical target.<\/a> One of the ways that this is done is to generate<br \/>\na large library of graphs describing known molecules. Then, given a particular<br \/>\ntarget, you generate a *complimentary* graph describing its structure; then you<br \/>\nsearch for graphs in the library that have large cliques in common with the<br \/>\ncomplimentary graph of the target. The resulting list of molecules from the library<br \/>\nform a set of likely candidates for a molecule that will produce the desired action<br \/>\nby bonding with the target.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Today, I&#8217;m going to talk a bit about two closely related problems in graph theory: the maximal clique detection problem, and the maximal common subgraph problem. The two problems are interesting both on their own as easy-to-understand but hard-to-compute problems; and also because they have so many applications. In particular, the two are used extensively [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":false,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2}},"categories":[25],"tags":[],"class_list":["post-470","post","type-post","status-publish","format-standard","hentry","category-graph-theory"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_shortlink":"https:\/\/wp.me\/p4lzZS-7A","jetpack_sharing_enabled":true,"jetpack_likes_enabled":true,"_links":{"self":[{"href":"http:\/\/www.goodmath.org\/blog\/wp-json\/wp\/v2\/posts\/470","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/www.goodmath.org\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/www.goodmath.org\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/www.goodmath.org\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/www.goodmath.org\/blog\/wp-json\/wp\/v2\/comments?post=470"}],"version-history":[{"count":0,"href":"http:\/\/www.goodmath.org\/blog\/wp-json\/wp\/v2\/posts\/470\/revisions"}],"wp:attachment":[{"href":"http:\/\/www.goodmath.org\/blog\/wp-json\/wp\/v2\/media?parent=470"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/www.goodmath.org\/blog\/wp-json\/wp\/v2\/categories?post=470"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/www.goodmath.org\/blog\/wp-json\/wp\/v2\/tags?post=470"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}