# Studying proteins and polymers using knot theory<no value>


Many kinds of data in science and engineering is expressed in terms of point sets; literally a table consisting of numbers. Yet, in the real-world, continuous structures make their appearance in many shapes and forms. In particular, the structure of proteins or polymers can be described in terms of curves in 3-dimensional space. Where one finds curves, it is to be expected that entanglement plays a role. Indeed, according to the second law of thermodynamics, we cannot possibly expect for polymers to be structured in an ordered way: the probability for that to happen is rather low. Instead long polymer chains thread through one another, and its entanglement determines its properties. Entanglement is the reason a melt of long polymers behaves like honey rather than water. 

So "entanglement" is a genuine physical variable, resulting in measurable physical properties. But how does one characterize it, keeping local structures in mind?


## Knot data analysis

The notion of knot data analysis was formally introduced in {{< cite "shenKnotDataAnalysis2024" >}}, but knot theory concepts have been used much earlier. In 2020,
the [Gauss link integral](https://en.wikipedia.org/wiki/Linking_number) was used in {{< cite "panagiotouTopologicalStudyProtein2020" >}} to study protein entanglement,
specifically for understanding protein folding kinetics. The authors showed the topology and geometry of proteins provide information regarding the folding rate.
Moreover, in {{< cite "baldwinLocalTopologicalFree2022" >}}, methods from topology were used to study the SARS-CoV-2 Spike protein.

In a previous post I discussed the notion of an [invariant]({{< ref "articles/topology/knot-theory/invariants" >}}) in topology, which is typically a function that remains unchanged under continuous deformation. Invariants are typically useful to answer a very specific question: are these two closed loops the same knot or not? It assigns to every knot a kind of fingerprint, which can be used to distinguish different knots. However, the tools built in knot theory are typically only defined for closed loops. Many structures, such as polymers, do not form closed loops. Moreover, ambient isotopy can significantly alter local structures while a globally defined knot invariant remains the same. These problems are addressed in a recent paper {{< cite "songMultiscaleJonesPolynomial2025" >}}, which I'll be discussing in this article.

First let us consider the problem of open ends. It turns out that many knot invariants generalize to polynomials associated to open curves in 3-dimensional space. This trend is part of a recent series of papers starting with {{< cite "barkatakiJonesPolynomialCollections2022" >}}. Now, the second problem is much subtler. When considering an invariant of 3-dimensional space, a single global "fingerprint" leaves out possibly interesting tangling that happens and how much of this occurs locally. In this way, an invariant gives you information for the entire chain. A biologist, on the other hand, usually wants to know which parts of the structure under consideration are actually being entangled. Which loop threads through which, and how tightly does this occur?  This *local* information is what correlates with global properties such as flexibility, reactivity, and function. Capturing a descriptor for local entanglement is precisely what the authors in {{< cite "songMultiscaleJonesPolynomial2025" >}} seek to address. Crucially, they borrow notions from [persistent homology](https://en.wikipedia.org/wiki/Persistent_homology), which is one of the main workhorses in computational topology. In this article I will give an informal description of their approach.

## Extending the Jones polynomial
The Jones polynomial is one of the most famous knot invariants. I have discussed its precise definition in the article [invariants]({{< ref "articles/topology/knot-theory/invariants" >}}). To turn this invariant into a polynomial related to open curves, a rather clever technique was introduced in {{< cite "barkatakiJonesPolynomialCollections2022" >}} that used averages. For open curves, this "new" Jones polynomial has real coefficients and is a continuous function of the curve coordinates. Calling it *the Jones polynomial* is sensible, because it is shown that upon closing loops and endpoints, one recovers the ordinary Jones polynomial originally defined by Vaughan Jones.

To localize this Jones polynomial of curves in 3-space, two notions are introduced in {{< cite "songMultiscaleJonesPolynomial2025" >}}:
- The multi-scale Jones polynomial;
- and the persistent Jones polynomial.

### The multi-scale Jones polynomial
For $L$ a collection of disjoint open or closed curves, it is assumed that a segmentation denoted by $P_n = \{ l_1, l_2, \dots, l_n\}$ exists where $l_i$ is a segment of $L$. In the case of proteins, each segment could for example be defined for each amino-acid. 
Now, instead of computing one descriptor for $L$, compute it for each segment together with its neighbors within a distance $R$.  By varying $R$ it is possible to consider very close neighbors, then a slightly larger neighborhood, then larger still. For each of these neighborhoods, one can compute the Jones polynomial, and evaluate it for some number. In this way each segment gets not one number but a whole profile describing how its local crossing pattern develops as you widen the lens. Were we to do this for every segment, we would end up with a matrix where each row represents a segment, and each column a distance scale. This matrix is a fingerprint of the structure of $L$, capturing both the local structure (narrow scales) and the global shape (wide scales) in a single quantifier. 

### The persistent Jones polynomial
Picture the segments as points, each able to "reach out" to its neighbors within a distance $r$. Now start increasing $r$ from zero. At first every segment is alone, but as $r$ grows, nearby segments start linking into small groups, those groups merge into bigger ones, and for big enough $r$ everything is joined together. During this process a *barcode* is formed: one horizontal bar for each group that forms, drawn from the scale at which it appears to the scale at which it gets absorbed into something larger. Here, a short bar is a group that only briefly exists on its own, while a long bar is a group that stays distinct across a wide range of scales. On its own, this barcode is just a summary of *which segments are close to which, and how close*.

This idea does not contain any notion or characterization of entanglement yet. This is where the Jones polynomial comes in. Every group in the barcode corresponds to a set of curve segments. Hence, for any group, it is possible to compute an open-curve Jones polynomial. By evaluating this polynomial in a chosen number, it is possible to attach a value to any bar in the barcode. This can be visualized as a color. Thus, each bar carries two types of information: when a group of segments comes together and a characterization of its crossing pattern (from the Jones polynomial).

## B-factor Prediction
The first method, the multi-scale Jones polynomial, is put to a test in {{< cite "songMultiscaleJonesPolynomial2025" >}} to predict B-factors in proteins. A B-factor (or Debye–Waller factor) is an experimentally measured number, that quantifies how much an atom "jiggles" around its average position. The authors represented each amino acid by a single point (its central carbon atom), which are sequentially connected to form a chain $L$. Let $C = \{c_0, c_1, \dots, c_n\}$ denote the set of $C_{\alpha}$ atoms arranged in the sequence of the protein. The resulting chain of the protein is considered a disjoint open curve. The segmentation of the $C_{\alpha}$ chain is achieved by cutting at the midpoint between each $C_{\alpha}$ atom and its adjacent $C_{\alpha}$ atom. The resulting segmentation is denoted by $P_n = \{l_0, l_1,\dots, l_n\}$. 

For this segmented curve, the multi-scale Jones polynomial matrix is computed. With the use of a simple linear regression to predict, the authors were able to predict B-factors, improving over a large set of established (and much more complex) methods.

## α-helices and β-sheets

Proteins contain two fundamental building blocks: the α-helix and the β-sheet. A natural question to ask is whether we can measure them in terms of the barcodes defined from the persistent Jones polynomial. The authors showed that the Jones polynomial values spanned a much wider range for the helix than that of the sheet. This is clear from Figure 1 and 2 below. This could imply that the helix's groups vary far more in their crossing patterns than the sheet's. The barcode also captures geometry: the shortest bars sit at the roughly 3.8-ångström spacing between consecutive (central) carbons, and the sheet's longer bars confirm its residues are farther apart than the helix's. 


{{< figure
  src="helix.png"
  alt="α-helix together with its barcode"
  caption="Figure 1: α-helix together with its barcode"
>}}


{{< figure
  src="sheet.png"
  alt="β-sheet together with its barcode"
  caption="Figure 2: β-sheet together with its barcode"
>}}


## References

{{< references >}}