Examples¶
Example 1: Basics¶
The source file can be downloaded here:
example1_basics.py
This example illustrates the basic operations of the StrainPycon package. First we draw random strain barcodes and frequencies, and then we generate two measurement vectors: one without noise and one with small additive Gaussian noise. The reconstruction, i.e., “MAP estimate”, is computed for both vectors. We also compute the misfit values, i.e., negative log-likelihoods, when the number of strains in the reconstruction varies.
We start by importing the necessary modules and setting the random seed for reproducibility:
import strainpycon
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(1)
We choose the number of SNP sites and the number of strains for our synthetic measurements:
m = 24 # number of SNP sites
n = 4 # number of strains
The binary strain barcode matrix has shape (n, m)
. We draw it from uniform
distribution. The frequency vector has shape (n,)
and it has positive
entries that sum up to one. StrainPycon provides a utility function for drawing
uniformly random frequency vectors. For illustration purposes, we sort the
vector in descending order, because the reconstruction method always returns the
vector in descending order. The noiseless measurement is obtained by simply
taking a weighted sum of the binary barcodes, which can be written as a
vector-matrix multiplication. For the noisy measurement, we add independent
Gaussian random variables with standard deviation 0.01:
strains = np.random.randint(0, 2, (n, m), dtype=int) # shape (n,m)
freq = strainpycon.psimplex.rand_simplex(n, sort=-1) # shape (n,)
meas1 = np.dot(freq, strains) # shape (m,)
meas2 = meas1 + 0.01 * np.random.normal(size=meas1.shape)
Now we are ready to test the strain identification functions in StrainPycon package. Most of the useful functions are methods of StrainRecon class, so first we create an object of that class:
S = strainpycon.StrainRecon()
The MAP estimates are computed as follows. Here, the correct number of strains is assumed to be known:
(mat1, vec1) = S.compute(meas1, n) # without noise
(mat2, vec2) = S.compute(meas2, n) # with noise
Finally, we examine the misfits, i.e., negative log-likelihoods, when n
in
the reconstruction varies between 1 and 7:
nrange = range(1,8)
misfits1 = S.misfits(meas1, nrange) # without noise
misfits2 = S.misfits(meas2, nrange) # with noise
The matrices, frequency vectors, and misfits can be visualized as follows:
# Show the original barcode matrix and the reconsructions.
fig, ax = plt.subplots(nrows=3)
plt.sca(ax[0])
plt.imshow(strains)
plt.title('Ground truth')
plt.sca(ax[1])
plt.imshow(mat1)
plt.title('Reconstruction from noiseless data')
plt.sca(ax[2])
plt.imshow(mat2)
plt.title('Reconstruction from noisy data')
for idx in range(3):
plt.sca(ax[idx])
plt.xlabel('SNP site')
plt.ylabel('Strain')
plt.tight_layout()
plt.show()
# Display frequencies.
print('Strain frequencies:')
print('Ground truth:\n{}'.format(freq))
print('Reconstruction from noiseless data:\n{}'.format(vec1))
print('Reconstruction from noisy data:\n{}'.format(vec2))
# Plot misfits.
fig = plt.figure()
plt.semilogy(nrange, misfits1, label='Noiseless data')
plt.semilogy(nrange, misfits2, label='Noisy data')
plt.legend()
plt.title('Misfit')
plt.xlabel('Number of strains')
plt.ylabel('Reconstruction misfit')
plt.show()
Example 2: Uncertainty quantification¶
The source file can be downloaded here:
example2_uncertainty.py
In this example, we consider experimental data and visualize the posterior statistics for the barcode matrix. Essentially, we reproduce the matrix part of the image shown as Figure 3 in https://doi.org/10.1088/1361-6420/aad7cd . See the corresponding section in the text for details about the measurement data.
We import the modules as above and provide the data. The number of SNP sites is 16 and there are three strains:
import strainpycon
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(1)
strains = np.array([[0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1],
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1],
[1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1]])
n = np.size(strains, 0)
meas = np.array([0.305732484076433,
0.144654088050314,
1.000000000000000,
0.872727272727273,
0.161073825503356,
0.178571428571429,
1.000000000000000,
1.000000000000000,
0.131034482758621,
0.838509316770186,
0.283132530120482,
0.200000000000000,
0.679802955665025,
0,
0,
1.000000000000000])
Next, we compute the reconstruction with uncertainty information. We assume the knowledge of the number of strains, and the standard deviation of noise is set to 0.1 as in the article:
S = strainpycon.StrainRecon()
(mat, vec, meanmat, meanvec, devmat, devvec) = S.compute(meas, n, gamma=0.1, uncertainty=True)
Lastly, we create the figure:
fig, ax = plt.subplots(nrows=4)
plt.sca(ax[0])
plt.imshow(strains, vmin=0, vmax=1)
plt.title('Ground truth')
plt.sca(ax[1])
plt.imshow(mat, vmin=0, vmax=1)
plt.title('Reconstruction (MAP estimate)')
plt.sca(ax[2])
plt.imshow(meanmat, vmin=0, vmax=1)
plt.title('Posterior mean')
plt.sca(ax[3])
plt.imshow(devmat, vmin=0, vmax=1)
plt.title('Posterior standard deviation')
for idx in range(4):
plt.sca(ax[idx])
plt.xlabel('SNP site')
plt.ylabel('Strain')
plt.tight_layout()
cbar_ax = fig.add_axes([0.85, 0.1, 0.05, 0.8])
plt.colorbar(cax=cbar_ax)
plt.show()
Example 3: Multiple categories¶
The source file can be downloaded here:
example3_multicat.py
Here we look at the case where the SNP sites are not binary but can have more than two (here, four) possible values:
import strainpycon
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(1)
m = 24
n = 3
cats = 4
For convenience, the package provides a function that we can use to generate synthetic data. The last argument (here, 0.1) is the standard deviation of the Gaussian noise:
S = strainpycon.StrainRecon()
(meas, strains, freq) = S.random_data(m, n, cats, 0.1)
The shape of meas
is now (cats, m)
instead of (m,)
. However, we can
use the compute
method as before:
(mat, vec) = S.compute(meas, n)
Visualization:
fig, ax = plt.subplots(nrows=2)
plt.sca(ax[0])
plt.imshow(strains)
plt.title('Ground truth')
plt.xlabel('SNP site')
plt.ylabel('Strain')
plt.sca(ax[1])
plt.imshow(mat)
plt.title('Reconstruction')
plt.xlabel('SNP site')
plt.ylabel('Strain')
plt.tight_layout()
plt.show()