admix.Dataset#

We introduce the central data stuctures used in this package.

  • dset.geno: genotype (n_snp, n_indiv, n_ploidy)

  • dset.lanc: local ancestry (n_snp, n_indiv, n_ploidy)

  • dset.snp: information about SNPs (n_snp, n_snp_feature)

  • dset.indiv: information about individuals (n_indiv, n_indiv_feature)

Central in python API is the admix.Dataset class, which support various convenient operations for manipulating large on-disk data sets.

[1]:
import admix

# load example data
dset = admix.io.read_dataset("example_data/CEU-YRI")
[2]:
# overview of data set
dset
[2]:
admix.Dataset object with n_snp x n_indiv = 15357 x 10000, n_anc=2
        snp: 'CHROM', 'POS', 'REF', 'ALT', 'QUAL', 'FILTER', 'INFO'
[3]:
# SNP attributes, CHROM (chromosomes), POS (positions), REF (reference allele), ALT (alternative allele), etc.
# we have also precomputed FREQ1, FREQ2 as ancestry-specific allele frequencies
dset.snp
[3]:
CHROM POS REF ALT QUAL FILTER INFO
snp
22:16406147:A:G 22 16406147 A G . . .
22:16551808:T:C 22 16551808 T C . . .
22:16573830:T:C 22 16573830 T C . . .
22:16575525:T:C 22 16575525 T C . . .
22:16576248:G:T 22 16576248 G T . . .
... ... ... ... ... ... ... ...
22:50739662:G:A 22 50739662 G A . . .
22:50743331:A:G 22 50743331 A G . . .
22:50772964:T:C 22 50772964 T C . . .
22:50774447:A:C 22 50774447 A C . . .
22:50780578:G:A 22 50780578 G A . . .

15357 rows × 7 columns

[4]:
# individual attributes
dset.indiv
[4]:
indiv
Sample_1
Sample_2
Sample_3
Sample_4
Sample_5
...
Sample_9996
Sample_9997
Sample_9998
Sample_9999
Sample_10000

10000 rows × 0 columns

[5]:
# phased genotype (n_snp, n_indiv, 2)
dset.geno
[5]:
Array Chunk
Bytes 1.14 GiB 78.12 MiB
Shape (15357, 10000, 2) (1024, 10000, 2)
Dask graph 15 chunks in 31 graph layers
Data type float32 numpy.ndarray
2 10000 15357
[6]:
# local ancestry (n_snp, n_indiv, 2)
dset.lanc
[6]:
Array Chunk
Bytes 292.91 MiB 19.53 MiB
Shape (15357, 10000, 2) (1024, 10000, 2)
Dask graph 15 chunks in 31 graph layers
Data type int8 numpy.ndarray
2 10000 15357
[7]:
# subset the first 50 SNPs
dset[0:50, :]
[7]:
admix.Dataset object with n_snp x n_indiv = 50 x 10000, n_anc=2
        snp: 'CHROM', 'POS', 'REF', 'ALT', 'QUAL', 'FILTER', 'INFO'
[8]:
# subset the first 50 individuals
dset[:, 0:50]
[8]:
admix.Dataset object with n_snp x n_indiv = 15357 x 50, n_anc=2
        snp: 'CHROM', 'POS', 'REF', 'ALT', 'QUAL', 'FILTER', 'INFO'
[9]:
# subset the first 50 SNPs and first 50 individuals
dset[0:50:, 0:50]
[9]:
admix.Dataset object with n_snp x n_indiv = 50 x 50, n_anc=2
        snp: 'CHROM', 'POS', 'REF', 'ALT', 'QUAL', 'FILTER', 'INFO'
[10]:
# calculate allele per ancestry backgrounds
dset.allele_per_anc()
[10]:
Array Chunk
Bytes 2.29 GiB 156.25 MiB
Shape (15357, 10000, 2) (1024, 10000, 2)
Dask graph 15 chunks in 63 graph layers
Data type float64 numpy.ndarray
2 10000 15357
[11]:
# calculate allele frequencies per ancestry backgrounds
dset.af_per_anc()
admix.data.af_per_anc: 100%|███████████████████████████████████████████████████████████████████████████████| 15/15 [00:11<00:00,  1.35it/s]
[11]:
array([[0.28433613, 0.01600149],
       [0.44607595, 0.46056075],
       [0.25962513, 0.01133815],
       ...,
       [0.1228952 , 0.01448099],
       [0.34757477, 0.55327383],
       [0.17893943, 0.36445915]])