Dataset preparation#

Overview#

To start genetic analysis for admixed populations with admix-kit, it is required to compile the data set into the following files.

  • Phased genotype in PLINK2 format.

  • Local ancestry inference results .lanc.

  • Additional individuals’ information .indiv_info

  • Additional SNPs’ information .snp_info

For genome-wide analysis, it is recommended to divide the data set by chromosomes. For example, the file structure will look like

.
├── genotype
│   ├── chr1.pgen
│   ├── chr1.psam
│   ├── chr1.pvar
│   ├── chr1.lanc
│   ├── chr1.indiv_info
│   ├── chr1.snp_info
│   ├── chr2.pgen
│   ├── ...

We details the steps to prepare data set as follows:

Step 1: format genotype#

Step 1.1 (optional): select well-imputed SNPs#

Often we start with the imputed genotype from imputation server. We can filter by MAF > 0.005 (5th column) and R2 > 0.8 (7th column) to select the SNPs with high quality.

IN_DIR=/path/to/vcf
OUT_DIR=/path/to/imputed

# convert to PLINK2 format
plink2 --vcf ${IN_DIR}/chr${chrom}.vcf.gz \
    --extract-if-info "R2>0.8" \
    --rm-dup exclude-all \
    --snps-only \
    --maf 0.005 \
    --max-alleles 2 \
    --make-pgen \
    --memory 16000 \
    --out ${OUT_DIR}/chr${chrom}

# NOTE: -extract-if-info "R2>0.8" is to retain well-imputed SNPs
# alternatively, if you R2 information in a .info.gz file
# you can replace the '--extract-if-info' with
# $ zcat ${IN_DIR}/chr${chrom}.info.gz | awk 'NR>1 {if($5>0.005 && $7>0.8) print $1}' > ${OUT_DIR}/chr${chrom}.snplist
# and use plink2 --extract ${OUT_DIR}/chr${chrom}.snplist

# if your vcf file is already processed, use the following
# $ plink2 --vcf ${vcf} --make-pgen --out ${out_plink}

PLINK2 is also versatile for converting other formats into .pgen format. See more at https://www.cog-genomics.org/plink/2.0/input#pgen.

Step 1.2 (optional): select HM3 SNPs#

Most genetic analysis (e.g., local ancestry inference) can be made more efficient by subsetting the data to HapMap3 SNPs.

admix subset-hapmap3 --pfile ${imputed_pfile} --out-pfile ${hm3_pfile} --build hg38

Note

Make sure your source data is phased because it is essential for many analyses with admix-kit. Use plink2 --pfile <pfile> --pgen-info for basic check. If there is a line “Explicitly phased hardcalls present”, that means phasing data is present.

Step 1.3 (optional): merge all chromosomes into one file#

(for i in {1..22}; do echo $"chr${i}"; done) > chr_list.txt
plink2 \
    --pmerge-list chr_list.txt \
    --pmerge-list-dir ${pfile_dir} \
    --make-pgen \
    --out ${merged_pfile}

Step 1.4 (optional): perform joint PCA with 1kg reference panel#

By overlapping your sample with the 1,000 Genomes reference panel, you can get an overall idea of genetic ancestries of individuals in your sample.

admix pfile-merge-indiv \
    --pfile1 ${REF_DIR}/all_chr \
    --pfile2 ${SAMPLE_HM3_DIR}/all_chr \
    --out ${OUT_DIR}/merged

plink2 --pfile ${OUT_DIR}/merged \
    --pca approx \
    --out ${OUT_DIR}/merged_pca

Step 2: Local ancestry inference#

Note

There are many tools for local ancestry inference. If you have not performed the local ancestry inference, we prepare a guideline to use RFmix for local ancestry inference (see RFmix guideline). Otherwise, you can use the tool you like.

We provide helper function to convert the local ancestry results into .lanc format (see more details below) which is a compact format for storing local ancestry. To convert the RFmix local ancestry into .lanc format, use the following command. This command can be applied to both imputed and hm3 data.

admix lanc-convert \
    --pfile <pgen_prefix> \      # e.g., dset.chr1
    --rfmix <rfmix_msp_path> \  # e.g., rfmix/dset.chr1.msp.tsv
    --out <lanc_path>           # e.g., dset.chr1.lanc

Now you already formatted all the required for other downstream analysis. Besides that, we also recommend calculating some basic statistics for the data set with:

admix append-snp-info \
    --pfile <pgen_prefix> \      # e.g., dset.chr1
    --out <snp_info>  # e.g., dset.chr1.snp_info

Other files#

PLINK2 genotype file and .lanc file are almost you need to start the analysis. The other files might be useful for better structuring your analysis. .snp_info contains SNP information file, such as allele frequency, and .indiv_info contains individual information file, such as top PCs.

Example simulated dataset#

Download an example dataset from here. See step-by-step instructions for simulation.

.lanc file formats#

.lanc is a text file containing a matrix of local ancestry of shape <n_snp> x <n_indiv> x <2 ploidy>.

The first line contains two numbers: <n_snp> for number of SNPs and <n_indiv> for number of indivduals. Then <n_indiv> lines follow with each line corresponds to one individual: For each line, the local ancestry change points are recorded as <pos>:<anc1><anc2> which records the position of the change point and the ordered ancestries (according to the phase) local ancestry information.

An example of .lanc file will make the format clear:

300 3
100:01 300:00
120:10 300:01
300:00

This corresponds to a 300 SNPs x 3 individuals x 2 ploidy matrix. The corresponding dense matrix for the first individual can be reconstructed using the following code:

# example for the first individual in the above example file
break_list = [100, 300]
anc0_list = [0, 0]
anc1_list = [1, 0]    
start = 0
lanc = np.zeros(300, 2, dtype=np.int8)
for stop, anc0, anc1 in zip(break_list, anc0_list, anc1_list):
    lanc[start : stop, 0] = anc0
    lanc[start : stop, 1] = anc1
    start = stop

Note these ranges are right-open intervals [start, stop) and the last position of each line always ends with <n_snp>. We provide helper function to convert between sparse .lanc format and dense matrix format.