Generating the inputs to SemiBin

Starting with a metagenome, you need to generate a contigs file (contigs.fa) and a sorted BAM file (output.bam).

  1. Assemble it into a contigs FASTA file. In this case, we are using NGLess to combine FastQ preprocessing & assembly (using MEGAHIT, but any other system will work.
ngless "1.2"

input = paired('reads_1.fq.gz', 'reads_2.fq.gz')
input = preprocess(input) using |r|:
    r = substrim(r, min_quality=25)
    if len(r) < 45:
        discard

contigs = assemble(input)
write(contigs, ofile='contig.fa')
  1. Map reads to the FASTA file

Mapping using NGLess

ngless "1.2"
import "samtools" version "1.0"

input = fastq('sample.fq.gz')
mapped = map(input, fafile='expected.fa')

write(samtools_sort(mapped),
    ofile='output.bam')

Mapping using bowtie2

bowtie2-build -f contig.fa contig.fa

bowtie2 -q --fr -x contig.fa -1 reads_1.fq.gz -2 reads_2.fq.gz -S contig.sam -p 64

samtools view -h -b -S contig.sam -o contig.bam -@ 64

samtools view -b -F 4 contig.bam -o contig.mapped.bam -@ 64

samtools sort -m 1000000000 contig.mapped.bam -o contig.mapped.sorted.bam -@ 64

samtools index contig.mapped.sorted.bam

Note: Unless you understand exactly what is going on, you probably do not want to do this. Feel free to check in with us if you have doubts.

SemiBin uses mmseqs2 by default, but you can also use CAT to produce contig taxonomic classifications and generate the cannot-link pairs.

CAT contigs \
        -c contig.fa \
        -d CAT_prepare_20200304/2020-03-04_CAT_database \
        --path_to_prodigal $path_to_prodigal \
        --path_to_diamond $path_to_diamond \
        -t CAT_prepare_20200304/2020-03-04_taxonomy \
        -o CAT_output/CAT \
        --force \
        -f 0.5 \
        --top 11 \
        --I_know_what_Im_doing \
        --index_chunks 1

CAT add_names \
    CAT_output/CAT.contig2classification.txt \
    -o CAT_output/CAT.out \
    -t CAT_prepare_20200304/2020-03-04_taxonomy \
    --force \
    --only_official

Generate cannot-link constrains using the script generate_cannot_link.py in the scripts/ directory

python script/generate_cannot_link.py -i CAT.out -c contig.fa -s sample-name -o output --CAT