Generating the inputs to SemiBin (from a metagenome)

Starting with a metagenome, you need to generate a contigs file (contigs.fa) and a sorted BAM file (output.sorted.bam) from mapping the metagenomic reads to the assembled contigs.

Step 1: Assemble it into a contigs FASTA file. In this case, we are using NGLess to combine FastQ preprocessing & assembly (using MEGAHIT, but any other system will work.

Step 2: Map reads to the FASTA file generated in Step 1.

Here is an NGLess file that performs all these operations in a single script:

ngless "1.5"
import "samtools" version "1.0"

input = paired('reads_1.fq.gz', 'reads_2.fq.gz')
input = preprocess(input) using |r|:
    r = substrim(r, min_quality=25)
    if len(r) < 45:
        discard

contigs = assemble(input)
write(contigs, ofile='contig.fa')

mapped = map(input, fafile=contigs)

write(samtools_sort(mapped),
    ofile='output.sorted.bam')

Mapping using bowtie2

You can also use bowtie2 directly, for example (using 4 threads, you can adjust -p 4 as needed when calling bowtie2):

bowtie2-build -f contig.fa contig.fa

bowtie2 -q --fr -x contig.fa -1 reads_1.fq.gz -2 reads_2.fq.gz -S contig.sam -p 4

samtools view -h -b -S contig.sam -o contig.bam
samtools view -b -F 4 contig.bam -o contig.mapped.bam
samtools sort contig.mapped.bam -o contig.mapped.sorted.bam

samtools index contig.mapped.sorted.bam

Note: Unless you understand exactly what is going on, you probably do not want to do this. Feel free to check in with us if you have doubts.

SemiBin1 uses mmseqs2 by default, but you can also use CAT to produce contig taxonomic classifications and generate the cannot-link pairs.

CAT contigs \
        -c contig.fa \
        -d CAT_prepare_20200304/2020-03-04_CAT_database \
        --path_to_prodigal $path_to_prodigal \
        --path_to_diamond $path_to_diamond \
        -t CAT_prepare_20200304/2020-03-04_taxonomy \
        -o CAT_output/CAT \
        --force \
        -f 0.5 \
        --top 11 \
        --I_know_what_Im_doing \
        --index_chunks 1

CAT add_names \
    CAT_output/CAT.contig2classification.txt \
    -o CAT_output/CAT.out \
    -t CAT_prepare_20200304/2020-03-04_taxonomy \
    --force \
    --only_official

Generate cannot-link constrains using the script generate_cannot_link.py in the scripts/ directory

python script/generate_cannot_link.py -i CAT.out -c contig.fa -s sample-name -o output --CAT