Generating the inputs to SemiBin
Starting with a metagenome, you need to generate a contigs file (contigs.fa
)
and a sorted BAM file (output.bam
).
- Assemble it into a contigs FASTA file. In this case, we are using NGLess to combine FastQ preprocessing & assembly (using MEGAHIT, but any other system will work.
ngless "1.2"
input = paired('reads_1.fq.gz', 'reads_2.fq.gz')
input = preprocess(input) using |r|:
r = substrim(r, min_quality=25)
if len(r) < 45:
discard
contigs = assemble(input)
write(contigs, ofile='contig.fa')
- Map reads to the FASTA file
Mapping using NGLess
ngless "1.2"
import "samtools" version "1.0"
input = fastq('sample.fq.gz')
mapped = map(input, fafile='expected.fa')
write(samtools_sort(mapped),
ofile='output.bam')
Mapping using bowtie2
bowtie2-build -f contig.fa contig.fa
bowtie2 -q --fr -x contig.fa -1 reads_1.fq.gz -2 reads_2.fq.gz -S contig.sam -p 64
samtools view -h -b -S contig.sam -o contig.bam -@ 64
samtools view -b -F 4 contig.bam -o contig.mapped.bam -@ 64
samtools sort -m 1000000000 contig.mapped.bam -o contig.mapped.sorted.bam -@ 64
samtools index contig.mapped.sorted.bam
Generate cannot-link constraints using CAT (advanced)
Note: Unless you understand exactly what is going on, you probably do not want to do this. Feel free to check in with us if you have doubts.
SemiBin uses mmseqs2 by default, but you can also use CAT to produce contig taxonomic classifications and generate the cannot-link pairs.
CAT contigs \
-c contig.fa \
-d CAT_prepare_20200304/2020-03-04_CAT_database \
--path_to_prodigal $path_to_prodigal \
--path_to_diamond $path_to_diamond \
-t CAT_prepare_20200304/2020-03-04_taxonomy \
-o CAT_output/CAT \
--force \
-f 0.5 \
--top 11 \
--I_know_what_Im_doing \
--index_chunks 1
CAT add_names \
CAT_output/CAT.contig2classification.txt \
-o CAT_output/CAT.out \
-t CAT_prepare_20200304/2020-03-04_taxonomy \
--force \
--only_official
Generate cannot-link constrains using the script generate_cannot_link.py
in the scripts/
directory
python script/generate_cannot_link.py -i CAT.out -c contig.fa -s sample-name -o output --CAT