SemiBin

SemiBin is a command line tool for metagenomic binning with semi-supervised siamese neural network using additional information from reference genomes and contigs themselves. It will output the reconstructed bins in single sample/co-assembly/multi-samples binning mode.

Single sample binning

Single sample binning means that each sample is assembled and binned independently.

This mode allows for parallel binning of samples and avoid cross-sample chimeras, but it does not use co-abundance information across samples.

Co-assembly binning

Co-assembly binning means samples are co-assembled first (as if the pool of samples were a single sample) and binned later.

This mode can generate better contigs (especially from species that are at a low abundance in any individual sample) and use co-abundance information, but co-assembly can lead to intersample chimeric contigs and binning based on co-assembly dows not retain sample specific variation. It is appropriate when the samples are very similar.

Multi-sample binning

With multi-sample binning, multiple samples are assembled and binned individually, but information from multiple samples is used together. This mode can use co-abundance information and retain sample-specific variation at the same time. However, it has increased computational costs.

This mode is implemented by concatenating the contigs assembled from the individual samples together and then mapping reads from each sample to this concatenated database.

Commands

single_easy_bin

Reconstruct bins with single or co-assembly binning using one line command.

multi_easy_bin

Reconstruct bins with multi-samples binning using one line command.

The following options (including synonyms) are the same as for single_easy_bin: --input-fasta, --output, --reference-db, --processes, --minfasta-kbs, --recluster,--epoches, --batch-size, --max-node, and --max-edges, --random-seed, --ratio, --min-len, --ml-threshold.

predict_taxonomy

Run the contig annotations using mmseqs with GTDB and generate cannot-link file used in the semi-supervised deep learning model training.

The following options are the same as for single_easy_bin: -i/--input-fasta, -o/--output, --cannot-name, -r/--reference-db, --ratio, --min-len and --ml-threshold.

generate_data_single

Generate training data(data.csv;data_split.csv) for single and co-assembly binning.

The following options are the same as for single_easy_bin: -i/--input-fasta, -b/--input-bam, -o/--output, -p/--processes/-t/--threads, --ratio, --min-len, --ml-threshold.

generate_data_multi

Generate training data(data.csv;data_split.csv) for multi-samples binning.

The following options are the same as for single_easy_bin: -i/--input-fasta, -o/--output, -p/--processes/-t/--threads, --ratio, --min-len, --ml-threshold.

The following options are the same as for multi_easy_bin: -b/--input-bam, -s/--separator.

train

Training the model.

The following options are the same as for single_easy_bin: -i/--input-fasta, -o/--output, --epoches, --batch-size, -p/--processes/-t/--threads, --random-seed, --ratio, --min-len.

bin

Clustering contigs into bins.

The following options are the same as for single_easy_bin: -i/--input-fasta, -o/--output, --minfasta-kbs, --recluster, --max-node, --max-edges, -p/--processes/-t/--threads, --random-seed, --environment, --ratio, --min-len.

download_GTDB

Download reference genomes(GTDB).