Usage

Single-sample binning

Input: S1.fa and S1.bam

Easy single binning mode

SemiBin single_easy_bin -i S1.fa -b S1.bam -o output 

Or with one of our built-in models (human_gut/dog_gut/ocean/soil/cat_gut/human_oral/mouse_gut/pig_gut/built_environment/wastewater/global)

SemiBin single_easy_bin -i S1.fa -b S1.bam -o output --environment human_gut

Advanced workflows

The basic idea of using SemiBin with single-sample and co-assembly is:

  1. generate data.csv and data_split.csv (used in training) for every sample,
  2. train the model for every sample, and
  3. bin the contigs with the model trained from the same sample.

You can run the individual steps by yourself, which can enable using compute clusters to make the binning process faster.

In particular, single_easy_bin includes the following steps: generate_cannot_links,generate_data_single and bin; while multi_easy_bin includes the following steps: generate_cannot_links, generate_data_multi and bin.

(1) Generate data.csv/data_split.csv

SemiBin generate_sequence_features_single -i S1.fa -b S1.bam -o S1_output

(2) Generate cannot-link

SemiBin generate_cannot_links -i S1.fa -o S1_output

(3) Train

SemiBin train -i S1.fa --data S1_output/train.csv --data-split S1_output/train_split.csv -c S1_output/cannot/cannot.txt -o S1_output --mode single

(4) Bin

SemiBin bin -i S1.fa --model S1_output/model.h5 --data S1_output/data.csv -o output

or with our built-in model(human_gut/dog_gut/ocean/soil/cat_gut/human_oral/mouse_gut/pig_gut/built_environment/wastewater/global)

SemiBin bin -i S1.fa --data S1_output/data.csv -o output --environment human_gut

SemiBin(pretrain)

Another suggestion is that you can pre-train a model from part of your dataset, which can provide a balance as it is faster than training for each sample while achieving better results than a pre-trained model from another dataset (see the manuscript for more information).

If you have S1.fa, S1/data.csv, S1/data_split.csv, S1/cannot/cannot.txt ; S2.fa, S2/data.csv, S2/data_split.csv, S2/cannot/cannot.txt; S3.fa, S3/data.csv, S3/data_split.csv, S3/cannot/cannot.txt. You can train the model from 3 samples.

SemiBin train -i S1.fa S2.fa S3.fa --data S1/train.csv S2/train.csv S3/train.csv --data-split S1/train_split.csv S2/train_split.csv S3/train_split.csv -c S1/cannot.txt s2/cannot.txt S3/cannot.txt -o output --mode several 

Co-assembly binning

Input: contig.fa and S1.bam, S2.bam, S3.bam

Easy single binning mode

SemiBin single_easy_bin -i contig.fa -b S1.bam S2.bam S3.bam -o output

Advanced workflows

(1) Generate data.csv/data_split.csv

SemiBin generate_sequence_features_single -i contig.fa -b S1.bam S2.bam S3.bam -o contig_output

(2) Generate cannot-link

SemiBin generate_cannot_links -i contig.fa -o contig_output

(3) Train

SemiBin train -i contig.fa --data contig_output/train.csv --data-split contig_output/train_split.csv -c contig_output/cannot/cannot.txt -o contig_output --mode single

(4) Bin

SemiBin bin -i contig.fa --model contig_output/model.h5 --data contig_output/data.csv -o output

Multi-sample binning

Input: original fasta: S1.fa S2.fa S3.fa S4.fa S5.fa combined: combined.fa and S1.bam, S2.bam, S3.bam, S4.bam, S5.bam

The format of combined.fa: for every contig, format of the name is <sample_name>:<contig_name>, where : is the default separator (it can be changed with the --separator argument). Note: Make sure the sample names are unique and the separator does not introduce confusion when splitting. For example:

>S1:Contig_1
AGATAATAAAGATAATAATA
>S1:Contig_2
CGAATTTATCTCAAGAACAAGAAAA
>S1:Contig_3
AAAAAGAGAAAATTCAGAATTAGCCAATAAAATA
>S2:Contig_1
AATGATATAATACTTAATA
>S2:Contig_2
AAAATATTAAAGAAATAATGAAAGAAA
>S3:Contig_1
ATAAAGACGATAAAATAATAAAAGCCAAATCCGACAAAGAAAGAACGG
>S3:Contig_2
AATATTTTAGAGAAAGACATAAACAATAAGAAAAGTATT
>S3:Contig_3
CAAAT

Easy multi binning mode

SemiBin multi_easy_bin -i combined.fa -b S1.bam S2.bam S3.bam S4.bam S5.bam -o multi_output

Advanced workflows

(1) Generate data.csv/data_split.csv

SemiBin generate_sequence_features_multi -i combined.fa -b S1.bam S2.bam S3.bam S4.bam S5.bam -o output -s :

(2) Generate cannot-link

SemiBin generate_cannot_links -i S1.fa -o S1_output
SemiBin generate_cannot_links -i S2.fa -o S2_output
SemiBin generate_cannot_links -i S3.fa -o S3_output
SemiBin generate_cannot_links -i S4.fa -o S4_output
SemiBin generate_cannot_links -i S5.fa -o S5_output

(3) Train

SemiBin train -i S1.fa --data multi_output/samples/S1/train.csv --data-split multi_output/samples/S1/train_split.csv -c S1_output/cannot/cannot.txt -o S1_output --mode single
SemiBin train -i S2.fa --data multi_output/samples/S2/train.csv --data-split multi_output/samples/S2/train_split.csv -c S2_output/cannot/cannot.txt -o S2_output --mode single
SemiBin train -i S3.fa --data multi_output/samples/S3/train.csv --data-split multi_output/samples/S3/train_split.csv -c S3_output/cannot/cannot.txt -o S3_output --mode single
SemiBin train -i S4.fa --data multi_output/samples/S4/train.csv --data-split multi_output/samples/S4/train_split.csv -c S4_output/cannot/cannot.txt -o S4_output --mode single
SemiBin train -i S5.fa --data multi_output/samples/S5/train.csv --data-split multi_output/samples/S5/train_split.csv -c S5_output/cannot/cannot.txt -o S5_output --mode single

(4) Bin

SemiBin bin -i S1.fa --model S1_output/model.h5 --data multi_output/samples/S1/data.csv -o output 
SemiBin bin -i S2.fa --model S2_output/model.h5 --data multi_output/samples/S2/data.csv -o output
SemiBin bin -i S3.fa --model S3_output/model.h5 --data multi_output/samples/S3/data.csv -o output 
SemiBin bin -i S4.fa --model S4_output/model.h5 --data multi_output/samples/S4/data.csv -o output 
SemiBin bin -i S5.fa --model S5_output/model.h5 --data multi_output/samples/S5/data.csv -o output