What's New

Unreleased

Drop support for Python 3.8. The minimum supported Python version is now 3.9.
Drop support for Python 3.9. The minimum supported Python version is now 3.10.
Use zip(..., strict=True) for paired-array iterations across clustering, marker handling, and feature generation so that silent length-mismatch bugs raise immediately.

Version 2.3.0

Released May 26, 2026

This release bundles a large number of bug fixes (several of which fix silent correctness issues in training and clustering), removes the deprecated SemiBin1 command, and includes many documentation improvements.

User-visible changes

Remove SemiBin1. Only SemiBin2 is shipped now; if you have old scripts that still call SemiBin1, they will need to be updated.
Accept hybrid as a synonym for long_read in --sequencing-type (#216)
Remove --environment parameter when learning a new model (#218)
check_install: now also looks for samtools and reports it (in verbose mode) when missing. This is a soft check: missing samtools does not cause check_install to fail, since samtools is only required for CRAM input.
--allow-missing-mmseqs2 is now marked as deprecated in the help text (the option is already a no-op since mmseqs2 is no longer required by check_install).
Data-generation log message now distinguishes between generating training data and generating features for inference with a pretrained model (#219)
Add Python 3.14 to the supported/tested Python versions (SemiBin now runs on Python 3.8–3.14)
Add the missing COPYING.MIT license file (#214)

Bug fixes

single_easy_bin: fix AttributeError: 'Namespace' object has no attribute 'training_type' when running with abundance files (-a) instead of BAM files. set_training_type() was only invoked on the BAM path, so the abundance-based single-sample workflow crashed before doing any work.
Semi-supervised training: fix wrong input in the unlabeled contrastive-learning branch. The second argument to model.forward() was unlabeled_train_input1 instead of unlabeled_train_input2, meaning both branches received identical data and the unlabeled-pair contrastive signal was being silently discarded.
Long-read clustering: clip zero depth values before np.log. Zero depth (no reads mapped) was producing -inf values, silently corrupting the embedding used for DBSCAN clustering.
Multi-sample binning: fix split_data matching sample names that are prefixes of others (e.g., S1 matching S10:contig). Now uses str.startswith() rather than str.contains().
Marker calling: validate the cached markers.hmmout against an input fingerprint (FASTA size/mtime, binned_length, ORF finder) before reusing it. Previously, reruns into an existing output directory could silently reuse stale marker calls from a different input or different parameters. On mismatch a warning is logged and markers are recomputed.
Long-read binning: remove stale output bins from previous runs before writing new results (previously, rerunning into the same output directory with fewer bins left old FASTA files behind).
Empty FASTA inputs now fail cleanly with Input file ... is empty. Please check inputs. instead of crashing with ZeroDivisionError (single-sample paths) or ValueError: attempt to get argmax of an empty sequence (multi-sample paths).
ORF finding: fix prodigal errors being silently ignored.
model_load: raise a clear ValueError for unknown model names instead of an UnboundLocalError.
check_install: fix incorrect logic when checking for prodigal.
Several minor error-message and validation improvements in utils.py and main.py.

Documentation fixes

Fix outdated output descriptions in docs/output.md (default model type, missing bins_info.tsv/contig_bins.tsv, clarify output_bins is the final output)
Fix wrong split_contigs output filename in README strobealign-aemb example (split.fa → split_contigs.fna.gz)
Fix docs/aemb.md description of split_contigs.fna.gz contents (only split halves, not originals) and an undefined variable in the helper script
Remove references to --mode and --training-type flags no longer accepted by SemiBin2 in docs/subcommands.md
Update docs/semibin2.md and docs/index.md to reflect that only SemiBin2 is installed since v2.2
Clarify -b/-a are alternatives (not both required) in docs/subcommands.md for single_easy_bin, generate_sequence_features_single, and generate_sequence_features_multi
Fix docs/generate.md path from scripts/ to script/
Clarify Prodigal is optional in docs/install.md; add samtools to the source-install dependency list
Fix step numbering in docs/usage.md (steps went 1, 3, 4 → now 1, 2, 3)
Make it more prominent in the FAQ and usage docs how to handle hybrid (long+short) data
Fix Python version ranges in README and install docs
Fix truncated help text for --random-seed
Fix wrong DOI for SemiBin2 in CITATION.md
Remove incorrect --depth-metabat2 mention from generate_sequence_features_single docs
Remove duplicate HMMER entry and fix malformed Bedtools URL in README
Many other smaller documentation corrections (typos, grammar, broken links, incorrect option names)

Internal improvements

Self-supervised training reads each input CSV once instead of every epoch.
Fix unclosed file handle in run_prodigal.
Remove unused pyyaml dependency.
Avoid naked except: clauses (better error reporting and easier debugging).
Write validation errors to the log file (not just to stderr).
Various small refactors and additional type annotations.

Version 2.2.1

Released Dec 7, 2025

User-visible changes

Add support for SEMIBIN_DEBUG environment variable to enable debug logging (overrides command line flags)

Internal improvements and bugfixes

Fix for newer version of igraph (#208)

Version 2.2.0

Released Mar 20, 2025

This is a maintenance release with many small improvement rather than a single big new feature. Upgrading is recommended, but not crucial.

User-visible changes

Remove SemiBin command. Only SemiBin1 and SemiBin2 are available (and SemiBin1 is deprecated). The only reason to use SemiBin1 is if you have old scripts that use it. It will be removed in the next release.
Better logging: Always log to file in DEBUG level and log command-line arguments. Print version number in logs.
Better error messages in several instances
check_install: Prints out information on the GPU

Deprecations

SemiBin: Deprecate --prodigal-output-faa argument
No longer check for mmseqs in check_install (it is not a hard requirement)

Internal improvements and bugfixes

Respect the number of threads requested better (#140)
SemiBin: Better method to save the model which is more compatible with newer versions of PyTorch. Added a subcommand to update old models to the new format (update_model)
SemiBin: Switch to pixi for testing (and recommend it in the README/installation instructions)
Convert to pyproject.toml instead of setup.py
Do not fail if no bins are produced (#170 & #173)

Version 2.1.0

Released Mar 6, 2024

Main new feature is adding support for using output of strobealign-aemb.

Use of the SemiBin command (instead of SemiBin2) will continue to work, but print a warning and set a delay to ask users to upgrade.

User-visible changes

Support running SemiBin with strobealign-aemb (--abundance/-a)
Add citation subcommand
Introduce separate SemiBin1 command as use of SemiBin is now deprecated and will trigger a warning

Internal improvements

Code simplification and refactor
deprecation: Deprecate --orf-finder=fraggenescan option
Update abundance normalization

Bugfixes

SemiBin: do not use more processes than can be taken advantage of #155

Version 2.0.2

Released Oct 31, 2023

Bugfix release

Fixes issue with multi_easy_bin --write-pre-reclustering-bins #128 on GH

Version 2.0.1

Released Oct 21, 2023

This is a bugfix release for version 2.0.0.

Version 2.0.0

Released Oct 20, 2023

User-visible changes

Running SemiBin now writes a log file in the output directory
The concatenate_fasta subcommand now supports compression
Adds bin_short subcommand as alias for bin (by analogy with bin_long)

Version 1.5.1 (SemiBin2 beta)

Released Mar 7, 2023

Bugfixes

Fix use of --no-recluster with multi_easy_bin (#128).

Version 1.5.0 (SemiBin2 beta)

Released Jan 17, 2023

Big change is the addition of a SemiBin2 script, which is still experimental, but should be a slightly nicer interface. See [upgrading to SemiBin2]

User-visible improvements

Added a new option for ORF finding, called fast-naive which is an internal very fast implementation.
Added the possibility of bypassing ORF finding altogether by providing prodigal outputs directly (or any other gene prediction in the right format)
Command line argument checking is more exhaustive instead of exiting at first error
Added --quiet flag to reduce the amount of output printed
Better --help (group required arguments separately)
Add --output-compression option to compress outputs
Add --tag-output option which allows for control of the output filenames (and also makes the anvi'o compatible — see discussion at #123.
Add contig->bin mapping table (#123)
SemiBin.main.main1 and SemiBin.main.main2 can now be called as a function with command line arguments (main1 corresponds to SemiBin1 and main2 corresponds to SemiBin2)

import SemiBin.main

...

SemiBin.main.main2(['single_easy_bin', '--input-fasta', ...])

Version 1.4.0: long reads binning!

Released December 15, 2022

Big change is the added binning algorithm for assemblies from long-read datasets.

The overall structure of the pipeline is still similar to what was manuscript, but when clustering, it does not use infomap, but another procedure (an iterative version of DBSCAN).

Use the flag --sequencing-type=long_read to enable an alternative clustering that works better with long reads.

Other user-visible improvements

Better error checking at multiple steps in the pipeline so that processes that will crash are caught as early as possible
Add --allow-missing-mmseqs2 flag to check_install subcommand (eventually, self-supervision will be the default and mmseqs2 will be an optional dependency)

Command line parameter deprecations

The previous arguments should continue to work, but going forward, the newer arguments are probably a better API.

Selecting self-supervised learning is now done with the --self-supervised flag (instead of --training-type=self)
Training from multiple samples is now enabled with the --train-from-many flag (instead of --mode=several)

Bugfixes

The output table sometimes had the wrong path in v1.3. This has been fixed
Prodigal is now run in a more robust manner when using multiple threads (#106)

Version 1.3.1

Release December 9, 2022

Bugfixes

Made --training-type argument optional (defaults to semi to keep backwards compatibility)

Version 1.3.0

Released November 4 2022

User visible improvements

Added self-supervised learning mode (see [Training SemiBin models] for more details)

Bugfixes

Fix output table to contain correct paths
Fix mispelling in argument name --epochs (the old variation, --epoches is still accepted for backwards compatibility, but should be considered deprecated)

Version 1.2.0

Released October 19 2022

User visible improvements

Pretrained model from chicken caecum (contributed by Florian Plaza Oñate)
Output table with basic information on bins (including N50 & L50)
When reclustering is used (default), output the unreclusted bins into a directory called output_prerecluster_bins
Added --verbose flag and silenced some of the output when it is not used
Use coloredlogs (if package is available)

Version 1.1.1

Released September 27 2022

Bugfixes

Completely remove use of atomicwrites package (#97)

Version 1.1.0

Released September 21 2022

User-visible improvements

Support .cram format input (#104)
Support using depth file from Metabat2 (#103)
More flexible specification of prebuilt models (case insensitive, normalize - and _)
Better output message when no bins are produced

Bugfixes

Fix bug using atomicwrite on certain network filesystems (#97)

Internal improvements

Remove torch version restriction (and test on Python 3.10)

Version 1.0.3

Released August 3 2022

Bugfixes

Fix coverage parsing when value is not an integer (#103)
Fix multi_easy_bin with taxonomy file given on the command line (see discussion at #102)

Version 1.0.2

Released July 8 2022

Bugfixes

Fix (#93) more thoroughly (#101)

Version 1.0.1

Released May 9 2022

Bugfixes

Fix edge case when calling prodigal with more threads than contigs (#93)

Version 1.0.0

Released April 29 2022

This coincides with the publication of the manuscript.

User-visible improvements

More balanced file split when calling prodigal in parallel should take better advantage of multiple threads
Fix bug when long stretches of Ns are present (#87]
Better error messages (#90 & #91])

Bugfixes

Fix bugs in training from multiple samples
Fix bug in incorporating CAT results

Version 0.7

Released March 2 2022

This release solves issues running on Mac OS X.

User-visible improvements

Improved check_install command: it now prints out paths and correctly handles optionality of FragGeneScan/prodigal
Add concatenate_fasta command to combine fasta files for multi-sample binning
Add option --tmpdir to set temporary directory
Substitute FragGeneScan with Prodigal (FragGeneScan can still be used with --orf-finder parameter). FragGeneScan caused issues, especially on Mac OSX

Internal improvements

Reuse markers.hmmout file to make the training from several samples faster

Version 0.6

Released February 7 2022

User-visible improvements

Provide pretrained models from soil, cat gut, human oral,pig gut, mouse gut, built environment, wastewater and global (training from all samples).
Users can now pass in the output of running mmseqs2 directly and SemiBin will use that instead of calling mmseqs itself (use option --taxonomy-annotation-table).
The subcommand to generate cannot links is now called generate_cannot_links. The old name (predict_taxonomy) is kept as a deprecated alias.
Similarly, sequence features (k-mer and abundance) are generated using the commands generate_sequence_features_single and generate_sequence_features_multi (for single- and multi-sample modes, respectively). The old names (generate_data_single/generate_data_multi) are kept as deprecated aliases.
Add check_install command and run check_install before easy command

Bugfixes

Fix bug with non-standard characters in sample names (#68).

Version 0.5

Released January 7 2022

User-visible improvements

Reclustering is now the default (use --no-recluster to disable it; the option --recluster is deprecated and ignored) as the computational costs are much lower
GTDB lazy downloading is now performed even if a non-standard directory is used
The CACHEDIR.TAG protocol was implemented (this is supported by several tools that perform tasks such as backups).

Bugfixes

Fix bug with --min-len (minimal length). Previously, only contigs greater than the given minimal length were used (instead of greater-equal to the minimal length).
GTDB downloading was inconsistent in a few instances which have been fixed

Internal improvements

Much more efficient code (including lower memory usage) for binning, especially if a pretrained model is used. As an example, using a deeply-sequenced ocean sample, generating the data (generate_data_single step) goes down from 14 to 9 minutes; while binning (bin step, using --recluster) goes down from 10m17s (using 20GB of RAM, at peak) to 4m33 (using 4.5 GB, at peak). Thus total time from BAM file to bins went down from 25 to 14 minutes (using 4 threads) and peak RAM is now 4.5GB, making it usable on a typical laptop.

Version 0.4.0

Released 27 October 2021

User-visible improvements

Add support for .xz FASTA files as input

Internal improvements

Removed BioPython dependency

Bug fixes

Fix bug when uncompressing FASTA files (#42)
Fix bug when splitting data

Version 0.3

Released 10 August 2021

User-visible improvements

Support training from several samples
Remove output_bin_path if output_bin_path exists
Make several internal parameters configuable: (1) minimum length of contigs to bin (--min-len parameter); (2) minimum length of contigs to break up in order to generate must-link constraints (--ml-threshold parameter); (3) the ratio of the number of base pairs of contigs between 1000-2500 bp smaller than this value, the minimal length will be set as 1000bp, otherwise 2500bp (--ratio parameter).
Add -p argument for predict_taxonomy mode

Internal improvements

Better code overall
Fix np.concatenate warning
Remove redundant matrix when clustering
Better pretrained models
Faster calculating dapth using Numpy
Use correct number of threads in kneighbors_graph()

Bugfixes

Respect number of threads (-p argument) when training (issue 34)

Version 0.2

Release 27 May 2021

User-visible improvements

Change name to SemiBin
Add support for training with several samples
Test with Python 3.9
Download mmseqs database with --remove-tmp-file 1
Better output names
Fix bugs when paths have spaces
Fix installation issues by listing all the dependencies
Add download_GTDB command
Add --recluster option
Add --environment option
Add --mode option

Internal improvements

All around more robust code by including more error checking & testing
Better built-in models

Version 0.1.1

Released 21 March 2021

Bugfix release fixing an issue with minfasta-kbs

Version 0.1

Released 21 March 2021

First release: testing version