What's New
Unreleased
- Drop support for Python 3.8. The minimum supported Python version is now 3.9.
- Drop support for Python 3.9. The minimum supported Python version is now 3.10.
- Use
zip(..., strict=True)for paired-array iterations across clustering, marker handling, and feature generation so that silent length-mismatch bugs raise immediately.
Version 2.3.0
Released May 26, 2026
This release bundles a large number of bug fixes (several of which fix silent
correctness issues in training and clustering), removes the deprecated
SemiBin1 command, and includes many documentation improvements.
User-visible changes
- Remove
SemiBin1. OnlySemiBin2is shipped now; if you have old scripts that still callSemiBin1, they will need to be updated. - Accept
hybridas a synonym forlong_readin--sequencing-type(#216) - Remove
--environmentparameter when learning a new model (#218) check_install: now also looks forsamtoolsand reports it (in verbose mode) when missing. This is a soft check: missingsamtoolsdoes not causecheck_installto fail, sincesamtoolsis only required for CRAM input.--allow-missing-mmseqs2is now marked as deprecated in the help text (the option is already a no-op sincemmseqs2is no longer required bycheck_install).- Data-generation log message now distinguishes between generating training data and generating features for inference with a pretrained model (#219)
- Add Python 3.14 to the supported/tested Python versions (SemiBin now runs on Python 3.8–3.14)
- Add the missing
COPYING.MITlicense file (#214)
Bug fixes
single_easy_bin: fixAttributeError: 'Namespace' object has no attribute 'training_type'when running with abundance files (-a) instead of BAM files.set_training_type()was only invoked on the BAM path, so the abundance-based single-sample workflow crashed before doing any work.- Semi-supervised training: fix wrong input in the unlabeled contrastive-learning branch. The second argument to
model.forward()wasunlabeled_train_input1instead ofunlabeled_train_input2, meaning both branches received identical data and the unlabeled-pair contrastive signal was being silently discarded. - Long-read clustering: clip zero depth values before
np.log. Zero depth (no reads mapped) was producing-infvalues, silently corrupting the embedding used for DBSCAN clustering. - Multi-sample binning: fix
split_datamatching sample names that are prefixes of others (e.g.,S1matchingS10:contig). Now usesstr.startswith()rather thanstr.contains(). - Marker calling: validate the cached
markers.hmmoutagainst an input fingerprint (FASTA size/mtime,binned_length, ORF finder) before reusing it. Previously, reruns into an existing output directory could silently reuse stale marker calls from a different input or different parameters. On mismatch a warning is logged and markers are recomputed. - Long-read binning: remove stale output bins from previous runs before writing new results (previously, rerunning into the same output directory with fewer bins left old FASTA files behind).
- Empty FASTA inputs now fail cleanly with
Input file ... is empty. Please check inputs.instead of crashing withZeroDivisionError(single-sample paths) orValueError: attempt to get argmax of an empty sequence(multi-sample paths). - ORF finding: fix prodigal errors being silently ignored.
model_load: raise a clearValueErrorfor unknown model names instead of anUnboundLocalError.check_install: fix incorrect logic when checking forprodigal.- Several minor error-message and validation improvements in
utils.pyandmain.py.
Documentation fixes
- Fix outdated output descriptions in
docs/output.md(default model type, missingbins_info.tsv/contig_bins.tsv, clarifyoutput_binsis the final output) - Fix wrong
split_contigsoutput filename in README strobealign-aemb example (split.fa→split_contigs.fna.gz) - Fix
docs/aemb.mddescription ofsplit_contigs.fna.gzcontents (only split halves, not originals) and an undefined variable in the helper script - Remove references to
--modeand--training-typeflags no longer accepted by SemiBin2 indocs/subcommands.md - Update
docs/semibin2.mdanddocs/index.mdto reflect that onlySemiBin2is installed since v2.2 - Clarify
-b/-aare alternatives (not both required) indocs/subcommands.mdforsingle_easy_bin,generate_sequence_features_single, andgenerate_sequence_features_multi - Fix
docs/generate.mdpath fromscripts/toscript/ - Clarify Prodigal is optional in
docs/install.md; addsamtoolsto the source-install dependency list - Fix step numbering in
docs/usage.md(steps went 1, 3, 4 → now 1, 2, 3) - Make it more prominent in the FAQ and usage docs how to handle hybrid (long+short) data
- Fix Python version ranges in README and install docs
- Fix truncated help text for
--random-seed - Fix wrong DOI for SemiBin2 in
CITATION.md - Remove incorrect
--depth-metabat2mention fromgenerate_sequence_features_singledocs - Remove duplicate HMMER entry and fix malformed Bedtools URL in README
- Many other smaller documentation corrections (typos, grammar, broken links, incorrect option names)
Internal improvements
- Self-supervised training reads each input CSV once instead of every epoch.
- Fix unclosed file handle in
run_prodigal. - Remove unused
pyyamldependency. - Avoid naked
except:clauses (better error reporting and easier debugging). - Write validation errors to the log file (not just to stderr).
- Various small refactors and additional type annotations.
Version 2.2.1
Released Dec 7, 2025
User-visible changes
- Add support for
SEMIBIN_DEBUGenvironment variable to enable debug logging (overrides command line flags)
Internal improvements and bugfixes
- Fix for newer version of igraph (#208)
Version 2.2.0
Released Mar 20, 2025
This is a maintenance release with many small improvement rather than a single big new feature. Upgrading is recommended, but not crucial.
User-visible changes
- Remove
SemiBincommand. OnlySemiBin1andSemiBin2are available (andSemiBin1is deprecated). The only reason to useSemiBin1is if you have old scripts that use it. It will be removed in the next release. - Better logging: Always log to file in DEBUG level and log command-line arguments. Print version number in logs.
- Better error messages in several instances
- check_install: Prints out information on the GPU
Deprecations
- SemiBin: Deprecate
--prodigal-output-faaargument - No longer check for
mmseqsincheck_install(it is not a hard requirement)
Internal improvements and bugfixes
- Respect the number of threads requested better (#140)
- SemiBin: Better method to save the model which is more compatible with newer versions of PyTorch. Added a subcommand to update old models to the new format (
update_model) - SemiBin: Switch to pixi for testing (and recommend it in the README/installation instructions)
- Convert to
pyproject.tomlinstead ofsetup.py - Do not fail if no bins are produced (#170 & #173)
Version 2.1.0
Released Mar 6, 2024
Main new feature is adding support for using output of strobealign-aemb.
Use of the SemiBin command (instead of SemiBin2) will continue to work, but
print a warning and set a delay to ask users to upgrade.
User-visible changes
- Support running SemiBin with strobealign-aemb (
--abundance/-a) - Add
citationsubcommand - Introduce separate
SemiBin1command as use ofSemiBinis now deprecated and will trigger a warning
Internal improvements
- Code simplification and refactor
- deprecation: Deprecate --orf-finder=fraggenescan option
- Update abundance normalization
Bugfixes
- SemiBin: do not use more processes than can be taken advantage of #155
Version 2.0.2
Released Oct 31, 2023
Bugfix release
Fixes issue with multi_easy_bin --write-pre-reclustering-bins #128 on GH
Version 2.0.1
Released Oct 21, 2023
This is a bugfix release for version 2.0.0.
Version 2.0.0
Released Oct 20, 2023
User-visible changes
- Running SemiBin now writes a log file in the output directory
- The
concatenate_fastasubcommand now supports compression - Adds
bin_shortsubcommand as alias forbin(by analogy withbin_long)
Version 1.5.1 (SemiBin2 beta)
Released Mar 7, 2023
Bugfixes
- Fix use of
--no-reclusterwith multi_easy_bin (#128).
Version 1.5.0 (SemiBin2 beta)
Released Jan 17, 2023
Big change is the addition of a SemiBin2 script, which is still experimental, but should be a slightly nicer interface.
See [upgrading to SemiBin2]
User-visible improvements
- Added a new option for ORF finding, called
fast-naivewhich is an internal very fast implementation. - Added the possibility of bypassing ORF finding altogether by providing prodigal outputs directly (or any other gene prediction in the right format)
- Command line argument checking is more exhaustive instead of exiting at first error
- Added
--quietflag to reduce the amount of output printed - Better
--help(group required arguments separately) - Add
--output-compressionoption to compress outputs - Add
--tag-outputoption which allows for control of the output filenames (and also makes the anvi'o compatible — see discussion at #123. - Add contig->bin mapping table (#123)
SemiBin.main.main1andSemiBin.main.main2can now be called as a function with command line arguments (main1corresponds to SemiBin1 andmain2corresponds to SemiBin2)
import SemiBin.main
...
SemiBin.main.main2(['single_easy_bin', '--input-fasta', ...])
Version 1.4.0: long reads binning!
Released December 15, 2022
Big change is the added binning algorithm for assemblies from long-read datasets.
The overall structure of the pipeline is still similar to what was manuscript, but when clustering, it does not use infomap, but another procedure (an iterative version of DBSCAN).
Use the flag --sequencing-type=long_read to enable an alternative clustering that works better with long reads.
Other user-visible improvements
- Better error checking at multiple steps in the pipeline so that processes that will crash are caught as early as possible
- Add
--allow-missing-mmseqs2flag tocheck_installsubcommand (eventually, self-supervision will be the default and mmseqs2 will be an optional dependency)
Command line parameter deprecations
The previous arguments should continue to work, but going forward, the newer arguments are probably a better API.
- Selecting self-supervised learning is now done with the
--self-supervisedflag (instead of--training-type=self) - Training from multiple samples is now enabled with the
--train-from-manyflag (instead of--mode=several)
Bugfixes
- The output table sometimes had the wrong path in
v1.3. This has been fixed - Prodigal is now run in a more robust manner when using multiple threads (#106)
Version 1.3.1
Release December 9, 2022
Bugfixes
- Made
--training-typeargument optional (defaults tosemito keep backwards compatibility)
Version 1.3.0
Released November 4 2022
User visible improvements
- Added self-supervised learning mode (see [Training SemiBin models] for more details)
Bugfixes
- Fix output table to contain correct paths
- Fix mispelling in argument name
--epochs(the old variation,--epochesis still accepted for backwards compatibility, but should be considered deprecated)
Version 1.2.0
Released October 19 2022
User visible improvements
- Pretrained model from chicken caecum (contributed by Florian Plaza Oñate)
- Output table with basic information on bins (including N50 & L50)
- When reclustering is used (default), output the unreclusted bins into a directory called
output_prerecluster_bins - Added
--verboseflag and silenced some of the output when it is not used - Use coloredlogs (if package is available)
Version 1.1.1
Released September 27 2022
Bugfixes
- Completely remove use of
atomicwritespackage (#97)
Version 1.1.0
Released September 21 2022
User-visible improvements
- Support .cram format input (#104)
- Support using depth file from Metabat2 (#103)
- More flexible specification of prebuilt models (case insensitive, normalize
-and_) - Better output message when no bins are produced
Bugfixes
- Fix bug using
atomicwriteon certain network filesystems (#97)
Internal improvements
- Remove torch version restriction (and test on Python 3.10)
Version 1.0.3
Released August 3 2022
Bugfixes
- Fix coverage parsing when value is not an integer (#103)
- Fix multi_easy_bin with taxonomy file given on the command line (see discussion at #102)
Version 1.0.2
Released July 8 2022
Bugfixes
Version 1.0.1
Released May 9 2022
Bugfixes
- Fix edge case when calling prodigal with more threads than contigs (#93)
Version 1.0.0
Released April 29 2022
This coincides with the publication of the manuscript.
User-visible improvements
- More balanced file split when calling prodigal in parallel should take better advantage of multiple threads
- Fix bug when long stretches of Ns are present (#87]
- Better error messages (#90 & #91])
Bugfixes
- Fix bugs in training from multiple samples
- Fix bug in incorporating CAT results
Version 0.7
Released March 2 2022
This release solves issues running on Mac OS X.
User-visible improvements
- Improved
check_installcommand: it now prints out paths and correctly handles optionality of FragGeneScan/prodigal - Add
concatenate_fastacommand to combine fasta files for multi-sample binning - Add option
--tmpdirto set temporary directory - Substitute FragGeneScan with Prodigal (FragGeneScan can still be used with
--orf-finderparameter). FragGeneScan caused issues, especially on Mac OSX
Internal improvements
- Reuse
markers.hmmoutfile to make the training from several samples faster
Version 0.6
Released February 7 2022
User-visible improvements
- Provide pretrained models from soil, cat gut, human oral,pig gut, mouse gut, built environment, wastewater and global (training from all samples).
- Users can now pass in the output of running mmseqs2 directly and SemiBin will
use that instead of calling mmseqs itself (use option
--taxonomy-annotation-table). - The subcommand to generate cannot links is now called
generate_cannot_links. The old name (predict_taxonomy) is kept as a deprecated alias. - Similarly, sequence features (k-mer and abundance) are generated using the
commands
generate_sequence_features_singleandgenerate_sequence_features_multi(for single- and multi-sample modes, respectively). The old names (generate_data_single/generate_data_multi) are kept as deprecated aliases. - Add
check_installcommand and runcheck_installbefore easy command
Bugfixes
- Fix bug with non-standard characters in sample names (#68).
Version 0.5
Released January 7 2022
User-visible improvements
- Reclustering is now the default (use
--no-reclusterto disable it; the option--reclusteris deprecated and ignored) as the computational costs are much lower - GTDB lazy downloading is now performed even if a non-standard directory is used
- The CACHEDIR.TAG protocol was implemented (this is supported by several tools that perform tasks such as backups).
Bugfixes
- Fix bug with
--min-len(minimal length). Previously, only contigs greater than the given minimal length were used (instead of greater-equal to the minimal length). - GTDB downloading was inconsistent in a few instances which have been fixed
Internal improvements
- Much more efficient code (including lower memory usage) for binning,
especially if a pretrained model is used. As an example, using a
deeply-sequenced ocean sample, generating the data (
generate_data_singlestep) goes down from 14 to 9 minutes; while binning (binstep, using--recluster) goes down from 10m17s (using 20GB of RAM, at peak) to 4m33 (using 4.5 GB, at peak). Thus total time from BAM file to bins went down from 25 to 14 minutes (using 4 threads) and peak RAM is now 4.5GB, making it usable on a typical laptop.
Version 0.4.0
Released 27 October 2021
User-visible improvements
- Add support for
.xzFASTA files as input
Internal improvements
- Removed BioPython dependency
Bug fixes
- Fix bug when uncompressing FASTA files (#42)
- Fix bug when splitting data
Version 0.3
Released 10 August 2021
User-visible improvements
- Support training from several samples
- Remove
output_bin_pathifoutput_bin_pathexists - Make several internal parameters configuable: (1) minimum length of contigs to bin (
--min-lenparameter); (2) minimum length of contigs to break up in order to generate must-link constraints (--ml-thresholdparameter); (3) the ratio of the number of base pairs of contigs between 1000-2500 bp smaller than this value, the minimal length will be set as 1000bp, otherwise 2500bp (--ratioparameter). - Add
-pargument forpredict_taxonomymode
Internal improvements
- Better code overall
- Fix
np.concatenatewarning - Remove redundant matrix when clustering
- Better pretrained models
- Faster calculating dapth using Numpy
- Use correct number of threads in
kneighbors_graph()
Bugfixes
- Respect number of threads (
-pargument) when training (issue 34)
Version 0.2
Release 27 May 2021
User-visible improvements
- Change name to
SemiBin - Add support for training with several samples
- Test with Python 3.9
- Download mmseqs database with
--remove-tmp-file 1 - Better output names
- Fix bugs when paths have spaces
- Fix installation issues by listing all the dependencies
- Add
download_GTDBcommand - Add
--reclusteroption - Add
--environmentoption - Add
--modeoption
Internal improvements
- All around more robust code by including more error checking & testing
- Better built-in models
Version 0.1.1
Released 21 March 2021
Bugfix release fixing an issue with minfasta-kbs
Version 0.1
Released 21 March 2021
- First release: testing version