PyTransDecoder

Python port of TransDecoder - identify candidate coding regions within transcript sequences.

Installation

cd PyTransDecoder
pip install -e .

Quick Start

Use pyTransdecoder as the primary interface when you want the standard end-to-end pipeline:

pyTransdecoder -t transcripts.fasta

This runs LongOrfs followed by Predict and writes the usual *.transdecoder.{gff3,bed,pep,cds} outputs.

You can also have the pipeline run homology support searches directly:

pyTransdecoder -t transcripts.fasta \
    --blast-search-pep uniprot_sprot.pep \
    --pfam-search-db Pfam-A.hmm

Phase-Specific Commands

If you want to run the phases separately, use the subcommand CLI:

Phase 1: Extract Long ORFs

pytransdecoder longorfs -t transcripts.fasta

This creates a directory transcripts.fasta.transdecoder_dir/ with:

longest_orfs.pep - Protein sequences
longest_orfs.cds - CDS sequences
longest_orfs.gff3 - ORF annotations
base_freqs.dat - Nucleotide frequencies

Options

-t, --transcripts PATH          Input transcripts FASTA file [required]
-m, --min-protein-length INT    Minimum protein length (default: 100 aa)
-G, --genetic-code TEXT         Genetic code (default: Standard)
-S, --strand-specific           Only analyze top strand
-O, --output-dir PATH           Output directory
--gene-trans-map PATH           Gene-to-transcript mapping file
--complete-orfs-only            Only output complete ORFs
-v, --verbose                   Verbose output

CLI Compatibility

PyTransDecoder's CLI accepts both underscore and dash formats for option names, ensuring compatibility with existing Perl TransDecoder workflows:

# Both of these work identically:
pytransdecoder predict -t transcripts.fasta --retain-pfam-hits pfam.domtblout
pytransdecoder predict -t transcripts.fasta --retain_pfam_hits pfam.domtblout

This means you can use existing scripts and Makefiles without modification.

Legacy compatibility wrappers for TransDecoder.LongOrfs and TransDecoder.Predict now live under util/ instead of the repository root.

Supported Genetic Codes

universal/standard
vertebrate_mitochondrial
yeast_mitochondrial
invertebrate_mitochondrial
ciliate/tetrahymena/dasycladacean
euplotid
bacterial
candida
And 15+ more...

Phase 2: Predict Likely Coding Regions

Basic Usage

pytransdecoder predict -t transcripts.fasta

This creates final output files in the same directory as your transcripts:

transcripts.fasta.transdecoder.gff3 - Gene predictions
transcripts.fasta.transdecoder.bed - BED format
transcripts.fasta.transdecoder.pep - Protein sequences
transcripts.fasta.transdecoder.cds - CDS sequences

With Homology Support

# Run BLASTP against UniProt
blastp -query transcripts.fasta.transdecoder_dir/longest_orfs.pep \
       -db uniprot_sprot.pep -max_target_seqs 1 \
       -outfmt 6 -evalue 1e-5 -num_threads 4 > blastp.outfmt6

# Run Pfam domain search
hmmscan --cpu 4 --domtblout pfam.domtblout Pfam-A.hmm \
        transcripts.fasta.transdecoder_dir/longest_orfs.pep

# Run predict with homology data
pytransdecoder predict -t transcripts.fasta \
    --retain-blastp-hits blastp.outfmt6 \
    --retain-pfam-hits pfam.domtblout

For the full pipeline entrypoint, you can skip the separate Pfam step and let pyTransdecoder prepare and search the HMM database for you:

pyTransdecoder -t transcripts.fasta --pfam-search-db Pfam-A.hmm

Options

-t, --transcripts PATH            Input transcripts FASTA file [required]
-O, --output-dir PATH             Output directory
-T, --top-orfs-train INT          Training ORFs (default: 500)
--retain-long-orfs-mode           'dynamic' or 'strict' (default: dynamic)
--retain-pfam-hits PATH           Pfam domain hits (domtblout format)
--retain-blastp-hits PATH         BLAST hits (outfmt 6 format)
--single-best-only                Only best ORF per transcript
--no-refine-starts                Skip start codon refinement
-G, --genetic-code TEXT           Genetic code (default: Standard)
-v, --verbose                     Verbose output

How It Works

Training: Selects top longest unique ORFs to train hexamer scoring model
Scoring: Scores all ORFs using Markov chain model (hexamer composition)
Selection: Chooses best ORFs based on:
- Homology support (BLAST/Pfam hits)
- Coding potential score
- ORF length
- GC content-based thresholds
Output: Generates final predictions in multiple formats

Development Status

✅ Phase 1 (LongOrfs): Implemented and validated
✅ Phase 2 (Predict): Implemented and tested
⏳ Performance benchmarking on large datasets

Testing Against Perl Version

To validate output matches the Perl version:

# Run Perl version
cd ../TransDecoder
./TransDecoder.LongOrfs -t test.fasta

# Run Python version
cd ../PyTransDecoder  
pytransdecoder longorfs -t test.fasta

# Compare outputs
diff ../TransDecoder/test.fasta.transdecoder_dir/longest_orfs.pep \
     test.fasta.transdecoder_dir/longest_orfs.pep

Requirements

Python 3.8+
BioPython >= 1.81
tqdm >= 4.65

Note: Ported by Claude.io Sonnet 4.5 under guidance by bhaas. Jan 24, 2026.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
pytransdecoder		pytransdecoder
sample_data		sample_data
tests		tests
util		util
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
COMPARISON_RESULTS.md		COMPARISON_RESULTS.md
EXECUTIVE_SUMMARY.md		EXECUTIVE_SUMMARY.md
GENE_OBJ_ANALYSIS.md		GENE_OBJ_ANALYSIS.md
IMPLEMENTATION_GUIDE.md		IMPLEMENTATION_GUIDE.md
LICENSE		LICENSE
PERL_PYTHON_DIFFERENCES.md		PERL_PYTHON_DIFFERENCES.md
PREDICT_IMPLEMENTATION.md		PREDICT_IMPLEMENTATION.md
README.md		README.md
RESEARCH.md		RESEARCH.md
SAMPLE_DATA_TESTING.md		SAMPLE_DATA_TESTING.md
pyTransdecoder		pyTransdecoder
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PyTransDecoder

Installation

Quick Start

Phase-Specific Commands

Phase 1: Extract Long ORFs

Options

CLI Compatibility

Supported Genetic Codes

Phase 2: Predict Likely Coding Regions

Basic Usage

With Homology Support

Options

How It Works

Development Status

Testing Against Perl Version

Requirements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PyTransDecoder

Installation

Quick Start

Phase-Specific Commands

Phase 1: Extract Long ORFs

Options

CLI Compatibility

Supported Genetic Codes

Phase 2: Predict Likely Coding Regions

Basic Usage

With Homology Support

Options

How It Works

Development Status

Testing Against Perl Version

Requirements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages