MalStat is a static PE malware triage toolkit for Windows executables. It extracts structural features from EXE, DLL, and SYS files, applies a calibrated machine learning classifier, and produces explainable JSON and HTML reports without executing the sample.
This repository is the clean public runtime repository for the MalStat project.
It intentionally contains:
- the canonical source code under
src/andscripts/; - the minimal runtime model bundle under
models/; - the runtime template and configuration files required for inference;
- the training and evaluation code paths for users who provide their own local datasets.
It intentionally does not contain private datasets, training/evaluation outputs, or the separate author-side coursework documentation folder that lives outside this repository.
- Static analysis of Windows PE files without running them.
- Feature extraction across file, header, section, import, string, resource, opcode, and optional reputation layers.
- Calibrated
LightGBMinference with canonical model artifacts already included. - JSON and HTML reporting for single-file and batch workflows.
- Reproducible retraining once you provide a local clean training subset.
- Evaluation tooling for threshold selection, error review, and external holdout checks against local artifacts.
.
|- configs/ Runtime and evaluation configuration
|- data/ Local ignored workspace for private datasets and generated dataset artifacts
|- models/ Canonical runtime model bundle only
|- scripts/ CLI entry points for analysis, dataset building, training, and evaluation
|- src/ Core project implementation
|- templates/ HTML report template(s)
|- tests/ Minimal smoke tests for CI
|- CHANGELOG.md Release notes
|- Dockerfile Reproducible container runtime
|- README.md This document
|- VERSION Repository release version
`- requirements.txt Exact tested dependency set
scripts/analyze_file.pyanalyzes one file.scripts/analyze_directory.pyanalyzes a directory tree.src/inference/contains the runtime analysis path.src/reporting/renders the explainable output payloads.
scripts/build_dataset.pyextracts and appends rows into aggregate dataset stores.scripts/export_clean_training_subset.pyexports the clean supervised subset.scripts/train_model.pytrains the canonical classifier.scripts/evaluate_model.pybuilds detailed evaluation reports.src/training/andsrc/evaluation/contain the training and evaluation logic.
models/calibrated_model.pklis the canonical shipped classifier.models/preprocessor.pklis the canonical feature preprocessor.models/feature_columns.jsonfixes the expected feature order.
reports/is for generated analysis outputs and is intentionally gitignored.data/is a local working area for private datasets and exported training subsets and is intentionally gitignored.- non-runtime files under
models/such as experiment logs, evaluation reports, prediction tables, and best snapshots are intentionally gitignored.
- Python
3.13.x pipandvenv- A platform supported by the pinned wheels in
requirements.txt - Enough disk space for model artifacts, generated reports, and optional dataset growth
This project analyzes Windows PE files, but the tooling itself can be run from a non-Windows host because it performs static parsing instead of sample execution.
python -m venv .venv
.\.venv\Scripts\Activate.ps1
python -m pip install --upgrade pip setuptools wheel
python -m pip install -r requirements.txtpython3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip setuptools wheel
python -m pip install -r requirements.txtpython scripts/analyze_file.py --help
python scripts/analyze_directory.py --help
python scripts/train_model.py --help
python scripts/evaluate_model.py --helpAnalyze a single file:
python scripts/analyze_file.py "C:\path\to\sample.exe"Analyze a directory recursively:
python scripts/analyze_directory.py "C:\path\to\samples" --recursive --no-htmlBy default, generated outputs are written under reports/, which is ignored by Git on purpose.
VirusTotal lookups are disabled by default. Add --enable-virustotal only when you explicitly want remote hash reputation lookups and have configured VT_API_KEY in the environment or .env.
python scripts/analyze_file.py "C:\path\to\sample.exe" --json-out sample.analysis.json --html-out sample.analysis.htmlpython scripts/analyze_directory.py "C:\path\to\samples" --recursive --output-root reports\batch_analysis\run01These dataset and training workflows require a private local dataset workspace under data/; they are not part of the public runtime bundle.
python scripts/build_dataset.py "C:\path\to\samples" --source manual --label 0 --label-confidence highpython scripts/export_clean_training_subset.pypython scripts/train_model.pypython scripts/evaluate_model.pyThe repository works without the tools below. Install them only if you need the associated workflow.
FLOSS improves string extraction for obfuscated samples.
Preferred installation method on Windows:
- Download the standalone executable from the official Mandiant FLARE-FLOSS releases page.
- Put
floss.exeon yourPATH, or place it in a known tools directory and configure your environment accordingly.
Alternative Python package installation:
python -m pip install flare-flossVirusTotal lookups are opt-in.
- Set
VT_API_KEYin your environment or local.envfile. - Add
--enable-virustotaltoscripts/analyze_file.py,scripts/analyze_directory.py,scripts/main_extractor.py,scripts/build_dataset.py, orscripts/backfill_dataset_features.pywhen you explicitly want remote hash lookups. - Leave the flag unset to keep analysis fully local.
Use SHAP only if you want additional post-hoc explanations for trained models.
python -m pip install shapUse Plotly for richer interactive plots and Kaleido for static image export.
python -m pip install plotly kaleidoPyTorch is optional and only relevant if you add experimental neural baselines.
Use the official installer selector when possible because CPU and CUDA wheels differ by platform.
Typical CPU-only installation example:
python -m pip install torch --index-url https://download.pytorch.org/whl/cpu- Canonical model version:
20260525T112232Z - Canonical classifier:
lightgbm - Canonical operating threshold:
0.454123 - Published repository version:
0.1.0
This repository includes several reproducibility controls:
requirements.txtpins the exact tested Python package versions.VERSIONis the single source of truth for repository releases.CHANGELOG.mdrecords repository-level release notes.Dockerfileprovides a reproducible runtime container.- GitHub Actions validates install, CLI help paths, smoke tests, and Docker buildability.
The repository ships with:
CIworkflow: installs pinned dependencies, runs smoke tests, validates CLI entry points, and builds the Docker image.Releaseworkflow: validates that the Git tag matchesVERSION, creates a source archive, and attaches it to the GitHub release.
Use this release process for clean GitHub releases:
- Update the code and documentation.
- Update
VERSIONusing Semantic Versioning, for example0.2.0. - Add the release notes to
CHANGELOG.md. - Commit the release changes.
- Tag the release as
vX.Y.Z, for examplev0.2.0. - Push the tag to trigger the release workflow.
- Keep
requirements.txtpinned. - Keep generated outputs out of version control.
- Update
CHANGELOG.mdfor every release. - Re-run the smoke tests after changing model artifacts.
- Treat
models/calibrated_model.pkl,models/preprocessor.pkl, andmodels/feature_columns.jsonas the published runtime contract.
After a successful single-file run, you should expect:
- a JSON analysis payload;
- an optional HTML report;
- a printed summary in the terminal with the verdict, probability, threshold, model name, and model version.
MalStat is a static triage tool. It is designed to prioritize and explain suspicious PE files, not to behave like a full production antivirus engine with dynamic detonation, live telemetry, or enterprise response orchestration.
:+xXX$$$$$$$$$Xx:
++++;;;+xXxxXXXXXXXXX$$&$X+
::::;+xXXX$$$XXXxxx++xX$$$$$$$&&$;
.::+$&&&$$&&&$$$$$$$XX$$X;;;x$$X$$$$&&$x
:;X&&&&&&$$XXXXXXxxXxxxxxXxxxx+:;x$$$$$$&&&X:
:x&&&&&$$$xxxxxx+++++;;;;;;++++++xx:;x&&$&$$&&&$;
:X&&&&&$$XXXxx++xX;++++x:::::::::;+x;++;+X&&$&$$&&&X:
.x&&&&&&$$Xxxxx+;;;X+;;;;x:::::::::;;:;;+++;x$&&&$$$&&$X
:$&&$X+X$XX$xx++;;::+x:;;;+:::::::::X;::::+;+++X&&$&$$&&&$;
.x&&&: +x+xX$X$+;;;;:::x:;;++::::::::X+::::::;;:++x&&$&$$$&&$+
.$&&$ xxxXxxxX;;::::;x:;;++:::::::x;:::::::::;;+;x&&$&$$$&&$$
.$&&$ :++Xx;xx:::::x++;;++::::::++:::::::::::;;;+x$&$&$$$&&Xx
.$&&& .;;+;xX:x+:::;:;x;;+;:::::xx::::::::::::::+;;X&&$&$$$&&$X
.$&&&. .;;;;;+;x;X;::::::::;:::::++::::::::::::::::+x+x&&$&$$$&&$+
X&&&$. ;:;;;::+:::.:::::::.;:;:::::::::::::::::X$+:;;;xX&&&$$$&&$$:
;&&&&X$Xx. ::;;;::::: ;;:::::+;;+++::::::::::::x$+:+$X:;+;X$&$&$$$&&XX
.$&&&+ ;+xx;;;;:::::. ..:::::x:;+;::;:::::.:;Xx:;x$;::::;;+X&$&&$$$&&X+
+&&&X. ; .;;$Xx+::::. ...::::x;++;::::;::.:;;:+$+::::::::+:x&&$&$$$&&XX
.$&&&x. ;. .;:;+:;:::;.. ..::::;;:;;::::::;::::xx:::::::::::+:x$&$&$&$$&$X;
;&&&&;:.; .;::;:x:::;: ...::::+;;;;:::::::::::::::::::::::::;+X&$$&$$$&$X+
+&&&$. ::.:;::;::::+:. ...:::.x:;;::::.::::;::::::::::::::::+;X$&$&$$$&&XX
x&&&X:.:::.;::::::;::.. ...::::x:;;:::::::::;::::::::::::::::;:X$&$&$&$$&XX
.X&&&x;::::.:::;;::+::. . ...:::;x:+;::::::.:::+:::+XXXXX$X$$$;::XX&$&$&$$&XX.
.$&&&x;:::::xxx;::::::....:;;:;xxxXx;::::::::::X::::;;;;:;;;;;:::xX$$&X$$&&XX:
.$&&&x;;+;::::;;;:::::....:x+xxx$XXXXXXXXXXXxxxxxx+;;;;;;;;;;+:::XX$$&X&$$&XX:
.$&&&x::x;::++xx;::;::.......::+x+xxx+++++++xxxXx++X&&$$$$$$$$;::X$$$&$&$$&XX.
.X&&&X:;;;:::::::::;::........:xX;++;:::::::::;;:;;;;;;;;;:::::;;X$&$&X$$$&Xx
.x&&&&XXx+;:::+X+:::+:........::::;;::::::::::;::::::::::::::::+;X$&$&$$$&&Xx
x&&&$xxxxx+X$;:X+:::;::::....::::::::::::.::::.:.:::::.:::::::;+$&$$$$$$&$x;
;&&&&&$$$$X;xX+:x+::;::::::..:::::.::::::::;;:..:+Xx:::::::::+:x&&$&$&$&&$x:
.X&&&&$$XX$X;;$x::::::+::::...:::.::::::::;::::;+:::x$+:::::;;;X&&$&$$$&&Xx
;&&&&&&$XxX$+:+xX;:::::;:........::...:;;......;$XXx;;X$+::;.X&&$&$$X$&$x;
:x&&&&$$$$xX+;X;:::;:::::;:::......::+:....::::::;x;+$$++;;:+X&&$&X$$&&x+
;$&&&&&$$x+$x::::;x:X:::::::::;:::::::::::::::::::;x:::;;:;X&&$&$$X&&X+
+&&&&&&$X$+;:::;X:$;:x...........:::::+::::::::::::;x;;;.X$&$&$$X$&$+:
:+$&&&&$$X+;;:;X:x::$;:..:x;.:....::::X$:::::::::::::;;:X$&&$&X$$&&+;
:x$&&&&&$X+;;X;x+;xx.....x+.$:..::::.:XX;::::::::::;;:x$&&$$X$$&$+;
:x$&&&&&$$+&;Xx+:x;.....x+.X;..::::::;+&x::::::::x+:X$&$$&XX$&&+;
.XX&&&&&$Xx+xx;;x:::...x++;;..:::::::+:xX:::::+;:+X&&$$$XX$&$+.
+x$&&&&$XXx;;X::.....++$;x.:::::::::x:+$;:;;::X$&&X&$$X$&X;
:xx$&&&&$X++;:::::::x+x:X::::::::::x;:;+;;:+$&&$$$XXX&&+.
+xx$&$&&$X+++;::::xx;:X::::::::::;+;X;;;X$&$$&XXX$&X:
+XxX&&$$$xx++;;++;:::::::;++;;:x+;:xX$&$X&XXX$&X;
;XXxX$$$$Xx++x++x;;;+;;;;+x+;;+X$$$X$$XXX&&X:
;+XXX++x$$$XXxxxx+x+x+.:xX$$&$XX&XXx$&&+
;xxxXXXXXxxxxxxX$$$$$$XXXX$XxX$&$;
xXXxxx+xxxx++xX$$XxxxX$&$+
.;xX$$$$$$$$Xx;.


