Skip to content

Fix most of unicodedata, unmask multiple tests, and remove ucd#7947

Open
joshuamegnauth54 wants to merge 1 commit into
RustPython:mainfrom
joshuamegnauth54:fix-isprintable
Open

Fix most of unicodedata, unmask multiple tests, and remove ucd#7947
joshuamegnauth54 wants to merge 1 commit into
RustPython:mainfrom
joshuamegnauth54:fix-isprintable

Conversation

@joshuamegnauth54

@joshuamegnauth54 joshuamegnauth54 commented May 21, 2026

Copy link
Copy Markdown
Contributor

Python bundles an old version of Unicode for compatibility. RustPython tries to mimic supporting that old version by checking the version of individual chars. This is a problem for a few reasons. The first is that the age check adds an additional hit per each char lookup in Unicode data. The check is outdated because the unic-ucd-age crate is several versions behind the current Unicode version. The check rejects valid chars because of the version differences.

The check is subtly wrong because it returns properties for Unicode 16.0.0 for Unicode 3.2.0 while checking against a Unicode 10.0.0 database.

Unfortunately, there isn't a crate that can help us here. icu4x targets modern Unicode versions. Writing a data provider for icu4x for Unicode 3.2.0 is a lot of work for a legacy path. I opted to parse the Unicode 3.2.0 data myself but to skip icu4x (mostly) to instead write small lookup tables.

As of this commit, Unicode names is still wrong for 3.2.0. Luckily, the crate RustPython uses is fast and robust for modern Unicode.

Summary by CodeRabbit

  • New Features

    • Vendored Unicode 3.2.0 plus new “latest” compatibility datasets for broader coverage.
  • Improvements

    • Build-time generation of Unicode tables for faster, consistent lookups.
    • Numeric APIs return normalized numeric values; digit/decimal behavior clarified.
    • Category, bidi, width and decomposition lookups are version-aware and allocation-minimized.
  • Bug Fixes

    • Applied normalization/decomposition corrections for historic compatibility.
  • Documentation

    • Added guidance on vendored Unicode data and update policy.
  • Tests

    • Updated Unicode tests to validate decompositions and versioned behavior.

@coderabbitai

coderabbitai Bot commented May 21, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: c7ca6479-b924-44b7-9ff7-4a911c4cb6e8

📥 Commits

Reviewing files that changed from the base of the PR and between da427d8 and d982b32.

⛔ Files ignored due to path filters (3)
  • Cargo.lock is excluded by !**/*.lock
  • Lib/test/test_str.py is excluded by !Lib/**
  • Lib/test/test_unicodedata.py is excluded by !Lib/**
📒 Files selected for processing (16)
  • Cargo.toml
  • crates/stdlib/Cargo.toml
  • crates/stdlib/build.rs
  • crates/stdlib/src/unicodedata.rs
  • crates/stdlib/unicode/README.md
  • crates/stdlib/unicode/latest/DerivedNumericValues.txt
  • crates/stdlib/unicode/latest/NormalizationCorrections.txt
  • crates/stdlib/unicode/latest/UnicodeData.txt
  • crates/stdlib/unicode/ucd32/DerivedBidiClass-3.2.0.txt
  • crates/stdlib/unicode/ucd32/DerivedBinaryProperties-3.2.0.txt
  • crates/stdlib/unicode/ucd32/DerivedCombiningClass-3.2.0.txt
  • crates/stdlib/unicode/ucd32/DerivedEastAsianWidth-3.2.0.txt
  • crates/stdlib/unicode/ucd32/DerivedGeneralCategory-3.2.0.txt
  • crates/stdlib/unicode/ucd32/DerivedNumericType-3.2.0.txt
  • crates/stdlib/unicode/ucd32/DerivedNumericValues-3.2.0.txt
  • extra_tests/snippets/builtin_str_unicode.py
💤 Files with no reviewable changes (1)
  • Cargo.toml
✅ Files skipped from review due to trivial changes (7)
  • crates/stdlib/unicode/README.md
  • crates/stdlib/unicode/ucd32/DerivedNumericType-3.2.0.txt
  • crates/stdlib/unicode/latest/NormalizationCorrections.txt
  • crates/stdlib/unicode/ucd32/DerivedBidiClass-3.2.0.txt
  • crates/stdlib/unicode/ucd32/DerivedNumericValues-3.2.0.txt
  • crates/stdlib/unicode/ucd32/DerivedCombiningClass-3.2.0.txt
  • crates/stdlib/unicode/ucd32/DerivedEastAsianWidth-3.2.0.txt
🚧 Files skipped from review as they are similar to previous changes (4)
  • crates/stdlib/Cargo.toml
  • extra_tests/snippets/builtin_str_unicode.py
  • crates/stdlib/src/unicodedata.rs
  • crates/stdlib/build.rs

📝 Walkthrough

Walkthrough

Migrates unicode handling from ucd/unic-ucd-age to ICU properties plus generated lookup tables; adds build-time generators and vendored UCD 3.2.0 + latest data, refactors unicodedata for version-aware property/numeric lookups and decompositions, and updates tests.

Changes

Unicode Data Infrastructure Refactor

Layer / File(s) Summary
Dependency migration
Cargo.toml, crates/stdlib/Cargo.toml
Workspace and stdlib dependencies remove ucd and unic-ucd-age, add icu_normalizer and icu_properties (including as build-dependencies).
Build script Unicode generation pipeline
crates/stdlib/build.rs
Adds generators: generate_unicode_3_2, generate_numeric_type, generate_numeric_value, generate_unicode_latest, shared helpers (write_derived, parse_unicode_3_2), decomposition mapping (DecompositionType), and concurrent runner wiring with cargo:rerun-if-changed.
Module includes and lookup helpers
crates/stdlib/src/unicodedata.rs (lines 5–130)
Includes generated tables, defines UnicodeVersion/UNICODE_VERSION, introduces DecompositionType/type_tag(), and implements lookup_property and lookup_numeric_val.
Character extraction, category, and name
crates/stdlib/src/unicodedata.rs (lines 211–264)
extract_char no longer age-filters; category returns &'static str and selects ICU or generated lookup by Unicode version; name updated to use new extraction flow.
Bidirectional and East Asian width
crates/stdlib/src/unicodedata.rs (lines 277–319)
bidirectional and east_asian_width branch by Unicode version: ICU for newer, generated lookup_property for 3.2.0, preserving PUA1 override and fallbacks.
Mirrored, combining, and decomposition
crates/stdlib/src/unicodedata.rs (lines 379–466)
mirrored and combining use versioned resolution; decomposition assembles compatibility/canonical decompositions from generated tables and normalization corrections with 3.2.0 special handling.
Numeric properties
crates/stdlib/src/unicodedata.rs (lines 476–551)
digit, decimal, and numeric validate via ICU NumericType, fetch values via lookup_numeric_val with Unicode-version gating, require integral values for integer APIs, and return floats for numeric.
Unicode data files and documentation
crates/stdlib/unicode/README.md, crates/stdlib/unicode/latest/NormalizationCorrections.txt, crates/stdlib/unicode/ucd32/*
Add README, NormalizationCorrections.txt (latest), and multiple UCD 3.2.0 derived-property files (BidiClass, BinaryProperties, CombiningClass, EastAsianWidth, NumericType, NumericValues).
Test coverage
extra_tests/snippets/builtin_str_unicode.py
Adds assertions for decomposition (including normalization corrections), East Asian width, and mirrored properties for default and ucd_3_2_0.

Sequence Diagram(s)

sequenceDiagram
  participant build_rs as build.rs
  participant UCDfiles as unicode/ucd32 & unicode/latest
  participant Parser as UnicodeFileParser
  participant Emitter as RustCodeEmitter
  participant Runtime as unicodedata.rs
  build_rs->>UCDfiles: read vendored UCD 3.2.0 + latest files
  build_rs->>Parser: parse derived-property & normalization files
  Parser->>Emitter: produce table entries and decomposition updates
  Emitter->>build_rs: write generated Rust sources to OUT_DIR/generated
  Runtime->>Emitter: include generated lookup tables (lookup_property / lookup_numeric_val)
  Runtime->>Runtime: call ICU APIs for newer UnicodeVersion paths
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

Suggested reviewers

  • ShaharNaveh
  • youknowone

Poem

🐰 I nibble through code, a tiny sleuth,
Tables sprout from vendored Unicode truth,
Old crates hopped out, ICU hopped in,
Decomps and digits ready to spin,
Carrots and code — a soft, warm proof!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 44.12% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the three main changes: fixing the unicodedata module, unmasking tests, and removing the ucd dependency.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions

github-actions Bot commented May 21, 2026

Copy link
Copy Markdown
Contributor

📦 Library Dependencies

The following Lib/ modules were modified. Here are their dependencies:

[x] test: cpython/Lib/test/test_str.py (TODO: 6)
[x] test: cpython/Lib/test/test_fstring.py (TODO: 19)
[x] test: cpython/Lib/test/test_string_literals.py (TODO: 4)

dependencies:

dependent tests: (no tests depend on str)

[ ] test: cpython/Lib/test/test_unicodedata.py (TODO: 11)
[x] test: cpython/Lib/test/test_unicode_file.py
[x] test: cpython/Lib/test/test_unicode_file_functions.py
[x] test: cpython/Lib/test/test_unicode_identifiers.py (TODO: 1)
[x] test: cpython/Lib/test/test_ucn.py (TODO: 4)

dependencies:

dependent tests: (no tests depend on unicode)

Legend:

  • [+] path exists in CPython
  • [x] up-to-date, [ ] outdated

Comment thread crates/literal/src/char.rs

@ShaharNaveh ShaharNaveh left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TYSM!

Overall I really like this change.

The only thing rhat might be worth is to a README.md under crates/stdlib/unicode/ucd32/README.md with links for the soure of each file so we can upsate them in the future if needed

Comment thread crates/stdlib/src/unicodedata.rs Outdated
}

/// Supported Unicode version based on icu4x.
pub const UNICODE_VERSION: UnicodeVersion = UnicodeVersion {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume icu4x doesn't expose this:/

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find it, but Rust itself exposes its Unicode version so I'm using that instead for now. Both Rust and icu4x target the latest Unicode version so it should be fine. I hope. 🤔 🤔

Comment thread Lib/test/test_str.py
self.assertTrue('\U0001F46F'.isprintable())
self.assertFalse('\U000E0020'.isprintable())

@unittest.expectedFailure # TODO: RUSTPYTHON

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<3

@joshuamegnauth54

Copy link
Copy Markdown
Contributor Author

Also, the Unicode 3.2 files will never change but I agree about linking the sources somewhere. CPython seems to provide 3.2 as a compatibility for an old IDNA version (which is also provided for compatibility 😹 ). All of this work is essentially to fix a 20+ year old module that is likely not used anywhere. 😅

Unicode 3.2 - here is the source for now. I'll add it to a README after I finish the patch. 😁

@joshuamegnauth54 joshuamegnauth54 force-pushed the fix-isprintable branch 2 times, most recently from 5479c61 to e3ad1ed Compare May 28, 2026 01:03
joshuamegnauth54 added a commit to joshuamegnauth54/RustPython that referenced this pull request May 30, 2026
I removed an embedded table of non-ASCII numbers in favor of using
`icu_decimal`. The benefits of using `icu4x` here are consistency plus
Unicode updates. As Unicode is updated, we automatically reap the
benefits without having to modify the table.

`icu_decimal` is also useful beyond `float()`. I'm also using it to
clean up `unicodedata` in RustPython#7947.
joshuamegnauth54 added a commit to joshuamegnauth54/RustPython that referenced this pull request May 30, 2026
I removed an embedded table of non-ASCII numbers in favor of using
`icu_decimal`. The benefits of using `icu4x` here are consistency plus
Unicode updates. As Unicode is updated, we automatically reap the
benefits without having to modify the table.

`icu_decimal` is also useful beyond `float()`. I'm also using it to
clean up `unicodedata` in RustPython#7947.
joshuamegnauth54 added a commit to joshuamegnauth54/RustPython that referenced this pull request May 30, 2026
I removed an embedded table of non-ASCII numbers in favor of using
`icu_decimal`. The benefits of using `icu4x` here are consistency plus
Unicode updates. As Unicode is updated, we automatically reap the
benefits without having to modify the table.

`icu_decimal` is also useful beyond `float()`. I'm also using it to
clean up `unicodedata` in RustPython#7947.
@joshuamegnauth54 joshuamegnauth54 force-pushed the fix-isprintable branch 4 times, most recently from de85853 to dc1e675 Compare June 7, 2026 04:28
@joshuamegnauth54 joshuamegnauth54 marked this pull request as ready for review June 7, 2026 04:31
@joshuamegnauth54

Copy link
Copy Markdown
Contributor Author

I undrafted for CodeRabbit because I couldn't trigger it. There will likely be a lot of suggestions. 😅

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
crates/stdlib/src/unicodedata.rs (1)

500-519: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Incorrect NumericType check in numeric method.

Line 509 checks NumericType::Decimal, but the numeric() method should accept characters with any numeric type (Decimal, Digit, or Numeric), not just Decimal. This will incorrectly reject valid numeric characters like Roman numerals (NumericType::Numeric) and circled digits (NumericType::Digit).

🐛 Proposed fix
         fn numeric(
             &self,
             character: PyStrRef,
             default: OptionalArg<PyObjectRef>,
             vm: &VirtualMachine,
         ) -> PyResult {
             self.extract_char(character, vm).and_then(|c| {
                 c.to_char()
                     .and_then(|c| {
-                        if NumericType::for_char(c) != NumericType::Decimal {
+                        let num_type = NumericType::for_char(c);
+                        if num_type == NumericType::None {
                             return None;
                         }

                         let value = lookup_numeric_val(c, self.unic_version)?;
                         Some(vm.ctx.new_float(value).into())
                     })
                     .or_else(|| default.into_option())
                     .ok_or_else(|| vm.new_value_error("not a numeric character"))
             })
         }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/stdlib/src/unicodedata.rs` around lines 500 - 519, The numeric method
is incorrectly filtering only Decimal numeric types; update the check in numeric
(inside the closure from extract_char) to accept any numeric type by testing
that NumericType::for_char(c) is not NumericType::None (or simply remove the
Decimal-only guard), so characters with Digit or Numeric are allowed; locate the
conditional that currently compares to NumericType::Decimal and change it to
compare against NumericType::None (or delete it) and keep the subsequent
lookup_numeric_val(c, self.unic_version) path.
🧹 Nitpick comments (1)
crates/stdlib/build.rs (1)

581-582: Use explicit Unicode input files for cargo:rerun-if-changed in crates/stdlib/build.rs.

Cargo does scan directory paths recursively for cargo:rerun-if-changed, but change detection is mtime-based, so directory-level tracking can be brittle. This build script only reads a fixed set of files from those directories, so track the exact inputs instead of unicode/ucd32 and unicode/latest.

Replace:

println!("cargo:rerun-if-changed=unicode/ucd32");
println!("cargo:rerun-if-changed=unicode/latest");

With cargo:rerun-if-changed for:

  • unicode/ucd32/DerivedGeneralCategory-3.2.0.txt
  • unicode/ucd32/DerivedEastAsianWidth-3.2.0.txt
  • unicode/ucd32/DerivedBidiClass-3.2.0.txt
  • unicode/ucd32/DerivedBinaryProperties-3.2.0.txt
  • unicode/ucd32/DerivedCombiningClass-3.2.0.txt
  • unicode/ucd32/DerivedNumericValues-3.2.0.txt
  • unicode/latest/UnicodeData.txt
  • unicode/latest/NormalizationCorrections.txt

Quick check: run cargo build -p rustpython-stdlib -vv, touch crates/stdlib/unicode/ucd32/DerivedGeneralCategory-3.2.0.txt, then run again and confirm Cargo re-runs the rustpython-stdlib build script.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/stdlib/build.rs` around lines 581 - 582, Replace the two
directory-level rerun directives that call
println!("cargo:rerun-if-changed=unicode/ucd32") and
println!("cargo:rerun-if-changed=unicode/latest") with explicit
cargo:rerun-if-changed entries for each input file the build script actually
reads: unicode/ucd32/DerivedGeneralCategory-3.2.0.txt,
unicode/ucd32/DerivedEastAsianWidth-3.2.0.txt,
unicode/ucd32/DerivedBidiClass-3.2.0.txt,
unicode/ucd32/DerivedBinaryProperties-3.2.0.txt,
unicode/ucd32/DerivedCombiningClass-3.2.0.txt,
unicode/ucd32/DerivedNumericValues-3.2.0.txt, unicode/latest/UnicodeData.txt,
and unicode/latest/NormalizationCorrections.txt so Cargo will reliably rerun the
build script when any of those exact files change.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@extra_tests/snippets/builtin_str_unicode.py`:
- Around line 27-29: The test asserts the wrong versioned decomposition for
U+F951: update the expectation for unicodedata.ucd_3_2_0.decomposition("\uf951")
to the pre-correction mapping "96FB" instead of "964B" so the versioned path
unicodedata.ucd_3_2_0.decomposition matches the Unicode 3.2 behavior (leave the
modern unicodedata.decomposition("\uf951") assertion as-is).

---

Outside diff comments:
In `@crates/stdlib/src/unicodedata.rs`:
- Around line 500-519: The numeric method is incorrectly filtering only Decimal
numeric types; update the check in numeric (inside the closure from
extract_char) to accept any numeric type by testing that
NumericType::for_char(c) is not NumericType::None (or simply remove the
Decimal-only guard), so characters with Digit or Numeric are allowed; locate the
conditional that currently compares to NumericType::Decimal and change it to
compare against NumericType::None (or delete it) and keep the subsequent
lookup_numeric_val(c, self.unic_version) path.

---

Nitpick comments:
In `@crates/stdlib/build.rs`:
- Around line 581-582: Replace the two directory-level rerun directives that
call println!("cargo:rerun-if-changed=unicode/ucd32") and
println!("cargo:rerun-if-changed=unicode/latest") with explicit
cargo:rerun-if-changed entries for each input file the build script actually
reads: unicode/ucd32/DerivedGeneralCategory-3.2.0.txt,
unicode/ucd32/DerivedEastAsianWidth-3.2.0.txt,
unicode/ucd32/DerivedBidiClass-3.2.0.txt,
unicode/ucd32/DerivedBinaryProperties-3.2.0.txt,
unicode/ucd32/DerivedCombiningClass-3.2.0.txt,
unicode/ucd32/DerivedNumericValues-3.2.0.txt, unicode/latest/UnicodeData.txt,
and unicode/latest/NormalizationCorrections.txt so Cargo will reliably rerun the
build script when any of those exact files change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: cd95b557-2363-4034-b21e-322853c602ab

📥 Commits

Reviewing files that changed from the base of the PR and between 51c97b9 and da04480.

⛔ Files ignored due to path filters (2)
  • Cargo.lock is excluded by !**/*.lock
  • Lib/test/test_str.py is excluded by !Lib/**
📒 Files selected for processing (16)
  • Cargo.toml
  • crates/stdlib/Cargo.toml
  • crates/stdlib/build.rs
  • crates/stdlib/src/unicodedata.rs
  • crates/stdlib/unicode/README.md
  • crates/stdlib/unicode/latest/DerivedNumericValues.txt
  • crates/stdlib/unicode/latest/NormalizationCorrections.txt
  • crates/stdlib/unicode/latest/UnicodeData.txt
  • crates/stdlib/unicode/ucd32/DerivedBidiClass-3.2.0.txt
  • crates/stdlib/unicode/ucd32/DerivedBinaryProperties-3.2.0.txt
  • crates/stdlib/unicode/ucd32/DerivedCombiningClass-3.2.0.txt
  • crates/stdlib/unicode/ucd32/DerivedEastAsianWidth-3.2.0.txt
  • crates/stdlib/unicode/ucd32/DerivedGeneralCategory-3.2.0.txt
  • crates/stdlib/unicode/ucd32/DerivedNumericType-3.2.0.txt
  • crates/stdlib/unicode/ucd32/DerivedNumericValues-3.2.0.txt
  • extra_tests/snippets/builtin_str_unicode.py
💤 Files with no reviewable changes (1)
  • Cargo.toml

Comment on lines +27 to +29
# These are from NormalizationCorrections.txt
assert unicodedata.decomposition("\uf951") == "964B"
assert unicodedata.ucd_3_2_0.decomposition("\uf951") == "964B"

@coderabbitai coderabbitai Bot Jun 7, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

ucd_3_2_0 normalization-correction expectation looks incorrect for U+F951.

Line 29 currently expects "964B" for unicodedata.ucd_3_2_0.decomposition("\uf951"), but the Unicode 3.2 compatibility path is intended to return the original pre-correction mapping. This should likely assert "96FB" to validate the versioned behavior instead of the modern corrected value.

Suggested test fix
 # These are from NormalizationCorrections.txt
 assert unicodedata.decomposition("\uf951") == "964B"
-assert unicodedata.ucd_3_2_0.decomposition("\uf951") == "964B"
+assert unicodedata.ucd_3_2_0.decomposition("\uf951") == "96FB"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# These are from NormalizationCorrections.txt
assert unicodedata.decomposition("\uf951") == "964B"
assert unicodedata.ucd_3_2_0.decomposition("\uf951") == "964B"
# These are from NormalizationCorrections.txt
assert unicodedata.decomposition("\uf951") == "964B"
assert unicodedata.ucd_3_2_0.decomposition("\uf951") == "96FB"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@extra_tests/snippets/builtin_str_unicode.py` around lines 27 - 29, The test
asserts the wrong versioned decomposition for U+F951: update the expectation for
unicodedata.ucd_3_2_0.decomposition("\uf951") to the pre-correction mapping
"96FB" instead of "964B" so the versioned path
unicodedata.ucd_3_2_0.decomposition matches the Unicode 3.2 behavior (leave the
modern unicodedata.decomposition("\uf951") assertion as-is).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is incorrect. Both CPython and RustPython return "964B".

Here's the line in NormalizationCorrections.txt:

F951;96FB;964B;3.2.0 # Corrigendum 3

The character is \u{F951}. The original, incorrect decomposition is \u{96FB}. The decomposition was correction in version 3.2.0 which means that versions >= 3.2.0 use the correct decomposition.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joshuamegnauth54, you're absolutely right — I misread the correction semantics. Since the NormalizationCorrections.txt format is CODEPOINT;WRONG;CORRECT;VERSION, and the correction for U+F951 was applied at version 3.2.0, ucd_3_2_0 (which is Unicode ≥ 3.2.0) correctly returns the fixed decomposition "964B". The original erroneous value "96FB" would only be relevant for a hypothetical pre-3.2.0 path. The test is correct as written. Sorry for the noise!

(ʘᴗʘ✿)


✏️ Learnings added
Learnt from: joshuamegnauth54
Repo: RustPython/RustPython PR: 7947
File: extra_tests/snippets/builtin_str_unicode.py:27-29
Timestamp: 2026-06-07T17:28:10.604Z
Learning: In RustPython's `extra_tests/snippets/builtin_str_unicode.py`, the `NormalizationCorrections.txt` format is `CODEPOINT;ORIGINAL_WRONG_DECOMP;CORRECT_DECOMP;VERSION`. The VERSION field indicates the Unicode version in which the correction was *applied*. For example, `F951;96FB;964B;3.2.0` means U+F951's decomposition was corrected to `964B` starting from Unicode 3.2.0. Therefore `unicodedata.ucd_3_2_0.decomposition("\uf951")` correctly returns `"964B"` (the corrected value), not `"96FB"` (the original wrong value). Both the modern `unicodedata` and the `ucd_3_2_0` versioned path return the same corrected decomposition when the correction version <= 3.2.0.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.


🧠 Learnings used
Learnt from: joshuamegnauth54
Repo: RustPython/RustPython PR: 7998
File: crates/vm/src/builtins/float.rs:219-223
Timestamp: 2026-05-30T20:17:08.394Z
Learning: In RustPython's `crates/vm/src/builtins/float.rs`, `crate::literal::float::parse_bytes` (backed by lexical-core) tolerates trailing NUL (`\0`) bytes in the input buffer. The `float_from_string` UTF-8/ICU path uses a fixed 1024-byte `mapped_str_buf` and passing the full buffer (including unwritten trailing zeros) does not cause float parsing to fail, as confirmed by the PR author via local testing (PR `#7998`).

Learnt from: bschoenmaeckers
Repo: RustPython/RustPython PR: 8032
File: crates/capi/src/moduleobject.rs:37-40
Timestamp: 2026-06-03T12:02:30.049Z
Learning: In RustPython's C-API (`crates/capi/src/moduleobject.rs`), `PyModule_NewObject` intentionally requires the module name to be a valid UTF-8 string (enforced via `PyStr::to_str()`). This is a deliberate design constraint because RustPython's string model is UTF-8 only. This deviates from CPython's `PyModule_NewObject`, which stores any Unicode object directly without requiring UTF-8 representability. Do not raise review issues about this UTF-8 restriction on module names.

Learnt from: bschoenmaeckers
Repo: RustPython/RustPython PR: 8032
File: crates/capi/src/moduleobject.rs:37-40
Timestamp: 2026-06-03T12:02:34.454Z
Learning: In RustPython's C-API (`crates/capi/src/moduleobject.rs`), `PyModule_NewObject` intentionally requires the module name to be a valid UTF-8 string (enforced via `PyStr::to_str()`). This is a deliberate design constraint because RustPython's string model is UTF-8 only. This deviates from CPython's `PyModule_NewObject`, which stores any Unicode object directly without requiring UTF-8 representability. Do not raise review issues about this UTF-8 restriction on module names.

@joshuamegnauth54 joshuamegnauth54 force-pushed the fix-isprintable branch 3 times, most recently from 4db4918 to da427d8 Compare June 9, 2026 03:11
@joshuamegnauth54 joshuamegnauth54 changed the title Fix isprintable() and fix Unicode 3.2 Fix most of unicodedata, unmask multiple tests, and remove ucd Jun 9, 2026
Python bundles an old version of Unicode for compatibility. RustPython
tries to mimic supporting that old version by checking the version of
individual chars. This is a problem for a few reasons. The first is that
the age check adds an additional hit per each char lookup in Unicode
data. The check is outdated because the `unic-ucd-age` crate is several
versions behind the current Unicode version. The check rejects valid
chars because of the version differences.

The check is subtly wrong because it returns properties for Unicode
16.0.0 for Unicode 3.2.0 while checking against a Unicode 10.0.0
database.

Unfortunately, there isn't a crate that can help us here. `icu4x`
targets modern Unicode versions. Writing a data provider for `icu4x` for
Unicode 3.2.0 is a lot of work for a legacy path. I opted to parse the
Unicode 3.2.0 data myself but to skip `icu4x` (mostly) to instead write
small lookup tables.

As of this commit, Unicode names is still wrong for 3.2.0.
Luckily, the crate RustPython uses is fast and robust for modern
Unicode.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants