Fix most of `unicodedata`, unmask multiple tests, and remove `ucd` by joshuamegnauth54 · Pull Request #7947 · RustPython/RustPython

joshuamegnauth54 · 2026-05-21T20:23:02Z

Python bundles an old version of Unicode for compatibility. RustPython tries to mimic supporting that old version by checking the version of individual chars. This is a problem for a few reasons. The first is that the age check adds an additional hit per each char lookup in Unicode data. The check is outdated because the unic-ucd-age crate is several versions behind the current Unicode version. The check rejects valid chars because of the version differences.

The check is subtly wrong because it returns properties for Unicode 16.0.0 for Unicode 3.2.0 while checking against a Unicode 10.0.0 database.

Unfortunately, there isn't a crate that can help us here. icu4x targets modern Unicode versions. Writing a data provider for icu4x for Unicode 3.2.0 is a lot of work for a legacy path. I opted to parse the Unicode 3.2.0 data myself but to skip icu4x (mostly) to instead write small lookup tables.

As of this commit, Unicode names is still wrong for 3.2.0. Luckily, the crate RustPython uses is fast and robust for modern Unicode.

Summary by CodeRabbit

New Features
- Vendored Unicode 3.2.0 plus new “latest” compatibility datasets for broader coverage.
Improvements
- Build-time generation of Unicode tables for faster, consistent lookups.
- Numeric APIs return normalized numeric values; digit/decimal behavior clarified.
- Category, bidi, width and decomposition lookups are version-aware and allocation-minimized.
Bug Fixes
- Applied normalization/decomposition corrections for historic compatibility.
Documentation
- Added guidance on vendored Unicode data and update policy.
Tests
- Updated Unicode tests to validate decompositions and versioned behavior.

coderabbitai · 2026-05-21T20:23:09Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: c7ca6479-b924-44b7-9ff7-4a911c4cb6e8

📥 Commits

Reviewing files that changed from the base of the PR and between da427d8 and d982b32.

⛔ Files ignored due to path filters (3)

Cargo.lock is excluded by !**/*.lock
Lib/test/test_str.py is excluded by !Lib/**
Lib/test/test_unicodedata.py is excluded by !Lib/**

📒 Files selected for processing (16)

Cargo.toml
crates/stdlib/Cargo.toml
crates/stdlib/build.rs
crates/stdlib/src/unicodedata.rs
crates/stdlib/unicode/README.md
crates/stdlib/unicode/latest/DerivedNumericValues.txt
crates/stdlib/unicode/latest/NormalizationCorrections.txt
crates/stdlib/unicode/latest/UnicodeData.txt
crates/stdlib/unicode/ucd32/DerivedBidiClass-3.2.0.txt
crates/stdlib/unicode/ucd32/DerivedBinaryProperties-3.2.0.txt
crates/stdlib/unicode/ucd32/DerivedCombiningClass-3.2.0.txt
crates/stdlib/unicode/ucd32/DerivedEastAsianWidth-3.2.0.txt
crates/stdlib/unicode/ucd32/DerivedGeneralCategory-3.2.0.txt
crates/stdlib/unicode/ucd32/DerivedNumericType-3.2.0.txt
crates/stdlib/unicode/ucd32/DerivedNumericValues-3.2.0.txt
extra_tests/snippets/builtin_str_unicode.py

💤 Files with no reviewable changes (1)

Cargo.toml

✅ Files skipped from review due to trivial changes (7)

crates/stdlib/unicode/README.md
crates/stdlib/unicode/ucd32/DerivedNumericType-3.2.0.txt
crates/stdlib/unicode/latest/NormalizationCorrections.txt
crates/stdlib/unicode/ucd32/DerivedBidiClass-3.2.0.txt
crates/stdlib/unicode/ucd32/DerivedNumericValues-3.2.0.txt
crates/stdlib/unicode/ucd32/DerivedCombiningClass-3.2.0.txt
crates/stdlib/unicode/ucd32/DerivedEastAsianWidth-3.2.0.txt

🚧 Files skipped from review as they are similar to previous changes (4)

crates/stdlib/Cargo.toml
extra_tests/snippets/builtin_str_unicode.py
crates/stdlib/src/unicodedata.rs
crates/stdlib/build.rs

📝 Walkthrough

Walkthrough

Migrates unicode handling from ucd/unic-ucd-age to ICU properties plus generated lookup tables; adds build-time generators and vendored UCD 3.2.0 + latest data, refactors unicodedata for version-aware property/numeric lookups and decompositions, and updates tests.

Changes

Unicode Data Infrastructure Refactor

Layer / File(s)	Summary
Dependency migration `Cargo.toml`, `crates/stdlib/Cargo.toml`	Workspace and stdlib dependencies remove `ucd` and `unic-ucd-age`, add `icu_normalizer` and `icu_properties` (including as build-dependencies).
Build script Unicode generation pipeline `crates/stdlib/build.rs`	Adds generators: `generate_unicode_3_2`, `generate_numeric_type`, `generate_numeric_value`, `generate_unicode_latest`, shared helpers (`write_derived`, `parse_unicode_3_2`), decomposition mapping (`DecompositionType`), and concurrent runner wiring with `cargo:rerun-if-changed`.
Module includes and lookup helpers `crates/stdlib/src/unicodedata.rs` (lines 5–130)	Includes generated tables, defines `UnicodeVersion`/`UNICODE_VERSION`, introduces `DecompositionType`/`type_tag()`, and implements `lookup_property` and `lookup_numeric_val`.
Character extraction, category, and name `crates/stdlib/src/unicodedata.rs` (lines 211–264)	`extract_char` no longer age-filters; `category` returns `&'static str` and selects ICU or generated lookup by Unicode version; `name` updated to use new extraction flow.
Bidirectional and East Asian width `crates/stdlib/src/unicodedata.rs` (lines 277–319)	`bidirectional` and `east_asian_width` branch by Unicode version: ICU for newer, generated `lookup_property` for 3.2.0, preserving PUA1 override and fallbacks.
Mirrored, combining, and decomposition `crates/stdlib/src/unicodedata.rs` (lines 379–466)	`mirrored` and `combining` use versioned resolution; `decomposition` assembles compatibility/canonical decompositions from generated tables and normalization corrections with 3.2.0 special handling.
Numeric properties `crates/stdlib/src/unicodedata.rs` (lines 476–551)	`digit`, `decimal`, and `numeric` validate via ICU `NumericType`, fetch values via `lookup_numeric_val` with Unicode-version gating, require integral values for integer APIs, and return floats for `numeric`.
Unicode data files and documentation `crates/stdlib/unicode/README.md`, `crates/stdlib/unicode/latest/NormalizationCorrections.txt`, `crates/stdlib/unicode/ucd32/*`	Add README, NormalizationCorrections.txt (latest), and multiple UCD 3.2.0 derived-property files (BidiClass, BinaryProperties, CombiningClass, EastAsianWidth, NumericType, NumericValues).
Test coverage `extra_tests/snippets/builtin_str_unicode.py`	Adds assertions for decomposition (including normalization corrections), East Asian width, and mirrored properties for default and `ucd_3_2_0`.

Sequence Diagram(s)

sequenceDiagram
  participant build_rs as build.rs
  participant UCDfiles as unicode/ucd32 & unicode/latest
  participant Parser as UnicodeFileParser
  participant Emitter as RustCodeEmitter
  participant Runtime as unicodedata.rs
  build_rs->>UCDfiles: read vendored UCD 3.2.0 + latest files
  build_rs->>Parser: parse derived-property & normalization files
  Parser->>Emitter: produce table entries and decomposition updates
  Emitter->>build_rs: write generated Rust sources to OUT_DIR/generated
  Runtime->>Emitter: include generated lookup tables (lookup_property / lookup_numeric_val)
  Runtime->>Runtime: call ICU APIs for newer UnicodeVersion paths

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

RustPython/RustPython#7555: Also updates crates/stdlib/src/unicodedata.rs to use ICU property/normalizer APIs.
RustPython/RustPython#7626: Switches character classification to ICU properties and NumericType, related to numeric/character predicate changes.

Suggested reviewers

ShaharNaveh
youknowone

Poem

🐰 I nibble through code, a tiny sleuth,
Tables sprout from vendored Unicode truth,
Old crates hopped out, ICU hopped in,
Decomps and digits ready to spin,
Carrots and code — a soft, warm proof!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 44.12% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the three main changes: fixing the unicodedata module, unmasking tests, and removing the ucd dependency.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-05-21T20:23:59Z

📦 Library Dependencies

The following Lib/ modules were modified. Here are their dependencies:

[x] test: cpython/Lib/test/test_str.py (TODO: 6)
[x] test: cpython/Lib/test/test_fstring.py (TODO: 19)
[x] test: cpython/Lib/test/test_string_literals.py (TODO: 4)

dependencies:

dependent tests: (no tests depend on str)

[ ] test: cpython/Lib/test/test_unicodedata.py (TODO: 11)
[x] test: cpython/Lib/test/test_unicode_file.py
[x] test: cpython/Lib/test/test_unicode_file_functions.py
[x] test: cpython/Lib/test/test_unicode_identifiers.py (TODO: 1)
[x] test: cpython/Lib/test/test_ucn.py (TODO: 4)

dependencies:

dependent tests: (no tests depend on unicode)

Legend:

[+] path exists in CPython
[x] up-to-date, [ ] outdated

ShaharNaveh

TYSM!

Overall I really like this change.

The only thing rhat might be worth is to a README.md under crates/stdlib/unicode/ucd32/README.md with links for the soure of each file so we can upsate them in the future if needed

ShaharNaveh · 2026-05-21T20:31:32Z

+}
+
+/// Supported Unicode version based on icu4x.
+pub const UNICODE_VERSION: UnicodeVersion = UnicodeVersion {


I assume icu4x doesn't expose this:/

I couldn't find it, but Rust itself exposes its Unicode version so I'm using that instead for now. Both Rust and icu4x target the latest Unicode version so it should be fine. I hope. 🤔 🤔

ShaharNaveh · 2026-05-21T20:35:18Z

        self.assertTrue('\U0001F46F'.isprintable())
        self.assertFalse('\U000E0020'.isprintable())

-    @unittest.expectedFailure  # TODO: RUSTPYTHON


joshuamegnauth54 · 2026-05-26T01:02:50Z

Also, the Unicode 3.2 files will never change but I agree about linking the sources somewhere. CPython seems to provide 3.2 as a compatibility for an old IDNA version (which is also provided for compatibility 😹 ). All of this work is essentially to fix a 20+ year old module that is likely not used anywhere. 😅

Unicode 3.2 - here is the source for now. I'll add it to a README after I finish the patch. 😁

I removed an embedded table of non-ASCII numbers in favor of using `icu_decimal`. The benefits of using `icu4x` here are consistency plus Unicode updates. As Unicode is updated, we automatically reap the benefits without having to modify the table. `icu_decimal` is also useful beyond `float()`. I'm also using it to clean up `unicodedata` in RustPython#7947.

joshuamegnauth54 · 2026-06-07T04:33:25Z

I undrafted for CodeRabbit because I couldn't trigger it. There will likely be a lot of suggestions. 😅

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

crates/stdlib/src/unicodedata.rs (1)

500-519: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Incorrect NumericType check in numeric method.

Line 509 checks NumericType::Decimal, but the numeric() method should accept characters with any numeric type (Decimal, Digit, or Numeric), not just Decimal. This will incorrectly reject valid numeric characters like Roman numerals (NumericType::Numeric) and circled digits (NumericType::Digit).

🐛 Proposed fix

         fn numeric(
             &self,
             character: PyStrRef,
             default: OptionalArg<PyObjectRef>,
             vm: &VirtualMachine,
         ) -> PyResult {
             self.extract_char(character, vm).and_then(|c| {
                 c.to_char()
                     .and_then(|c| {
-                        if NumericType::for_char(c) != NumericType::Decimal {
+                        let num_type = NumericType::for_char(c);
+                        if num_type == NumericType::None {
                             return None;
                         }

                         let value = lookup_numeric_val(c, self.unic_version)?;
                         Some(vm.ctx.new_float(value).into())
                     })
                     .or_else(|| default.into_option())
                     .ok_or_else(|| vm.new_value_error("not a numeric character"))
             })
         }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/stdlib/src/unicodedata.rs` around lines 500 - 519, The numeric method
is incorrectly filtering only Decimal numeric types; update the check in numeric
(inside the closure from extract_char) to accept any numeric type by testing
that NumericType::for_char(c) is not NumericType::None (or simply remove the
Decimal-only guard), so characters with Digit or Numeric are allowed; locate the
conditional that currently compares to NumericType::Decimal and change it to
compare against NumericType::None (or delete it) and keep the subsequent
lookup_numeric_val(c, self.unic_version) path.

🧹 Nitpick comments (1)

crates/stdlib/build.rs (1)
581-582: Use explicit Unicode input files for cargo:rerun-if-changed in crates/stdlib/build.rs.

Cargo does scan directory paths recursively for cargo:rerun-if-changed, but change detection is mtime-based, so directory-level tracking can be brittle. This build script only reads a fixed set of files from those directories, so track the exact inputs instead of unicode/ucd32 and unicode/latest.

Replace:
println!("cargo:rerun-if-changed=unicode/ucd32");
println!("cargo:rerun-if-changed=unicode/latest");
With cargo:rerun-if-changed for:

unicode/ucd32/DerivedGeneralCategory-3.2.0.txt

unicode/ucd32/DerivedEastAsianWidth-3.2.0.txt

unicode/ucd32/DerivedBidiClass-3.2.0.txt

unicode/ucd32/DerivedBinaryProperties-3.2.0.txt

unicode/ucd32/DerivedCombiningClass-3.2.0.txt

unicode/ucd32/DerivedNumericValues-3.2.0.txt

unicode/latest/UnicodeData.txt

unicode/latest/NormalizationCorrections.txt

Quick check: run cargo build -p rustpython-stdlib -vv, touch crates/stdlib/unicode/ucd32/DerivedGeneralCategory-3.2.0.txt, then run again and confirm Cargo re-runs the rustpython-stdlib build script.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/stdlib/build.rs` around lines 581 - 582, Replace the two
directory-level rerun directives that call
println!("cargo:rerun-if-changed=unicode/ucd32") and
println!("cargo:rerun-if-changed=unicode/latest") with explicit
cargo:rerun-if-changed entries for each input file the build script actually
reads: unicode/ucd32/DerivedGeneralCategory-3.2.0.txt,
unicode/ucd32/DerivedEastAsianWidth-3.2.0.txt,
unicode/ucd32/DerivedBidiClass-3.2.0.txt,
unicode/ucd32/DerivedBinaryProperties-3.2.0.txt,
unicode/ucd32/DerivedCombiningClass-3.2.0.txt,
unicode/ucd32/DerivedNumericValues-3.2.0.txt, unicode/latest/UnicodeData.txt,
and unicode/latest/NormalizationCorrections.txt so Cargo will reliably rerun the
build script when any of those exact files change.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@extra_tests/snippets/builtin_str_unicode.py`:
- Around line 27-29: The test asserts the wrong versioned decomposition for
U+F951: update the expectation for unicodedata.ucd_3_2_0.decomposition("\uf951")
to the pre-correction mapping "96FB" instead of "964B" so the versioned path
unicodedata.ucd_3_2_0.decomposition matches the Unicode 3.2 behavior (leave the
modern unicodedata.decomposition("\uf951") assertion as-is).

---

Outside diff comments:
In `@crates/stdlib/src/unicodedata.rs`:
- Around line 500-519: The numeric method is incorrectly filtering only Decimal
numeric types; update the check in numeric (inside the closure from
extract_char) to accept any numeric type by testing that
NumericType::for_char(c) is not NumericType::None (or simply remove the
Decimal-only guard), so characters with Digit or Numeric are allowed; locate the
conditional that currently compares to NumericType::Decimal and change it to
compare against NumericType::None (or delete it) and keep the subsequent
lookup_numeric_val(c, self.unic_version) path.

---

Nitpick comments:
In `@crates/stdlib/build.rs`:
- Around line 581-582: Replace the two directory-level rerun directives that
call println!("cargo:rerun-if-changed=unicode/ucd32") and
println!("cargo:rerun-if-changed=unicode/latest") with explicit
cargo:rerun-if-changed entries for each input file the build script actually
reads: unicode/ucd32/DerivedGeneralCategory-3.2.0.txt,
unicode/ucd32/DerivedEastAsianWidth-3.2.0.txt,
unicode/ucd32/DerivedBidiClass-3.2.0.txt,
unicode/ucd32/DerivedBinaryProperties-3.2.0.txt,
unicode/ucd32/DerivedCombiningClass-3.2.0.txt,
unicode/ucd32/DerivedNumericValues-3.2.0.txt, unicode/latest/UnicodeData.txt,
and unicode/latest/NormalizationCorrections.txt so Cargo will reliably rerun the
build script when any of those exact files change.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: cd95b557-2363-4034-b21e-322853c602ab

📥 Commits

Reviewing files that changed from the base of the PR and between 51c97b9 and da04480.

⛔ Files ignored due to path filters (2)

Cargo.lock is excluded by !**/*.lock
Lib/test/test_str.py is excluded by !Lib/**

📒 Files selected for processing (16)

Cargo.toml
crates/stdlib/Cargo.toml
crates/stdlib/build.rs
crates/stdlib/src/unicodedata.rs
crates/stdlib/unicode/README.md
crates/stdlib/unicode/latest/DerivedNumericValues.txt
crates/stdlib/unicode/latest/NormalizationCorrections.txt
crates/stdlib/unicode/latest/UnicodeData.txt
crates/stdlib/unicode/ucd32/DerivedBidiClass-3.2.0.txt
crates/stdlib/unicode/ucd32/DerivedBinaryProperties-3.2.0.txt
crates/stdlib/unicode/ucd32/DerivedCombiningClass-3.2.0.txt
crates/stdlib/unicode/ucd32/DerivedEastAsianWidth-3.2.0.txt
crates/stdlib/unicode/ucd32/DerivedGeneralCategory-3.2.0.txt
crates/stdlib/unicode/ucd32/DerivedNumericType-3.2.0.txt
crates/stdlib/unicode/ucd32/DerivedNumericValues-3.2.0.txt
extra_tests/snippets/builtin_str_unicode.py

💤 Files with no reviewable changes (1)

Cargo.toml

coderabbitai · 2026-06-07T04:41:21Z

+# These are from NormalizationCorrections.txt
+assert unicodedata.decomposition("\uf951") == "964B"
+assert unicodedata.ucd_3_2_0.decomposition("\uf951") == "964B"


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

ucd_3_2_0 normalization-correction expectation looks incorrect for U+F951.

Line 29 currently expects "964B" for unicodedata.ucd_3_2_0.decomposition("\uf951"), but the Unicode 3.2 compatibility path is intended to return the original pre-correction mapping. This should likely assert "96FB" to validate the versioned behavior instead of the modern corrected value.

Suggested test fix

# These are from NormalizationCorrections.txt assert unicodedata.decomposition("\uf951") == "964B" -assert unicodedata.ucd_3_2_0.decomposition("\uf951") == "964B" +assert unicodedata.ucd_3_2_0.decomposition("\uf951") == "96FB"

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# These are from NormalizationCorrections.txt

assert unicodedata.decomposition("\uf951") == "964B"

assert unicodedata.ucd_3_2_0.decomposition("\uf951") == "964B"

# These are from NormalizationCorrections.txt

assert unicodedata.decomposition("\uf951") == "964B"

assert unicodedata.ucd_3_2_0.decomposition("\uf951") == "96FB"

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@extra_tests/snippets/builtin_str_unicode.py` around lines 27 - 29, The test asserts the wrong versioned decomposition for U+F951: update the expectation for unicodedata.ucd_3_2_0.decomposition("\uf951") to the pre-correction mapping "96FB" instead of "964B" so the versioned path unicodedata.ucd_3_2_0.decomposition matches the Unicode 3.2 behavior (leave the modern unicodedata.decomposition("\uf951") assertion as-is).

This is incorrect. Both CPython and RustPython return "964B".

Here's the line in NormalizationCorrections.txt:

F951;96FB;964B;3.2.0 # Corrigendum 3

The character is \u{F951}. The original, incorrect decomposition is \u{96FB}. The decomposition was correction in version 3.2.0 which means that versions >= 3.2.0 use the correct decomposition.

@joshuamegnauth54, you're absolutely right — I misread the correction semantics. Since the NormalizationCorrections.txt format is CODEPOINT;WRONG;CORRECT;VERSION, and the correction for U+F951 was applied at version 3.2.0, ucd_3_2_0 (which is Unicode ≥ 3.2.0) correctly returns the fixed decomposition "964B". The original erroneous value "96FB" would only be relevant for a hypothetical pre-3.2.0 path. The test is correct as written. Sorry for the noise!

(ʘᴗʘ✿)

✏️ Learnings added

Learnt from: joshuamegnauth54 Repo: RustPython/RustPython PR: 7947 File: extra_tests/snippets/builtin_str_unicode.py:27-29 Timestamp: 2026-06-07T17:28:10.604Z Learning: In RustPython's `extra_tests/snippets/builtin_str_unicode.py`, the `NormalizationCorrections.txt` format is `CODEPOINT;ORIGINAL_WRONG_DECOMP;CORRECT_DECOMP;VERSION`. The VERSION field indicates the Unicode version in which the correction was *applied*. For example, `F951;96FB;964B;3.2.0` means U+F951's decomposition was corrected to `964B` starting from Unicode 3.2.0. Therefore `unicodedata.ucd_3_2_0.decomposition("\uf951")` correctly returns `"964B"` (the corrected value), not `"96FB"` (the original wrong value). Both the modern `unicodedata` and the `ucd_3_2_0` versioned path return the same corrected decomposition when the correction version <= 3.2.0.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

🧠 Learnings used

Learnt from: joshuamegnauth54 Repo: RustPython/RustPython PR: 7998 File: crates/vm/src/builtins/float.rs:219-223 Timestamp: 2026-05-30T20:17:08.394Z Learning: In RustPython's `crates/vm/src/builtins/float.rs`, `crate::literal::float::parse_bytes` (backed by lexical-core) tolerates trailing NUL (`\0`) bytes in the input buffer. The `float_from_string` UTF-8/ICU path uses a fixed 1024-byte `mapped_str_buf` and passing the full buffer (including unwritten trailing zeros) does not cause float parsing to fail, as confirmed by the PR author via local testing (PR `#7998`).

Learnt from: bschoenmaeckers Repo: RustPython/RustPython PR: 8032 File: crates/capi/src/moduleobject.rs:37-40 Timestamp: 2026-06-03T12:02:30.049Z Learning: In RustPython's C-API (`crates/capi/src/moduleobject.rs`), `PyModule_NewObject` intentionally requires the module name to be a valid UTF-8 string (enforced via `PyStr::to_str()`). This is a deliberate design constraint because RustPython's string model is UTF-8 only. This deviates from CPython's `PyModule_NewObject`, which stores any Unicode object directly without requiring UTF-8 representability. Do not raise review issues about this UTF-8 restriction on module names.

Learnt from: bschoenmaeckers Repo: RustPython/RustPython PR: 8032 File: crates/capi/src/moduleobject.rs:37-40 Timestamp: 2026-06-03T12:02:34.454Z Learning: In RustPython's C-API (`crates/capi/src/moduleobject.rs`), `PyModule_NewObject` intentionally requires the module name to be a valid UTF-8 string (enforced via `PyStr::to_str()`). This is a deliberate design constraint because RustPython's string model is UTF-8 only. This deviates from CPython's `PyModule_NewObject`, which stores any Unicode object directly without requiring UTF-8 representability. Do not raise review issues about this UTF-8 restriction on module names.

Python bundles an old version of Unicode for compatibility. RustPython tries to mimic supporting that old version by checking the version of individual chars. This is a problem for a few reasons. The first is that the age check adds an additional hit per each char lookup in Unicode data. The check is outdated because the `unic-ucd-age` crate is several versions behind the current Unicode version. The check rejects valid chars because of the version differences. The check is subtly wrong because it returns properties for Unicode 16.0.0 for Unicode 3.2.0 while checking against a Unicode 10.0.0 database. Unfortunately, there isn't a crate that can help us here. `icu4x` targets modern Unicode versions. Writing a data provider for `icu4x` for Unicode 3.2.0 is a lot of work for a legacy path. I opted to parse the Unicode 3.2.0 data myself but to skip `icu4x` (mostly) to instead write small lookup tables. As of this commit, Unicode names is still wrong for 3.2.0. Luckily, the crate RustPython uses is fast and robust for modern Unicode.

ShaharNaveh reviewed May 21, 2026

View reviewed changes

Comment thread crates/literal/src/char.rs

ShaharNaveh approved these changes May 21, 2026

View reviewed changes

joshuamegnauth54 force-pushed the fix-isprintable branch 2 times, most recently from 5479c61 to e3ad1ed Compare May 28, 2026 01:03

joshuamegnauth54 mentioned this pull request May 30, 2026

Use icu4x for UTF-8 float(); remove an alloc #7998

Closed

2 tasks

joshuamegnauth54 force-pushed the fix-isprintable branch 4 times, most recently from de85853 to dc1e675 Compare June 7, 2026 04:28

joshuamegnauth54 marked this pull request as ready for review June 7, 2026 04:31

joshuamegnauth54 force-pushed the fix-isprintable branch from dc1e675 to da04480 Compare June 7, 2026 04:32

coderabbitai Bot reviewed Jun 7, 2026

View reviewed changes

joshuamegnauth54 force-pushed the fix-isprintable branch 3 times, most recently from 4db4918 to da427d8 Compare June 9, 2026 03:11

joshuamegnauth54 changed the title ~~Fix isprintable() and fix Unicode 3.2~~ Fix most of unicodedata, unmask multiple tests, and remove ucd Jun 9, 2026

joshuamegnauth54 force-pushed the fix-isprintable branch from da427d8 to d982b32 Compare June 9, 2026 03:32

Conversation

joshuamegnauth54 commented May 21, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

github-actions Bot commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📦 Library Dependencies

Uh oh!

Uh oh!

ShaharNaveh left a comment

Choose a reason for hiding this comment

Uh oh!

ShaharNaveh May 21, 2026

Choose a reason for hiding this comment

Uh oh!

joshuamegnauth54 May 26, 2026

Choose a reason for hiding this comment

Uh oh!

ShaharNaveh May 21, 2026

Choose a reason for hiding this comment

Uh oh!

joshuamegnauth54 commented May 26, 2026

Uh oh!

joshuamegnauth54 commented Jun 7, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

joshuamegnauth54 Jun 7, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

joshuamegnauth54 commented May 21, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 21, 2026 •

edited

Loading

github-actions Bot commented May 21, 2026 •

edited

Loading

coderabbitai Bot Jun 7, 2026 •

edited

Loading