Fix most of unicodedata, unmask multiple tests, and remove ucd#7947
Fix most of unicodedata, unmask multiple tests, and remove ucd#7947joshuamegnauth54 wants to merge 1 commit into
unicodedata, unmask multiple tests, and remove ucd#7947Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yml Review profile: CHILL Plan: Pro Run ID: ⛔ Files ignored due to path filters (3)
📒 Files selected for processing (16)
💤 Files with no reviewable changes (1)
✅ Files skipped from review due to trivial changes (7)
🚧 Files skipped from review as they are similar to previous changes (4)
📝 WalkthroughWalkthroughMigrates unicode handling from ChangesUnicode Data Infrastructure Refactor
Sequence Diagram(s)sequenceDiagram
participant build_rs as build.rs
participant UCDfiles as unicode/ucd32 & unicode/latest
participant Parser as UnicodeFileParser
participant Emitter as RustCodeEmitter
participant Runtime as unicodedata.rs
build_rs->>UCDfiles: read vendored UCD 3.2.0 + latest files
build_rs->>Parser: parse derived-property & normalization files
Parser->>Emitter: produce table entries and decomposition updates
Emitter->>build_rs: write generated Rust sources to OUT_DIR/generated
Runtime->>Emitter: include generated lookup tables (lookup_property / lookup_numeric_val)
Runtime->>Runtime: call ICU APIs for newer UnicodeVersion paths
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
📦 Library DependenciesThe following Lib/ modules were modified. Here are their dependencies: [x] test: cpython/Lib/test/test_str.py (TODO: 6) dependencies: dependent tests: (no tests depend on str) [ ] test: cpython/Lib/test/test_unicodedata.py (TODO: 11) dependencies: dependent tests: (no tests depend on unicode) Legend:
|
ShaharNaveh
left a comment
There was a problem hiding this comment.
TYSM!
Overall I really like this change.
The only thing rhat might be worth is to a README.md under crates/stdlib/unicode/ucd32/README.md with links for the soure of each file so we can upsate them in the future if needed
| } | ||
|
|
||
| /// Supported Unicode version based on icu4x. | ||
| pub const UNICODE_VERSION: UnicodeVersion = UnicodeVersion { |
There was a problem hiding this comment.
I assume icu4x doesn't expose this:/
There was a problem hiding this comment.
I couldn't find it, but Rust itself exposes its Unicode version so I'm using that instead for now. Both Rust and icu4x target the latest Unicode version so it should be fine. I hope. 🤔 🤔
| self.assertTrue('\U0001F46F'.isprintable()) | ||
| self.assertFalse('\U000E0020'.isprintable()) | ||
|
|
||
| @unittest.expectedFailure # TODO: RUSTPYTHON |
|
Also, the Unicode 3.2 files will never change but I agree about linking the sources somewhere. CPython seems to provide 3.2 as a compatibility for an old IDNA version (which is also provided for compatibility 😹 ). All of this work is essentially to fix a 20+ year old module that is likely not used anywhere. 😅 Unicode 3.2 - here is the source for now. I'll add it to a README after I finish the patch. 😁 |
5479c61 to
e3ad1ed
Compare
I removed an embedded table of non-ASCII numbers in favor of using `icu_decimal`. The benefits of using `icu4x` here are consistency plus Unicode updates. As Unicode is updated, we automatically reap the benefits without having to modify the table. `icu_decimal` is also useful beyond `float()`. I'm also using it to clean up `unicodedata` in RustPython#7947.
I removed an embedded table of non-ASCII numbers in favor of using `icu_decimal`. The benefits of using `icu4x` here are consistency plus Unicode updates. As Unicode is updated, we automatically reap the benefits without having to modify the table. `icu_decimal` is also useful beyond `float()`. I'm also using it to clean up `unicodedata` in RustPython#7947.
I removed an embedded table of non-ASCII numbers in favor of using `icu_decimal`. The benefits of using `icu4x` here are consistency plus Unicode updates. As Unicode is updated, we automatically reap the benefits without having to modify the table. `icu_decimal` is also useful beyond `float()`. I'm also using it to clean up `unicodedata` in RustPython#7947.
de85853 to
dc1e675
Compare
dc1e675 to
da04480
Compare
|
I undrafted for CodeRabbit because I couldn't trigger it. There will likely be a lot of suggestions. 😅 |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
crates/stdlib/src/unicodedata.rs (1)
500-519:⚠️ Potential issue | 🔴 Critical | ⚡ Quick winIncorrect NumericType check in
numericmethod.Line 509 checks
NumericType::Decimal, but thenumeric()method should accept characters with any numeric type (Decimal, Digit, or Numeric), not just Decimal. This will incorrectly reject valid numeric characters like Roman numerals (NumericType::Numeric) and circled digits (NumericType::Digit).🐛 Proposed fix
fn numeric( &self, character: PyStrRef, default: OptionalArg<PyObjectRef>, vm: &VirtualMachine, ) -> PyResult { self.extract_char(character, vm).and_then(|c| { c.to_char() .and_then(|c| { - if NumericType::for_char(c) != NumericType::Decimal { + let num_type = NumericType::for_char(c); + if num_type == NumericType::None { return None; } let value = lookup_numeric_val(c, self.unic_version)?; Some(vm.ctx.new_float(value).into()) }) .or_else(|| default.into_option()) .ok_or_else(|| vm.new_value_error("not a numeric character")) }) }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/stdlib/src/unicodedata.rs` around lines 500 - 519, The numeric method is incorrectly filtering only Decimal numeric types; update the check in numeric (inside the closure from extract_char) to accept any numeric type by testing that NumericType::for_char(c) is not NumericType::None (or simply remove the Decimal-only guard), so characters with Digit or Numeric are allowed; locate the conditional that currently compares to NumericType::Decimal and change it to compare against NumericType::None (or delete it) and keep the subsequent lookup_numeric_val(c, self.unic_version) path.
🧹 Nitpick comments (1)
crates/stdlib/build.rs (1)
581-582: Use explicit Unicode input files forcargo:rerun-if-changedincrates/stdlib/build.rs.Cargo does scan directory paths recursively for
cargo:rerun-if-changed, but change detection is mtime-based, so directory-level tracking can be brittle. This build script only reads a fixed set of files from those directories, so track the exact inputs instead ofunicode/ucd32andunicode/latest.Replace:
println!("cargo:rerun-if-changed=unicode/ucd32"); println!("cargo:rerun-if-changed=unicode/latest");With
cargo:rerun-if-changedfor:
unicode/ucd32/DerivedGeneralCategory-3.2.0.txtunicode/ucd32/DerivedEastAsianWidth-3.2.0.txtunicode/ucd32/DerivedBidiClass-3.2.0.txtunicode/ucd32/DerivedBinaryProperties-3.2.0.txtunicode/ucd32/DerivedCombiningClass-3.2.0.txtunicode/ucd32/DerivedNumericValues-3.2.0.txtunicode/latest/UnicodeData.txtunicode/latest/NormalizationCorrections.txtQuick check: run
cargo build -p rustpython-stdlib -vv,touch crates/stdlib/unicode/ucd32/DerivedGeneralCategory-3.2.0.txt, then run again and confirm Cargo re-runs therustpython-stdlibbuild script.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/stdlib/build.rs` around lines 581 - 582, Replace the two directory-level rerun directives that call println!("cargo:rerun-if-changed=unicode/ucd32") and println!("cargo:rerun-if-changed=unicode/latest") with explicit cargo:rerun-if-changed entries for each input file the build script actually reads: unicode/ucd32/DerivedGeneralCategory-3.2.0.txt, unicode/ucd32/DerivedEastAsianWidth-3.2.0.txt, unicode/ucd32/DerivedBidiClass-3.2.0.txt, unicode/ucd32/DerivedBinaryProperties-3.2.0.txt, unicode/ucd32/DerivedCombiningClass-3.2.0.txt, unicode/ucd32/DerivedNumericValues-3.2.0.txt, unicode/latest/UnicodeData.txt, and unicode/latest/NormalizationCorrections.txt so Cargo will reliably rerun the build script when any of those exact files change.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@extra_tests/snippets/builtin_str_unicode.py`:
- Around line 27-29: The test asserts the wrong versioned decomposition for
U+F951: update the expectation for unicodedata.ucd_3_2_0.decomposition("\uf951")
to the pre-correction mapping "96FB" instead of "964B" so the versioned path
unicodedata.ucd_3_2_0.decomposition matches the Unicode 3.2 behavior (leave the
modern unicodedata.decomposition("\uf951") assertion as-is).
---
Outside diff comments:
In `@crates/stdlib/src/unicodedata.rs`:
- Around line 500-519: The numeric method is incorrectly filtering only Decimal
numeric types; update the check in numeric (inside the closure from
extract_char) to accept any numeric type by testing that
NumericType::for_char(c) is not NumericType::None (or simply remove the
Decimal-only guard), so characters with Digit or Numeric are allowed; locate the
conditional that currently compares to NumericType::Decimal and change it to
compare against NumericType::None (or delete it) and keep the subsequent
lookup_numeric_val(c, self.unic_version) path.
---
Nitpick comments:
In `@crates/stdlib/build.rs`:
- Around line 581-582: Replace the two directory-level rerun directives that
call println!("cargo:rerun-if-changed=unicode/ucd32") and
println!("cargo:rerun-if-changed=unicode/latest") with explicit
cargo:rerun-if-changed entries for each input file the build script actually
reads: unicode/ucd32/DerivedGeneralCategory-3.2.0.txt,
unicode/ucd32/DerivedEastAsianWidth-3.2.0.txt,
unicode/ucd32/DerivedBidiClass-3.2.0.txt,
unicode/ucd32/DerivedBinaryProperties-3.2.0.txt,
unicode/ucd32/DerivedCombiningClass-3.2.0.txt,
unicode/ucd32/DerivedNumericValues-3.2.0.txt, unicode/latest/UnicodeData.txt,
and unicode/latest/NormalizationCorrections.txt so Cargo will reliably rerun the
build script when any of those exact files change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
Run ID: cd95b557-2363-4034-b21e-322853c602ab
⛔ Files ignored due to path filters (2)
Cargo.lockis excluded by!**/*.lockLib/test/test_str.pyis excluded by!Lib/**
📒 Files selected for processing (16)
Cargo.tomlcrates/stdlib/Cargo.tomlcrates/stdlib/build.rscrates/stdlib/src/unicodedata.rscrates/stdlib/unicode/README.mdcrates/stdlib/unicode/latest/DerivedNumericValues.txtcrates/stdlib/unicode/latest/NormalizationCorrections.txtcrates/stdlib/unicode/latest/UnicodeData.txtcrates/stdlib/unicode/ucd32/DerivedBidiClass-3.2.0.txtcrates/stdlib/unicode/ucd32/DerivedBinaryProperties-3.2.0.txtcrates/stdlib/unicode/ucd32/DerivedCombiningClass-3.2.0.txtcrates/stdlib/unicode/ucd32/DerivedEastAsianWidth-3.2.0.txtcrates/stdlib/unicode/ucd32/DerivedGeneralCategory-3.2.0.txtcrates/stdlib/unicode/ucd32/DerivedNumericType-3.2.0.txtcrates/stdlib/unicode/ucd32/DerivedNumericValues-3.2.0.txtextra_tests/snippets/builtin_str_unicode.py
💤 Files with no reviewable changes (1)
- Cargo.toml
| # These are from NormalizationCorrections.txt | ||
| assert unicodedata.decomposition("\uf951") == "964B" | ||
| assert unicodedata.ucd_3_2_0.decomposition("\uf951") == "964B" |
There was a problem hiding this comment.
ucd_3_2_0 normalization-correction expectation looks incorrect for U+F951.
Line 29 currently expects "964B" for unicodedata.ucd_3_2_0.decomposition("\uf951"), but the Unicode 3.2 compatibility path is intended to return the original pre-correction mapping. This should likely assert "96FB" to validate the versioned behavior instead of the modern corrected value.
Suggested test fix
# These are from NormalizationCorrections.txt
assert unicodedata.decomposition("\uf951") == "964B"
-assert unicodedata.ucd_3_2_0.decomposition("\uf951") == "964B"
+assert unicodedata.ucd_3_2_0.decomposition("\uf951") == "96FB"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # These are from NormalizationCorrections.txt | |
| assert unicodedata.decomposition("\uf951") == "964B" | |
| assert unicodedata.ucd_3_2_0.decomposition("\uf951") == "964B" | |
| # These are from NormalizationCorrections.txt | |
| assert unicodedata.decomposition("\uf951") == "964B" | |
| assert unicodedata.ucd_3_2_0.decomposition("\uf951") == "96FB" |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@extra_tests/snippets/builtin_str_unicode.py` around lines 27 - 29, The test
asserts the wrong versioned decomposition for U+F951: update the expectation for
unicodedata.ucd_3_2_0.decomposition("\uf951") to the pre-correction mapping
"96FB" instead of "964B" so the versioned path
unicodedata.ucd_3_2_0.decomposition matches the Unicode 3.2 behavior (leave the
modern unicodedata.decomposition("\uf951") assertion as-is).
There was a problem hiding this comment.
This is incorrect. Both CPython and RustPython return "964B".
Here's the line in NormalizationCorrections.txt:
F951;96FB;964B;3.2.0 # Corrigendum 3
The character is \u{F951}. The original, incorrect decomposition is \u{96FB}. The decomposition was correction in version 3.2.0 which means that versions >= 3.2.0 use the correct decomposition.
There was a problem hiding this comment.
@joshuamegnauth54, you're absolutely right — I misread the correction semantics. Since the NormalizationCorrections.txt format is CODEPOINT;WRONG;CORRECT;VERSION, and the correction for U+F951 was applied at version 3.2.0, ucd_3_2_0 (which is Unicode ≥ 3.2.0) correctly returns the fixed decomposition "964B". The original erroneous value "96FB" would only be relevant for a hypothetical pre-3.2.0 path. The test is correct as written. Sorry for the noise!
(ʘᴗʘ✿)
✏️ Learnings added
Learnt from: joshuamegnauth54
Repo: RustPython/RustPython PR: 7947
File: extra_tests/snippets/builtin_str_unicode.py:27-29
Timestamp: 2026-06-07T17:28:10.604Z
Learning: In RustPython's `extra_tests/snippets/builtin_str_unicode.py`, the `NormalizationCorrections.txt` format is `CODEPOINT;ORIGINAL_WRONG_DECOMP;CORRECT_DECOMP;VERSION`. The VERSION field indicates the Unicode version in which the correction was *applied*. For example, `F951;96FB;964B;3.2.0` means U+F951's decomposition was corrected to `964B` starting from Unicode 3.2.0. Therefore `unicodedata.ucd_3_2_0.decomposition("\uf951")` correctly returns `"964B"` (the corrected value), not `"96FB"` (the original wrong value). Both the modern `unicodedata` and the `ucd_3_2_0` versioned path return the same corrected decomposition when the correction version <= 3.2.0.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
🧠 Learnings used
Learnt from: joshuamegnauth54
Repo: RustPython/RustPython PR: 7998
File: crates/vm/src/builtins/float.rs:219-223
Timestamp: 2026-05-30T20:17:08.394Z
Learning: In RustPython's `crates/vm/src/builtins/float.rs`, `crate::literal::float::parse_bytes` (backed by lexical-core) tolerates trailing NUL (`\0`) bytes in the input buffer. The `float_from_string` UTF-8/ICU path uses a fixed 1024-byte `mapped_str_buf` and passing the full buffer (including unwritten trailing zeros) does not cause float parsing to fail, as confirmed by the PR author via local testing (PR `#7998`).
Learnt from: bschoenmaeckers
Repo: RustPython/RustPython PR: 8032
File: crates/capi/src/moduleobject.rs:37-40
Timestamp: 2026-06-03T12:02:30.049Z
Learning: In RustPython's C-API (`crates/capi/src/moduleobject.rs`), `PyModule_NewObject` intentionally requires the module name to be a valid UTF-8 string (enforced via `PyStr::to_str()`). This is a deliberate design constraint because RustPython's string model is UTF-8 only. This deviates from CPython's `PyModule_NewObject`, which stores any Unicode object directly without requiring UTF-8 representability. Do not raise review issues about this UTF-8 restriction on module names.
Learnt from: bschoenmaeckers
Repo: RustPython/RustPython PR: 8032
File: crates/capi/src/moduleobject.rs:37-40
Timestamp: 2026-06-03T12:02:34.454Z
Learning: In RustPython's C-API (`crates/capi/src/moduleobject.rs`), `PyModule_NewObject` intentionally requires the module name to be a valid UTF-8 string (enforced via `PyStr::to_str()`). This is a deliberate design constraint because RustPython's string model is UTF-8 only. This deviates from CPython's `PyModule_NewObject`, which stores any Unicode object directly without requiring UTF-8 representability. Do not raise review issues about this UTF-8 restriction on module names.
4db4918 to
da427d8
Compare
unicodedata, unmask multiple tests, and remove ucd
Python bundles an old version of Unicode for compatibility. RustPython tries to mimic supporting that old version by checking the version of individual chars. This is a problem for a few reasons. The first is that the age check adds an additional hit per each char lookup in Unicode data. The check is outdated because the `unic-ucd-age` crate is several versions behind the current Unicode version. The check rejects valid chars because of the version differences. The check is subtly wrong because it returns properties for Unicode 16.0.0 for Unicode 3.2.0 while checking against a Unicode 10.0.0 database. Unfortunately, there isn't a crate that can help us here. `icu4x` targets modern Unicode versions. Writing a data provider for `icu4x` for Unicode 3.2.0 is a lot of work for a legacy path. I opted to parse the Unicode 3.2.0 data myself but to skip `icu4x` (mostly) to instead write small lookup tables. As of this commit, Unicode names is still wrong for 3.2.0. Luckily, the crate RustPython uses is fast and robust for modern Unicode.
da427d8 to
d982b32
Compare
Python bundles an old version of Unicode for compatibility. RustPython tries to mimic supporting that old version by checking the version of individual chars. This is a problem for a few reasons. The first is that the age check adds an additional hit per each char lookup in Unicode data. The check is outdated because the
unic-ucd-agecrate is several versions behind the current Unicode version. The check rejects valid chars because of the version differences.The check is subtly wrong because it returns properties for Unicode 16.0.0 for Unicode 3.2.0 while checking against a Unicode 10.0.0 database.
Unfortunately, there isn't a crate that can help us here.
icu4xtargets modern Unicode versions. Writing a data provider foricu4xfor Unicode 3.2.0 is a lot of work for a legacy path. I opted to parse the Unicode 3.2.0 data myself but to skipicu4x(mostly) to instead write small lookup tables.As of this commit, Unicode names is still wrong for 3.2.0. Luckily, the crate RustPython uses is fast and robust for modern Unicode.
Summary by CodeRabbit
New Features
Improvements
Bug Fixes
Documentation
Tests