Skip to content

utils: replace chardet with charset-normalizer#775

Closed
CyberTailor wants to merge 3 commits intopkgcore:masterfrom
CyberTailor:charset-normalizer
Closed

utils: replace chardet with charset-normalizer#775
CyberTailor wants to merge 3 commits intopkgcore:masterfrom
CyberTailor:charset-normalizer

Conversation

@CyberTailor
Copy link
Contributor

@CyberTailor CyberTailor commented Mar 22, 2026

chardet 7.0 has been completely rewritten using AI coding agents, and now related pkgcheck's tests fail. It's unlikely that chardet v7 will be packaged in Gentoo, so it makes no sense to make pkgcheck compatible with it. It also doesn't really make sense to pin to chardet<7, as it won't receive updates. Switching to another library would be fine.

I think we can also remove most of our custom "binary or not" heuristics, charset_normalizer.is_binary should be good enough.

Another reason to switch from chardet:

$ time python -c "import chardet"

real    0m0,388s
user    0m0,236s
sys     0m0,053s

$ time python -c "import charset_normalizer"

real    0m0,119s
user    0m0,091s
sys     0m0,018s

Replace a 'utf_16_be'-decodable byte string with another, which is
non-decodable.

Signed-off-by: Anna (cybertailor) Vyalkova <cyber+gentoo@sysrq.in>
Replace 'Big5'-decodable byte string with another, which is
non-decodable.

Signed-off-by: Anna (cybertailor) Vyalkova <cyber+gentoo@sysrq.in>
Also replaces our custom heuristics with charset_normalizer.is_binary().

Signed-off-by: Anna (cybertailor) Vyalkova <cyber+gentoo@sysrq.in>
Copy link
Member

@arthurzam arthurzam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that charset-normalizer has no python lib dependencies, which is nice and what I prefer.

Looks good, thank you.

@arthurzam arthurzam marked this pull request as ready for review March 22, 2026 10:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants