first commit

2026-05-06 08:05:37 +00:00 · 2022-05-23 00:16:32 +04:00
commit d660f2a4ca
24786 changed files with 4428337 additions and 0 deletions
@@ -0,0 +1 @@
+pip
@@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) 2019 TAHRI Ahmed R.
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
@@ -0,0 +1,269 @@
+Metadata-Version: 2.1
+Name: charset-normalizer
+Version: 2.0.12
+Summary: The Real First Universal Charset Detector. Open, modern and actively maintained alternative to Chardet.
+Home-page: https://github.com/ousret/charset_normalizer
+Author: Ahmed TAHRI @Ousret
+Author-email: ahmed.tahri@cloudnursery.dev
+License: MIT
+Project-URL: Bug Reports, https://github.com/Ousret/charset_normalizer/issues
+Project-URL: Documentation, https://charset-normalizer.readthedocs.io/en/latest
+Keywords: encoding,i18n,txt,text,charset,charset-detector,normalization,unicode,chardet
+Platform: UNKNOWN
+Classifier: License :: OSI Approved :: MIT License
+Classifier: Intended Audience :: Developers
+Classifier: Topic :: Software Development :: Libraries :: Python Modules
+Classifier: Operating System :: OS Independent
+Classifier: Programming Language :: Python
+Classifier: Programming Language :: Python :: 3
+Classifier: Programming Language :: Python :: 3.5
+Classifier: Programming Language :: Python :: 3.6
+Classifier: Programming Language :: Python :: 3.7
+Classifier: Programming Language :: Python :: 3.8
+Classifier: Programming Language :: Python :: 3.9
+Classifier: Programming Language :: Python :: 3.10
+Classifier: Programming Language :: Python :: 3.11
+Classifier: Topic :: Text Processing :: Linguistic
+Classifier: Topic :: Utilities
+Classifier: Programming Language :: Python :: Implementation :: PyPy
+Classifier: Typing :: Typed
+Requires-Python: >=3.5.0
+Description-Content-Type: text/markdown
+License-File: LICENSE
+Provides-Extra: unicode_backport
+Requires-Dist: unicodedata2 ; extra == 'unicode_backport'
+
+
+<h1 align="center">Charset Detection, for Everyone 👋 <a href="https://twitter.com/intent/tweet?text=The%20Real%20First%20Universal%20Charset%20%26%20Language%20Detector&url=https://www.github.com/Ousret/charset_normalizer&hashtags=python,encoding,chardet,developers"><img src="https://img.shields.io/twitter/url/http/shields.io.svg?style=social"/></a></h1>
+
+<p align="center">
+  <sup>The Real First Universal Charset Detector</sup><br>
+  <a href="https://pypi.org/project/charset-normalizer">
+    <img src="https://img.shields.io/pypi/pyversions/charset_normalizer.svg?orange=blue" />
+  </a>
+  <a href="https://codecov.io/gh/Ousret/charset_normalizer">
+      <img src="https://codecov.io/gh/Ousret/charset_normalizer/branch/master/graph/badge.svg" />
+  </a>
+  <a href="https://pepy.tech/project/charset-normalizer/">
+    <img alt="Download Count Total" src="https://pepy.tech/badge/charset-normalizer/month" />
+  </a>
+</p>
+
+> A library that helps you read text from an unknown charset encoding.<br /> Motivated by `chardet`,
+> I'm trying to resolve the issue by taking a new approach.
+> All IANA character set names for which the Python core library provides codecs are supported.
+
+<p align="center">
+  >>>>> <a href="https://charsetnormalizerweb.ousret.now.sh" target="_blank">👉 Try Me Online Now, Then Adopt Me 👈 </a> <<<<<
+</p>
+
+This project offers you an alternative to **Universal Charset Encoding Detector**, also known as **Chardet**.
+
+| Feature       | [Chardet](https://github.com/chardet/chardet)       | Charset Normalizer | [cChardet](https://github.com/PyYoshi/cChardet) |
+| ------------- | :-------------: | :------------------: | :------------------: |
+| `Fast`         | ❌<br>          | ✅<br>             | ✅ <br> |
+| `Universal**`     | ❌            | ✅                 | ❌ |
+| `Reliable` **without** distinguishable standards | ❌ | ✅ | ✅ |
+| `Reliable` **with** distinguishable standards | ✅ | ✅ | ✅ |
+| `Free & Open`  | ✅             | ✅                | ✅ |
+| `License` | LGPL-2.1 | MIT | MPL-1.1
+| `Native Python` | ✅ | ✅ | ❌ |
+| `Detect spoken language` | ❌ | ✅ | N/A |
+| `Supported Encoding` | 30 | :tada: [93](https://charset-normalizer.readthedocs.io/en/latest/user/support.html#supported-encodings)  | 40
+
+<p align="center">
+<img src="https://i.imgflip.com/373iay.gif" alt="Reading Normalized Text" width="226"/><img src="https://media.tenor.com/images/c0180f70732a18b4965448d33adba3d0/tenor.gif" alt="Cat Reading Text" width="200"/>
+
+*\*\* : They are clearly using specific code for a specific encoding even if covering most of used one*<br> 
+Did you got there because of the logs? See [https://charset-normalizer.readthedocs.io/en/latest/user/miscellaneous.html](https://charset-normalizer.readthedocs.io/en/latest/user/miscellaneous.html)
+
+## ⭐ Your support
+
+*Fork, test-it, star-it, submit your ideas! We do listen.*
+  
+## ⚡ Performance
+
+This package offer better performance than its counterpart Chardet. Here are some numbers.
+
+| Package       | Accuracy       | Mean per file (ms) | File per sec (est) |
+| ------------- | :-------------: | :------------------: | :------------------: |
+|      [chardet](https://github.com/chardet/chardet)        |     92 %     |     220 ms      |       5 file/sec        |
+| charset-normalizer |    **98 %**     |     **40 ms**      |       25 file/sec    |
+
+| Package       | 99th percentile       | 95th percentile | 50th percentile |
+| ------------- | :-------------: | :------------------: | :------------------: |
+|      [chardet](https://github.com/chardet/chardet)        |     1115 ms     |     300 ms      |       27 ms        |
+| charset-normalizer |    460 ms     |     240 ms      |       18 ms    |
+
+Chardet's performance on larger file (1MB+) are very poor. Expect huge difference on large payload.
+
+> Stats are generated using 400+ files using default parameters. More details on used files, see GHA workflows.
+> And yes, these results might change at any time. The dataset can be updated to include more files.
+> The actual delays heavily depends on your CPU capabilities. The factors should remain the same.
+
+[cchardet](https://github.com/PyYoshi/cChardet) is a non-native (cpp binding) and unmaintained faster alternative with 
+a better accuracy than chardet but lower than this package. If speed is the most important factor, you should try it.
+
+## ✨ Installation
+
+Using PyPi for latest stable
+```sh
+pip install charset-normalizer -U
+```
+
+If you want a more up-to-date `unicodedata` than the one available in your Python setup.
+```sh
+pip install charset-normalizer[unicode_backport] -U
+```
+
+## 🚀 Basic Usage
+
+### CLI
+This package comes with a CLI.
+
+```
+usage: normalizer [-h] [-v] [-a] [-n] [-m] [-r] [-f] [-t THRESHOLD]
+                  file [file ...]
+
+The Real First Universal Charset Detector. Discover originating encoding used
+on text file. Normalize text to unicode.
+
+positional arguments:
+  files                 File(s) to be analysed
+
+optional arguments:
+  -h, --help            show this help message and exit
+  -v, --verbose         Display complementary information about file if any.
+                        Stdout will contain logs about the detection process.
+  -a, --with-alternative
+                        Output complementary possibilities if any. Top-level
+                        JSON WILL be a list.
+  -n, --normalize       Permit to normalize input file. If not set, program
+                        does not write anything.
+  -m, --minimal         Only output the charset detected to STDOUT. Disabling
+                        JSON output.
+  -r, --replace         Replace file when trying to normalize it instead of
+                        creating a new one.
+  -f, --force           Replace file without asking if you are sure, use this
+                        flag with caution.
+  -t THRESHOLD, --threshold THRESHOLD
+                        Define a custom maximum amount of chaos allowed in
+                        decoded content. 0. <= chaos <= 1.
+  --version             Show version information and exit.
+```
+
+```bash
+normalizer ./data/sample.1.fr.srt
+```
+
+:tada: Since version 1.4.0 the CLI produce easily usable stdout result in JSON format.
+
+```json
+{
+    "path": "/home/default/projects/charset_normalizer/data/sample.1.fr.srt",
+    "encoding": "cp1252",
+    "encoding_aliases": [
+        "1252",
+        "windows_1252"
+    ],
+    "alternative_encodings": [
+        "cp1254",
+        "cp1256",
+        "cp1258",
+        "iso8859_14",
+        "iso8859_15",
+        "iso8859_16",
+        "iso8859_3",
+        "iso8859_9",
+        "latin_1",
+        "mbcs"
+    ],
+    "language": "French",
+    "alphabets": [
+        "Basic Latin",
+        "Latin-1 Supplement"
+    ],
+    "has_sig_or_bom": false,
+    "chaos": 0.149,
+    "coherence": 97.152,
+    "unicode_path": null,
+    "is_preferred": true
+}
+```
+
+### Python
+*Just print out normalized text*
+```python
+from charset_normalizer import from_path
+
+results = from_path('./my_subtitle.srt')
+
+print(str(results.best()))
+```
+
+*Normalize any text file*
+```python
+from charset_normalizer import normalize
+try:
+    normalize('./my_subtitle.srt') # should write to disk my_subtitle-***.srt
+except IOError as e:
+    print('Sadly, we are unable to perform charset normalization.', str(e))
+```
+
+*Upgrade your code without effort*
+```python
+from charset_normalizer import detect
+```
+
+The above code will behave the same as **chardet**. We ensure that we offer the best (reasonable) BC result possible.
+
+See the docs for advanced usage : [readthedocs.io](https://charset-normalizer.readthedocs.io/en/latest/)
+
+## 😇 Why
+
+When I started using Chardet, I noticed that it was not suited to my expectations, and I wanted to propose a
+reliable alternative using a completely different method. Also! I never back down on a good challenge!
+
+I **don't care** about the **originating charset** encoding, because **two different tables** can
+produce **two identical rendered string.**
+What I want is to get readable text, the best I can. 
+
+In a way, **I'm brute forcing text decoding.** How cool is that ? 😎
+
+Don't confuse package **ftfy** with charset-normalizer or chardet. ftfy goal is to repair unicode string whereas charset-normalizer to convert raw file in unknown encoding to unicode.
+
+## 🍰 How
+
+  - Discard all charset encoding table that could not fit the binary content.
+  - Measure chaos, or the mess once opened (by chunks) with a corresponding charset encoding.
+  - Extract matches with the lowest mess detected.
+  - Additionally, we measure coherence / probe for a language.
+
+**Wait a minute**, what is chaos/mess and coherence according to **YOU ?**
+
+*Chaos :* I opened hundred of text files, **written by humans**, with the wrong encoding table. **I observed**, then
+**I established** some ground rules about **what is obvious** when **it seems like** a mess.
+ I know that my interpretation of what is chaotic is very subjective, feel free to contribute in order to
+ improve or rewrite it.
+
+*Coherence :* For each language there is on earth, we have computed ranked letter appearance occurrences (the best we can). So I thought
+that intel is worth something here. So I use those records against decoded text to check if I can detect intelligent design.
+
+## ⚡ Known limitations
+
+  - Language detection is unreliable when text contains two or more languages sharing identical letters. (eg. HTML (english tags) + Turkish content (Sharing Latin characters))
+  - Every charset detector heavily depends on sufficient content. In common cases, do not bother run detection on very tiny content.
+
+## 👤 Contributing
+
+Contributions, issues and feature requests are very much welcome.<br />
+Feel free to check [issues page](https://github.com/ousret/charset_normalizer/issues) if you want to contribute.
+
+## 📝 License
+
+Copyright © 2019 [Ahmed TAHRI @Ousret](https://github.com/Ousret).<br />
+This project is [MIT](https://github.com/Ousret/charset_normalizer/blob/master/LICENSE) licensed.
+
+Characters frequencies used in this project © 2012 [Denny Vrandečić](http://simia.net/letters/)
+
+
@@ -0,0 +1,33 @@
+../../Scripts/normalizer.exe,sha256=M2lL5SxcA4Ks-HOlcAQPjPyhkKeSc8HzFQfbTWx2MVQ,106413
+charset_normalizer-2.0.12.dist-info/INSTALLER,sha256=zuuue4knoyJ-UwPPXg8fezS7VCrXJQrAP7zeNuwvFQg,4
+charset_normalizer-2.0.12.dist-info/LICENSE,sha256=6zGgxaT7Cbik4yBV0lweX5w1iidS_vPNcgIT0cz-4kE,1070
+charset_normalizer-2.0.12.dist-info/METADATA,sha256=eX-U3s7nb6wcvXZFyM1mdBf1yz4I0msVBgNvLEscAbo,11713
+charset_normalizer-2.0.12.dist-info/RECORD,,
+charset_normalizer-2.0.12.dist-info/WHEEL,sha256=G16H4A3IeoQmnOrYV4ueZGKSjhipXx8zc8nu9FGlvMA,92
+charset_normalizer-2.0.12.dist-info/entry_points.txt,sha256=5AJq_EPtGGUwJPgQLnBZfbVr-FYCIwT0xP7dIEZO3NI,77
+charset_normalizer-2.0.12.dist-info/top_level.txt,sha256=7ASyzePr8_xuZWJsnqJjIBtyV8vhEo0wBCv1MPRRi3Q,19
+charset_normalizer/__init__.py,sha256=x2A2OW29MBcqdxsvy6t1wzkUlH3ma0guxL6ZCfS8J94,1790
+charset_normalizer/__pycache__/__init__.cpython-310.pyc,,
+charset_normalizer/__pycache__/api.cpython-310.pyc,,
+charset_normalizer/__pycache__/cd.cpython-310.pyc,,
+charset_normalizer/__pycache__/constant.cpython-310.pyc,,
+charset_normalizer/__pycache__/legacy.cpython-310.pyc,,
+charset_normalizer/__pycache__/md.cpython-310.pyc,,
+charset_normalizer/__pycache__/models.cpython-310.pyc,,
+charset_normalizer/__pycache__/utils.cpython-310.pyc,,
+charset_normalizer/__pycache__/version.cpython-310.pyc,,
+charset_normalizer/api.py,sha256=r__Wz85F5pYOkRwEY5imXY_pCZ2Nil1DkdaAJY7T5o0,20303
+charset_normalizer/assets/__init__.py,sha256=FPnfk8limZRb8ZIUQcTvPEcbuM1eqOdWGw0vbWGycDs,25485
+charset_normalizer/assets/__pycache__/__init__.cpython-310.pyc,,
+charset_normalizer/cd.py,sha256=a9Kzzd9tHl_W08ExbCFMmRJqdo2k7EBQ8Z_3y9DmYsg,11076
+charset_normalizer/cli/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
+charset_normalizer/cli/__pycache__/__init__.cpython-310.pyc,,
+charset_normalizer/cli/__pycache__/normalizer.cpython-310.pyc,,
+charset_normalizer/cli/normalizer.py,sha256=LkeFIRc1l28rOgXpEby695x0bcKQv4D8z9FmA3Z2c3A,9364
+charset_normalizer/constant.py,sha256=51u_RS10I1vYVpBao__xHqf--HHNrR6me1A1se5r5Y0,19449
+charset_normalizer/legacy.py,sha256=XKeZOts_HdYQU_Jb3C9ZfOjY2CiUL132k9_nXer8gig,3384
+charset_normalizer/md.py,sha256=WEwnu2MyIiMeEaorRduqcTxGjIBclWIG3i-9_UL6LLs,18191
+charset_normalizer/models.py,sha256=XrGpVxfonhcilIWC1WeiP3-ZORGEe_RG3sgrfPLl9qM,13303
+charset_normalizer/py.typed,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
+charset_normalizer/utils.py,sha256=AWSL0z1B42IwdLfjX4ZMASA9cTUsTp0PweCdW98SI-4,9308
+charset_normalizer/version.py,sha256=uxO2cT0YIavQv4dQlNGmHPIOOwOa-exspxXi3IR7dck,80
@@ -0,0 +1,5 @@
+Wheel-Version: 1.0
+Generator: bdist_wheel (0.37.1)
+Root-Is-Purelib: true
+Tag: py3-none-any
+
@@ -0,0 +1,3 @@
+[console_scripts]
+normalizer = charset_normalizer.cli.normalizer:cli_detect
+
@@ -0,0 +1 @@
+charset_normalizer