Fast Censor

fast_censor

A fast and flexible package for filtering out profanity or other strings from text, ~100 times faster than alternatives
the fastest string utility for profanity detection / censoring
Allows for detection with repeated characters and character substitution
Requires zero dependencies and works for python 3.6 -- 3.11

Installation

From source

cd fast-censor  # enter into project directory
python setup.py install

From GitHub

pip install git+https://github.com/mbuchove/fast_censor.git

Uses

from fast_censor import FastCensor

# to load default (encoded) profanity word list
censor = FastCensor()

# load alternate path, example is a plain text word list without encoding
censor_clean = fast_censor.FastCensor(
    wordlist=fast_censor.WordListHandler.get_default_wordlist_path("clean_wordlist_decoded.txt"), 
    wordlist_encoded=False,
)

# censor texts or simply get the indices of matches
matches = censor_clean.check_text("this bat is for riii1ick")
# >>> [(5, 9), (17, 25)]
censored_text = censor_clean.censor("fuuudge you")
# >>> "******* you"

Character substitutions

FastCensor's profanity matcher allows the flexibility to match words when specified characters are substituted for others, as is customary in 1337 speak. A default is set for commonly used substitutions.

To set your own, for example, you would pass the following into FastCensor

substitutions = {'a': '@4'}

all matching is case-insensitive

Character repititon

By default, words will still match even if a matching character is repeated any number of times. This includes any valid substitute for that character

For example, "baaa@@aatt" will match "bat"

You can turn this off by passing allow_repititions=False to censor_text or check_text

Delimiters

Use the delimiters parameter of FastCensor to set the delimiter characters, which determine the boundaries of a word. Profanity matches will not extend across any delimiting character.

For example, if '_' is a delimiter, "ba_t" would not match "bat"

Editing and saving wordlist

censor.add_word('new_word') # to add a new word censor.write_words_file("word_lists/new_wordlist_encoded.txt", encode=True)

Encoding

By default, the word lists are base64-encoded, so you can avoid displaying vulgar or offensive words. If you would like to save a word list in plain text, set encode=False in write_words_file

Benchmarks

See notebooks/benchmarks.ipynb for details

See: This Gist

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
fast_censor		fast_censor
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fast Censor

fast_censor

Installation

From source

From GitHub

Uses

Character substitutions

Character repititon

Delimiters

Editing and saving wordlist

Encoding

Benchmarks

About

Releases 4

Packages

Languages

License

MattGPT-ai/fast_censor

Folders and files

Latest commit

History

Repository files navigation

Fast Censor

fast_censor

Installation

From source

From GitHub

Uses

Character substitutions

Character repititon

Delimiters

Editing and saving wordlist

Encoding

Benchmarks

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 4

Packages 0

Languages

Packages