A better way of removing punctuation from a string in Python
Reading time: 2 min read
This post is as a future reminder for me as much as anything.
I made a Python program called game-image-resizer a few months ago. It takes a list of board games, finds each board game on BoardGameGeek's API, downloads the best image for each game, does some resizing and editing of the image, and then saves it using a useful filename.
That final stage of saving as a useful filename meant taking the board game name, making it lower case, removing punctuation, and replacing spaces with underscores.
I did it like this – roughly using the information in this StackOverflow discussion.
from string import punctuation # making string lower case working_string = working_string.lower() # removing punctuation remove_punctuation = str.maketrans('', '', punctuation) working_string = working_string.translate(remove_punctuation) # replacing spaces and double-spaces with an underscore working_string = working_string.replace(" ", "_") working_string = working_string.replace(" ", "_")
It uses Inflection – a "string transformation library". Inflection does all sorts of things including
inflection.parameterize(). Parameterize "replace[s] special characters in a string so that it may be used as part of a 'pretty' URL."
This means I can now do the following which is a much nicer-to-read and nicer-to-write solution.
from inflection import parameterize # Example board game names with upper case, punctuation, and non-ASCII characters board_game_names = [ "Dawn of the Zeds (Third edition)", "Flash Point: Fire Rescue – Honor & Duty", "Orléans", "Mechs vs. Minions", "Tzolk'in: The Mayan Calendar", "T.I.M.E Stories", "Aeon's End", ] for name in board_game_names: parameterized_name = parameterize(name, separator="_") # Default is `separator='-'` print(parameterized_name) # Or whatever I want to do with it Output dawn_of_the_zeds_third_edition flash_point_fire_rescue_honor_duty orleans mechs_vs_minions tzolk_in_the_mayan_calendar t_i_m_e_stories aeon_s_end
Parameterize mostly just uses some regular expressions but it's very useful. It has the effect of:
- Replacing non-ASCII characters with an ASCII approximation – using
- Replacing any character with the separator if it isn't one of:
- a hyphen (-)
- an underscore(_)
- Ensuring there is never more than one separator in a row
- Removing separators from the start or end of the string
- Making the string lower case