Python’s re module is used to perform operations such as searching, matching, and replacing strings using regular expressions (Regular Expression – Python Document)
Key features
- Pattern matching: Find or inspect for specific string patterns
- Search and replace: Modify strings based on patterns
- String Separation: Split strings by a specific pattern
Regular expression key patterns
- ..: Any single character
- ^: Start of string
- $: End of string
- \d: a number
- \w: Alphabet/number/underscore
- +: Repeat 1 or more times
- *: Repeat 0 or more times
- []: Character set
- |: OR condition
Even if you’re familiar with regular expression patterns, creating new ones from scratch can be quite challenging.
That’s why I usually refer to existing patterns, modify them as needed, and use them instead of starting from scratch. It’s more efficient and avoids the hassle of debugging errors.
In this example code, I used the Card Number as it’s the easiest to understand at a glance.
Regular expression Code
import re
# Matching
pattern = r'[34569]\d{3}-\d{4}-\d{4}-\d{4}'
text = "My card number is 9923-2341-2354-2385."
match = re.search(pattern, text)
if match:
print("Match found:", match.group()) # Match found: 9923-2341-2354-2385
else:
print("No match found")
pattern = r'\d+' # Chase only numbers
matches = re.findall(pattern, text)
print("Matches found:", matches) # Matches found: ['9923', '2341', '2354', '2385']
# replace (this is masking)
new_text = re.sub(pattern, '****', text)
print("Replaced text:", new_text)
# Replaced text: My card number is ****-****-****-****.
# group
print(re.match('(23)', text)) # None - First
print(re.search('(23)', text)) # <re.Match object; span=(20, 22), match='23'>
print(re.findall('(23)', text)) # ['23', '23', '23', '23']
print(re.fullmatch('(23)', text)) # None
# capture
pattern = '(\d\d\d\d)'
match = re.findall(pattern, text)
if match:
print(f"{match[0]}-{match[1]}-{match[2]}-{match[3]}") #
else:
print("No match found")
# compile
pattern = re.compile(r'\b\w{2}\b') # 2-letter
matches = pattern.findall(text)
print("2-letter words:", matches) # ['My', 'is']
# Start pattern
pattern_start = r'^My'
if re.match(pattern_start, text):
print("The string starts with 'My'") # The string starts with 'My'
else:
print("The string does not start with 'My'")
# End pattern
pattern_end = r'.$'
if re.search(pattern_end, text):
print("The string ends with '.'") # The string ends with '.'
else:
print("The string does not end with '.'")
# Multiple lines
pattern = r'^\w+' # Fist word
text = """first line
second line
third line"""
matches = re.findall(pattern, text, re.MULTILINE)
print("Words at the start of each line:", matches) # ['first', 'third']
# 비어 있지 않은 줄 찾기
pattern = r'^.+$'
text = """first line
second line
third line
"""
matches = re.findall(pattern, text, re.MULTILINE)
print("Non-empty lines:", matches) # ['first line', ' second line', 'third line']
We also use Complie for speed issues when searching.
Output
Match found: 9923-2341-2354-2385
Matches found: ['9923', '2341', '2354', '2385']
Replaced text: My card number is ****-****-****-****.
None
<re.Match object; span=(20, 22), match='23'>
['23', '23', '23', '23']
None
9923-2341-2354-2385
2-letter words: ['My', 'is']
The string starts with 'My'
The string ends with '.'
Words at the start of each line: ['first', 'third']
Non-empty lines: ['first line', ' second line', 'third line']
Nowadays, personal information is protected by strict laws, so companies pay close attention to it. However, many companies don’t know exactly where personal data resides within their services, leading to the development of tools that search for such information internally.
From what I understand, these tools use patterns and machine learning to identify personal data, as the information can’t be easily searched or understood otherwise. However, this process can be time-consuming. Wouldn’t it be better to block such data at the input stage with a lightweight program?
If you’re a CISO (Chief Information Security Officer), this is a critical issue to consider. That said, implementing strong policies often reduces productivity. Personally, I believe productivity and security should coexist in a well-separated, secure environment—though this approach can be expensive.
Anyway, since I’m not a CISO, I’ll leave it at that.