Table of Contents
When working with natural language or code, we often face the need to compare two pieces of text. Whether it’s to check changes in a document, filter out similar news articles, or detect near-duplicate commands, Python’s built-in difflib module is a practical solution.
In this post, we’ll explore how to use python difflib for similarity verification, comparing text strings and files, and identifying close matches. These tools are especially useful in 2025, where automation and content filtering are crucial in data processing.
What is Python difflib?
The Python difflib
module is part of Python’s standard library. It provides tools to:
- Compare sequences (strings, lists, files)
- Highlight differences
- Calculate similarity ratios
- Generate human-readable or HTML diffs
Its core function is SequenceMatcher
, which estimates similarity using an algorithm inspired by the Levenshtein Distance.
1. Comparing Strings with ndiff
import difflib
text1 = "Hello, world!"
text2 = "Hello, Python world!"
diff = difflib.ndiff(text1, text2)
print(''.join(diff))
Output:
H e l l o , + P+ y+ t+ h+ o+ n+ w o r l d !
Characters with +
indicate additions, -
for deletions.
2. Similarity Ratio with SequenceMatcher
similarity = difflib.SequenceMatcher(None, text1, text2).ratio()
print(f"Similarity: {similarity:.2f}") # e.g. 0.79
This returns a float between 0 and 1. A value closer to 1 indicates high similarity.
Rule of thumb:
> 0.8
: almost identical> 0.5
: moderately similar< 0.5
: weak similarity
3. Finding Close Matches in Words
words = ["apple", "orange", "banana", "grape", "pineapple"]
target = "appel"
close_matches = difflib.get_close_matches(target, words)
print(f"Close matches for '{target}': {close_matches}")
Output:
Close matches for 'appel': ['apple', 'pineapple']
This is helpful for typo correction, search suggestions, or data cleansing.
4. Comparing Sentences
sentences = [
"The quick brown fox jumps over the lazy dog.",
"A slow white dog runs the cat.",
"The quick red fox jumps over the sleepy cat."
]
target_sentence = "The quick brown fox jumps over a lazy dog."
close_matches = difflib.get_close_matches(target_sentence, sentences)
print(close_matches)
Output:
['The quick brown fox jumps over the lazy dog.',
'The quick red fox jumps over the sleepy cat.']
It’s remarkably effective at detecting semantic similarity with only a few word differences.
5. Comparing Files with HtmlDiff
with open('file1.txt', 'r') as file1, open('file2.txt', 'r') as file2:
lines1 = file1.readlines()
lines2 = file2.readlines()
html_diff = difflib.HtmlDiff().make_file(lines1, lines2,
fromdesc='File 1', todesc='File 2')
with open('diff.html', 'w') as output:
output.write(html_diff)
Use Case:
Generate visual HTML diffs of config files, logs, or reports.
Example Files
file1.txt
Hello, world!
file2.txt
Hello, Python world!
Open diff.html
in your browser to view a formatted comparison.
Real-World Applications of Similarity Verification
News Deduplication
When collecting news data, filtering out near-duplicate articles is essential. difflib.get_close_matches()
can help remove redundancies before analysis.
Natural Language Input Correction
In chatbots or user-facing tools, comparing input with known command phrases enhances user experience:
commands = ["search", "exit", "reload"]
user_input = "serach"
suggestion = difflib.get_close_matches(user_input, commands, n=1)
Comparing Database Queries
Analyzing log traces for SQL queries or CLI commands becomes easier when similar lines are grouped.
A Note on Vector-Based Alternatives
While Python difflib
is great for simple pattern matching, in complex NLP tasks, methods like:
- Cosine Similarity (TF-IDF or BERT embeddings)
- Manhattan/Euclidean Distance in vector space
- Jaccard Similarity
…often yield better results. However, difflib
remains ideal when:
- You need a lightweight, no-dependency solution
- You’re comparing plain text or configuration files
- You want human-readable diffs
Conclusion
In 2025, as developers deal with more text data than ever, mastering tools like difflib
is a must. It’s built-in, fast, and surprisingly powerful for comparing strings, generating diffs, and verifying similarity.
Whether you’re building an editor, log analyzer, or a simple content checker, difflib
offers the flexibility and simplicity you need to get the job done efficiently.