difflib (Python Document) is a library in the Python standard library. It is used to compare differences between strings or sequences and analyze their similarity verify.

It’s mainly used to compare text, compare files, and measure similarity, and in this post we’ll talk about measuring similarity.

I think functions like ndiff might be hard to grasp if you’re learning by following lectures sequentially.

You can look it up when needed since it’s a basic function. Personally, I learned it by going through the documentation step by step, which is why I’m familiar with and using it.

Similarity Verify Code

import difflib

text1 = "Hello, world!"
text2 = "Hello, Python world!"

diff = difflib.ndiff(text1, text2)
print(''.join(diff))
# H  e  l  l  o  ,   + P+ y+ t+ h+ o+ n+    w  o  r  l  d  !

# Calculating similarity ratios with SequenceMatcher
similarity = difflib.SequenceMatcher(None, text1, text2).ratio()
print(f"Similarity: {similarity:.2f}")  # Similarity: 0.84

# word close match
words = ["apple", "orange", "banana", "grape", "pineapple"]
target = "appel"

# similarity word search
close_matches = difflib.get_close_matches(target, words)
print(f"Close matches for '{target}': {close_matches}")
# Close matches for 'appel': ['apple', 'pineapple']

# Sentence lists and sentences to compare
sentences = [
    "The quick brown fox jumps over the lazy dog.",
    "A slow white dog runs the cat.",
    "The quick red fox jumps over the sleepy cat."
]
target_sentence = "The quick brown fox jumps over a lazy dog."

close_matches = difflib.get_close_matches(target_sentence, sentences)
print(f"Close matches for '{target_sentence}': {close_matches}")
# Close matches for 'The quick brown fox jumps over a lazy dog.':
# ['The quick brown fox jumps over the lazy dog.',
# 'The quick red fox jumps over the sleepy cat.']

# file to list
with open('file1.txt', 'r') as file1, open('file2.txt', 'r') as file2:
    file1_lines = file1.readlines()
    file2_lines = file2.readlines()

# HtmlDiff use list diff
diff = difflib.HtmlDiff().make_file(file1_lines, file2_lines,
                                    fromdesc='File 1', todesc='File 2')

# Save result
with open('diff.html', 'w') as result_file:
    result_file.write(diff)

Output

  H  e  l  l  o  ,   + P+ y+ t+ h+ o+ n+    w  o  r  l  d  !
Similarity: 0.79
Close matches for 'appel': ['apple', 'grape']
Close matches for 'The quick brown fox jumps over a lazy dog.': ['The quick brown fox jumps over the lazy dog.', 'The quick red fox jumps over the sleepy cat.']

The algorithm for calculating similarity in difflib is said to resemble Levenshtein Distance. I don’t think it’s appropriate to explain the algorithm in detail here, and I’d rather refer you to a more academic post for that sort of thing.

For now, I’ll just say that the example came up with a similarity of 0.79 points, and as a rule of thumb, even 0.5 seems to cover pretty much the same ground.

This reminds me of trying to separate news types by looking for words that indicate positive and negative. This function seems to capture some of those subtleties.

file1.txt

Hello, world!

file2.txt

Hello, Python world!

I became interested in Similarity Verification while collecting news and facing some challenges.

To filter out unnecessary information, I experimented with techniques like cosine similarity and Manhattan Distance. However, I wasn’t aware that there was a Python library specifically designed for this.

Recently, I decided that summarizing with an LLM would suffice, so that’s what I’m using now.

For reference, when analyzing server commands or database queries, comparing similarity by mapping these words into a vector space often yields better results than using an LLM.

By Mark

-_-

Leave a Reply

Your email address will not be published. Required fields are marked *