
Yeah you did!
Matching strings is easy, right up until it isn’t. Most programming languages give you a double-equals operator or maybe an eq method. We will focus on Ruby here. Put in two strings and you get out a true or a false.
irb(main):001> 'easy' == 'easy'
=> true
This works great on two strings carefully typed into the REPL!
What happens when we start getting strings from other sources? We might need to pull in strings from a command line interface. We might have a string from a form we’ve built in our web app. We lump many of our strings into the broad category of user input. We build validation into our forms and use trusted methods to escape strings before using them to query databases or even display them back.
These scenarios introduce a lot more opportunities for subtle differences.
irb(main):002> 'easy' == 'Easy'
=> false
Case sensitivity, line break character conventions, and the optional oxford comma can make strings that mean the same thing be not strictly equal.
Non-printing characters like the byte order mark can create strings that look identical but are not strictly equal. Smart quotes or changes from two hyphens to an em dash happen automatically in some editors or not at all in others.
We might have strings parsed from a cell in an Excel spreadsheet. We might even get our strings from the mandatory registration questions returned by an API associated with a webinar. In these cases the user has moved on since they typed their input. We no longer have control of the feedback cycle to ask for corrections.
“Close enough only counts in horseshoes and hand-grenades” is a quote I would hear often growing up. Most of the time this quote was cited it meant I needed to try harder or be more exact. With strings we have to try harder to be less exact.
The cool thing to do right now is send your strings off to a Large Language Model (LLM) and have it tell you if they are the same. There is even a gem that you can use. Using billions or trillions of parameters is throwing an awful lot of computing power at a relatively mundane task. The cost in time and tokens will only add up the more matches you need to make.
Where the GPTs really shine is in exploring possible solutions. Without even using the correct terminology I was able to ask Google Gemini and discover two broad categories: Soundex matching and Edit distance algorithms.
SoundEx matching translates your strings into a series of phonetic sounds and then compares the result. It is a really interesting approach, but there doesn’t seem to be a commonly used ruby library.
Edit distance algorithms step through both strings determining how many characters would need to be changed to make the strings match. Hamming distance requires equal length strings. Levenshtein distance gives you that count of edits including insertions or deletions.
Jaro-Winkler is my favorite of the distance algorithms. The result of this algorithm is a number between 0 and 1. Having a result in a defined range means we can do things like pick a threshold for a good enough match. With a well chosen threshold we can stop searching potential matches early or determine that none of the potential matches are close enough.
Best of all Ruby’s gem command is already using this algorithm to provide “Did you mean?” suggestions on the command line. So we can use it with any normal Ruby installation.
<pre>irb(main):007> good_enough = 0.85
=> 0.85
irb(main):008> DidYouMean::JaroWinkler.distance('Typos, case, and oxford commas', 'Typps, Case and oxford commas') > good_enough
=> true
irb(main):009> DidYouMean::JaroWinkler.distance('dog', 'cat') > good_enough
=> false
With Jaro-Winkler I was able to find close enough matching strings cross referencing different spreadsheets. I was also able to match questionnaire questions and responses between a web application and responses from GoTo Webinar’s API. That is close enough for me.
Loved the article? Hated it? Didn’t even read it?
We’d love to hear from you.