It has often been the case in the course of my assignments that I’ve had to do quite some data prep in Excel before I could have a clean dataset to crunch.
If you also need to do this in Excel, below are 10 functions that I have found indispensable.
The datasets I’m going to use to exemplify them are:
2. Game Tweets – a dataset we created for a uni project where we had to monitor the live tweets related to an event of our choice.
So let’s crack on with the functions:
1. left(), right(), mid()
I bet you’ve already heard of these 3 – after all they’re the only way you could extract a substring from another in Excel, which unlike SQL does not support functions like substring()
Left() and Right() are pretty straightforward – they get you the part of the string, which is a specified number of characters from its beginning or end.
This means that they can only be used in their stand-alone form if the part of the string that needs extraction does not vary in length.
For instance, right() works like a charm in the event that you need to extract the unique 8-digit ID associated with a person from a string that starts with his/her name and ends with their ID
What’s a bit trickier is when you need to extract the first part of a string that varies in length (the name in our case). It is in such instances that just left() will not do the trick because of the variation in the number of characters.
A handy resolution in such an instance would be to use a combination of left() and len()
Len() gives you the length of a string including trailing blanks. Its counterpart in SQL Server is datalength()
Mid() on the other hand is a bit more peculiar in the sense that:
• It is the only function out of the three that requires 3 arguments – mid (string, starting position, number of characters)
• It is not widely supported in databases. Neither SQL Server, PostgreSQL, Vertica or Teradata support it. The only instance I am aware of is MySQL
Nevertheless, it’s an absolutely smashing function if the string that needs trimming always:
• starts with the same number of characters
• has a fixed length
2. upper(), lower(), proper()
Although rare, there’s still a good chance that you may end up with a dataset full of strings spelt entirely in capitals.
Not only is this ugly for the eye, but it could also prove problematic if you need to match the dataset to another as lookups wouldn’t recognize the match.
That’s where lower() and upper() come to aid.
Going back to our example of using left() and len() to extract the family name of the interviewer from a string – the underlying reason for doing so, was so that we could show the supervisors a list of the people reporting to them in a nice presentable format.
And that’s exactly where proper() does its magic
3. find(), search()
Both find() and search() find a string within another by identifying its first position. In essence, they’re very similar to instr() in MySQL.
Unlike instr() and its SQL Server and PostgreSQL counterparts however, find() and search() will both return a #VALUE error if a substring matching the pattern was not found.
Hence it’s best that their usage in Excel be always accompanied with iferror()
Did you notice the difference between the results that find() and search() returned in row 1?
This is because unlike find(), search() is case-insensitive
Another big difference between the two is that search() will allow the usage of wild characters
In a nutshell:
|case sensitive||case insensitive|
|does not allow the usage of wild characters||allows usage of wild characters|
4. substitute(), replace()
Both functions essentially replace part of a string with another. Yet, because of their inherent differences in syntax, it is not always that can they be used interchangeably. As you’ll see below replace() requires you to specify the starting position of the string that you want replaced as well as the number of characters that need replacing.
|REPLACE(old_text, start_num, num_chars, new_text)||SUBSTITUTE(text, old_text, new_text, [instance_num])|
The difference in syntax makes replace() a bit trickier to use if:
• the string to be replaced does not always start at the same position
• or varies in length
Replace() is still usable in such cases, but you’ll need to nest two more functions to achieve the same result as substitute()
Those of you also working with SQL – yes, it’s understandable why you might intuitively opt to use replace() in Excel. The true equivalent to the SQL function, however, is substitute()