Cleaning Up Data Using SQL

Sometimes you just need to bring a sledgehammer to drive that stake down. Using the right tool for the job can save incredible amounts of time and frustration.

For example, we have found many people enter "none" on the Allergies field. Or maybe "no known allergies" or "nka" or "nada" or "no" or any other variation thereof. In turn, this adds a value into the Allergy field, and now when their kid checks in, they have an Allergy note on their name tag. So then they come back and ask why we say they have allergies when they said they had "no food allergies" when asked. Sure, you could go and search every record and delete their errant values manually. But if you are a large organization, you'd much rather do this in seconds vs weeks, right?

Enter the power of SQL.

Disclaimer: We are using a "no going back" method here. You need to be 100% sure you know what you are doing. I will guide you and you can even run a "preview" of the changes to be made to ensure you are correct. But know there is no "undo" button. Only the final step is actually changing anything. So you can't do any damage on the first "investigative" SQL commands. I will let you know when you need to really be careful when we get to that step.

So the first thing we are doing is searching for WHAT people are entering as Allergies. I don't want to see 40,000 answers here... just the UNIQUE answers. So if 100 people put "peanut butter" and one guy put "jelly" my results will be one entry of each, not 101 entries. This will let us quickly scan the list to see which we wish to remove. In the case of Allergies, I needed to find WHICH Attribute ID is the allergy value.

1) Open "Admin Tools - General Settings" and go to "Person Attributes"

2) Find the attribute you are looking to edit here. In our case we are looking for Allergy which shows an ID of 676.

Now we are going to run a command to return a list of all UNIQUE values people have entered as Attribute 676.

3) Go to "Admin Tools - Power Tools" and pick "SQL Command"

4) Here you see a demo SQL command that is totally safe to run. It will return the first 10 people in your Database as a result. Click "Run" to see it or if you are good, highlight the text in the SQL Text box and delete it so we can put our command in instead.

5) Copy/Paste the following as your SQL to run, keeping in mind you may need to change your ID value from 676 to whatever ID you are searching for, and you can rename "Allergy Listed" to be whatever column title you want to see there:

SELECT
	COUNT ([value]) AS '# of Entries'
	, [value] AS 'Allergy Listed'
FROM
	[Attribute] a
	INNER JOIN [AttributeValue] av ON av.[AttributeID] = a.[Id]
WHERE
	a.[Id] = 676
GROUP BY
    [value]

After you run your SQL, you can now sort your results by number of matching entries, or alphabetically to see what your database holds. Then make a list of the values you want to change. In our case, we have a lot of random "no" type entries, so my list looks like this:

-
0
n/a
na
-na-
na unknown
nada
nane
nda
nine
nka
nkda
nkw
no
no allergies
no food allergies
no food allergies.
no known
no known allergies
no known allergies.
non
non that we know of
none known
none we are aware of
nonw
noone
nope
not known
nothing
on
unknown
zero

Now just for confirmation, I'm going to search for and show count of these incorrect allergy entries that should be blank. Here I've taken my list above and put it into a query that returns all person attribute 676 entries where the value matches those I have in my list. You can see I've added a few comments (notes) after the double hyphens (--)on some just for my info. You'll want to update these values between the single quotes to be your listed items you wish to remove.

6) Prepare your SQL statement to search for your values. Copy/paste the below SQL and update it as needed.

SELECT
    COUNT ([value]) AS '# of Entries'
	, [value] AS 'Allergy Listed'
FROM
	[Attribute] a
	INNER JOIN [AttributeValue] av ON av.[AttributeID] = a.[Id]
WHERE
	a.[Id] = 676
	AND
	(av.[Value] = 'none'
		OR av.[Value] = '-'
		OR av.[Value] = '0'
		OR av.[Value] = 'n/a'
		OR av.[Value] = 'na'
		OR av.[Value] = '-na-'
		OR av.[Value] = 'na unknown'
		OR av.[Value] = 'nada'
		OR av.[Value] = 'nane' --misspelling of none
		OR av.[Value] = 'nda' --no drug allergies abbreviation
		OR av.[Value] = 'nine' --misspelling of none
		OR av.[Value] = 'nka' --no known allergies abbreviation
		OR av.[Value] = 'nkda' --no known drug allergies abbreviation
		OR av.[Value] = 'nkw' --no known ... misspelling?
		OR av.[Value] = 'no'
		OR av.[Value] = 'no allergies'
		OR av.[Value] = 'no food allergies'
		OR av.[Value] = 'no food allergies.'
		OR av.[Value] = 'no known'
		OR av.[Value] = 'no known allergies'
		OR av.[Value] = 'no known allergies.'
		OR av.[Value] = 'non' --misspelling of none
		OR av.[Value] = 'non that we know of' --misspelling of none
		OR av.[Value] = 'none known'
		OR av.[Value] = 'none we are aware of'
		OR av.[Value] = 'nonw' --misspelling of none
		OR av.[Value] = 'noone' --misspelling of none
		OR av.[Value] = 'nope'
		OR av.[Value] = 'not known'
		OR av.[Value] = 'nothing'
		OR av.[Value] = 'on' --misspelling of no
		OR av.[Value] = 'unknown'
		OR av.[Value] = 'zero')
GROUP BY
    [value]

If all looks good on your result, you can now build a "replacement" SQL command to blank out those pesky entries. Note how this is flanked by "BEGIN TRANSACTION" and "ROLLBACK TRANSACTION" which will give a message result showing how many rows would be affected. As long as those are present, we will NOT be changing data and this command is safe to use.

7) Copy/paste the SQL below (or your statement containing your replacements) into your SQL field. You need to change the "Selection Query" option from "Yes" to "No" now as we are not returning rows of results, rather we want to see the SQL command results (how long it took to run, and how many rows are affected). Once you have your command in, your "Selection Query" set to "No" and you are sure you start and end with the "BEGIN TRANSACTION" and "ROLLBACK TRANSACTION" lines, you are free to hit Run!

BEGIN TRANSACTION
UPDATE
	[AttributeValue]
SET
	[Value] = ''
FROM
	[Attribute] a
	INNER JOIN [AttributeValue] av ON av.[AttributeID] = a.[Id]
WHERE
	a.[Id] = 676
	AND
	(av.[Value] = 'none'
		OR av.[Value] = '-'
		OR av.[Value] = '0'
		OR av.[Value] = 'n/a'
		OR av.[Value] = 'na'
		OR av.[Value] = '-na-'
		OR av.[Value] = 'na unknown'
		OR av.[Value] = 'nada'
		OR av.[Value] = 'nane' --misspelling of none
		OR av.[Value] = 'nda' --no drug allergies abbreviation
		OR av.[Value] = 'nine' --misspelling of none
		OR av.[Value] = 'nka' --no known allergies abbreviation
		OR av.[Value] = 'nkda' --no known drug allergies abbreviation
		OR av.[Value] = 'nkw' --no known ... misspelling?
		OR av.[Value] = 'no'
		OR av.[Value] = 'no allergies'
		OR av.[Value] = 'no food allergies'
		OR av.[Value] = 'no food allergies.'
		OR av.[Value] = 'no known'
		OR av.[Value] = 'no known allergies'
		OR av.[Value] = 'no known allergies.'
		OR av.[Value] = 'non' --misspelling of none
		OR av.[Value] = 'non that we know of' --misspelling of none
		OR av.[Value] = 'none known'
		OR av.[Value] = 'none we are aware of'
		OR av.[Value] = 'nonw' --misspelling of none
		OR av.[Value] = 'noone' --misspelling of none
		OR av.[Value] = 'nope'
		OR av.[Value] = 'not known'
		OR av.[Value] = 'nothing'
		OR av.[Value] = 'on' --misspelling of no
		OR av.[Value] = 'unknown'
		OR av.[Value] = 'zero')
ROLLBACK TRANSACTION

Wow! We have 1,466 BAD ALLERGY entries! Usually this comes from a form requiring a reply or something and people HAVE to put something in. Let's clean that up for now.

THIS IS YOUR DANGER MOMENT! YOU ARE NOW ABOUT TO CHANGE VALUES! Are you certain you have no typos in your SQL command? Are you certain all your single quotes are good (opened and closed around values)? Take time to look it over before moving forward! Do you have a backup of your database? Now may be a great time to grab one. There is NO UNDO other than restoring a backup after you run this.

8) Go ahead and remove the first and last lines (BEGIN TRANSACTION and ROLLBACK TRANSACTION) and hit Run again. You should see an identical output when it finishes with the time being slightly different, but the rows affected being the same.

9) Run it a second time without changing anything and the rows affected should now be 0 as nothing is left!

10) If you want to confirm your results, revisit this SQL from step 5 and pull a new list of unique values... (remember to change your "Selection Query" back to "Yes" first) confirming those you wished to remove are no longer present!

SELECT
	COUNT ([value]) AS '# of Entries'
	, [value] AS 'Allergy Listed'
FROM
	[Attribute] a
	INNER JOIN [AttributeValue] av ON av.[AttributeID] = a.[Id]
WHERE
	a.[Id] = 676
GROUP BY
    [value]

That's all there is to it! You've just done a MASSIVE amount of work in a few minutes. Behold, the power of SQL.