
Mekki MOURADI
Data Analyst
Rabat, Maroc
À Propos
Data Analyst Junior avec une solide expérience en marketing numérique et des compétences approfondies dans l’utilisation de Microsoft Excel, SQL, Tableau Software et Python.
Using SQL String Functions to Clean Data


Liens
Compétences & Outils
Description du projet
Apprendre à manipuler les données via les Fonctions de Chaînes de Caractères :
- LENGTH, LEFT, RIGHT
- UPPER, LOWER
- REPLACE
- TRIM, LTRIM, RTRIM
- CONCATENATION
- SUBSTRING
- STRING AGGREGATION
- COALESCE
Objectifs de l'analyse
- Utiliser les Fonctions de Chaînes de Caractères pour modifier et nettoyer les données.
- LENGTH, LEFT and RIGHT
- Find the Length of the first name of Male Employees
SELECT emp_no, first_name, LENGTH(first_name) AS characters_number
FROM employees
WHERE gender='M'
LIMIT 10;

emp_no | first_name | characters_number |
---|---|---|
10003 | Parto | 5 |
10004 | Chirstian | 9 |
10005 | Kyoichi | 7 |
10008 | Saniya | 6 |
10012 | Patricio | 8 |
10013 | Eberhardt | 9 |
10014 | Berni | 5 |
10015 | Guoxiang | 8 |
10016 | Kazuhito | 8 |
10019 | Lillian | 7 |
- Find the length of the first name of male employees where the length of the first name is greater than 5
SELECT emp_no, LENGTH(first_name) AS characters_number
FROM employees
WHERE gender='M' AND LENGTH(first_name) > 5
LIMIT 10;

emp_no | characters_number |
---|---|
10004 | 9 |
10005 | 7 |
10008 | 6 |
10012 | 8 |
10013 | 9 |
10015 | 8 |
10016 | 8 |
10019 | 7 |
10020 | 6 |
10022 | 6 |
- Retrieve a list of the customer group of all customers
SELECT customer_id, LEFT(customer_id, 2) AS customer_group
FROM customers
LIMIT 10;

customer_id | customer_group |
---|---|
CG-12520 | CG |
DV-13045 | DV |
SO-20335 | SO |
BH-11710 | BH |
AA-10480 | AA |
IM-15070 | IM |
HP-14815 | HP |
PK-19075 | PK |
AG-10270 | AG |
ZD-21925 | ZD |
- Retrieve a lsit of the customer number of all customers
SELECT customer_id, RIGHT(customer_id, 5) AS customer_number
FROM customers
LIMIT 10;

customer_id | customer_number |
---|---|
CG-12520 | 12520 |
DV-13045 | 13045 |
SO-20335 | 20335 |
BH-11710 | 11710 |
AA-10480 | 10480 |
IM-15070 | 15070 |
HP-14815 | 14815 |
PK-19075 | 19075 |
AG-10270 | 10270 |
ZD-21925 | 21925 |
- Retrieve a lsit of the customer number of all customers using length
SELECT customer_id, RIGHT(customer_id, LENGTH(customer_id)-3) AS customer_number
FROM customers
LIMIT 10;

customer_id | customer_number |
---|---|
CG-12520 | 12520 |
DV-13045 | 13045 |
SO-20335 | 20335 |
BH-11710 | 11710 |
AA-10480 | 10480 |
IM-15070 | 15070 |
HP-14815 | 14815 |
PK-19075 | 19075 |
AG-10270 | 10270 |
ZD-21925 | 21925 |
- UPPER and LOWER
- Change the first name of the first employee to uppercase letters
SELECT emp_no, first_name
FROM employees
LIMIT 1;

emp_no | first_name |
---|---|
10001 | Georgi |
-- start a transaction
BEGIN
-- updating to upper case
UPDATE employees
SET first_name = UPPER(first_name)
WHERE emp_no='10001'
-- confirming edits
SELECT emp_no, first_name
FROM employees
WHERE emp_no='10001'
-- ROLLBACK to the previous step to not alter the dataset
ROLLBACK;

emp_no | first_name |
---|---|
10001 | GEORGI |
- REPLACE
- Replace M to Male and F to Female
SELECT first_name,
last_name,
gender,
REPLACE(REPLACE(gender, 'M', 'Male'),'F','Female') AS new_gender
FROM employees
LIMIT 10;

first_name | last_name | gender | new_gender |
---|---|---|---|
Bezalel | Simmel | F | Female |
Parto | Bamford | M | Male |
Chirstian | Koblick | M | Male |
Kyoichi | Maliniak | M | Male |
Anneke | Preusig | F | Female |
Tzvetan | Zielinski | F | Female |
Saniya | Kalloufi | M | Male |
Sumant | Peac | F | Female |
Duangkaew | Piveteau | F | Female |
Mary | Sluis | F | Female |
- SQL is case sensitive so to make it easier, we can use String Functions like LOWER to replace data
SELECT customer_name,
region,
country,
REPLACE(LOWER(country),'united states','US') AS new_country
FROM customers
LIMIT 10;

customer_name | region | country | new_country |
---|---|---|---|
Claire Gute | South | United States | US |
Darrin Van Huff | West | United States | US |
Sean O'Donnell | South | United States | US |
Brosina Hoffman | West | United States | US |
Andrew Allen | South | United States | US |
Irene Maddox | West | United States | US |
Harold Pawlan | Central | United States | US |
Pete Kriz | Central | United States | US |
Alejandro Grove | West | NULL | NULL |
Zuschuss Donatelli | West | United States | US |
- TRIM, LTRIM and RTRIM
TRIM
-- trim all spaces
SELECT TRIM(' ' FROM ' trim this ')
LEFT TRIM
-- trim all spaces in the left
SELECT LTRIM (' trim this ')
--OR
SELECT TRIM(leading ' ' FROM ' trim this ')
RIGHT TRIM
-- trim all spaces in the left
SELECT RTRIM (' trim this ')
--OR
SELECT TRIM(trailing ' ' FROM ' trim this ')
- Trim the () from the bracket_cust_id column
SELECT bracket_cust_id, TRIM(bracket_cust_id,'()') AS cleaned_cust_id
FROM customers
LIMIT 10;

bracket_cust_id | cleaned_cust_id |
---|---|
(CG-12520) | CG-12520 |
(DV-13045) | DV-13045 |
(SO-20335) | SO-20335 |
(BH-11710) | BH-11710 |
(AA-10480) | AA-10480 |
(IM-15070) | IM-15070 |
(HP-14815) | HP-14815 |
(PK-19075) | PK-19075 |
(AG-10270) | AG-10270 |
(ZD-21925) | ZD-21925 |
- CONCATENATION
- Create a new column called full_name from the first and last name of employees
SELECT first_name,
last_name,
(first_name || ' ' || last_name) AS full_name
FROM employees
LIMIT 10;

first_name | last_name | full_name |
---|---|---|
Bezalel | Simmel | Bezalel Simmel |
Parto | Bamford | Parto Bamford |
Chirstian | Koblick | Chirstian Koblick |
Kyoichi | Maliniak | Kyoichi Maliniak |
Anneke | Preusig | Anneke Preusig |
Tzvetan | Zielinski | Tzvetan Zielinski |
Saniya | Kalloufi | Saniya Kalloufi |
Sumant | Peac | Sumant Peac |
Duangkaew | Piveteau | Duangkaew Piveteau |
Mary | Sluis | Mary Sluis |
- Create a new column called address from the city, state and country of customers
SELECT city||', '||state||', '||country) AS address,
city,
state,
country
FROM customers
LIMIT 10;

address | city | state | country |
---|---|---|---|
Henderson, Kentucky, United States | Henderson | Kentucky | United States |
Los Angeles, California, United States | Los Angeles | California | United States |
Fort Lauderdale, Florida, United States | Fort Lauderdale | Florida | United States |
Los Angeles, California, United States | Los Angeles | California | United States |
Concord, North Carolina, United States | Concord | North Carolina | United States |
Seattle, Washington, United States | Seattle | Washington | United States |
Fort Worth, Texas, United States | Fort Worth | Texas | United States |
Madison, Wisconsin, United States | Madison | Wisconsin | United States |
NULL | West Jordan | Utah | NULL |
San Francisco, California, United States | San Francisco | California | United States |
- Create a column called desc_age from the customers name and age
SELECT (customer_name||', '||age||' years old.') AS desc_age,
customer_name,
age
FROM customers
LIMIT 10;

desc_age | customer_name | age |
---|---|---|
Claire Gute, 67 years old. | Claire Gute | 67 |
Darrin Van Huff, 31 years old. | Darrin Van Huff | 31 |
Sean O'Donnell, 65 years old. | Sean O'Donnell | 65 |
Brosina Hoffman, 20 years old. | Brosina Hoffman | 20 |
Andrew Allen, 50 years old. | Andrew Allen | 50 |
Irene Maddox, 66 years old. | Irene Maddox | 66 |
Harold Pawlan, 20 years old. | Harold Pawlan | 20 |
Pete Kriz, 46 years old. | Pete Kriz | 46 |
Alejandro Grove, 18 years old. | Alejandro Grove | 18 |
Zuschuss Donatelli, 66 years old. | Zuschuss Donatelli | 66 |
- SUBSTRING
- Retrieve the ids, names, and groups of customers
SELECT customer_id,
customer_name,
SUBSTRING(customer_id FOR 2) AS customer_group
FROM customers
LIMIT 10;

customer_id | customer_name | customer_group |
---|---|---|
CG-12520 | Claire Gute | CG |
DV-13045 | Darrin Van Huff | DV |
SO-20335 | Sean O'Donnell | SO |
BH-11710 | Brosina Hoffman | BH |
AA-10480 | Andrew Allen | AA |
IM-15070 | Irene Maddox | IM |
HP-14815 | Harold Pawlan | HP |
PK-19075 | Pete Kriz | PK |
AG-10270 | Alejandro Grove | AG |
ZD-21925 | Zuschuss Donatelli | ZD |
- Retrieve the ids and names of the customers in the customer group 'AB'
SELECT customer_id,
customer_name,
SUBSTRING(customer_id FOR 2) AS customer_group
FROM customers
WHERE SUBSTRING(customer_id FOR 2) = 'AB';

customer_id | customer_name | customer_group |
---|---|---|
AB-10060 | Adam Bellavance | AB |
AB-10165 | Alan Barnes | AB |
AB-10255 | Alejandro Ballentine | AB |
AB-10600 | Ann Blume | AB |
AB-10105 | Adrian Barton | AB |
AB-10150 | Aimee Bixby | AB |
AB-10015 | Aaron Bergman | AB |
- Retrieve the ids, names, and customer number of customers in the custom group 'AB'
SELECT customer_id,
customer_name,
SUBSTRING(customer_id FROM 4 FOR 5) AS customer_number
FROM customers
WHERE SUBSTRING(customer_id FOR 2) = 'AB';

customer_id | customer_name | customer_number |
---|---|---|
AB-10060 | Adam Bellavance | 10060 |
AB-10165 | Alan Barnes | 10165 |
AB-10255 | Alejandro Ballentine | 10255 |
AB-10600 | Ann Blume | 10600 |
AB-10105 | Adrian Barton | 10105 |
AB-10150 | Aimee Bixby | 10150 |
AB-10015 | Aaron Bergman | 10015 |
- Retrieve the year of birth for all employees
SELECT emp_no,
birth_date,
SUBSTRING(CAST(birth_date AS VARCHAR) FOR 4) AS year
FROM employees
LIMIT 10;
-- OR
SELECT emp_no,
birth_date,
EXTRACT(YEAR FROM birth_date) AS year
FROM employees
LIMIT 10;

emp_no | birth_date | year |
---|---|---|
10002 | 1964-06-02 | 1964 |
10003 | 1959-12-03 | 1959 |
10004 | 1954-05-01 | 1954 |
10005 | 1955-01-21 | 1955 |
10006 | 1953-04-20 | 1953 |
10007 | 1957-05-23 | 1957 |
10008 | 1958-02-19 | 1958 |
10009 | 1952-04-19 | 1952 |
10010 | 1963-06-01 | 1963 |
10011 | 1953-11-07 | 1953 |
- STRING AGGREGATION
- Retrieve a list of all department numbers for different employees
SELECT emp_no,
STRING_AGG(dept_no,', ') AS departments
FROM dept_emp
GROUP BY emp_no
LIMIT 10;

emp_no | departments |
---|---|
10001 | d005 |
10002 | d007 |
10003 | d004 |
10004 | d004 |
10005 | d003 |
10006 | d005 |
10007 | d008 |
10008 | d005 |
10009 | d006 |
10010 | d004, d006 |
- Retrieve a list of all products that were ordered by a customer from the sales table
SELECT order_id,
STRING_AGG(product_id,' /// ') AS products_ordered
FROM sales
GROUP BY order_id

order_id | products_ordered |
---|---|
CA-2014-100006 | TEC-PH-10002075 |
CA-2014-100090 | OFF-BI-10001597 /// FUR-TA-10003715 |
CA-2014-100293 | OFF-PA-10000176 |
CA-2014-100328 | OFF-BI-10000343 |
CA-2014-100363 | OFF-FA-10000611 /// OFF-PA-10004733 |
CA-2014-100391 | OFF-PA-10001471 |
CA-2014-100678 | OFF-AR-10001868 /// TEC-AC-10000474 /// OFF-EN-10000056 /// FUR-CH-10002602 |
CA-2014-100706 | TEC-AC-10001314 /// FUR-FU-10002268 |
CA-2014-100762 | OFF-PA-10001815 /// OFF-AR-10000380 /// OFF-LA-10003930 /// OFF-PA-10004082 |
CA-2014-100860 | OFF-LA-10001982 |
- COALESCE
- Replace a missing country with the city, state, or no address depending on the data availabe
SELECT customer_name,
country,
city,
state,
COALESCE(country,city,state,'No Address') AS customer_address
FROM customers
LIMIT 10;

customer_name | country | city | state | customer_address |
---|---|---|---|---|
Claire Gute | United States | Henderson | Kentucky | United States |
Darrin Van Huff | United States | Los Angeles | California | United States |
Sean O'Donnell | United States | Fort Lauderdale | Florida | United States |
Brosina Hoffman | United States | Los Angeles | California | United States |
Andrew Allen | United States | Concord | North Carolina | United States |
Irene Maddox | United States | Seattle | Washington | United States |
Harold Pawlan | United States | Fort Worth | Texas | United States |
Pete Kriz | United States | Madison | Wisconsin | United States |
Alejandro Grove | NULL | West Jordan | Utah | West Jordan |
Zuschuss Donatelli | United States | San Francisco | California | United States |