Extract number out of character column and use it to update another column (2024)

A friend told me about the old database he had to work with. There was one file with a field that contained two pieces of information. Any leading characters, A – Z, meant one thing and the numbers within it meant something else. Unfortunately there could be any number of leading characters, the numbers could be three of four long, and there could be characters at the end. The examples he shared with me were:

'A123B'
'ABC123'
'AB4679'
'BC0123D'

His plan was to add a new numeric field to the file, and update it with the numbers extracted from this original field.

"What would be a simple way to do this with SQL?" he asked.

Before I answer that question I need a file and data within it. The DDS code for my test file is:

01 A R TESTFILER02 A ORIGINAL 10A03 A NBR_CODE 5P 0

The file contains two fields:

Line 4: The original alphanumeric field.

Line 5: The equivalent of his new numeric field.

I can use SQL to insert the example data he gave me into the file:

01 INSERT INTO TESTFILE (ORIGINAL)02 VALUES('A123B'),('ABC123'),('AB4679'),('BC0123D') ;

This statement just added values into the first field, ORIGINAL. I can check using the following SQL statement:

01 SELECT * FROM TESTFILE

The results for the above statement is:

ORIGINAL NBR_CODE-------- --------A123B 0ABC123 0AB4679 0BC0123D 0

It is obvious that I will need to substring the "number" from the character strings. The Substring SQL scalar function requires three pieces of information:

01 SUBSTR(ORIGINAL,02 < start of string >,03 < length of string >)

Line 1: The first piece of information is the field name we are substring the information out of, which is ORIGINAL.

Line 2: Start position of the "number" I want to substring out of the field. This could be in different places, therefore, I can use a hardcoded value.

Line 3: As the length of the "number" can vary too I cannot hardcode it too.

The best way I know to look for a range of characters in one statement is to use a Regular Expression, specifically REGEXP_INSTR. It will return to me the first position in the string the desired character(s) are found.

It has a number of parameters. The ones I will be using are:

source-string: The variable, or string, that will be searched. In this example this will be 'ORIGINAL' as that is the field I am looking within.
search-pattern: The pattern I am looking for in the source string.
starting-position: Where to start searching for the pattern.
occurrence: The occurrence I want to check string for the search pattern. In this scenario it will always be the first occurrence.

The first thing I need to find is the position first number in the field. For that I would use:

01 SELECT ORIGINAL, 02 REGEXP_INSTR(ORIGINAL,'[0-9]',1,1)03 FROM TESTFILE

Line 1: I want to display the field I am searching for the pattern in.

Line 2: What this REGEXP_INSTR says is:

ORIGINAL: I am looking within this field.
'[0-9]': This does need to be enclosed within apostrophes/single quotes. I am enclosing within the square brackets ( [ ] ) I am looking to any character that is in the range of zero to nine.
1: Start searching in the first position.
1: Stop when you find the first occurrence of a character in the desired range.

The results are:

ORIGINAL 00002-------- ------A123B 2ABC123 4AB4679 3BC0123D 3

Which has returned the first "numeric" character in the field.

The next thing I need to find is harder. Where is the next alphabetic character?

01 SELECT ORIGINAL,02 REGEXP_INSTR(ORIGINAL,'[A-Z ]',03 (REGEXP_INSTR(ORIGINAL,'[0-9]',1,1)),04 1)05 FROM TESTFILE

Lines 2 - 4: This looks a lot more complicated as I have a REGEXP_INSTR nested within another REGEXP_INSTR. Let me break this line down.

Line 2: This is the start of the REGEXP_INSTR expression. I am looking for any character that is in the range of A – Z or blank.

Line 3: Start position is anywhere after the where the first "numeric" character is found.

Line 4: I am looking for the first occurrence.

The results are:

ORIGINAL 00002-------- ------A123B 5ABC123 7AB4679 7BC0123D 7

In the first result the letter 'B' is in the fifth position.

In the second and third results there are no letters following the "number", but there is a blank which is in the seventh position.

In the fourth result the letter 'D' is in the seventh position.

I think I have enough to be able to construct the REGEXP_INSTR expression to substring the "number" from the field.

01 SELECT ORIGINAL,02 SUBSTR(ORIGINAL,03 REGEXP_INSTR(ORIGINAL,'[0-9]',1,1),04 REGEXP_INSTR(ORIGINAL,'[A-Z ]',05 REGEXP_INSTR(ORIGINAL,'[0-9]',1,1),1) 06 - REGEXP_INSTR(ORIGINAL,'[0-9]',1,1))07 FROM TESTFILE

Line 1: I want to display the content of the ORIGINAL field.

Lines 2 – 6: This is the part that substrings the "number" from ORIGINAL.

Line 2: I am substring-ing from ORIGINAL.

Line 3: The starting place is returned from this REGEXP_INSTR returns the position of the first "numeric" character.

Lines 4 – 6: This is where it gets complicated… my statement calculate the length of the "number". Which it can do:

< length of "number" > = < position of alphabetic or blank character after "number" > - < start position of "number" >

Lines 4 and 5: This part determines the position of the first alphabetic or blank character after the "number". The REGEXP_INSTR scans for first A – Z or blank character that is found after the first part of the "number".

Line 6: From that value the starting position of the number is subtract. This is the length of the number.

When I execute this statement my results are:

ORIGINAL 00002 -------- ------A123B 123ABC123 123AB4679 4679BC0123D 0123

The second column are character values, not numbers.

I now need to update the file NBR_CODE field in the file TESTFILE. I can take the above Select statement and modify it into an Update statement:

01 UPDATE TESTFILE 02 SET NBR_CODE = 03 SUBSTR(ORIGINAL,04 REGEXP_INSTR(ORIGINAL,'[0-9]',1,1),05 REGEXP_INSTR(ORIGINAL,'[A-Z ]',06 REGEXP_INSTR(ORIGINAL,'[0-9]',1,1),1) 07 - REGEXP_INSTR(ORIGINAL,'[0-9]',1,1))

Lines 3 – 7: This substring is the same as the one in the Select statement in line 2 – 6.

Once the above is executed I can check the contents of TESTFILE with the following:

01 SELECT * FROM TESTFILE

The results are:

ORIGINAL NBR_CODE-------- ---------A123B 123ABC123 123AB4679 4679BC0123D 123

The character "numbers" extracted by the substring statement are converted to numbers when the NBR_CODE is updated.

As you can see that this is a somewhat simple statement extracted the number from the character field and was used to update the new field in the file. All of which is what my friend had asked for.

This article was written for IBM i 7.5 TR3 and 7.4 TR9.

Extract number out of character column and use it to update another column (2024)

References