Can Excel Abstract This Type of Data?

October 31, 2019 at 09:34:22
Specs: Windows
I have a large 3000 page word document with data similar to what you see below for customer information.

JOHN SMITH 1 COMPUTING AVE.APT.A1 NEW YORK, NY 10001
555-5555 F01012000 S

To break this down:
[first name][last name][street address][street address 2 - apartment number or P.O. box][town][state][zip]
[number][gender][date of birth] S

Note the 'S' is something my system creates and is meaningless for this task

I need to abstract this data into an excel file that breaks down each field into a different column. Each row is a different customer. Rather than going through 3000 pages and typing all of that information one by one, I am curious if there is some way to have excel abstract this data for me. Maybe paste the data into worksheet 1 and have a VBA abstract to worksheet 2?

The end result excel sheet I need has to be set up like this:

Column A: LAST NAME
Column B: FIRST NAME
Column C: DATE OF BIRTH
Column D: GENDER
Column E: empty column. no data entered
Column F: STREET ADDRESS ROW 1
Column G: STREET ADDRESS ROW 2
Column H: CITY
Column I: STATE
Column J: ZIP
Column K: PHONE

I'm sure there is more I'll need to explain, but I hope this can at least give an idea of my need to see if Excel can abstract data in this manner. If this is not something Excel can do, does anyone have any suggestions on a program that can?

Thank you.


See More: Can Excel Abstract This Type of Data?

Reply ↓  Report •

#1
October 31, 2019 at 11:12:17
We could probably get something close,
but you would still need to edit the final result
as there are multiple variations on each section.

Is the data on Two lines as in you example:
JOHN SMITH 1 COMPUTING AVE.APT.A1 NEW YORK, NY 10001
555-5555 F01012000 S

or Three lines as in you explanation:
[first name][last name][street address][street address 2 - apartment number or P.O. box][town][state][zip]
[number][gender][date of birth] S

Will the name always be just two words John Smith or multiple words like
John A. Smith
John A. Smith Jr.
John Smith Jr.
Both with and without the punctuation?

This also applies to the street address & telephone numbers
1 North Computing Ave.
1 Computing Ave. North West

555-555-5555 (not counting the various European varieties.)

These are just some of the variations that can/will appear.

To begin it would probably be easier if all the data was on One line.
You should be able to do that in Word using Search & Replace.

MIKE

http://www.skeptic.com/


Reply ↓  Report •

#2
October 31, 2019 at 11:22:06
The data is on 2 lines. The 3 lines in the example is solely because the text space filled up in the post. To jump to your last point, I can go in and edit the make everything on a single line. It would take time, but not as much time as abstracting each field would.

The variations is something I did think about but wanted to give a general concept first before diving deeper into this. There can indeed be variations in the name or punctuation like you suggested. Phone numbers are all United States based, so we should only be dealing with 10 digit numbers. Another "variation" to consider is that not all of the fields are populated for each customer. For example, some may not have a phone number listed or more commonly may not have a second address line such as an apartment number.

message edited by Steven4321


Reply ↓  Report •

#3
October 31, 2019 at 11:31:29
I assume that you mean "extract", not "abstract". If so, then...

My first question is this: Where did the data in the Word document come from? If you had access to the source data, it might already be "extracted", by which I mean each piece is already in its own field.

My other question is: Can you tell if the Word doc is formatted into columns or perhaps it's Tab delimited? Knowing that might help.

Excel has a Text-To-Columns feature that can separate each string into its own column, either using a fixed width method or delimited by a character, such as a space.

In addition, there are formulas such as LEFT(), MID() and RIGHT() that can extract characters and break up a string.

Finally, It's possible that a Macro could help you accomplish your goal.

That's the good news. The bad news that always comes along with requests to split names and addresses can be illustrated with a few examples.

Are all of your names made up of just 2 words or do you have some like these:

JOHN SMITH
JOHN A. SMITH
JOHN SMITH, JR.
JOHN ALEXANDER ROBERT SMITH

Are all of your addresses made up of a number and 2 words or do you have some like these:

1 COMPUTING AVE
1 COMPUTING TECHNOLOGY AVE
122A SOUTH COMPUTING AVE
122 - 54 COMPUTING AVE N

Then of course, there are those that have an apartment (or Unit) number or letter and those that don't.

How about cities and states? One word or more?

NEW YORK, NY
ALBANY, NY

It's these inconsistencies that make it difficult to employ the above mentioned methods. Determining where a name ends and an address begins can be difficult. Take a look at the formulas found at the following site, which are used to just extract First, Last and Middle names from a cell with only names in it. The formulas are quite complicated. Imagine trying to expand those formulas to fit your needs.

http://www.cpearson.com/excel/First...

If there is anything else that you can tell us about your data, maybe we can come up with something that gets you close.

How To Post Data or Code ---> Click Here Before Posting Data or VBA Code


Reply ↓  Report •

Related Solutions

#4
October 31, 2019 at 11:31:32
I can go in and edit the make everything on a single line. It would take time

In Word,
Select Replace on the Ribbon

In the Find What box it will be either a Paragraph Mark or a Manual Page Break

Click on the More button and in the extended box select Special
In the Special list first select Paragraph Mark, that should get most of your data on one line
if needed try the Manual Page Break.
The Paragraph Mark symbol in the Find What box in ^p the Manual Page Break is ^|

In the Replace with box use a single space.

See how that works for you.

MIKE

http://www.skeptic.com/


Reply ↓  Report •

#5
October 31, 2019 at 12:07:58
DerbyDad:

The data is coming from an old DOS based system. My company took over a smaller group that logged this data in some DOS based system. I cannot say for certain what they used.

The document I have now is sort of a report from that system. I'll refer to it as such from now on. I used the word abstract because the report only pulled the information I mentioned above from each person's profile and left behind additional unneeded information. I do acknowledge though that extract would be more appropriate now that I have the report.

With that said, I would assume this report was created with extracted data fields. The file I now have is a .dbf type. There does not seem to be any columns separating the fields.

I can see how this task becomes extremely complicated when you consider multiple variables for each field. I'll also predict the likelihood of a code not being 100% perfect without me providing the exact data, which I unfortunately cannot do. Does the information I just provided make this task a little easier?

mmcconaghy:

I quickly did what you explained and it did get rid of the 2 lines, but it also got rid of every line. So each persons information just continued a long single string that ran for pages.

I'm sure I likely did something wrong. But the point here is that there are easier ways to reduce each person's data to a single line. If we come to a point of determining the worth in investing time with this over me having some people go through the report, then we can revisit this point.

message edited by Steven4321


Reply ↓  Report •

#6
October 31, 2019 at 12:26:41
Sounds like to did both Paragraph Marks & Manual Page breaks.

Try doing just Manual Page breaks first,
then if you have any odd double lines, use the Paragraph Mark on just those lines or manually combine them.

I can't see your doc from where I'm sitting, so we'll have to take this slow, one step at a time.

MIKE

http://www.skeptic.com/

message edited by mmcconaghy


Reply ↓  Report •

#7
October 31, 2019 at 12:31:39
Also, check your data and see if the Sex letter has spaces on both sides, or if it is snugged up against the birth date as in you example.
And, check you Phone numbers, are they mostly 555-5555 or 123-555-555 and do they have other structures, like (123) 555-555 or 555.5555.5555 etc.

MIKE

http://www.skeptic.com/


Reply ↓  Report •

#8
October 31, 2019 at 12:43:20
Data is structured like this.

JOHN SMITH 1 COMPUTING AVE.APT.A1 NEW YORK, NY 10001
555-5555 F01012000 S

JOHN SMITH 1 COMPUTING AVE.APT.A1 NEW YORK, NY 10001
555-5555 F01012000 S

JOHN SMITH 1 COMPUTING AVE.APT.A1 NEW YORK, NY 10001
555-5555 F01012000 S

JOHN SMITH 1 COMPUTING AVE.APT.A1 NEW YORK, NY 10001
555-5555 F01012000 S

The sex is either M or F, has a space to the left and is attached with no space to the date of birth. Exactly as shown in the example.

^l = 0 replacements. Does nothing.
^ p = The issue I described above. Everything is put into a single line that continues down the page.


Reply ↓  Report •

#9
October 31, 2019 at 13:01:22
When you display formatting marks, (on the ribbon in the paragraph section, the back wards P symbol)
at the end of your lines do you have an arrow mark or the back wards P?

The little crooked arrow is the Manual Page break, the backwards P the Paragraph Mark.


Try using the More/Special box to insert the mark in the Find What box.
Do not do it manually.

MIKE

http://www.skeptic.com/


Reply ↓  Report •

#10
October 31, 2019 at 13:17:08
There is a paragraph mark at the end of each line. Also each person has a mark between then in the blank space.

JOHN SMITH 1 COMPUTING AVE.APT.A1 NEW YORK, NY 10001 paragraph mark
555-5555 F01012000 S paragraph mark
paragraph mark
JOHN SMITH 1 COMPUTING AVE.APT.A1 NEW YORK, NY 10001 paragraph mark
555-5555 F01012000 S paragraph mark

Inserting a paragraph mark in the find what box results in 0 replacements.


Reply ↓  Report •

#11
October 31, 2019 at 13:24:33
I just realized that I never addressed this:

And, check you Phone numbers, are they mostly 555-5555 or 123-555-555 and do they have other structures, like (123) 555-555 or 555.5555.5555 etc.

Mostly 7 characters with a dash between; 555-5555. There are a few that have all 10 characters but those all use another dash as well; 555-555-5555. No parentheses present.


Reply ↓  Report •

#12
October 31, 2019 at 13:57:59
OK, that screws everything up. :-)

Try this, take about 100 or 200 lines of data and Copy / Paste them into a blank Excel sheet.

On the Ribbon,
Use the Paste / Special / Text

That should get each 2 lines of data deposited into their own cells. A1 & A2, A3 & A4, etc

Now we can combine the two rows and get our single line of data.

In cell B1 enter the formula: =A1&" "&A2
Leave cell B2 blank
In cell B3 enter the formula:=A3&" "&A4
Leave cell B4 blank

Select the range of cells B1 thru & including B4

Grab the small square on the lower right of your selected range and drag down.

Select the whole column B and do a Copy / Past / Values

Now delete all of column A

Now to get rid of the blank rows:

Select column A
On the Ribbon
Select Find & Select
Select Go to Special
In the Specials window select Blank, click OK
On the Ribbon
Click Delete

All you blank rows should now be gone.

MIKE

http://www.skeptic.com/


Reply ↓  Report •

#13
October 31, 2019 at 16:08:54
re: "Everything is put into a single line that continues down the page."

Do you mean like this?

J
O
H
N

S
M
I
T
H

If so, just expand the width of the cell and it should "flatten out".

For now, I've left your data as 2 lines. You can fix it later if this suggestion works for you. By leaving it as 2 lines, extracting the Phone Number, Gender and DOB is easier

In A2: 555-5555 F01012000 S or 555-555-5555 F01012000 S

Phone Number: =LEFT(A2,FIND(" ",A2)-1) (Assuming no space in phone number)

Gender: =IF(ISERROR(FIND("F",A2)),"M","F")

DOB: Start with this...

=IFERROR(MID(A2,FIND("F",A2)+1,LEN(A2)-FIND(" ",A2)-3),MID(A2,FIND("M",A2)+1,LEN(A2)-FIND(" ",A2)-3))

Then apply this to the result (I'm too lazy to combine them)

Assume result of 01012000 in D2:

=LEFT(D2,2)&"/"&MID(D2,3,2)&"/"&RIGHT(D2,4)

The name and address line will need a lot more, more than I have time for right now. The concept is the same, but the layout is harder to work with.

How To Post Data or Code ---> Click Here Before Posting Data or VBA Code


Reply ↓  Report •

#14
October 31, 2019 at 17:00:58
Once you have everything on one line here are a few formulas to get you started.

Column B: First & Last Name:

=LEFT(A2,MIN(FIND({0,1,2,3,4,5,6,7,8,9},A2&"0123456789"))-2)

Have combined First Name & Last Name,
you can process the file again using the site suggested by DerbyDad to split the names apart.

The formula relies on the Address Starting with a Street Number, so it may not always work.

Columns C & D & E Address section still a work in progress.

Column F: State:
=IF(MID(A2,LEN(A2)-24,1)=" ",MID(A2,LEN(A2)-32,2),MID(A2,LEN(A2)-28,2))

Column G: Zip:
=IF(MID(A2,LEN(A2)-24,1)=" ",MID(A2,LEN(A2)-29,5),MID(A2,LEN(A2)-25,5))

Zip Code is restricted to 5 digits

Column H: Phone Number:
=IF(MID(A2,LEN(A2)-24,1)=" ",MID(A2,LEN(A2)-23,12),MID(A2,LEN(A2)-19,8))

Phone number can be either 8 or 12 digits

Column I: Gender:
=MID(A2,LEN(A2)-11,2)

Column J: DOB:
=DATE(MID(A2,LEN(A2)-5,4),MID(A2,LEN(A2)-7,2),MID(A2,LEN(A2)-9,2))

Column K: the trailing S:
=RIGHT(A2,1)


See how they work on your data.

MIKE

http://www.skeptic.com/


Reply ↓  Report •

#15
October 31, 2019 at 17:38:08
Sorry, but the best I have been able to do is combine the Street Address, Apt and City into one cell.
There are just to many variables to break it apart.

Column C: Street, Apt, City:

=MID(A2,MIN(FIND({0,1,2,3,4,5,6,7,8,9},A2&"0123456789")),FIND(",",A2,1)-MIN(FIND({0,1,2,3,4,5,6,7,8,9},A2&"0123456789")))

This formula relies on a comma following the City name, so it may also fail.

As with the Name cell, once you have it broken apart there may be a way to processes it more and get the Street, Apt. and City into their own cells.

I'll keep working on it.

MIKE

http://www.skeptic.com/


Reply ↓  Report •

#16
October 31, 2019 at 18:43:36
re: "With that said, I would assume this report was created with extracted data fields. The file I now have is a .dbf type."

In your OP you said it was a Word doc. Do you also have a copy of the dbf file? If so...

Have you tried opening the dbf file directly with Excel?

Try dragging the dbf file onto the Excel app icon. it should open.

Or inside Excel, choose Open, browse to the correct folder and change the file type option to All Files and choose the dbf file.

How To Post Data or Code ---> Click Here Before Posting Data or VBA Code


Reply ↓  Report •

#17
October 31, 2019 at 20:25:33
DerbyDad

I saw that and it didn't even register. :-(
Must be getting old.

MIKE

http://www.skeptic.com/


Reply ↓  Report •

#18
November 1, 2019 at 03:43:30
Don't worry...it didn't immediately register with me either.

It'll be interesting to hear the result of trying to open it with Excel.

How To Post Data or Code ---> Click Here Before Posting Data or VBA Code


Reply ↓  Report •

#19
November 1, 2019 at 10:03:37
Windows doesn't recognize Excel as a viable program to open the file with. I can select it though manually.

When doing so and Excel begins to launch I get this message:
The file format and extension of 'FILENAME.dbf' don't match. The file could be corrupted or unsafe. Unless you trust its source, don't open it. Do you want to open it anyway?

If I open it anyway, I get a similar structure to what I open in Word. Everything is in column A. Each person is broken up into two lines, in this case 2 rows, separating the information in the same manner described before.

Cell A1: JOHN SMITH 1 COMPUTING AVE.APT.A1 NEW YORK, NY 10001
Cell A2: 555-5555 F01012000 S

I'll read through/test/respond to your other posts in a few. Heading into a meeting at the moment.


Reply ↓  Report •

#20
November 1, 2019 at 10:32:34
That's too bad, was hoping the dbf file would have more structure or delimiters.

Try all of the suggested postings and see how they work, any problems let us know.
Would recommend that you Copy/Paste from the postings, as opposed to trying to write them by hand.

This is my attempt as solving the Street/Apt/City problem, with one caveat:

=LEFT(SUBSTITUTE(A2,B2&" ",""),FIND(".",SUBSTITUTE(A2,B2&" ",""))-1)

This will separate out the Street address from the Apt/City ONLY if the street address
is followed by a period, as in "Ave." Note the period.

You might get lucky with a fairly large group that end in Ave. St. etc,
but if you get a Way, Drive, Court, etc you'll have to modify them by hand.

Unless DerbyDad03 has some magic up his sleeve.

Good luck and keep us posted on your progress.

MIKE

http://www.skeptic.com/


Reply ↓  Report •

#21
November 1, 2019 at 11:33:38
It might help if we had more examples of the variants, as opposed to us making up our own. Any solution we find for the same repeated example that we've been given probably won't work other variants. Even if we come up for options for variants that we make up, they still might not work for the actual data.

In any case, we're pretty sure that the formulas I posted in #13 work for the Phone, Gender and DOB data, assuming that there are no variants of the layout given for that part. So let's keep that separate for now.

As far as the Name and Address line I applied the Text-To-Columns feature using 2 delimiters: space and period. For the example given, I got this:

	 A        B     C           D	         E	 F	G	 H	 I	J	 K
1	JOHN	SMITH	1	COMPUTING	AVE	APT	A1	NEW	YORK,	NY	10001

From there I could easily concatenate cells to reconstruct the name and address. Obviously, if all of the data looked like that, the concatenation method would be simple. However, since it's not, maybe there is a way to sort/group the variants such that only a few concatenation formulas would be required. Maybe even something like =IF (count of cells = 11, use this concatenation formula),IF(count of cells = 14, use this concatenation formula),IF(Count =etc., etc.)

I doubt we'll ever get away without some manual labor once all of that is done, but at the more we can get Excel to do for us, the better.

So, toss out a few more examples of your data layouts so that we're working with the right stuff.

How To Post Data or Code ---> Click Here Before Posting Data or VBA Code


Reply ↓  Report •

#22
November 1, 2019 at 11:42:32
BTW... a fairly simple way to combine the 2 lines of data is to use OFFSET. Once all of the blank rows are eliminated, put this in B1 and drag it down:

=OFFSET(A$1,ROW()*2-2,0)&" "&OFFSET(A$2,ROW()*2-2,0)

Copy...PasteSpecial...Values

But for now, I'd keep the Names and Address data in one cell =OFFSET(A$1,ROW()*2-2,0) and the Phone, gender, DOB in another =OFFSET(A$2,ROW()*2-2,0) so that they can be worked on separately.

How To Post Data or Code ---> Click Here Before Posting Data or VBA Code


Reply ↓  Report •

#23
November 3, 2019 at 06:11:51
I know there have been lots of responses to this...from Excel experts.. So, I don't wish to diminish their knowledge in any way.

What version of Office do you have? Access 2010 and earlier will open a dbf (old DBase) file.

I can't remember if Office 2013 will open those or not. Regardless of your version of Office, it's worth a shot..

If you can't open it, you could email it to me, and I can give it a shot. Send me a PM if you wish.

Doing the best I can here... And remember, there's always more than one path to success. :)


Reply ↓  Report •

#24
November 4, 2019 at 07:59:23
We use Office 2016. Normally I would take you up on your offer but I cannot as this file includes private information. Hence why I gave examples in basic format without specifics.

I plan on attempting the suggestions today or tomorrow and I will keep you all informed.


Reply ↓  Report •

#25
November 4, 2019 at 09:01:06
re: "Hence why I gave examples in basic format without specifics."

But you could post other generic examples of the variants that make finding a one-size-fits-all solution difficult.

See my response #21 above.

Use fake names, but show the variants

John Smith
Bill A. Jones

Use fake addresses, but show the variants

Address with apartment
Address without apartment

Any variants when it comes to city names? (1 word vs. 2, 3, etc.?)

New York, NY
Albany, NY

etc.

The more we know about what other variants you are working with, the better we can help you find a solution that gets you as close as possible to your goal.

How To Post Data or Code ---> Click Here Before Posting Data or VBA Code


Reply ↓  Report •

#26
November 5, 2019 at 13:53:35
I just sat in on a meeting this afternoon about integrating a few things from the company we took over. One of the topics was this large file we've been discussing here. Two things were decided.

1: Marketing only wants to target a specific age group so now we will only be copying the information of customers who are under 25 years old. I did a spot check of about 100 people and there were only 9 that qualified. So the amount of information we will need to copy is significantly decreased.

2: Given the decreased workload, we've assigned some people to scroll through pages and pull out the customers in that age group. This may still take a fair amount of time, but it will certainly be easier than asking you all to create a code and trying to communicate results.

With that said, I appreciate all of the time you have all spent so far. Part of me wants to see this discussion yield a result and I'm sure you all love a good coding puzzle, but I don't want to waste anyone's time. I'll be sure to come here first the next time I have an office task that is beyond me.

Thank you.


Reply ↓  Report •

#27
November 6, 2019 at 09:35:27
Assuming that your file isn't already sorted by DOB, you might want to try this:

1 - Combine the 2 sets of data for each person into a single row.
2 - Extract the DOB year using this formula on whatever cell the Phone GenderDOB S string ends up in. (replace A2 with the correct cell reference and drag it down)

=RIGHT(IFERROR(MID(A2,FIND("F",A2)+1,LEN(A2)-FIND(" ",A2)-3),MID(A2,FIND("M",A2)+1,LEN(A2)-FIND(" ",A2)-3)),4)

3 - Copy the Year column, then PasteSpecial...Values
4 - Sort the file on the Year column

How To Post Data or Code ---> Click Here Before Posting Data or VBA Code

message edited by DerbyDad03


Reply ↓  Report •

#28
November 6, 2019 at 13:20:19
With your data on one line per cell, starting at A2, these formulas will give you a bit of help. They work from Right to Left, which I think makes them easier to understand.

Column K: the trailing S:
=RIGHT(A2,1)

Column J: DOB:
=DATE(MID(A2,LEN(A2)-5,4),MID(A2,LEN(A2)-7,2),MID(A2,LEN(A2)-9,2))

Column I: Gender:
=MID(A2,LEN(A2)-11,2)

Column H: Phone Number:
=IF(MID(A2,LEN(A2)-24,1)=" ",MID(A2,LEN(A2)-23,12),MID(A2,LEN(A2)-19,8))

Phone number can be either 8 or 12 digits

Column G: Zip:
=IF(MID(A2,LEN(A2)-24,1)=" ",MID(A2,LEN(A2)-29,5),MID(A2,LEN(A2)-25,5))

Zip Code is restricted to 5 digits

Column F: State Abbrev:
=IF(MID(A2,LEN(A2)-24,1)=" ",MID(A2,LEN(A2)-32,2),MID(A2,LEN(A2)-28,2))

2 letter state abbreviation.

MIKE

http://www.skeptic.com/


Reply ↓  Report •

Ask Question