Extract data from messy file

January 31, 2011 at 07:15:13
Specs: Windows 7
Hi friends,

I want to extract really messy data from a text file (also importable in excel format from the website - see below):

1. First field is name of the hospital (no tags to identify this, but it
comes after a string "Date of Last Update: 2010-10-28". The date itself
might be different in each string. A blank string can be added for the
first record.
2. Hospital Type: A variable length string but identified by "Hospital
Type:"
3. City, State Zip: All of this is consistently placed in the line before
ACP Chapter Code
4. Program Size: A number identified by preceding string "Program Size:" 
5. PGY1: A number identified by preceeding string "PGY1:" 
6. Tracks Available: This can have 1-5 variables e.g. Categorical Internal
Medicine, Preliminary Medicine, Osteopathic Internship, Osteopathic
Residency etc.
I want to capture all of them such that there is one column for each.
7. Date last updated: Use the date at the end of the dataline (rather than
preceeding the name)

Website with the data: http://www.acponline.org/cgi-bin/re...

/******Sample text file*****/

Alameda County Medical Center Program
Hospital Type: Community/University
Program Director(s): Theodore G. Rose, MD, FACP
Send correspondence to:
Department of Medicine
1411 E. 31st Street
Oakland, CA 94602
ACP Chapter Code: CANO
Phone: (510) 535-7540
Fax: (510) 437-4187
Program Size: 50; PGY1: 19
Tracks
Available
Duration
(Years)
Categorical Internal Medicine
3
Primary Care Internal Medicine
3
Preliminary Medicine
1
Date of Last Update: 2004-07-20
Albany Medical Center Program
Hospital Type: University Hospital
Program Director(s): Alwin Steinmann, MD
Send correspondence to:
Albany Medical Center
Medicine Education A-17
47 New Scotland Avenue
Albany, NY 12208
ACP Chapter Code: NYHV
Phone: (518) 262-5377
Fax: (518) 262-6873
E-mail: ferar@mail.amc.edu
Program Size: 58; PGY1: 17
Tracks
Available
Duration
(Years)
Categorical Internal Medicine
3
Preliminary Medicine
1
Date of Last Update: 2004-07-22
Albert Einstein College of Medicine & Montefiore Medical Center Program
Hospital Type: Community Teaching Hospital
Program Director(s): Sharon Silbiger, MD
Send correspondence to:
Montefiore Medical Center
Department of Medicine
111 East 210th Street
Bronx, NY 10467
ACP Chapter Code: NYD1
Phone: (718) 920-4417
Program Size: 150; PGY1: 50
Tracks
Available
Duration
(Years)
Preliminary Medicine
1
Categorical
3
Primary Care
3
Social Internal Medicine
3
Clinical Scientist
6-7 yrs. Depending if subs
Date of Last Update: 2002-09-19


See More: Extract data from messy file

Report •

#1
January 31, 2011 at 13:11:51
You are right; this is messy which is why I am only performing the hospital name and type for you. I will leave the rest to you.

First, I added a 'Date of Last Update' string to the top of the file so I could get the first hospital name:

Here is the script:

#!/bin/ksh

awk ' {

# first line of file must be blank
if($0 ~ /Date of Last Update/ || NR == 1)
   {
   # next line is hospital name
   getline
   print $0
   }

if($0 ~ /Hospital Type:/)
   {
   # parse the hospital type into an array by a colon
   split($0, arr, ":")
   print arr[2]
   }

} ' datafile.txt

Let me give you a hint on obtaining the city, state, zip; It will be a lot easier if you search for the:

Send correspondence to

string and grab the 4th line down than to save the line before:

ACP Chapter Code


Report •
Related Solutions


Ask Question