Solved BASH - counting chars and shaping script' behavior upon that

December 27, 2011 at 13:07:53
Specs: Linux x86_64)
I've somehow managed to obtain lists of people's names.
These names where read this way

Status #for instance Sir., Mr., Eng., Doc. ... etc.
Name(s) #in case of multiple names, they are listed on different lines, not the same
Surname

These words were gradually attached to a file during the precedent "for loop".

Now i have a file witch is full of names, surnames, and statuses, sorted in the order they were grabbed.

I now have to split the data contained in the file by dividing it in contacts in the following form:

Status\tName(1)\t...\tName(n)\Surname\n #\t stands for tab and \n for newline of course

This means i have to make my script recognise where the "Statuses" are located (they're the only fields always made by 2 to 3 chars), store that line in the satus field, then look for the following occourrence of a status in the list and store everything that there's between as names until the last line before the next occourrence of a status, which will be stored in the Surname field. And pass to the next contact. Until EOF.

I'm pretty confused. I got the concept (and wish you understood me expressing it as well) but have no idea how to concretizing, except from trying cat loops with counters within cat loops with counters within cat loops with counters...but i'm getting pretty mad.


See More: BASH - counting chars and shaping script behavior upon that

Report •


✔ Best Answer
December 28, 2011 at 16:01:21
It looks like your real data files are not entirely ASCII. Sounds like you moved the files over from Windows? In that case, you may have an extra Carriage Return at the end of each line. Window's uses a Carriage Return/Line Feed at the end of each line and Unix/Linux uses just the LF.

If you view your files using vi, if the CR is there, you'll see a control-M, ^M at the end of each line.

There are shell scripts to remove the CR or you can use the dos2unix command.



#1
December 27, 2011 at 14:42:03
Sorry, but I am not understanding what you want. Maybe if you included a data example it would help.

Report •

#2
December 28, 2011 at 04:25:55
well for instance this is the file:

Mr
Franz
Kafka
Dr
Artur
Conan
Doyle
Miss
Elisabeth
Taylor

And this should be the output:

Mr franz kafka
Dr Arthur conan doyle
Miss elisabeth taylor

No matter how many names someone has. Otherwise i would have split the document every 3 or 4 lines, but that's not the case.
I think that working around with the length of each line could be a good starting point, as the statuses (Mr., Miss., Dr., Eng. ... ) are always made of 2 to 3 characters (without counting dots) and could be easily recognized by a machine.


Report •

#3
December 28, 2011 at 08:55:47
The number of characters probably won't work. "miss" has 4 characters - not 3. You would have to be sure the first and last names are not 4 characters or less.

Do you know what the different status values are? If so, this script will work:

#!/bin/bash

cnt=0
wc=0
while read line
do
   ((cnt+=1))
   ((wc+=1))
   if [ $cnt -eq 1 ]
   then
      prstr="$line "
      continue
   fi
   newline=$(echo "$line"|tr '[A-Z]' '[a-z]') # downshift all letters
   newline=$(echo "$newline"|tr -d '.') # delete the period

   # add to the status as needed:
   if [[ $newline == "mr" || $newline == "dr" || $newline == "miss" || $newline == "sir" || $newline == "eng" ]]
   then
      echo $prstr
      prstr="$line "
      wc=0
      continue
   fi
   prstr="$prstr $line "
done < mydata.txt
# print the last line
if [ $cnt -gt 0 ]
then
    echo $prstr
fi
# end script

If you insist on using the length of the string, this stub might help you:

#!/bin/bash

string="miss"
len=$(echo "${#string}")
if [[ $len -eq 4 ]]
then
   echo $string
fi


Report •

Related Solutions

#4
December 28, 2011 at 14:49:02
Ok.
I am starting to think something strange is happening. Although I know software is ALWAYS right.
Before receiving your answer i managed to make the script work with a dummy txt file and everything worked as expected.
I then tried to change the dimensions of the dummy file and it kept working.

When i finally tried to use the real text files something went awfully wrong. And weird.

Let me explain what is happening:
As you know i am manipulating huge text files containing nothing but names, surnames, and statuses. Only a-z and A-Z characters (as far as my eyes and grep could tell).

All the text was grabbed from html and manipulated with unix-based, command-line software.

When i attempt to order this data in rows, doing exactly what I previously described in this topic, anything goes fine with the dummy files BUT not with the real ones.

I used grep over the real files but found nothing different than the dummy ones.
I tried to input dummy files filled with garbage and non-alpha characters but the script filters them and outputs the exact data. But not with the real files.

AND, most strange of all, when i try to view the content of the output file i get, i see all the data with gedit, while absolutely nothing with cat.

This sounds definitely illogical to me.


Report •

#5
December 28, 2011 at 16:01:21
✔ Best Answer
It looks like your real data files are not entirely ASCII. Sounds like you moved the files over from Windows? In that case, you may have an extra Carriage Return at the end of each line. Window's uses a Carriage Return/Line Feed at the end of each line and Unix/Linux uses just the LF.

If you view your files using vi, if the CR is there, you'll see a control-M, ^M at the end of each line.

There are shell scripts to remove the CR or you can use the dos2unix command.


Report •

#6
December 29, 2011 at 05:43:53
You're right, i see plenty of ^M at the end of every line while opening it with Vi!

This explains why my dummy, unix-made texts work while the real ones don't: they where written under windows!

Huge bow.


Report •


Ask Question