How to extract data from text file with complex structure?

October 23, 2019 at 18:20:53
Specs: Windows 10
Hi! I have a large flat text file with data for thousands of companies. Data for each company follow a similar format, but some companies have more and larger data groups/sections than others. Additionally, some of the data are within a single line, other data spill onto subsequent lines, and still other data are in more of a table format. Example text is below.

I'm wondering how best to find data locations and variable values, with the end goals being to crawl all the data for all companies in my text file (let's say 3333 firms, so I need a loop of some sort) to identify and extract all variable values (most follow a colon, but the table-like data do not), add/append each company's respective data to a single project dataframe, then export the dataframe into a .csv that I can use for further analyses. I am most familiar with R, but I have some experience with Python using json files. Can anyone help? These data are for my dissertation :). Thank you!

COMPANY: +Abc, Inc. Data Last Updated: 04/18/2019

Address: 6 Street Founded: 01/01/2012
Status: Public
- IPO Date: 05/15/2017
Phone: 15555555555
Fax:
Website: www.Abc.com

Business Description: +Abc, Inc. is an online insurance company that serves
small businesses. The Company manages a
search engine.


Industry Class: 2000 / Technology
Industry Major Grp: 1000 / Computer
Industry Minor Grp: 1200 / Internet
Industry Sub-Grp 1: 1200 / Internet
Industry Sub-Grp 2: 1800 / Ecommerce
Industry Sub-Grp 3: 1840 / Finance

EXECUTIVES:

First Name Last Name Job Title
-------------------- ------------------------------ ----------------------------------------
Person A Alast Name Director, Development
Person B Blastname Board Member
Person BC BClastname Director, Operations
Person BCD BCDlast Name Board Member

FINANCE ROUNDS:

Round No. Round Date Company Financing Stage
--------- ---------- ----------------------------------------
1 01/01/2016 Early Stage

Round Date, Est Round Total Amt ($000), Investor
--------------------------------------------------------------------------------
01/01/2018 12065 Imma Partners SA - Unspecified Fund
01/01/2018 12065 Annie Group SA - Unspecified Fund
01/01/2018 12065 Undisclosed Fund
01/01/2018 12065 Undisclosed Fund

PRODUCT/MARKET INFORMATION:

#new page

Product(s): BIGHERO

Competing Against: BBC Company, JAM, Inc.

FINANCIAL OUTCOMES:

Yr Ended Yr Ended
12/31/10 12/31/09

Net Sales 33,123 11,033
Pre-Tax Income 13,216 12,036
Net Income 11,900 8,311
EPS - -
TA 59,633 48,718
SE - -


See More: How to extract data from text file with complex structure?

Reply ↓  Report •

#1
October 24, 2019 at 01:19:29
"a long essay on a particular subject, especially one written for a university degree or diploma."

So pretty much you're asking us to do some of your homework?

Nope,

i5-6600K[delid]@4.828GHz Core/4.627GHz cache@1.456v | 2x4GB Crucial-DDR4-2400MHzCL18@3018MHzCL12@1.465v | Sapphire Nitro+ SE RX 590 8GB@1675Mhz core@1.2v/2236MHz


Reply ↓  Report •

#2
October 24, 2019 at 06:00:51
Thanks for your help, hidde663. I'm not a computer science expert or in a computer science program; I'm a social scientist. My "homework" is to propose a theoretical model and test hypotheses by running longitudinal analyses on my data, which I'm fully capable of doing myself. Getting a text file into a usable format to run analyses is what I'm asking help on, which is beyond my expertise and expectations for my social science (not computer science) dissertation. I've already looked at other resources to do this, and attempted to write code based on those resources, but those resources provide examples with much simpler structures and aren't helpful. I have not come to this community before taking initiative to learn how to do this on my own. So if you have a resource in mind that would be more helpful, I'd appreciate you sharing. TIA

message edited by ineedhelp2019


Reply ↓  Report •

#3
October 24, 2019 at 11:45:20
It's unclear for me ???
What's your input and what you want to get as results show us with examples !

Reply ↓  Report •

Related Solutions

#4
October 24, 2019 at 13:50:50
Hi Hackoo -- My input is a huge text file (.txt) that I need to crawl. I imagine that I would need 4 .csv files from the text file. Below I provide an example of how the files would be set up. I hope this makes sense... I believe pandas may be helpful for the table-like data, but I am a novice and unfamiliar with Python. I would greatly appreciate any ideas and/or suggestions for resources that match the complexity of my problem

1) For data following a colon, see this example for maincompany.csv:
(Text Line): Company: +Abc, Inc. Data Last Updated: 04/18/2019
(Text Line): Status: Public
(Text Line): Competing Against: BBC Company, JAM, Inc.

Row 1 (header row) of a .csv would include 4 variables: "CompanyName", "LastUpdated", "Status", "Competitors".
Row 2 would include: "+Abc, Inc.", "04/18/2019", "Public", "BBC Company, JAM, Inc."

Additional rows would be appended for each company in the text file.

2) For data in a table-like format, here's an example for executives.csv:
(Text Line) First Name Last Name Job Title
(Text Line) Person A Alast Name Director, Development
(Text Line) Person B Blastname Board Member

Row 1 (header row) of a new .csv file would include 4 variables: "CompanyName", "ExecFN", "ExecLN", "ExecJobTitle".
Row 2 would include: "+Abc, Inc.", "Person A", "Alast Name", "Director, Development"
Row 3 would include: "+Abc, Inc.", "Person B", "Blastname", "Board Member"

Additional rows would be appended for each executive listed, for each company.

3) For financing.csv, I would want to connect multiple "tables." Here's an example:
(Text Line) Round No. Round Date Company Financing Stage
(Text Line) 1 01/01/2018 Early Stage
(Text Line) Round Date, Est Round Total Amt ($000), Investor
(Text Line) 01/01/2018 12065 Imma Partners SA - Unspecified Fund

Row 1 (header row) of a new .csv file would include 6 variables: "CompanyName", "RoundNo", "RoundDate", "CompFinancingStage", "RoundTotalAmt", "Investor".
Row 2 would include: "+Abc, Inc.", "01/01/2016", "Early Stage", "12065", "Imma Partners SA - Unspecified Fund"
Row 3 would include: "+Abc, Inc.", "01/01/2016", "Early Stage", "12065", "Annie Group SA - Unspecified Fund"

Additional rows would be added for each company's rounds and for each round investor.

4) For financialoutcomes.csv, again more table-like, here's an example:
(Text Line): Yr Ended Yr Ended
(Text Line): 12/31/10 12/31/09
(Text Line): Net Sales 33,123 11,033
(Text Line): Pre-Tax Income 13,216 12,036
(Text Line): Net Income 11,900 8,311
(Text Line): EPS - -
(Text Line): TA 59,633 48,718
(Text Line): SE - -

Row 1 (header row) of a new .csv would include 7 variables: "CompanyName", "YrEnd", "NetSales", "PretaxIncome", "NetIncome", "EPS", "TotAssets", "ShareEquity"
Row 2 would include: "+Abc, Inc.", "12/31/10", "33123", "13216", "11900", "na", "59633", "na"
Row 3 would include: "+Abc, Inc.", "12/31/09", "11033", "12036", "8311", "na", "48718", "na"

Additional rows would be appended for each YrEnd for each company.

message edited by ineedhelp2019


Reply ↓  Report •

#5
October 25, 2019 at 23:36:21
If the data is separated by comma/tab/any unique separator

Reply ↓  Report •

#6
October 26, 2019 at 07:33:30
Hi Isabella8, some of the data are separated by colons, but not all. Other data are in a more table-like structure with spaces as separators, but some of the data in this case including multiple words and commas within the value I want to extract. I've tried to provide a representative example of the types of text formats I encounter in this file in my initial post and a follow-up post. Please let me know if these are confusing. This is my first time ever posting on a site like this, but I will attempt to correct or better explain the best I can. Thanks!

message edited by ineedhelp2019


Reply ↓  Report •

Ask Question