I'm hoping for a little help on how to align complex data in excel 2007. In my 1st spreadsheet I have a very large list of GeneIDs arranged in column A with several other columns of data (columns B-T). What I would like to do is add an extra column(s) (U-V) which has COG data in. The COG spreadsheet has GeneID and COG data only, but there are additional or missing GeneIDs in this 2nd spread sheet. Additionally there may be more than 1 row with the same GeneID but a different COG. e.g.
GeneID COG
JW0012 J
JW0013 K
JW0013 L
JW0016 C
JW0017 C
Is there a way to get the align the GeneIDs so that any differences between spreadsheet 1 and spreadsheet 2 results in the addition or deletion of rows from spreadsheet 2 only (it's important that there are not additional rows in spreadsheet 1) and so that if there is more than 1 COG it is added to the same row, but different column as the GeneID?Any help would be really appreciated as it would save me going through some 20,000 data points.
Hmmm - the easiest way i think for me would be to sort the two spreadsheets - first by GeneID then COG.
Once that is done, copy and paste the two GeneID columns into a new blank worksheet, separated by an empty column.
Then use this formula in the first empty slot between the two rows:
=countif(f:f;a2)f being the two columns the data is in, a2 being the first empty slot. This will compare the two rows, and will let you know if there are matches. If there is a match, the a2 columns will display a "1", if there is no match a "0" will display.. All of the ones with a "1" you can safely delete, and there is a mtach, all of the "0", you can copy over to the worksheet you are planning on using, as there is no match.
Hi, I do have a solution that uses formulas.
The reason for using formulas is that the data can be extended or updated without having to re-sort / match or delete dataThe solution appears to meet your criteria in that, the original data remains 'as is' and no blank rows are added.
The COG data is added on the matching line, and any number of duplicate matching gene ID's with COG's will occur on the same row (starting at column U)
The down side of the solution is that it requires one user defined function - so there is a bit of visual basic to copy and paste, and with 20,000 rows of data, all the copies of the user defined function plus all the regular functions in the formulas will significantly slow re-calculation of the worksheet.
I tested this on 5000 Gene ID/COG pairs with some missing Gene ID's and numerous duplicates (up to 12 for some Gene ID's)
There were 100 unique Gene ID's on the main worksheet - they were in random order and the numbers were not sequential.I placed the GeneID/COG pairs in Sheet2 of the main workbook - searching and linking to a second workbook would have added an enormous overhead to the calculation.
The ID's were in column A and the COG'c in column B
They started in row 2 and ended on row 5000
Where you see Sheet2!$A$2:$B$5000, adjust the range as required - although making it larger than currently required allows for extra pairs to be added.
ID / COG pairs do not need to be sorted into order.To add the user defined function do the following:
Right-click the name tab of the worksheet containing the original data
Select View Code
In the Visual Basic window that opens there will be a 'Project Explorer' pane on the left (if not visible, select 'View' from the Visual Basic menu bar and then click 'Project Explorer'
Your worksheet name should be highlighted, below the workbook name
Right-click the worksheet name and select 'Insert' and click Module (not Class module)
Double click the new module - typically 'Module1' under the Modules folder
Paste the following into the large Visual Basic code window:Option Explicit Public Function FindInstV( _ SearchRange As Range, _ SearchValue As String, _ Optional SearchInstance As Integer = 2, _ Optional HOffset As Integer = 0, _ Optional MatchYorN As Boolean = True) _ As Variant '****************************************************************** ' Finds the Nth instance of the search string in the range specified ' Optional Search Instance - default is 2 ' Optional horizontal offset - default is 0 '****************************************************************** Dim InstFind As Object Dim strFrstFnd As String Dim rngNxtAddr As Range Dim rngLastAddr As Range Dim intFindCnt As Integer Dim dblFndNxtRw As Double Dim intFndNxtCl As Integer '****************************************************************** On Error GoTo ErrHnd If SearchInstance = 0 Then GoTo ErrHnd '****************************************************************** 'search forward only With SearchRange intFindCnt = 0 'first search - start After last cell in range, i.e., starts at first cell 'this is a quirk of the find method which starts after the active cell or 'after the first cell in the range Set InstFind = .Find(What:=SearchValue, _ After:=.Cells(SearchRange.Rows.Count, _ SearchRange.Columns.Count), _ LookIn:=xlValues, _ LookAt:=xlWhole, _ SearchOrder:=xlByRows, _ SearchDirection:=xlNext, _ MatchCase:=MatchYorN) If Not InstFind Is Nothing Then 'if an instance found then test if searchinstance =1. 'If >1 go on to test some more If SearchInstance = 1 Then intFindCnt = 1 Set rngLastAddr = InstFind Else intFindCnt = 1 strFrstFnd = InstFind.Address Set rngNxtAddr = InstFind Do 'to re-use Find we need to specify that it starts 'after the last found location dblFndNxtRw = rngNxtAddr.Row - SearchRange.Row + 1 intFndNxtCl = rngNxtAddr.Column - SearchRange.Column + 1 Set rngLastAddr = rngNxtAddr Set InstFind = .Find(What:=SearchValue, _ After:=.Cells(dblFndNxtRw, intFndNxtCl), _ LookIn:=xlValues, _ LookAt:=xlWhole, _ SearchOrder:=xlByRows, _ SearchDirection:=xlNext, _ MatchCase:=MatchYorN) If Not InstFind Is Nothing Then intFindCnt = intFindCnt + 1 Set rngNxtAddr = InstFind End If Loop While Not InstFind Is Nothing And _ InstFind.Address <> strFrstFnd And _ intFindCnt < SearchInstance 'get out of loop if no more instances of search value found, 'or we have looped back to the start 'or we have found the required number of instances 'if loop stopped for required instance found, 'keep the latest address If intFindCnt = SearchInstance Then Set rngLastAddr = rngNxtAddr End If End If End With '****************************************************************** 'If search value not found in range, or the requested instance 'is not found - return an error 'else return the value found using column offset. If intFindCnt = 0 Or intFindCnt < SearchInstance Or _ InstFind.Address = strFrstFnd Then FindInstV = CVErr(xlErrValue) Else 'return the value found with offset if required FindInstV = rngLastAddr.Offset(0, HOffset).Value End If Exit Function '****************************************************************** ErrHnd: FindInstV = CVErr(xlErrNA) End Function
This user defined function is modified version of one I already use - a rather quick 'hack', so some of the description lines may not be 100% accurate. Also I haven't tested it for all types of error handling.
Click Save from the Visual Basic Menu bar
Click Alt + f11 to get back to the main Excel window.On Sheet1 with the original data you need at least one row above the first row containing Gene ID data
This row is used to specify the instance of GeneID/COG pairs.
If you always have less than 12 GeneID/COG pairs for one Gene ID, you can cut back on the number of instances - using excess adds lots of CPU overhead!Gene ID data starts on row 3.
Lets say that you are using row 2 for the instance numbers:
In cell U2 enter this formula:=INT(COLUMN()/2)-9The result should be 1
Select cells U2 and V2 and copy
Select cells W2 to AR2 and Paste
AQ2 will show 12In cell U3 enter this:
=IF(COUNTIF(Sheet2!$A$2:$A$5000,$A3)<U$2,"",findinstV(Sheet2!$A$2:$A$5000,$A3,U$2,0))
In cell V3 enter this:=IF(COUNTIF(Sheet2!$A$2:$A$5000,$A3)<U$2,"",findinstV(Sheet2!$A$2:$A$5000,$A3,U$2,1))Select cells U3 and V3, and Copy
Select cells W3 to AR3 and pasteYou should now have all Gene ID/COG pairs for the Gene ID in cell A3
Once this row works, just select U3 to AR3 and drag the formulas down as many rows as you have data.
For large numbers of rows it is quicker to Copy the cells U3 to AR3 and then with the shift key held down, select U3 to the last required row in column AR, and with all these selected, PasteThere will be a delay as the formulas calculate. The UDF is written in Visual Basic, so it is much slower than built-in functions.
Hopefully you have the matching COG data.
As this is somewhat complicated, I suggest starting with a copy of your workbook, or make a copy of the workbook before you start.
As the UDF is stored in the workbook, do not have two copies of the workbook open at once - even if they have different names as you will get a duplicate error for the UDF name 'FindInstV'
If you choose to go down this path and if you run into difficulties, please reply to the post with as much information about what doesn't work and what you have done, as possible.
Regards