MS excel line up 2 columns

Sony / Sz583n
June 16, 2010 at 05:21:10
Specs: Windows Vista
I'm hoping for a little help on how to align complex data in excel 2007.

In my 1st spreadsheet I have a very large list of GeneIDs arranged in column A with several other columns of data (columns B-T). What I would like to do is add an extra column(s) (U-V) which has COG data in. The COG spreadsheet has GeneID and COG data only, but there are additional or missing GeneIDs in this 2nd spread sheet. Additionally there may be more than 1 row with the same GeneID but a different COG. e.g.

JW0012 J
JW0013 K
JW0013 L
JW0016 C
JW0017 C

Is there a way to get the align the GeneIDs so that any differences between spreadsheet 1 and spreadsheet 2 results in the addition or deletion of rows from spreadsheet 2 only (it's important that there are not additional rows in spreadsheet 1) and so that if there is more than 1 COG it is added to the same row, but different column as the GeneID?

Any help would be really appreciated as it would save me going through some 20,000 data points.

See More: MS excel line up 2 columns

Report •

June 21, 2010 at 11:45:49
Hmmm - the easiest way i think for me would be to sort the two spreadsheets - first by GeneID then COG.
Once that is done, copy and paste the two GeneID columns into a new blank worksheet, separated by an empty column.
Then use this formula in the first empty slot between the two rows:

f being the two columns the data is in, a2 being the first empty slot. This will compare the two rows, and will let you know if there are matches. If there is a match, the a2 columns will display a "1", if there is no match a "0" will display.. All of the ones with a "1" you can safely delete, and there is a mtach, all of the "0", you can copy over to the worksheet you are planning on using, as there is no match.

Report •

June 21, 2010 at 16:09:11

I do have a solution that uses formulas.
The reason for using formulas is that the data can be extended or updated without having to re-sort / match or delete data

The solution appears to meet your criteria in that, the original data remains 'as is' and no blank rows are added.

The COG data is added on the matching line, and any number of duplicate matching gene ID's with COG's will occur on the same row (starting at column U)

The down side of the solution is that it requires one user defined function - so there is a bit of visual basic to copy and paste, and with 20,000 rows of data, all the copies of the user defined function plus all the regular functions in the formulas will significantly slow re-calculation of the worksheet.

I tested this on 5000 Gene ID/COG pairs with some missing Gene ID's and numerous duplicates (up to 12 for some Gene ID's)
There were 100 unique Gene ID's on the main worksheet - they were in random order and the numbers were not sequential.

I placed the GeneID/COG pairs in Sheet2 of the main workbook - searching and linking to a second workbook would have added an enormous overhead to the calculation.
The ID's were in column A and the COG'c in column B
They started in row 2 and ended on row 5000
Where you see Sheet2!$A$2:$B$5000, adjust the range as required - although making it larger than currently required allows for extra pairs to be added.
ID / COG pairs do not need to be sorted into order.

To add the user defined function do the following:
Right-click the name tab of the worksheet containing the original data
Select View Code
In the Visual Basic window that opens there will be a 'Project Explorer' pane on the left (if not visible, select 'View' from the Visual Basic menu bar and then click 'Project Explorer'
Your worksheet name should be highlighted, below the workbook name
Right-click the worksheet name and select 'Insert' and click Module (not Class module)
Double click the new module - typically 'Module1' under the Modules folder
Paste the following into the large Visual Basic code window:

Option Explicit

Public Function FindInstV( _
SearchRange As Range, _
SearchValue As String, _
Optional SearchInstance As Integer = 2, _
Optional HOffset As Integer = 0, _
Optional MatchYorN As Boolean = True) _
As Variant
' Finds the Nth instance of the search string in the range specified
' Optional Search Instance - default is 2
' Optional horizontal offset - default is 0
Dim InstFind As Object
Dim strFrstFnd As String
Dim rngNxtAddr As Range
Dim rngLastAddr As Range
Dim intFindCnt As Integer
Dim dblFndNxtRw As Double 
Dim intFndNxtCl As Integer 
On Error GoTo ErrHnd
If SearchInstance = 0 Then GoTo ErrHnd
'search forward only
    With SearchRange
        intFindCnt = 0
        'first search - start After last cell in range, i.e., starts at first cell
        'this is a quirk of the find method which starts after the active cell or
        'after the first cell in the range
        Set InstFind = .Find(What:=SearchValue, _
                    After:=.Cells(SearchRange.Rows.Count, _
                    SearchRange.Columns.Count), _
                    LookIn:=xlValues, _
                    LookAt:=xlWhole, _
                    SearchOrder:=xlByRows, _
                    SearchDirection:=xlNext, _
        If Not InstFind Is Nothing Then
            'if an instance found then test if searchinstance =1.
            'If >1 go on to test some more
            If SearchInstance = 1 Then
                intFindCnt = 1
                Set rngLastAddr = InstFind
                intFindCnt = 1
                strFrstFnd = InstFind.Address
                Set rngNxtAddr = InstFind
                    'to re-use Find we need to specify that it starts
                    'after the last found location
                    dblFndNxtRw = rngNxtAddr.Row - SearchRange.Row + 1
                    intFndNxtCl = rngNxtAddr.Column - SearchRange.Column + 1
                    Set rngLastAddr = rngNxtAddr
                    Set InstFind = .Find(What:=SearchValue, _
                                After:=.Cells(dblFndNxtRw, intFndNxtCl), _
                                LookIn:=xlValues, _
                                LookAt:=xlWhole, _
                                SearchOrder:=xlByRows, _
                                SearchDirection:=xlNext, _
                    If Not InstFind Is Nothing Then
                        intFindCnt = intFindCnt + 1
                        Set rngNxtAddr = InstFind
                    End If
                Loop While Not InstFind Is Nothing And _
                                InstFind.Address <> strFrstFnd And _
                                intFindCnt < SearchInstance
                'get out of loop if no more instances of search value found,
                'or we have looped back to the start
                'or we have found the required number of instances
                'if loop stopped for required instance found,
                'keep the latest address
                If intFindCnt = SearchInstance Then Set rngLastAddr = rngNxtAddr
            End If
        End If
    End With
'If search value not found in range, or the requested instance
'is not found - return an error
'else return the value found using column offset.
If intFindCnt = 0 Or intFindCnt < SearchInstance Or _
                InstFind.Address = strFrstFnd Then
    FindInstV = CVErr(xlErrValue)
    'return the value found with offset if required
    FindInstV = rngLastAddr.Offset(0, HOffset).Value
End If
Exit Function
FindInstV = CVErr(xlErrNA)
End Function

This user defined function is modified version of one I already use - a rather quick 'hack', so some of the description lines may not be 100% accurate. Also I haven't tested it for all types of error handling.
Click Save from the Visual Basic Menu bar
Click Alt + f11 to get back to the main Excel window.

On Sheet1 with the original data you need at least one row above the first row containing Gene ID data
This row is used to specify the instance of GeneID/COG pairs.
If you always have less than 12 GeneID/COG pairs for one Gene ID, you can cut back on the number of instances - using excess adds lots of CPU overhead!

Gene ID data starts on row 3.

Lets say that you are using row 2 for the instance numbers:
In cell U2 enter this formula:

The result should be 1
Select cells U2 and V2 and copy
Select cells W2 to AR2 and Paste
AQ2 will show 12

In cell U3 enter this:


In cell V3 enter this:

Select cells U3 and V3, and Copy
Select cells W3 to AR3 and paste

You should now have all Gene ID/COG pairs for the Gene ID in cell A3

Once this row works, just select U3 to AR3 and drag the formulas down as many rows as you have data.
For large numbers of rows it is quicker to Copy the cells U3 to AR3 and then with the shift key held down, select U3 to the last required row in column AR, and with all these selected, Paste

There will be a delay as the formulas calculate. The UDF is written in Visual Basic, so it is much slower than built-in functions.

Hopefully you have the matching COG data.

As this is somewhat complicated, I suggest starting with a copy of your workbook, or make a copy of the workbook before you start.

As the UDF is stored in the workbook, do not have two copies of the workbook open at once - even if they have different names as you will get a duplicate error for the UDF name 'FindInstV'

If you choose to go down this path and if you run into difficulties, please reply to the post with as much information about what doesn't work and what you have done, as possible.


Report •

Related Solutions

Ask Question