Need help with UDF (user-defined function)

Experts:

I need some help with creating a user-defined function (UDF) that will scan through the following lines of text.  (Please note that I have thousands of lines like these vs. just 13 lines).  

*******Begin of example lines *****************************************

Ackoff RL (1974a). Bla bla bla...
Ackoff RL and Gharajedaghi J (1996). Bla bla bla...
Ackoff R.L. and Sasieni M.W. (1968). Bla bla bla...
Ackoff R.L. (1999) Bla bla bla...
Ackoff Russell L. et al. (1976) Bla bla bla...
Russel L Ackoff (1981). Bla bla bla...
Anupindi R, Chopra S, Deshmukh SD, Van Mieghem JA, Zemel E (2006) Bla bla bla...
Arthur, W.B. (1999), "Bla bla bla..."
Anderson, P. (1999), 'Bla bla bla...'
Roundy S., M. Strasser, and P. K. Wright, Bla bla bla...
Julie Ann Stuart, Jane C. Ammons and Laura J. Turbini (1999) Bla bla bla...
[17] Kaplan, R.S. (1983), "Bla bla bla...
22. Kauffman S (1995). Bla bla bla...


*******End of example lines *****************************************

The following two (2) UDFs are needed:

1. UDF "SingleName" -- where single name exists in cell A1, UDF shall extract the following:

Ackoff RL                    Ackoff RL
Ackoff R.L.            Ackoff RL  
Ackoff Russell L.      Ackoff RL
Russel L Ackoff      Ackoff RL
Arthur, W.B.            Arthur WB
Anderson, P.            Anderson, P
Kaplan, R.S.              Kaplan RS
Kauffman S            Kauffman S


UDF "MultipleNames" -- where multiple names exists in cell A1, UDF shall extract the following:

Ackoff RL and Gharajedaghi J                                                            Ackoff RL et al.
Ackoff R.L. and Sasieni M.W.                                                            Ackoff RL et al.
Anupindi R, Chopra S, Deshmukh SD, Van Mieghem JA, Zemel E            Anupindi R et al.
Roundy S., M. Strasser, and P. K. Wright                                                Roundy S et al.
Julie Ann Stuart, Jane C. Ammons and Laura J. Turbini                          Stuart JA et al. **

** Note: This one maybe more difficult to determine.

Please see attached XLS for additional details (incl. an example UDF which extracts other information).

Thank you in advance,
EEH
User-Defined-Function.xls
ExpExchHelpAnalystAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

oleggoldCommented:
0
Learn SQL Server Core 2016

This course will introduce you to SQL Server Core 2016, as well as teach you about SSMS, data tools, installation, server configuration, using Management Studio, and writing and executing queries.

oleggoldCommented:
and done
go to vb,Module 1 code paste this:
Function GetTextOnly(S As String) As String
Dim X As Long
GetTextOnly = Space(Len(S))
For X = 1 To Len(S)
If Mid(S, X, 1) Like "[!0-9]" Then Mid(GetTextOnly, X) = Mid(S, X, 1)
Next
GetTextOnly = WorksheetFunction.Trim(GetTextOnly)
End Function
0
oleggoldCommented:
or per Your design
Function SingleName(S As String) As String
Dim X As Long
SingleName = Space(Len(S))
For X = 1 To Len(S)
If Mid(S, X, 1) Like "[!0-9]" Then Mid(SingleName, X) = Mid(S, X, 1)
Next
SingleName = WorksheetFunction.Trim(SingleName)
End Function

Open in new window

0
oleggoldCommented:
here's the file
User-Defined-Function.xls
0
aikimarkCommented:
I had a loooong comment in your question that explains the attached file.  By the time I finished the comment, you had deleted the question and I lost ALL of my explanatory text. :-(

I used regular expressions to do parsing, editing (Find/Replace), and some of the the filtering.
Citations.xls
0
ExpExchHelpAnalystAuthor Commented:
oleggold:

Thank you for the prompt response... the output I'm getting, however, is not exactly what I'm after.

Please refer to the XLS and view the following cell ranges for "clean" examples:

1. C26:C33 (clean single names)
2. F38:F42 (clean multiple names)

Thank you in advance for any additional information that would produce the desired output.

EEH
0
ExpExchHelpAnalystAuthor Commented:
aikimark:

I apologize for that... I thought I had included too much information; thus, no one had responded as of ~1 hour ago.

I've looked at your XLS... did you use functions/UDFs to extract the information?  

Any chance you could please provide me more details on how to trim the information?

Again, sorry for the inconvenience.    (I didn't know one could delete a questions once it's being "worked" on).

EEH
0
oleggoldCommented:
i think you'd better off with 1 function then 2 , here's what i propose ,tweaked gettext only function, may need some debugging:
Function GetTextOnly(S As String) As String
Dim X As Long
Dim h As Long
Dim i As String
GetTextOnly = Space(Len(S))
h = Len(S)
If Mid(S, X, 1) Like "and" Then Mid(GetTextOnly, X) = Mid(i, X, 1)
h = Len(i)
Next
For X = 1 To h
If Mid(S, X, 1) Like "[!0-9]" Then Mid(GetTextOnly, X) = Mid(S, X, 1)
Next
GetTextOnly = WorksheetFunction.Trim(GetTextOnly)

GetTextOnly = WorksheetFunction.Trim(GetTextOnly)
End Function

Open in new window

Function GetTextOnly(S As String) As String
Dim X As Long
Dim h As Long
Dim i As String
GetTextOnly = Space(Len(S))
h = Len(S)
If Mid(S, X, 1) Like "and" Then Mid(GetTextOnly, X) = Mid(i, X, 1)
h = Len(i)
Next
For X = 1 To h
If Mid(S, X, 1) Like "[!0-9]" Then Mid(GetTextOnly, X) = Mid(S, X, 1)
Next
GetTextOnly = WorksheetFunction.Trim(GetTextOnly)

GetTextOnly = WorksheetFunction.Trim(GetTextOnly)
End Function

Open in new window

0
ExpExchHelpAnalystAuthor Commented:
oleggold:

I saw your posted XLS... again, it doesn't provide the information as desired.   The cell range to be cleaned up is in A1:A13.

Cells below row 15 are examples of the "before" and "after".  Essentially, they're already "clean".  

EEH
0
ExpExchHelpAnalystAuthor Commented:
oleggold:

I just noticed your additional two proposed functions.   They contain syntax errors and don't execute.

Did you test them in Excel before posting them?   (Btw, I use Excel 2003... not sure if that makes a difference).

EEH
0
aikimarkCommented:
@EEH

If you delete a question before anyone has posted a comment, it happens immediately.  This was a timing issue where you closed the question between the time I started working on it and when I clicked the SUBMIT button.

First, I ran this regexp pattern against the text file:
(.*?)\((\d{4})\)(?:\. |, | |\.|,)(.*)

I applied the following pattern to the authors values to pull out name words of length 3
or more: \b([A-Z]\w{3,})\b

I added the AuthorName column header and then applied the advanced filter to get a list of unique name values.  I sorted this.  You will use this later.

I removed the period characters from the names column and used RegExp to change those items with an " and " to be of the form _____ et al.
Regexp pattern: (.*?) and .*
Replace pattern: \1 et al.

For the Parsed_2 worksheet, I applied a filter to show rows that contain a comma, followed by four or more characters.

Somewhere along the way, I trimmed the text of leading/trailing spaces.

=============
That was where I was going to start the conversation.  You still have a lot of editing to do.  Fortunately, you can use the unique author name list to filter the big list and do more clean-up activities.
0
ExpExchHelpAnalystAuthor Commented:
oleggold:

I've scanned through the original posts where you included hyperlinks.

I'm sorry... you merely copied pasted some functions (from other experts) in -- as far as I'm concerned -- into the posts.  

As mentioned before, they don't address the underlying problem.  

As a result, there's now plenty of activity and other users -- who actually would have read the problem -- may not chime in now.

I believe I have to repost this question.   I kindly request to only chime in once you've actually determined a solution that works.    

Thank you,
EEH
0
aikimarkCommented:
@EEH

Do not repost this question.  Duplicate questions are not allowed.
0
aikimarkCommented:
to help other participating experts, I've posted the text file from the other (now deleted) question
Citations.txt
0
aikimarkCommented:
@EEH

If you need help with the question title and text, please let me know.
0
ExpExchHelpAnalystAuthor Commented:
aikimark:

How did you run the regexp pattern?   Is that via a UDF?

I was about to award you the point (previous post) and reopen a new thread.   This one includes too much other information which does not address the original question (in this post).

While UDF would be the perfect scenario (e.g., the "PublicationYear") extract, I'll certainly entertain any manual steps given the many different potential scenarios.

Again, thanks for you additional feedback on this.

EEH
0
ExpExchHelpAnalystAuthor Commented:
Ok... I won't post the question again.  

Any chances your extraction patterns could be added into a UDF?

EEH
0
aikimarkCommented:
Any chances your extraction patterns could be added into a UDF?
Yes.

At this point in the problem, it is better to do things manually and get to the point where the data transformation process can be automated with VBA code.
0
ExpExchHelpAnalystAuthor Commented:
Ok... I'm still not what to do with the following, e.g.:

First, I ran this regexp pattern against the text file:
(.*?)\((\d{4})\)(?:\. |, | |\.|,)(.*)


Also, for a slightly different question, please see the following post:
http://www.experts-exchange.com/Software/Office_Productivity/Office_Suites/MS_Office/Excel/Q_28254870.html

Again, thanks for your help thus far.
0
ExpExchHelpAnalystAuthor Commented:
aikimark:

Again, I found the formula on the internet... see below.

http://excel.bigresource.com/Find-Text-String-Value-Between-Quotes-E2NXRK3A.html

Please reopen the question.

Thanks,
EEH
0
aikimarkCommented:
@EEH

Let us discuss the state of the data in the worksheets I added to your workbook.

* How will you determine the 'gold standard' for author name?
I would suggest that you populate a couple of columns in the unique name worksheet with the single-author and primary-author values.
Example:
AuthorNames	Sole         	Primary
Ackoff    	Ackoff, RL	Ackoff, RL et al.

Open in new window

* I noticed that "Roundy" is missing from my unique authors list.  Do you see any other problems?  How many entries do you have that resemble the "Roundy" entry?
The problem with then "Roundy" entry is that it doesn't fit the pattern that the other items have, with the date in parenthesis, located between the author and the title.  So, we have identified another problem to be addressed.
Citations-2.xls
0
ExpExchHelpAnalystAuthor Commented:
aikimark:

I appreciate your sending the XLS.

This contains clean data but I still can't seem to identify the process (e.g., formulas, etc.) for getting it clean.

I'll continuously have to reapply those functions to additional hundreds of citations.   Thus, I need to come up w/ a repeatable process for doing so.

Please advise as to how you got this clean in this particular format.

Thanks,
EEH
0
aikimarkCommented:
@EEH

Please continue with the discussion of the questions/items in my prior comment http:#a39537389

I am discussing your data and the required transformation processes.  I can only comment on the data you have posted (in the now deleted question) and not the complete data.

As more of my questions are answered, I will probably have enough confidence in automating that part of the total conversion.  Until then, discuss.  All of my work on this question has been in the area of analysis, regexp pattern applications, and some Excel manual processes to reflect the state of the data.

It is your responsibility to provide me feedback and, possibly, other files.  Do not expect a solution before you have helped me understand the problem and solution criteria.

It might be simpler to post the full citation txt file, rather than the extract.  That way, I would be asking you a more complete set of questions about the data.
0
ExpExchHelpAnalystAuthor Commented:
aikimark:

I appreciate your continued assistance.   As requested, I'm attaching 3 additional files:

1. "FullListOfCitations.txt"  -- this includes the original sources (copied/pasted from publications).

2. "ResultsFromSeminAutomatedExtract.xls" -- this is what the "end product" should look like.   This XLS was created using semi-automated process... with plenty of manual steps in between.   It took a while to create this XLS.   Given that I'll have to repeat with a few additional documents, I need to automate it.   Btw, given the semi-automated/manual processes involved, it includes errors.

3. "UDF 'Year.xls" -- based on earlier research, the extracting function for the publication year works great.


Is there anything else that you'd need.   Like we say "help you to help me".

Thank you again,
EEH
FullListOfCitations.txt
ResultsFromSeminAutomatedExtract.xls
UDF--Year-.xls
0
aikimarkCommented:
1. What about these entries?

AFI ...
Accounting in the Digital Economy, Oxford University Press, Oxford.
Air Force Instruction (AFI) 36-2629,
Allenna Leonard, “The Viable System Model and Its Application to Complex Organizations,” Systemic Practice and Action Research 22, no. 4 (2009): 227–29, doi:10.1007/s11213009- 9126-z.
BBC News ...
Environmental management
Fairtlough, G., The Three Ways of Getting Things Done: Hierarchy, Heterarchy & Responsible Autonomy in Organizations. Triarchy Press, 2005.
Further Reading:
Go¨ ran Broman, John Holmberg and Karl-Henrik Robo¨ rt (2000) Simplicity Without Reduction: Thinking Upstream Towards the Sustainable Society; 30(3):13–25.
ID ...
Ibid ...
Readiness Management Group ...
2. According to the citations.xls workbook you posted, you wanted a comma between the primary/sole author last name and the author initials.  In this most recent workbook, there is no comma separator.
3. Found duplicate entries (see Leonard A)
4. How to distinguish between "Ashby R", "Ashby W", and "Ashby WR" ?
5. Year values not in parenthesis
6. Year values followed by a letter (not shown in your results)
7. quotes are missing from your results
8. multiple year pairs: 1981/1999 for example
9. Year value not between author and title
10. Need to watch out for different authors with same last name: "Johnson HT" and "Johnson M"
11. Need to watch out for short last names, like "Ng"

===============
The approach I usually take is to identify as many patterns as possible.  These patterns are used to parse the lines and to separate/identify the known-pattern data from the unknown pattern and unusual data.

Then, I try to find sets of these pattern-deviant rows that share an identifiable pattern or that can be transformed into a known pattern easily.

Hopefully, you won't be left with many rows to process manually.

The data transformation steps can largely be automated, since most primary/sole authors have unique last names.
0
ExpExchHelpAnalystAuthor Commented:
aikimark:

Thanks for the continued help.    I truly appreciate it.

Before I respond to your questions, allow me to provide you an updated XLS with -- with 85% accuracy extracts "publication year" and "publication title".  

What it does not include is anything extract the "author" or "author et al.".   Based on your latest posting, most of your questions address the author scenario... so that's great as it will add the "missing pieces" to the existing XLS.

Ok... now to your questions:

1.  Extreme outliers (see attached JPG) miss most of the 3 elements [author(s), pub year, title)].  If possible, let's ignore them for right now.

2. Which file (of the 2 xls) did you refer to.    Ideally, it will always be [Lastname, 2 initials of firstname without "."] or [Lastname, 2 initials of firstname without "." et al.]

3. Duplicate entries... yes, I may have multiple "Leonard A".... totally expected.

4. Variations of, e.g., Ashby.    After the extraction, I'll later on (manually) do a find/replace so that variations of the same author will be removed.   Most important piece is to extract them into the right format.   The other piece falls into "quality assurance".

5. Pls see attached XLS.   This function "PublicationYear" extracts the first instance of 4-digit values.   Out of nearly 2,000 records, only 12 were incorrect (i.e., pub reference vs. actual year).

6. See #5

7. Unless a title uses an apostrophy (e.g., Jack's Wonderful Journey), I don't want any double/single quotes before/after the title.

8. See #5

9. See #5

10. Understood... this will require human-in-the-loop... as long as the author value is extracted into its own cell, I'll be ok

11. Yes, not sure how to address short last names such as "Ng".   I'm sure they're truly the exception though.  

In summary, I'll be totally thrilled if the cleanup function (focusing on authors) will give me the ~85-90% solution.    There's too much unique cases in this data set (and maybe new ones in future data sets) that would make it too time-consuming for automating this.

Again, thanks in advance for your continued assistance.

EEH
Citation-Extract-with-UDFs----St.xls
Outlier----Ignore.jpg
0
aikimarkCommented:
I found two patterns that covered most of the entries.  If you look at the attached workbook, you will see this code.  The worksheets have been populated and I manually applied  a filter to the Authors list to show author values that do not contain a space.

When prompted, you should select the text file.

You will want to walk through these values and the other Author values to make sure they contain the correct last name and initials format.  The code uses the data from the first encounter of an author last name value.
Therefore, you will see Ackoff R and will probably want to change that to Ackoff RL
Option Explicit

Public Sub Q_28254706_Import()
    Dim wksGood As Worksheet, wksBad As Worksheet
    Dim rngGood As Range, rngBad As Range
    Dim intFN As Integer
    Dim strLine As String, strAuthors As String, strPrimary As String
    Dim oRE_Line As Object
    Dim oRE_Author As Object
    Dim oRE_Primary As Object
    Dim oMatches_Line As Object, oMatch_Line As Object
    Dim oMatches_Author As Object, oMatch_Author As Object
    Dim oMatches_Primary As Object, oMatch_Primary As Object
    Dim oDicUnique As Object
    Dim boolGoodMatch As Boolean
    Dim oDicLinePatterns As Object
    Dim vItem As Variant
    
    With Application.FileDialog(msoFileDialogFilePicker)
        .AllowMultiSelect = False
        .Show
    End With
    If Application.FileDialog(msoFileDialogFilePicker).SelectedItems.Count = 0 Then
        MsgBox "No files selected"
        Exit Sub
    End If
    
    Set oRE_Line = CreateObject("vbscript.regexp")
    Set oRE_Author = CreateObject("vbscript.regexp")
    Set oRE_Primary = CreateObject("vbscript.regexp")
    
    oRE_Author.Global = True
    oRE_Author.Pattern = "\b([A-Z]\w{2,})\b"
    oRE_Line.Global = True
    oRE_Primary.Global = True
    Set wksGood = Worksheets("Good")
    Set rngGood = wksGood.Range("A2")
    Set wksBad = Worksheets("Bad")
    Set rngBad = wksBad.Range("A1")
    Set oDicUnique = CreateObject("scripting.dictionary")
    Set oDicLinePatterns = CreateObject("scripting.dictionary")
    oDicLinePatterns.Add "(.*?) (?:[(]{0,1})(\d{4}[a-z]{0,1})(?:[)]{0,1})(?:[.,]{0,1}) (.{10,})", Array(0, 1, 2)
    oDicLinePatterns.Add "(.*?), (.*?) [(]{0,1}(\d{4})[)]{0,1}.{0,1}$", Array(0, 2, 1)
    
    intFN = FreeFile
    Open Application.FileDialog(msoFileDialogFilePicker).SelectedItems(1) For Input As #intFN
    Application.ScreenUpdating = False
    Do
        Line Input #intFN, strLine
        boolGoodMatch = False
        For Each vItem In oDicLinePatterns
            oRE_Line.Pattern = vItem
            If oRE_Line.test(strLine) Then
                Set oMatches_Line = oRE_Line.Execute(strLine)
                strAuthors = oMatches_Line(0).submatches(oDicLinePatterns(vItem)(0))
                wksGood.Range(rngGood, rngGood.Offset(0, 2)).Value = _
                                    Array(strAuthors, _
                                        oMatches_Line(0).submatches(oDicLinePatterns(vItem)(1)), _
                                        oMatches_Line(0).submatches(oDicLinePatterns(vItem)(2)))
                Set rngGood = rngGood.Offset(1)
                Set oMatches_Author = oRE_Author.Execute(strAuthors)
                For Each oMatch_Author In oMatches_Author
                    If strAuthors Like oMatch_Author.Value & "*" Then
                        If oDicUnique.exists(oMatch_Author.Value) Then
                        Else
                            oRE_Primary.Pattern = "(" & oMatch_Author.Value & "[,]{0,1} [^,. ]*)"
                            If oRE_Primary.test(strAuthors) Then
                                Set oMatches_Primary = oRE_Primary.Execute(strAuthors)
                                strPrimary = oMatches_Primary(0)
                            Else
                                strPrimary = oMatch_Author.Value
                            End If
                            oDicUnique.Add oMatch_Author.Value, strPrimary
                        End If
                    End If
                Next
                boolGoodMatch = True
                Exit For
            End If
        Next
        If boolGoodMatch Then
        Else
            rngBad.Value = strLine
            Set rngBad = rngBad.Offset(1)
        End If
    Loop Until EOF(intFN)
    Set rngGood = Worksheets("Authors").Range("A2")
    For Each vItem In oDicUnique
        rngGood.Value = vItem
        rngGood.Offset(0, 1).Value = oDicUnique(vItem)
        Set rngGood = rngGood.Offset(1)
    Next
    rngGood.Sort key1:=rngGood.Cells(1, 1), header:=xlYes
    Worksheets("Authors").Range(Worksheets("Authors").Range("A2"), Worksheets("Authors").Range("A2").End(xlDown)).Offset(, 2).FormulaR1C1 = "=RC[-1]&"" et al."""
    Application.ScreenUpdating = True
    Close intFN
End Sub

Open in new window

Q-28254706.xls
0
aikimarkCommented:
What remains is Update code that will iterate through the cleaned up entries, using the author entries you supplied.  The Update code will update the items with the appropriate Sole/Primary value.
0
ExpExchHelpAnalystAuthor Commented:
Wow... that's unbelievably cool!!!   Thousand thanks for helping me out on this.

Allow me to recap the process (for future cleanup processes using a different input file).

1a. "Good" tab: with the exception of the first row, delete all content
1b. "Authors" tab: with the exception of the first row, delete all content
1c. "Bad" tab: delete all content

2. For future cleanup (with a different data set), add all citations to be cleaned to, e.g., "CitationCleanup.txt"

2. Run macro and select "CitationCleanup.txt"

3. Based on current data file (total of 1948 records), 1901 records ("Good" tab) are separated into [author(s), [year], and [title].  That's great!

4. Based on current data file (total of 1948 records), 87 records are on the "Bad" tab.  Both combined gives me a total of 1948 records.  I'll manually check those 87 records... no problem.

5. Based on current data file, there are 885 records on teh "Authors" tab.  They're broken up into 3 columns.  That's good.

********

Now, here's what I'm not entirely clear on:

a. In the end, the "Good" tab will become the source for input into a database (allowing to run queries etc.).

b. The "Good" tab (column A), however, still has the "unclean" authors in column A.   Meaning, it contains, e.g., "Arcaute, E., Christensen, K., Dahl, T., Espinosa, A. and Sendova, A." (row 67).

Question:

c. Based on b., how do I end up with a clean column A (from tab "Authors") given that either "Arcaute" or "Arcaute, E" (row 25) cannot be matched to "Arcaute, E., Christensen, K., Dahl, T., Espinosa, A. and Sendova, A." (row 67)?

d. Also, on tab "Authors", how do I know that a record should use either the "sole" author or the "primary" author?

Again, thousand thanks to you already.   'Just need some additional clarifications on question c and d.

Finally... is there any chance you could please add some comments to the XLS and then re-post the file?

Cheers,
EEH
0
aikimarkCommented:
Until we get through with testing, manually clear the contents of
*the second through last rows in the Good worksheet,
* column A in the Bad worksheet,
* A2 through the last used cell in the Authors worksheet

As I said in my prior comment, there is another routine that will update the values in column A of the Good worksheet.

What is required for that update process is to have a correct list of author values.  You will edit the values in the first two columns of the Authors worksheet (column B editing should only be required for the vast majority of the items.  You may need to add/delete items in column A to compensate for omissions in the Import routine.  For instance, you have some authors with double last names, starting with "von".  You will need to create entries for these authors and remove the existing "Von" entry.

You also have some authors sharing the same last name. (i.e. Zhu J and Zhu Z) You will need to prevent the Update routine from trying to change these items.  Clear the "Zhu" text from column A in the Authors worksheet.

I'm using the file that you posted earlier (FullListOfCitations.txt) and see 1988 lines/records.  This is different than your 1948 lines.

The formula in column C of the Authors worksheet will automatically evaluate a correct primary author value.  You should not need to change that formula.  If you do, please let me know, since I populate those cells' formulae in the Import routine.
0
aikimarkCommented:
Your text file is similar to mine, since I get 96% Good and you're getting 97% Good.
0
ExpExchHelpAnalystAuthor Commented:
Ok... so, if I'm hearing you right, the current content of the Authors tab will be used to populate some form of "lookup list" (different tab).

If so, I'll review the current 850 authors (columns A and B) and provide an updated list later on.

Btw, I do have 1988 as the total.   The 1948 was a type.   Good tab included 1901 records... Bad tab 87 records.

I'll try to post that authors list within the hour or so.

EEH
0
ExpExchHelpAnalystAuthor Commented:
aikimark:

Per previous post, I've cleaned up the authors list.  Renamed tab to "Authors_Lookup" (wanted to make sure if won't get wiped out).

Again, I'm very excited about the existing automation.    If this lookup table (not sure if that's really the envisioned process) will help further cleanup column A in the "Good" tab.

Please let me know if you require additional information.

Cheers,
EEH
AuthorsLookup.xls
0
aikimarkCommented:
@EEH

What do the cells/rows with the yellow background indicate?

What about the name differentiation for the "von ____" authors?
von Bertalanffy
von Hayek
von Hippel
von Krogh

One of the things I had considered was a routine that would allow you to see a filtered list of the items in the Good worksheet that contained that name.  If you mapped a keyboard shortcut to that routine, it might make your editing a bit easier.

Thank you for your effort on the Authors lookup workbook.
0
aikimarkCommented:
There is another way of setting the correct author name data, which you might find easier...

The Import routine can supply just the initials for the column B values in the Authors worksheet, rather than the last name and initials as it is currently doing.  Your editing would be simplified, since you would only need to worry about the initials data/text.
0
ExpExchHelpAnalystAuthor Commented:
Away from desk right now.  

Yellow cells = organization... We potentially can exclude them.

Don't remember all authors off the top of me head.  Bertallanffy, I believe, does not have the "von" as prefix.  Need to verify via find in input file.  I'll let you know later on.

At any rate, I'd be ok to concatenate (e.g., vonHippel)... Assuming it works.

WRT import routine.... I'm all for the easy solution.  However, somehow I must have the full author's lastname as part of the final product.  For instance, SB for 'Stafford Beer' wouldn't work. So, 'Beer, S' is much preferred.

Thank you again for helping me out.

EEH
0
aikimarkCommented:
I've reconfigured the Authors worksheet and tweaked the code that populates it.

I've also created a showfiltered() routine and mapped it to Ctrl+Shift+A
When this macro runs, you will see all the Author cells containing the text in the active cell in the Authors worksheet.  You must select a cell in column A before invoking this routine.

When you are ready to move to another author last name, just press the Enter key.
You should have Caps Lock in effect when you are editing the initials values.

You have several items where the author name is in firstname lastname order.  If we have a lot of these items and no duplicates, then you might change the column C and column D entries for those authors.  Otherwise, it would be simpler to just do those manually.

If an item skipped, then just delete the last name text.  If that causes confusion, we can figure some other way of indicating that the code should skip processing that name.
Q-28254706.xls
0
ExpExchHelpAnalystAuthor Commented:
Thanks for the latest posting of the XLS.   I'm still not entirely clear about the entire process.

Here's what I've done:
- Tab "Good": Deleted all content except first row
- Tab "Authors": Delete all content in columns A and B.  Left formulas in C and D untouched.
- Tab "Bad": Delete all content

Ran the macro and imported data from the original file.

Here's what I see:
1. Tab "Good": Column A (authors) still contains a bunch of "junk"
2. Tab "Authors":  I noticed the new column B (initials)
3. Tab "Bad": No problem... I can manually review those 87 records.

Here are my questions:
1. Based on earlier "author" cleanup, it appears that it's not taken into consideration.  

For example, on the author's tab, let's look at the following rows:
- Row 10:  It should be "Papadopoulos, Agis" vs. "Agis, Papadopoulos"
- Row 21:  It should be "Goodman, Ann" vs. "Ann, Goodman"

I understand that this is due to the article's author referencing "firstname lastname" (vs. lastname, firstname).    However, based on earlier's request to clean up the author file, I somehow thought an of those errors would be resolved.    Or what was the purpose of cleaning up the authors?

2. I'm still not clear as to how we can get a clean column A. in the "Good" tab.   "Good" contains 1901 garbled authors.   How can the clean authors in tab "Authors" be utilized to update column A.

Am I missing something fundamental?  Btw, I do really appreciate your help on this.   I think we're so very close to an almost perfect solution.    The only thing that seems to be missing is having "good" authors on the "Good" tab.   ;)

Makes sense?

EEH
0
aikimarkCommented:
While it is possible for me to use your cleaned-up names, we are dealing with a data-transform process that starts with your text file and produces the worksheet with parsed and cleaned values.  If this is a one-time process, then I can certainly use the authorslookup.xls file.

Do I ignore the yellowed cells?

================
In the attached workbook, I've added an Update routine that prompts you to select a cell in the cleaned up Authors worksheet as well as a cell in your parsed data (hopefully the output of the Import routine).  Any Author last name cells with a yellow background color are ignored.
Public Sub Q_28254706_Update()
    Dim wksSrc As Worksheet, wksTgt As Worksheet
    Dim rngSrc As Range, rngTgt As Range
    Dim rng As Range
    Dim lngSoleCol As Long, lngPrimaryCol As Long
    Dim strLastname As String, strAuthors As String, strPrimary As String
    Dim oRE_Author As Object, oRE_OtherStuff As Object
    Dim oMatches_Author As Object, oMatch_Author As Object
    Dim oMatches_OtherStuff As Object, oMatch_OtherStuff As Object
    Dim oDicLookup As Object
    On Error Resume Next
    Set rngSrc = Application.InputBox("Please click on any cell in the cleaned-up Authors worksheet", , , , , , , 8)
    If rngSrc Is Nothing Then
        MsgBox "No source selected"
        Exit Sub
    End If
    Set rngTgt = Application.InputBox("Please click on any cell in the imported data worksheet", , , , , , , 8)
    If rngTgt Is Nothing Then
        MsgBox "No target selected"
        Exit Sub
    End If
    On Error GoTo 0
    Set wksSrc = rngSrc.Worksheet
    Set wksTgt = rngTgt.Worksheet
    Set rngSrc = wksSrc.Range(wksSrc.Cells(1, 1), wksSrc.Cells(1, 1).End(xlToRight))
    For Each rng In rngSrc.Cells
        Select Case rng.Value
            Case "Sole"
                lngSoleCol = rng.Column
            Case "Primary"
                lngPrimaryCol = rng.Column
        End Select
    Next
    Set rngSrc = wksSrc.Range(wksSrc.Cells(2, 1), wksSrc.Cells(wksSrc.UsedRange.Rows.Count, 1))
    Set oDicLookup = CreateObject("scripting.dictionary")
    For Each rng In rngSrc.Cells
        If rng.Interior.Color = vbYellow Then
        Else
            If oDicLookup.exists(rng.Value) Then
            Else
                oDicLookup.Add rng.Value, rng.Row
            End If
        End If
    Next
    
    Set oRE_Author = CreateObject("vbscript.regexp")
    oRE_Author.Global = True
    oRE_Author.Pattern = "(\w+)(.*)"
    Set oRE_OtherStuff = CreateObject("vbscript.regexp")
    oRE_OtherStuff.Global = True
    oRE_OtherStuff.Pattern = "\w"
    
    Set rngTgt = wksTgt.Range(wksTgt.Cells(2, 1), wksTgt.Cells(2, 1).End(xlDown))
    For Each rng In rngTgt.Cells
        strAuthors = rng.Value
        If oRE_Author.test(strAuthors) Then
            Set oMatches_Author = oRE_Author.Execute(strAuthors)
            strLastname = oMatches_Author(0).submatches(0)
            Set oMatches_OtherStuff = oRE_OtherStuff.Execute(oMatches_Author(0).submatches(1))
            If oDicLookup.exists(strLastname) Then
                'is this a sole or primary author?
                If oMatches_OtherStuff.Count > 3 Then
                    rng.Value = wksSrc.Cells(oDicLookup(strLastname), lngPrimaryCol)
                Else
                    rng.Value = wksSrc.Cells(oDicLookup(strLastname), lngSoleCol)
                End If
'                Stop
            End If
        End If
    Next
End Sub

Open in new window

Q-28254706.xls
0
ExpExchHelpAnalystAuthor Commented:
aikimark:

Thanks for posting the latest file.   Allow me to address your questions first.

1. "one-time process?" -- no, I will have to repeat this process several times a month as new citations will be added to the source file.

2. "yellow boxes?" -- ideally, citations by organization would also be cleaned.   Given that some of them have a long string, e.g., "Office of the Vice Chairman of the Chiefs of Staff", this make be rather difficult.   So, as the most important sources are the actual author's, we potentially can exclude the ones marked in yellow.

Follow-up questions:
- For testing purposes, I've removed all content (as previously described)
- Reimported the date from the source file.  
- Click on the "Update" function which prompted me to "click on any cell in the cleaned-up Authors" worksheet.  

I'm not sure what the "update" function does?   For example, we still have author "Ann Goodman" where "Ann" is actually viewed as the lastname (vs. firstname).

Again, I think we're almost there.   The firstname vs. lastname scenario, I potentially could even fix (manually).   Not ideal because it's not easy to recognize that a string (Ann) is placed incorrectly.  

The most important thing (as far as I'm concerned) though is to have column A (Good) reflect clean authors (without all the garbled information).

Is that doable?

EEH
0
aikimarkCommented:
Use the cleaned up authorlookup workbook when prompted by the update routine.

I've been commenting on the name clean-up for some time.  Unless the source of the entries that have the firstname lastname entries can be identified, you will need to manually change either the source file or the imported author cells.  If you are going to change the imported author cells manually, you probably don't need to worry about the update routine for those entries.  Likewise, you will probably want to exclude those imported items that have different authors with the same last name.
0
ExpExchHelpAnalystAuthor Commented:
Ok... I think this will work....

Is there any chance you could add some comments to the actual VBA.   That'll allow me to get "smart" on the coding.    

This would also help with documenting the process... something I'm tasked to do.

Again, thanks in advance!!   This is truly an awesome solution.  

EEH
0
aikimarkCommented:
for the dominant patterned lines (those like the Ackoff rows), I found these 47 candidates that might be firstname lastname order.  However, this list includes the double last name entries for the van_X/von_X authors.
First Names
Agis
Alvarez
Andrew
Ann
Asian
Australian
Australian
Australian
Birrer
Canadian
Carol
Chialin
Cognitive
Dietz
Dominique
Dries
Forest
Glen
Irina
Javier
Julian
Julie
Khairul
Kumar
Lucas
Martin
Michael
Michael
Oei
Qinghua
Rene
Ribbers
Rupesh
Stern
Stuart
Stuart
TPM
Tran
Von
Von
William
Zouwen
van
van
von
von
von
zur

Open in new window

Last Names
Papadopoulos, Avraam Karagiannidis 
Gil M. J. , J. Burgos Jime´nez, J. J. Ce´spedes Lorente 
King and Michael Lenox 
Goodman 
Case Res J 14(2):187–232 AG 
Bureau of Statistics 
National Audit Office 
Publics Service Commission 
FAJ 
Policy Research Network 
Prahinski, Canan Kocabasoglu 
Chen 
Edge 
JLG 
Bollinger, Jacques Pictet 
sen, P.,Glasbergen, P. & Verdaas, C. 
Reinhardt 
Dowell, Stuart Hart and Bernard Yeung 
Maslennikova and David Foley 
Gonza´ lez-Benito, O ´ scar Gonza´ lez-Benito 
Lowe, David Lewis 
Ann Stuart, Jane C. Ammons and Laura J. Turbini 
Akmaliah A, Mohd Fuaad S 
Rajaram and Charles J. Corbett 
HCJ, Baroudi J 
MJC 
Lenox, Andrew King and John Ehrenfeld 
Lenox, Andrew King and John Ehrenfeld 
JLH, van Hemmen JGT, Falkenberg ED, Brinkkemper S 
Zhu, Joseph Sarkis, James J. Cordeiro, Kee-Hung Lai 
Caldentey and Susana Mondschein 
PMA, Schoo K-C 
Kumar Pati, Prem Vrat, Pradeep Kumar 
Review Report 
Hart, Matthew Arnold and Rob Day 
Hart, Matthew Arnold and Rob Day 
Biotech 
CIC 
Foerster H 
Foerster H 
McDonough and Michael Braungart 
Jvd 
Aalst MK, Cannon T, Burton I 
der Heijden K 
Bertalanffy L 
Bertalanffy L 
Bertalanffy L 
Muehlen M, Recker J 

Open in new window

0
aikimarkCommented:
what process did you go through to create the cleaned up authors list?

did you try the Ctrl+Shift+A (invoked) macro on the imported Authors worksheet?
0
ExpExchHelpAnalystAuthor Commented:
Wow... can't believe that your coding can detect when a string is potentially a first name.   Is this based on "lastname comma firstname" vs. (firstname lastname).... w/o comma?

Are you planning on updating the code to "reshuffle" those after the import?   Although I don't know how, it certainly would be awesome as the final product would be much more accurate.

Also, I forgot to mention/ask... is there any way that column C (in Good) will contain only the actual title.   Right now, it includes the title followed by non-critical information such as Journal names, page numbers etc.

- While many title are enclosed in double/single parentheses, not all of them are...
- Other patterns include commas and periods... and maybe be others.

Doable?

EEH
0
aikimarkCommented:
The candidate list of firstname lastname items was done strictly by looking at the length of the first two words before the parenthesized date value.  I didn't look at any of the date-at-the-end lines, non-parenthesized date lines, or dates-with-letters lines.  Since the results aren't highly reliable, it is best if we think about such a candidate list as a pre-process aid to your manual clean-up efforts prior to running the Update routine.

is there any way that column C (in Good) will contain only the actual title...Doable?

While it might be 'doable', it isn't advised.  Take a look at the Titles column in the Good worksheet of the attached workbook.  I did a text-to-column operation on the text, using a period character as the delimiter.  While most of the results are satisfactory, it isn't perfect.  In fact, it might cause you more work to fix the split data.

After the period delimiter split, I executed a few Find/Replace commands, to try and eliminate the quotes and smart quote characters.
Q-28254706.xls
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
ExpExchHelpAnalystAuthor Commented:
Copy that... I'll follow your recommendation.  

I close this posting now... again, thank you so much for your most valuable help on this activity.   I'll make "life" much easier now.   ;)

Any chance you might add some comments (I know... I've asked before) to the VBA?

EEH
0
ExpExchHelpAnalystAuthor Commented:
aikimark:

Your help on this effort is truly appreciated.   Your solution is most excellent and elegant.   It'll save many, many hours and will result in more reliable results.

Again, thousand thanks!!!  ;)

EEH
0
aikimarkCommented:
If you are satisfied that we have minimized the amount of manual work in the clean-up process, I will add some comments to the code and post it here.
0
ExpExchHelpAnalystAuthor Commented:
Yes... I'm definitely most satisfied w/ the clean-up process.    Again, for offering up to post the XLS w/ comments later on.

Very much appreciated.

EEH
0
aikimarkCommented:
I have commented the code in this version of the workbook.
Q-28254706.xls
0
aikimarkCommented:
@EEH

Did you try  Ctrl+Shift+A
?
0
ExpExchHelpAnalystAuthor Commented:
Thanks for posting the XLS with comments... very much appreciated.

No, please remind me... what's the "Ctrl+Shift+A" for again?

EEH
0
aikimarkCommented:
go to the authors worksheet, populated by the import routine, select any cell in column A, and press Ctrl+Shift+A

you will be shown the unique author values that contain that author's last name.

The Initial cell and the next author's name cell are selected.

You can change or supply the initials for the author or leave it as it is.

Press the Enter key to go to the next author.
0
aikimarkCommented:
@EEH

Did you test the showfiltered results (Ctrl+Shift+A)?  If so, was it helpful?  If not helpful, what would make it useful for you?
0
ExpExchHelpAnalystAuthor Commented:
aikimark:

Yes, I saw it... it seemed a bit confusing though.   Maybe I'm not fully clear of the capability.

EEH
0
aikimarkCommented:
Maybe I'm not fully clear of the capability.
For each of the last name candidates in the first column, the routine will display a unique list of author values that contain that last name.  The active cell will be set to the Initials column (B).

You may change the initials for that author or continue with the next author by just pressing the Enter key.  The showfiltered() routine has automatically selected the next author cell.

Note: The list of author values can also include data that should cause you to avoid that author's name in the Update() processing.  Such reasons include (and not limited to):
* two or more different authors with the same last name
* firstname lastname order
* ineligble author last name, such as "AFI" and a business/institute name
* doubled last name
* the author lastname isn't really a last name (mostly because of the doubled last name pattern I described with the Von ___ and Van ___ authors)

I also played with some code that would automatically invoke the showfiltered() routine when you selected an initials cell in column B.  It is a bit easier to use than pressing Ctrl+Shift+A for every author.

Also, the mechanism the current Update routine uses to exclude an author would probably need to be changed.

=============
I had posted this because you said that there would be several such clean-up processes.  I don't see any way to avoid the manual inspection and correction of the initials as well as selective prevention of some updates.
0
ExpExchHelpAnalystAuthor Commented:
Thanks... I'll give it another look.   I appreciate your sharing the additional information.

EEH
0
ExpExchHelpAnalystAuthor Commented:
aikimark:

Ok... I now see the benefit of this function.   It definitely adds some value.  

Thank you for providing more clarification on this.   'Much obliged.

EEH
0
aikimarkCommented:
@EEH

You're welcome.  Post another question and put a link to that new question in a new comment in this thread.

As a result of preparing a regular expression presentation at a local user group, I think the title splitting and author initials (showfiltered) inspection/cleanup activity can be tweaked.
0
ExpExchHelpAnalystAuthor Commented:
@aikimark:

Hope all is well... again, I appreciate your assistance with this question.

I've now posted an entirely different question on EE.   This also, however, requires help with "pattern analysis".    I believe you indicated that much of your work has been involved with regexp analysis.   Granted, this new question is slightly different but it still requires analysis of data patterns.

Just wanted to share in case you're up for the challenge.   URL is as follows:
http://www.experts-exchange.com/Software/Office_Productivity/Office_Suites/MS_Office/Excel/Q_28272746.html

Cheers,
EEH
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Microsoft Excel

From novice to tech pro — start learning today.