Identify Special Char in JAVA and Shell Scriipt

enthuguy
enthuguy used Ask the Experts™
on
Hi Experts,

Could you help me how to find below special charcters "" in JAVA and Shell Script please.

This appers just of the first name of a person. Since we are not identifying this...this slips through our process and fails at the end of complete process. if we could identify this as part of the first parsing step. it would save lot of reprocessing time and defects etc.


This is what I see in a XML file
Firstname  Lastname


Same above character in a CSV file after the complete process failed
Firstname � Lastname
Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
Distinguished Expert 2017

Commented:
Please clarify, your posted info is reflecting ASCII encoded..

In a shell script you can use cat -v  file

In this case, an image of what it is you are talking about.
Ascii 141
If not mistaken, ASCII 26 ....
Distinguished Expert 2017

Commented:
It might be better for you to outline what it is you are trying to achieve.

Are you looking for pattern replacement?
Top Expert 2016

Commented:
you could use a regex and only allow A-Z and a-z
Success in ‘20 With a Profitable Pricing Strategy

Do you wonder if your IT business is truly profitable or if you should raise your prices? Learn how to calculate your overhead burden using our free interactive tool and use it to determine the right price for your IT services. Start calculating Now!

Author

Commented:
thx arnold and David fur ur quick reply.


what I’m trying to achieve is to run a script everyday on the file i’m receiving identify the line with this special char and report it.
In XML special characters are escaped with &<chars>;
http://xml.silmaril.ie/specials.html

So you're looking at specially encoded characters:
#141
#26
#26

https://www.alt-codes.net/
lists 141 as an i with an accent and 26 as a control key (right arrow).

It looks like some mis-encoded data.  Perhaps somebody was trying to encode a Unicode character here?

You may just want to replace any similar character with "?" so this becomes ???.

Author

Commented:
Pearson, what would happen / appear when I replace with ?
Top Expert 2016

Commented:
what would happen / appear when I replace with ?
There's no way for us to know that. A question mark is not a good candidate in that it's a shell-special character in Unix though
Could you help me how to find below special characters "&#141;&#26;&#26;" in JAVA and Shell Script please.
Java what? Java and shell scripts are quite different things, but more particularly those characters look like data, so i wonder why data is being mixed with code?
David FavorFractional CTO
Distinguished Expert 2018

Commented:
What you're asking is very easy. Just use PERL along with POSIX style regular expressions to find umlat + other related characters.

I suspect there's much more to your question.

These characters are correct, meaning they are part of people's names.

Once you've found them, nothing you can do to modify them... well... you can change &#141; to a lower case "i" character.

Many people will find this offensive in our hyper sensitive, politically correct climate right now. I know this because usually people will complain about you munging up their name.

Tip: Rather than finding these, better to modify your code (whatever is breaking during processing of these characters) so your code works.

So you will no longer have any consideration of finding these characters, only processing all characters correctly.
David FavorFractional CTO
Distinguished Expert 2018

Commented:
Suggestion: Attach one of your XML files, along with the transforms you'll be making (like &#141 -> lower case "i").

Likely someone has some XML parsing/replacing code around to handle this...

Or you can just use sed or awk for replacements.

To actually find the string you're looking for, grep will work well or in Java, you'll just use a string search to find the characters.
Top Expert 2016

Commented:
These characters are correct, meaning they are part of people's names.
Why would control character &#26; be in someone's name though?
Distinguished Expert 2017

Commented:
IT is a substitute, much depends on what is entered and then converted by the entry side.
potentially it can be stripped, but before doing anything the impact if any has to be assessed.
or whether there is some reliance on these control ...

Author

Commented:
Yes, I'm wondering why this special char be part of someone's name.

Also reason for identifying is to report back to the source system, so they can analyse and fix the source if they can. If it is legitimate , then we  would have to handle it in our code not to error out to moving forward.

Will try to upload sample xml but i would have to do lots of masking before that.
Top Expert 2016

Commented:
Will try to upload sample xml but i would have to do lots of masking before that.
It doesn't necessarily have to be a full file, just one that shows the full structure
Distinguished Expert 2017

Commented:
a simple regex looking for &#\d+;

Author

Commented:
btw, any suggestion on these chars pls. they are from a text file and in between names again :)

<8d>^Z^Z

Open in new window

Author

Commented:
How to grep this on a text file please.
<8d>^Z^Z

Open in new window

Top Expert 2016

Commented:
They are probably depictions of control characters. You're saying it's literally that in the text file? If so
sed 's/<8d>^Z^Z//g' inputfile.txt

Open in new window

should remove that sequence

Author

Commented:
Please check the uploaded images. This is how it appears  "A question mark inside diamond"

I dont know how to grep. So I had grep by name
if I vi on the same file I see this <8d>^Z^Z

Is it possible to grep this symbol?
Or is it possible to view the file as a background process, check the char and identify?

Please suggest
966BB81F-9FB1-4F8D-B4CA-ED5138798BF.jpeg
Top Expert 2016

Commented:
The question mark inside a diamond is simply an indication that whatever is trying to show the character that is being encoded cannot do so.
Grep/sed? Don't describe the code you think you need or the steps. Describe the GOAL

Author

Commented:
Goal is

Would like to automate the process of identifying these special char and extract few words from the same line/entry. so I get an alert if there is a special char found everyday
Top Expert 2016

Commented:
OK, that's better! Shall think about that. Do you know the character encoding being used?

Author

Commented:
Thanks CEHJ,
sorry no :(
Top Expert 2016

Commented:
What does the following give at the command line?
file inputfile.txt

Open in new window

Author

Commented:
file inputfile.txt
inputfile.txt: cannot open (No such file or directory)
Top Expert 2016

Commented:
Well you need to put the actual name of your real file in there

Author

Commented:
:) sorry

inputfile.txt: ASCII text, with very long lines, with CRLF line terminators
Distinguished Expert 2017

Commented:
it is a text file not sure what else you expected?
XML.

One is if you can modify whatever script you run to account for the ASCII encoded &#\d+; as a valid entry.
Top Expert 2016

Commented:
Try the following. Its return code will be the number of control characters in the file. If you want to see output then uncomment the penultimate line

#!/bin/bash
if [ $# -eq 0 ];then
    echo "Usage: $(basename ${0}) <input file for which to detect the number of control chars>"  
    exit 1
fi
input_file=$1
num_c_chars=$( tr -d -c '\000-\010\013-\014\016-\031\177-\237' <"$input_file" | wc -c )
#echo $num_c_chars
exit $num_c_chars

Open in new window

Author

Commented:
Thanks so much CEHJ, checking

Author

Commented:
sorry, it returns 0
Even though the inputfile had the symbol
Top Expert 2016

Commented:
Well i've done some testing. Can you please post an edited version of the input file? e.g. all you need do is extract a short section around the control character(s) in question, obfuscating if necessary, retest with that and if it fails, post it here

Author

Commented:
cat sample.txt
Firstname � SomeNameo  GPO BOX 0074                            NEWYORK CITY                              NEWYORK                          USA                                                   SAMPLE DATA SAMPLE DATA                                    SAMPLE DATA SAMPLE DATA SAMPLE DATA
Distinguished Expert 2017

Commented:
Look at cat -v of the same thing.

Author

Commented:
Cool...good trick

M-oM-?M-=

Author

Commented:
not able to grep :(
Distinguished Expert 2017

Commented:
Please clarify?
Try using egrep if/when you use regex patterns.
A single statement like, "no that did not work." Is a,bigous when multiple people provide different suggestion on what to try.

Author

Commented:
HI arnold,

I did try
1. grep with multiple combinations
2. Partly search
3. with escape char
4. grep with different switches -F
5. egrep again with part of the strings
6. using python (from this link Phttp://www.fileformat.info/info/unicode/char/fffd/index.htm)
7. UTF-8 hex / binary

so far no luck with my little knowledge :(

thx all for trying to help me

Author

Commented:
However, I was trying in python like this but still not there. finding best way to store this symbol for this string comparison...suggestion please

>>> str = "Firstname � SomeNameo  GPO BOX 0074"
>>> "GPO" in str
True
>>> "FirstName" in str
False
>>> "Firstname" in str
True
>>> "� " in str
True
>>> "0074" in str
True
>>>

Open in new window

Author

Commented:
Searching word 'TEST'

cat sample.txt
Firstname � SomeNameo  GPO BOX 0074                            NEWYORK CITY                              NEWYORK                          USA                                                   SAMPLE DATA SAMPLE DATA                                    SAMPLE DATA SAMPLE DATA SAMPLE DATA
SAMPLE
SAMPLE TEST 1
SAMPLE TEST 2
SAMPLE TEST 3
SAMPLE TEST 4
SAMPLE TEST 5

Open in new window


check.py
# Open the file with read only permit
f = open('sample.txt')
# use readline() to read the first line
line = f.readline()
diamond = 'TEST'
# use the read line to read further.
# If the file is not empty keep reading one line
# at a time, till the file is empty
while line:
    # in python 2+
    # print line
    # in python 3 print is a builtin function, so
    # print(line)
    if diamond in line:
        print "Found " + diamond
    else:
        print "Not found!"
    # use realine() to read next line
    line = f.readline()
f.close()

Open in new window


Output
Not found!
Not found!
Found TEST
Found TEST
Found TEST
Found TEST
Found TEST

Open in new window

Author

Commented:
Searching "�"

# Open the file with read only permit
f = open('sample.txt')
# use readline() to read the first line
line = f.readline()
# diamond = 'TEST'
diamond = '�'
# use the read line to read further.
# If the file is not empty keep reading one line
# at a time, till the file is empty
while line:
    # in python 2+
    # print line
    # in python 3 print is a builtin function, so
    # print(line)
    if diamond in line:
        print "Found " + diamond
    else:
        print "Not found!"
    # use realine() to read next line
    line = f.readline()
f.close()

Open in new window


Error assigning this special char
python check.py
  File "check.py", line 6
SyntaxError: Non-ASCII character '\xef' in file check.py on line 6, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details

Open in new window

Author

Commented:
Nearly there I guess

cat sample.txt
Firstname � SomeNameo  GPO BOX 0074                            NEWYORK CITY                              NEWYORK                          USA                                                   SAMPLE DATA SAMPLE DATA                                    SAMPLE DATA SAMPLE DATA SAMPLE DATA
SAMPLE
SAMPLE TEST 1
SAMPLE TEST 2
SAMPLE TEST 3
SAMPLE TEST 4
SAMPLE TEST 5
Firstname � SomeNameo  GPO BOX 0074                            NEWYORK CITY                              NEWYORK                          USA                                                   SAMPLE DATA SAMPLE DATA                                    SAMPLE DATA SAMPLE DATA SAMPLE DATA
Firstname � SomeNameo  GPO BOX 0074                            NEWYORK CITY                              NEWYORK                          USA                                                   SAMPLE DATA SAMPLE DATA                                    SAMPLE DATA SAMPLE DATA SAMPLE DATA
Firstname � SomeNameo  GPO BOX 0074                            NEWYORK CITY                              NEWYORK                          USA                                                   SAMPLE DATA SAMPLE DATA                                    SAMPLE DATA SAMPLE DATA SAMPLE DATA

Open in new window


cat check.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-

# Open the file with read only permit
f = open('sample.txt')
# use readline() to read the first line
line = f.readline()
diamond = '�'
# use the read line to read further.
# If the file is not empty keep reading one line
# at a time, till the file is empty
while line:
    # in python 2+
    # print line
    # in python 3 print is a builtin function, so
    # print(line)
    if diamond in line:
    	print "Found " + diamond
    else:
    	print "Not found!"
    # use realine() to read next line
    line = f.readline()
f.close()

Open in new window


./check.py
Found �
Not found!
Not found!
Not found!
Not found!
Not found!
Not found!
Found �
Found �
Found �

Open in new window



could you advice/suggest on improvising this please

Author

Commented:
finally the grep way

grep -P '[^\x00-\x7f]' sample.txt

Firstname � SomeNameo  GPO BOX 0074                            NEWYORK CITY                              NEWYORK                          USA                                                   SAMPLE DATA SAMPLE DATA                                    SAMPLE DATA SAMPLE DATA SAMPLE DATA
Firstname � SomeNameo  GPO BOX 0074                            NEWYORK CITY                              NEWYORK                          USA                                                   SAMPLE DATA SAMPLE DATA                                    SAMPLE DATA SAMPLE DATA SAMPLE DATA
Firstname � SomeNameo  GPO BOX 0074                            NEWYORK CITY                              NEWYORK                          USA                                                   SAMPLE DATA SAMPLE DATA                                    SAMPLE DATA SAMPLE DATA SAMPLE DATA
Firstname � SomeNameo  GPO BOX 0074                            NEWYORK CITY                              NEWYORK                          USA                                                   SAMPLE DATA SAMPLE DATA                                    SAMPLE DATA SAMPLE DATA SAMPLE DATA

Open in new window

Top Expert 2016

Commented:
It's not helpful to paste unfortunately.
You need to attach examples as files.

Author

Commented:
Yeah CEHJ, I understand. sorry about that. Thanks for trying to help me out though. Good people in EE :)

but anything I upload/attach is scanned and sometimes its creates an alert to my manager
Top Expert 2016

Commented:
grep -P '[^\x00-\x7f]' sample.txt

Open in new window

That i'm afraid is inadequate since you're 'allowing' all sorts of characters that absolutely should not be present in an xml file
e.g. &#26; (one you originally reported)

If you can't attach, please do the following:
tr 'A-Za-z' 'x' < input.txt | xxd >out.txt

Open in new window

and then paste the contents of the file 'out.txt' into code tags. You'll notice that all letters are replaced by 'x'
Distinguished Expert 2017

Commented:
It seems you are getting how to search for a pattern.
if the variation you are looking for you can use a negative check
Look for name without any variances
^[A-Za-z' \-]+$ if true name contains only plain characters and a ' else, the name contains sone variance
It is easier to look for what you expect and kickout variance,  as a uniform approach versus having to explicitly identify all the current variances and look for them specifically.
Top Expert 2016

Commented:
Already did that way back HERE
The question is why it's returning zero when there are non-printing characters. In order to find out why, real data must be provided (see my last comment)
Distinguished Expert 2017

Commented:
The asker's examples deal with looking for the variances.
Because of the nature of the data that is being dealt with, the asker does not post a sample.

It is unclear to me whether it would be sufficient to modify the script running into thus issue as it seems a validation shoukd be done in the front end, versus post post. Or what action if any can be taken.
Stripping, replacing, interpreting.. Mode of the data entry that includes control characters.

If the issue is with searching for these records once added. .. And this is the reason for the normalization...
Top Expert 2016

Commented:
Because of the nature of the data that is being dealt with, the asker does not post a sample.
Again, i gave the solution to that problem HERE
That can't possibly leave any revealing data and removes the need for attachments
Distinguished Expert 2017

Commented:
my Point is that the askers has been provided with additional information, such that a consideration one should take in their approach.
IMHO, it is of little importance to list all the current variances

This options provided might have the asker alter/adjust/modify......

Possibly find approach that takes the different items here in addressing the .......
Distinguished Expert 2017

Commented:
@CEHJ,

YOu could have very well provided a solution.
Top Expert 2016

Commented:
Yes, but i am told at least one control character has been undetected. We need to find out why
Distinguished Expert 2017

Commented:
This is my suggestion, instead of looking for what was not found, look for what we expect and everything else will be covered no matter how infrequent that might be or what rare event it could be.
i.e. the first name and lastname with spaces, dashes and ' anything different should be looked at...
Top Expert 2016

Commented:
We expect (afaik) zero to infinity control chars mixed in with legitimate content
Theoretically, apart from standard control chars such as newline,carriage return, tab etc. there is really no place for anything exotic and non-printing (such as at the top of this question) in an xml file and anything present should be flagged up. Of course removal can use similar techniques
Distinguished Expert 2017

Commented:
NO disagreement there. The application in question is what might provide for these entries. I.e. a web portal where the user controls the input encoding, the app takes in the characters as they come i.e. a person as the asker noted has a hyphen, but on their input the ' comes in as the curly quote or a person has a in the name, but has the ã the processing might convert that into the &#227;
instead of Pena  you will have Pen &#227;&#26;&#26;
I think everyone provided the different approaches to address the items raised by the asker, and the asker based on their python, grep example has the information to implement the approach best suited to for their case.
Top Expert 2016

Commented:
instead of Pena  you will have Pen &#227;&#26;&#26;
I'm not actually clear what would cause the two end characters ...
Distinguished Expert 2017

Commented:
I think it deals with the backspace to clear out the extra space in the encoding (speculation on my part) &#26; is a substitute (^Z)
Without access ..
One would usually think that Penã would be transmitted as Pen&#227; but in the XML encoding it comes in as Pen &#227;&#26;&#26;

Author

Commented:
thanks again everyone. Without EE, i may not achieve things as quickly as I can.

hats off to everyone who spends their precious time to contribute and help others :)
Commented:
for now this is ok

[b]grep -P '[^\x00-\x7f]' sample.txt[/b]
Top Expert 2016

Commented:
I don't understand that - it ignores all sorts of control characters

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial