Let me preface this by saying I'm a complete amateur when it comes to RegEx and only started a few days ago. I'm trying to solve a problem formatting a file and have hit a hitch with a particular type of data. The input file is structured like this:
Two words,Word,Word,Word,"Number, number"
What I need to do is format it like this...
"Two words","Word",Word","Word","Number, number"
I have had a RegEx pattern of
s/,/","/g
working, except it also replaces the comma in the already quoted Number, number section, which causes the field to separate and breaks the file. Essentially, I need to modify my pattern to replace a comma with "," [quote comma quote], but only when that comma isn't followed by a space. Note that the other fields will never have a space following the comma, only the delimited number list.
I managed to write up
s/,[A-Za-z0-9]/","/g
which, while matching the appropriate strings, would replace the comma AND the following letter. I have heard of backreferences and think that might be what I need to use? My understanding was that
s/(,)[A-Za-z0-9]\b
should work, but it doesn't.
Anyone have an idea?
-
s/,([^ ])/","$1/
will match a ",
" followed by a "not-a-space", capturing the not-a-space, then replacing the whole thing with the captured part.Depending on which regex engine you're using, you might be writing
\1
or other things instead of$1
.If you're using Perl or otherwise have access to a regex engine with negative lookahead,
s/,(?! )/","/
(a ",
" not followed by a space) works.Your input looks like CSV, though, and if it actually is, you'd be better off parsing it with a real CSV parser rather than with regexes. There's lot of other odd corner cases to worry about.
IL : I'd like to do it through the CSV parser but I'm being given these files as they are, without my having in any say in the actual format. So I'm stuck fiddling with RegEx.IL : s/,(?! )/","/ Worked perfectly, thanks. I'm using Perl so I can run a script against the files as I'm sent them. Saves opening it in a Parser and working with it there. Besides, I wanted to learn Perl and RegEx anyway so two birds with one stone. Thanks for your help :).From ephemient -
My experience has been that this is not a great use of regexes. As already said, CSV files are better handled by real CSV parsers. You didn't tag a language, so it's hard to tell, but in perl, I use Text::CSV_XS or DBD::CSV (allowing me SQL to access a CSV file as if it were a table, which, of course, uses Text::CSV_XS under the covers). Far simpler than rolling my own, and far more robust than using regexes.
From Tanktalus -
This question is similar to: Replace patterns that are inside delimiters using a regular expression call.
This could work:
s/"([^"]*)"|([^",]+)/"$1$2"/g
ephemient : Heh, that accomplishes what Isaac wants instead of what he asked for :) You could be a little fancier, and handle CSV's quote escaping too... but there's not much point to handling it with a regex when pre-built CSV parsers can do better.IL : There were two main reasons I went with doing it this way. One, I wanted to learn Perl, and then RegEx seemed like it could solve this problem. Second, I'm being handed these files regularly and being able to just run a script against them saves me a bunch of time.From MizardX -
Looks like you're using Sed.
While your pattern seems to be a little inconsistent, I'm assuming you'd like every item separated by commas to have quotations around it. Otherwise, you're looking at areas of computational complexity regular expressions are not meant to handle.
Through sed, your command would be:
sed 's/[ \"]*,[ \"]*/\", \"/g'
Note that you'll still have to put doublequotes at the beginning and end of the string.
From Robert Elwell
0 comments:
Post a Comment