Text File parsing with Flatworm and Substring Hacks
Recently I was implementing a new project at work where I had to read and write a bunch of fixed-width files for communicating with a new vendor. My first thought was that I would have to use Java’s String.substring() method to pull out the individual fields. This is ugly because you end up with stuff like
final String FIELD1_START_POS = 10;
final String FIELD1_END_POS = 20;
When you have a few hundred fields, this is exceptionally ugly. Luckily for code and my sanity, I asked how best to do this on StackOverflow and was pointed to the great little library Flatworm. Flatworm allows you to create an XML descriptor for the file you need to parse, then it reads it into a plain Java bean for you. It also takes care of parsing the data if needed, casting into the correct type, stripping unwanted characters, etc. Very, very useful indeed.
Aside: Of course you could also use regular expressions to pull out the data but I don’t see where that would give any advantage. You still have to encode what the field looks like, where it starts, something of that nature. It just feels even more brittle to me. There’s also the more complete route of using lexers, parsers, etc. I don’t know enough about that process to see how that would be benefit me in this particular case. Maybe it would, I don’t know for sure. But from what I do know, it seems like overkill and not a big benefit.
Fast forward to now, I have another project where I need to read and write text files. “Ah ha!” I say. “I’ll use Flatworm again.” “Nope,” says the universe. Unfortunately the file I need to read runs into a limitation of Flatworm. The file has lines where the data starts on column 10, but then it’s a name that could be any length. Instead of padding out the line to the end of the file, the line ends after the name. Flatworm has no way of handling this. I considered hacking Flatworm to handle this condition (and I still might do this as I think it’s useful) but I wanted to try something else first. What I ended up with was better than my first example I think but not quite as cool as Flatworm.
Here’s a mockup of the file I’m working with for reference (. is a blank space)
.....12345......................................02/01/2010.....$123.45..... .....One hundred twenty-three and forty-five cents ..........Matt Grommes ..........98765..........1 Test St..............123.45
Here’s the first version of the parser code I had
check.setCheckDate( new Date(checkLines[0].substring(91, checkLines[0].indexOf("$"))) );
check.setCheckTotal( checkLines[0].substring( checkLines[0].indexOf("$")+1, checkLines[0].length()) );
check.setAmountWords( checkLines[1].substring(10, checkLines[1].length()) );
check.setPayee( checkLines[3].substring( 10, checkLines[3].length()) );
This isn’t optimal because there is a ton of duplicated code, plus there are problems with the same lines that tripped up Flatworm. I ended up making a new function called getLineValue()
private static String getLineValue(String line, int beginIndex, int endIndex) {
String value = "";
// endIndex 0 is just a shortcut to EndOfLine
if(endIndex == 0)
endIndex = line.length();
if(line.length() != 0 && line.length() >= endIndex)
value = line.substring(beginIndex, endIndex).trim();
return value;
}
This is of course just a wrapper around String.substring() but it lets me do some extra checks and have extra logic like using 0 for endIndex to indicate “go to end of line”.
Here’s the modified version, using the new function.
check.setCheckDate( new Date(getLineValue(checkLines[0], 91, checkLines[0].indexOf("$"))) );
check.setCheckTotal( getLineValue(checkLines[0], checkLines[0].indexOf("$")+1, 0) );
check.setAmountWords( getLineValue(checkLines[1], 10, 0) );
check.setPayee( getLineValue(checkLines[3], 10, 0) );
This is a lot better to my eye, not as much extra code cluttering things up. It’s a lot clearer what I’m doing since you don’t have to pay attention to a bunch of substring() and length() calls. This is only about 1/5 of the total lines of parsing code so hopefully you can see how much better this looks over the course of the whole method. See the Aside above for thoughts on some other ways of doing this.
This wasn’t a big project and there may be better ways of going about it but I was pretty happy how this ended up. I like seeing less code so when there are ways of cutting extra things out, it’s a win.
Thanks to the couple of redditors that made comments about this post. I’m always looking to get better at this so constructive criticism is welcome.