Walt Mankowski on 2 Apr 2010 14:44:26 -0700 |
On Fri, Apr 02, 2010 at 04:04:36PM -0400, John Karr wrote: > My preprocessor is pretty good right now, (I crunch 153 fields to 67 that > are split to 6 smaller tables and make a bunch of corrections) the main > error I get is duplicate records within the dump, which is fixable. But no > matter how tight I can make my preprocessor I know that the data source will > find a new error to throw at me, and if a few records out of 10 million > don't import I can ignore the problem, 1 bad record can stop an entire > County like Philadelphia or Allegheny from importing (if I break it into > arbitrary batches of 10,000 what about the other 9,999 records), so I > strongly prefer an import method that isn't broken by a few bad records. Could you split the input by county or zip code and then bulk import each file separately? That way you'd at least have a smaller file to try to fix. If not, it seems to me the best method (suggested already) is to try to do a bulk load to a test database. When you find errors, fix them there and try again. Presumably the batch load is fast, so you could go through several rounds of debugging and fixing errors in the time it's currently taking you to add it a record at a time. Then when it's ready, loading it into production should be easy. Walt Attachment:
signature.asc ___________________________________________________________________________ Philadelphia Linux Users Group -- http://www.phillylinux.org Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce General Discussion -- http://lists.phillylinux.org/mailman/listinfo/plug
|
|