Re: [PLUG] postgres data loading

Walt Mankowski <waltman@pobox.com> · Fri, 2 Apr 2010 17:44:21 -0400

On Fri, Apr 02, 2010 at 04:04:36PM -0400, John Karr wrote:
> My preprocessor is pretty good right now, (I crunch 153 fields to 67 that
> are split to 6 smaller tables and make a bunch of corrections) the main
> error I get is duplicate records within the dump, which is fixable. But no
> matter how tight I can make my preprocessor I know that the data source will
> find a new error to throw at me, and if a few records out of 10 million
> don't import I can ignore the problem, 1 bad record can stop an entire
> County like Philadelphia or Allegheny from importing (if I break it into
> arbitrary batches of 10,000 what about the other 9,999 records), so I
> strongly prefer an import method that isn't broken by a few bad records. 

Could you split the input by county or zip code and then bulk import
each file separately?  That way you'd at least have a smaller file to
try to fix.

If not, it seems to me the best method (suggested already) is to try
to do a bulk load to a test database.  When you find errors, fix them
there and try again.  Presumably the batch load is fast, so you could
go through several rounds of debugging and fixing errors in the time
it's currently taking you to add it a record at a time.  Then when
it's ready, loading it into production should be easy.

Walt