John Karr on 2 Apr 2010 13:06:30 -0700

[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: [PLUG] postgres data loading

My preprocessor is pretty good right now, (I crunch 153 fields to 67 that
are split to 6 smaller tables and make a bunch of corrections) the main
error I get is duplicate records within the dump, which is fixable. But no
matter how tight I can make my preprocessor I know that the data source will
find a new error to throw at me, and if a few records out of 10 million
don't import I can ignore the problem, 1 bad record can stop an entire
County like Philadelphia or Allegheny from importing (if I break it into
arbitrary batches of 10,000 what about the other 9,999 records), so I
strongly prefer an import method that isn't broken by a few bad records. 

I have experience with a non-free SQL implementation that has a bulk
operations mode that would help speed the type of insert I'm doing and that
has bulk import tools that can skip bad records. Because this is for an
organization with limited funds, my choices are postgres and mysql, and for
some reason I really dislike working with mysql (if this is solvable in
mysql but not postgres then I must reconsider choice of db).

By the way the data is a static copy, so once imported it won't change until
the next reload. In that other SQL Server I could always leave the db in
bulk mode (disabling transaction logging).

-----Original Message-----
[] On Behalf Of Eric
Sent: Friday, April 02, 2010 3:20 PM
To: Philadelphia Linux User's Group Discussion List
Subject: Re: [PLUG] postgres data loading


What kind of errors are causing you problems?  If the types of errors
are frequently the same type (for example: characters in an integer
field or an invalid date) then perhaps you could pre-process the data
with a script (suggest: perl).  I have frequently had to import data
from disparate sources and have almost always had to pre-process the
data to protect from this kind of thing.  Typically, I throw the records
that cannot be automatically corrected into a separate "bad" file so
that they can be manually scanned and then re-processed.


John Karr wrote:
> I'm working on a database that involves periodic replacement of a 10
> record database from an external source. 
> The postgres bulk load (copy from) is very sensitive to errors and crashes
> the entire operation for just one error in an operation, so my import
> adds records one at a time so I can deal with or ignore failures. 
> On my test machine it takes nearly a day to load the database this way. I
> tried using transactions, but only got about a 30% speed boost at the cost
> of a single error in a batch crashing the whole batch. 
> Any ideas on how to do this load faster? Normally this load would occur
> about once every 3 months, but in times peak activity it could be weekly
> it wouldn't be acceptable to have the application down for a day for every
> load.
> Philadelphia Linux Users Group         --
> Announcements -
> General Discussion  --

#  Eric Lucas
#                "Oh, I have slipped the surly bond of earth
#                 And danced the skies on laughter-silvered wings...
#                                        -- John Gillespie Magee Jr

Philadelphia Linux Users Group         --
Announcements -
General Discussion  --

Philadelphia Linux Users Group         --
Announcements -
General Discussion  --