John Karr on 2 Apr 2010 13:06:30 -0700 |
My preprocessor is pretty good right now, (I crunch 153 fields to 67 that are split to 6 smaller tables and make a bunch of corrections) the main error I get is duplicate records within the dump, which is fixable. But no matter how tight I can make my preprocessor I know that the data source will find a new error to throw at me, and if a few records out of 10 million don't import I can ignore the problem, 1 bad record can stop an entire County like Philadelphia or Allegheny from importing (if I break it into arbitrary batches of 10,000 what about the other 9,999 records), so I strongly prefer an import method that isn't broken by a few bad records. I have experience with a non-free SQL implementation that has a bulk operations mode that would help speed the type of insert I'm doing and that has bulk import tools that can skip bad records. Because this is for an organization with limited funds, my choices are postgres and mysql, and for some reason I really dislike working with mysql (if this is solvable in mysql but not postgres then I must reconsider choice of db). By the way the data is a static copy, so once imported it won't change until the next reload. In that other SQL Server I could always leave the db in bulk mode (disabling transaction logging). -----Original Message----- From: plug-bounces@lists.phillylinux.org [mailto:plug-bounces@lists.phillylinux.org] On Behalf Of Eric Sent: Friday, April 02, 2010 3:20 PM To: Philadelphia Linux User's Group Discussion List Subject: Re: [PLUG] postgres data loading John: What kind of errors are causing you problems? If the types of errors are frequently the same type (for example: characters in an integer field or an invalid date) then perhaps you could pre-process the data with a script (suggest: perl). I have frequently had to import data from disparate sources and have almost always had to pre-process the data to protect from this kind of thing. Typically, I throw the records that cannot be automatically corrected into a separate "bad" file so that they can be manually scanned and then re-processed. Eric John Karr wrote: > I'm working on a database that involves periodic replacement of a 10 million > record database from an external source. > > The postgres bulk load (copy from) is very sensitive to errors and crashes > the entire operation for just one error in an operation, so my import script > adds records one at a time so I can deal with or ignore failures. > > On my test machine it takes nearly a day to load the database this way. I > tried using transactions, but only got about a 30% speed boost at the cost > of a single error in a batch crashing the whole batch. > > Any ideas on how to do this load faster? Normally this load would occur > about once every 3 months, but in times peak activity it could be weekly and > it wouldn't be acceptable to have the application down for a day for every > load. > > ___________________________________________________________________________ > Philadelphia Linux Users Group -- http://www.phillylinux.org > Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce > General Discussion -- http://lists.phillylinux.org/mailman/listinfo/plug > > -- # Eric Lucas # # "Oh, I have slipped the surly bond of earth # And danced the skies on laughter-silvered wings... # -- John Gillespie Magee Jr ___________________________________________________________________________ Philadelphia Linux Users Group -- http://www.phillylinux.org Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce General Discussion -- http://lists.phillylinux.org/mailman/listinfo/plug ___________________________________________________________________________ Philadelphia Linux Users Group -- http://www.phillylinux.org Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce General Discussion -- http://lists.phillylinux.org/mailman/listinfo/plug
|
|