csv-to-sqlite 2.0 - memory improvements

When I first published my small csv-to-sqlite script over three years ago, I didn't really expect anyone to use it. But it seems to have gained a few fans over the years. One thing that gathered a few complaints was its memory usage - one user reported needing 30GB of RAM to import a 3GB .csv file. Yeah, really does sound a bit... excessive. To put it very mildly.

Of course, the main problem here was that I didn't really expect anyone to use it on files that big. And since I'd already been reading the entire file anyway (to determine column typings), I thought I might as well store it in memory and then just dump it into the database.

Well, that is no longer the case. I've done some tests with a ~500MB .csv file - and where it required ~4GB of memory before, it now makes do with just ~15MB. The data are now read and inserted in batches of 10,000 rows at a time.

However, due to some restructuring that was needed to effect this change, a breaking change to the CLI was needed.

New type detection

Where there were two kinds of type detection before, there are three now.

  • none: as before, this just types every column as string. No processing is done.
  • full: also as before, this reads the entire file, making sure every value will fit into the proposed column type. Since the file is no longer stored in memory, the input file is read twice in this mode - once to determine types, then the second time to insert the data into the DB. (I can't insert the data in the first pass, as I only know the column types needed to create the table after I've read the whole file.)
  • quick: the new default. In this mode, the typings are based on the types of the first row. If you know for a fact that this is good enough, there's no need to read the entire file. This is quick, and in most cases, sufficient.

Other than that, the CLI remains the same. I suspect most users won't even notice the change. Well, except for the fact that you now shouldn't be running out of memory. 🙂