CSV Importing Encoding (invalid byte sequence in UTF-8)

This seemed like it would be a simple problem to me:

invalid byte sequence in UTF-8  

So simple - it's only a one line error! No. Not quite. It took some time to finally come to the conclusion that the manipulation of File's and Tempfile's in Rails were not as succinct as you would expect.

Now, the ultimate solution in our case didn't end up being brain surgery. I still had to ask myself, "Why is this not something handled within the core of Ruby or Rails?" Anyways...

The Problem

In the next few days @Healthify is launching with a new client (yay!) but this also means we're cramming to get everything just right for them, as is human nature. In this particular instance of cramming, we were importing a lot of new information into our system using our new importing module. We had until Friday to get this information in, but similar situations in the past @dleve123 and I have spent all night typing furiously, so we planned to do a sample import on Tuesday.

Tuesday rolls around and BAM! first import does nothing. What? Well, it didn't throw an error so to Papertrail we go:

#staging server stack trace
invalid byte sequence in UTF-8  

Well, that's encouraging a simple error (not!).

The Solution

Google/StackOverflow/RubyDocs/etc. didn't make this terribly simple to resolve. All the pieces were there but for some reason, no one had a simple solution for importing a CSV where the file was encoded in something other than UTF-8. After a lot of trial and error, I created this little method to take a file and convert it to the identical file with non UTF-8 characters removed:

def convert_to_utf8_encoding(original_file)  
  original_string = original_file.read
  final_string = original_string.encode(invalid: :replace, undef: :replace, replace: '') #If you'd rather invalid characters be replaced with something else, do so here.
  final_file = Tempfile.new('import') #No need to save a real File
  final_file.close #Don't forget me

Two things I would simply note:

  1. It might not seem necessary but be sure to close the file. I'm not used to dealing with File's or Tempfile's but apparently, they operate just like a file in a GUI.
  2. Assuming you haven't messed with your default encoding, then you shouldn't need to specify your desired encoding when writing the file, simply specify how you'd like to handle the encoding-confused characters.


Testing in this case was rather simple. We test most of our importing module with RSpec fixtures that are actually imported. I just ran a test with a file that contained a random arabic character.

Hope this helps someone!