Welcome to HBH! If you have tried to register and didn't get a verification email, please using the following link to resend the verification email.

PostgreSQL Non-Latin Characters


BrandonHeat's Avatar
Member
0 0

I've been working on a project that requires me to create a simple search from using PHP and PostgreSQL, importing the data form XML files. The tricky part is that the XML files must contain Latin, Cyrillic, Korean and Japanese characters. I figured that if I just use UTF-8 encoding for both the XML/HTML pages and the database, everything should work just fine, and even though the non-latin characters appear all screwed up when I view them directly from the database, they actually look just fine when I get them to display on the page.

The problem comes with the searching. When I search for an English title, or anyting using latin, it works just fine, but when I enter a Cyrillic/Japanese/Korean search string, I get no results whatsoever. Any idea why that is happening and how I can fix it?


spyware's Avatar
Banned
0 0

You probably need to either enforce UTF-8 encoding on the page where you search, or/and convert the string in PHP to something usable.


BrandonHeat's Avatar
Member
0 0

I had already enforced UTF-8 on the search page and though it should be enough, but converting the string in PHP actually did the trick. Thanks.

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Edit: It appears this isn't over yet! Things got even weirder now. Now it works with Japanese/Korean as well, but what's weird is that it works for Cyrillic only if I copy the word directly from the XML file, but if I input it from the keyboard I get no results. This doesn't make any sense to me, and I'm even more confused now :right: Any idea how to solve that?

P.S Here's the title copied from the XML: Eндивaл Дoмът нa Хaoсa Here's the same title typed using the keyboard: Ендивал Домът на Хаоса


spyware's Avatar
Banned
0 0

It might be the case that the xml editor you're using sanitizes the data when it displays/saves it, ie. while those two sentences look the same, the data you copy is different.

Try echoing the copied and typed string in PHP, and see how they differ (paste back the results here if you like).


BrandonHeat's Avatar
Member
0 0

I get the same result when echoing them: copied one - Eндивaл Дoмът нa Хaoсa, typed one - Ендивал Домът на Хаоса. I tried comparing them online using http://www.textdiff.com/, and the result is that they are 100% different… I'm not really sure why that is and even if they are using a different encoding or something(like windows-1251 and UTF-8), I convert both of them to UTF-8 before searching so there shouldn't be a problem with that. I'm really at a loss here.

SOLVED: In the end it was just bad luck I guess. When trying Cyrillic I was always searching for the first entry and didn't try the others because I figured they wouldn't work as well - turned out they did and the first one was the only one that wasn't working. Thinking back I figured I typed all of the others by hand, but for this one I was lazy so I just copied the title from another site, which was obviously using a different encoding. I was converting it to UTF-8 anyway, but I guess it didn't work properly. Doesn't matter now - I just updated the XML entry and typed it myself, and everything is OK now.