Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encode content before calling LWP #27

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ksmadsen
Copy link

@ksmadsen ksmadsen commented Nov 4, 2014

Hi,

This issue caught us today. I hope that this patch with a test-case is deemed appropriate.

When LWP encodes a HASH-ref, as is done in generic_solr_request it will use the
URI module to create the www-form-urlencoded content from the HASH-ref.

The URI module will try to deduce the desired target charset from the utf8-flag
on the strings, which means that the strings sent to Solr aren't UTF-8 encoded
unless they have been either decoded, or encoded as utf8.

This is confusing, as e.g. the string "\xc6" is passed on to Solr as an UTF-8
encoded Æ passed through the add-method, but the same string will result in an
error from Solr if used with the search-method.

To fix this issue, encode all of the strings from the parameters in
generic_solr_request before passing them on to LWP. This way the charset
behaviour of generic_solr_request and _send_update is aligned.

Note: This will break applications that encode strings to UTF-8 before calling
WebService::Solr generic_solr_request, search or auto_suggest.

When LWP encodes a HASH-ref, as is done in generic_solr_request it will use the
URI module to create the www-form-urlencoded content from the HASH-ref.

The URI module will try to deduce the desired target charset from the utf8-flag
on the strings, which means that the strings sent to Solr aren't UTF-8 encoded
unless they have been either decoded, or encoded as utf8.

This is confusing, as e.g. the string "\xc6" is passed on to Solr as an UTF-8
encoded Æ passed through the add-method, but the same string will result in an
error from Solr if used with the search-method.

To fix this issue, encode all of the strings from the parameters in
generic_solr_request before passing them on to LWP. This way the charset
behaviour of generic_solr_request and _send_update is aligned.

Note: This will break applications that encode strings to UTF-8 before calling
WebService::Solr generic_solr_request, search or auto_suggest.
@petdance
Copy link
Owner

petdance commented Nov 5, 2014

I can't even begin to assess if this is something that we should put in. Encodings are far from my strong suit.

@petdance petdance added the Bug label Dec 10, 2016
@tonycoz
Copy link

tonycoz commented Jun 14, 2019

something like this is needed, for example:

#!perl
use strict;
use warnings;
use WebService::Solr;

binmode STDOUT, ":utf8";

my $text = "alpha\xFF";

# default managed schema includes id and _text_
my $doc =
  {
    id => "abc",
    _text_ => $text,
  };

my $solr = WebService::Solr->new("http://127.0.0.1:8983/solr/test");
$solr->ping
  or die;

$solr->add($doc)
  or die;

use Data::Dumper;
my $resp = $solr->search($text);
$resp->is_success
  or die Dumper($resp->content);

results in:

$VAR1 = {
          'error' => {
                       'metadata' => [
                                       'error-class',
                                       'org.apache.solr.common.SolrException',
                                       'root-error-class',
                                       'org.apache.solr.common.SolrException'
                                     ],
                       'msg' => 'URLDecoder: Invalid character encoding detected after position 10 of query string / form data (while parsing as UTF-8)',
                       'code' => 400
                     }
        };

Note that the text is being encoded correctly during indexing because _send_update() encodes the document before sending:

    HTTP::Headers->new( Content_Type => 'text/xml; charset=utf-8' ),
    '<?xml version="1.0" encoding="UTF-8"?>' . encode( 'utf8', "$xml" )

while search() doesn't.

@karenetheridge
Copy link

related - libwww-perl/URI#48

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants