Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wikipedia: increase REXML entity expansion limit during XML parsing #199

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 14 additions & 1 deletion lib/datasets/wikipedia.rb
Original file line number Diff line number Diff line change
Expand Up @@ -48,11 +48,16 @@ def each(&block)
open_data do |input|
listener = ArticlesListener.new(block)
parser = REXML::Parsers::StreamParser.new(input, listener)
parser.parse
with_increased_entity_expansion_text_limit do
parser.parse
end
end
end

private

ENTITY_EXPANSION_TEXT_LIMIT = 1_342_177_280

def base_name
"#{@language}wiki-latest-#{type_in_path}.xml.bz2"
end
Expand Down Expand Up @@ -80,6 +85,14 @@ def type_in_path
end
end

def with_increased_entity_expansion_text_limit
default_limit = REXML::Security.entity_expansion_text_limit
REXML::Security.entity_expansion_text_limit = ENTITY_EXPANSION_TEXT_LIMIT
yield
ensure
REXML::Security.entity_expansion_text_limit = default_limit
end

class ArticlesListener
include REXML::StreamListener

Expand Down