Skip to content

Commit

Permalink
Recognize and after nesting (#22)
Browse files Browse the repository at this point in the history
  • Loading branch information
wvengen committed Jan 19, 2024
1 parent 0904d14 commit b25c064
Show file tree
Hide file tree
Showing 3 changed files with 22 additions and 1 deletion.
1 change: 1 addition & 0 deletions data/test-samples-parsed
Original file line number Diff line number Diff line change
Expand Up @@ -16,3 +16,4 @@ Halfvolle yoghurt met L rhamnosus Gorbach & Goldin, L acidophilus en B lactis, 2
Water, champignon˄1 12%, schouderham˄2 2,5%, plantaardige olie, TARWEBLOEM, gemodificeerd zetmeel, magere MELKPOEDER, zout, bieslook 10,5%, gistextract (bevat GERST), aroma, champignonsapconcentraat˄1 0,1% , fructose, mineraalzout (kalium), uienpoeder, knoflook, witte wijnextract, balsamicoazijn (wijnazijn, druivenmost), ˄1op duurzame wijze geteeld., ˄2Beter Leven keurmerk 1 ster. Kan ei, soja, selderij, mosterd bevatten.
Tomaat~ 84% (tomaat, tomatenpuree), wortel, ui, ROOM, suiker, Italiaanse KAAS (Parmigiano Reggiano BOB◊ 1,1%, Grana Padano BOB◊ 0,8% (bevat lysozym van EI)), extra olijfolie verkregen bij de eerste persing, rijstzetmeel, basilicum, peterselie, knoflook, maïszetmeel, tijm, MELKWEI, aroma, zwarte peper, zuurteregelaar: citroenzuur, tomat~ 84% (tomaat, tomatenpuree) , wortel, ui, ROOM, suiker, italiaanse KAAS (Parmigiano Reggiano BOB◊ 1,1%, Grana Padano BOB◊ 0,8% (bevat lysozym van EI)), ~ op duurzame wijze geteeld., ◊ Beschermde Oorsprongsbenaming.
Wraphapje mozzarella-tomaat: 36% tomatenwrap , 29% half zongedroogde tomaat (27% tomaat, zonnebloemolie, knoflook, zout, oregano, marjolein, peterselie), 20% mozzarella , 15% groene pesto . , Wraphapje geitenkaas-beenham: 41% geitenkaas , 33% wrap , 22% beenham , 4% honing. Allergie-informatie: bevat tarwe (gluten), lactose, melkeiwit, ei, cashewnoot, geitenmelkeiwit. Gemaakt in een bedrijf waar ook pinda's en andere noten worden verwerkt.
Wheat Flour [with Calcium, Iron, Niacin (B3) and Thiamin (B1)] and Wholemeal Wheat Flour, Water, Yeast, Vegetable Oils (Sunflower, Rapeseed and Sustainable Palm in varying proportions), Salt, Wheat Gluten, Malted Barley Flour, Emulsifiers: E471, E472e, Soya Flour, Preservative: Calcium Propionate, Flavouring, Flour Treatment Agent: Ascorbic Acid (Vitamin C)
15 changes: 15 additions & 0 deletions lib/food_ingredient_parser/loose/scanner.rb
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ module FoodIngredientParser::Loose
class Scanner

SEP_CHARS = "|;,.".freeze
AND_SEP_RE = /\A\s*(and|en|und)\s+/i.freeze
MARK_CHARS = "¹²³⁴⁵ᵃᵇᶜᵈᵉᶠᵍªº⁽⁾†‡⁺•°▪◊#^˄*~".freeze
PREFIX_RE = /\A\s*(ingredients(\s*list)?|contains|ingred[iï][eë]nt(en)?(declaratie)?|bevat|dit zit er\s?in|samenstelling|zutaten)\b\s*[:;.]?\s*/i.freeze
NOTE_RE = /\A\b(dit product kan\b|deze verpakking kan\b|kan sporen\b.*?\bbevatten\b|voor allergenen\b|allergenen\b|allergie[- ]informatie(\s*:|\b)|E\s*=|gemaakt in\b|geproduceerd in\b|bevat mogelijk\b|kijk voor meer\b|allergie-info|in de fabriek\b|in dit bedrijf\b|voor [0-9,.]+ (g\.?|gr\.?|ram|ml).*\bis [0-9,.]+ (g\.?|gr\.?|ram|ml).*\bgebruikt\b)/i.freeze
Expand Down Expand Up @@ -75,6 +76,11 @@ def scan_iteration_standard
elsif ")]".include?(c) # close nesting
add_child
close_parent
# after bracket check for 'and' to not lose text
if is_and_sep?(@i+1)
@i += and_sep_len(@i+1)
add_child
end
elsif is_notes_start? # usually a dot marks the start of notes
close_all_ancestors
@iterator = :notes
Expand Down Expand Up @@ -148,6 +154,15 @@ def is_sep?(chars: SEP_CHARS)
chars.include?(c) && @s[@i-1..@i+1] !~ /\A\d.\d\z/
end

def is_and_sep?(i = @i)
and_sep_len(i) > 0
end

def and_sep_len(i = @i)
m = @s[i..-1].match(AND_SEP_RE)
m ? m.offset(0).last : 0
end

def is_mark?(i = @i)
mark_len(i) > 0 && @s[i..i+1] !~ /\A°[CF]/
end
Expand Down
7 changes: 6 additions & 1 deletion lib/food_ingredient_parser/strict/grammar/ingredient.treetop
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,12 @@ module FoodIngredientParser::Strict::Grammar
include IngredientColoned

rule ingredient
ws* ( ingredient_nested / ingredient_coloned / ingredient_simple_with_amount )
ws*
(
ingredient_nested ( ws* and ws+ ingredient )? /
ingredient_coloned /
ingredient_simple_with_amount
)
end

end
Expand Down

0 comments on commit b25c064

Please sign in to comment.