Skip to content
Snippets Groups Projects
Select Git revision
  • master default protected
1 result

w2vp

  • Clone with SSH
  • Clone with HTTPS
    1. Datensatz heruntergeladen und entpackt.

    2. wc -l yelp_academic_dataset_review.json (wie viele Zeilen hat der Datensatz) -> 8021122

    3. Datensatz geteilt mit "split -l 350000 yelp_academic_dataset_review.json" -> 23 Dateien je ~ 250mb

    4. Datenstruktur ermittelt -> {"review_id":"qCMDfOjWdoyNE-oU3h9DKg","user_id":"JtmLdyw4k1xV78jgjhKM_w","business_id":"vdR_vmmgfI56bwSZYYHtXg","stars":5.0,"useful":0,"funny":0,"cool":0, "text":"Tasty burgers and sandwiches..Staff were super friendly including the owner..I was just visiting the area and found this place yelp and thought would give this place a try..highly recommended!","date":"2019-08-01 06:51:51"}$-

    5. Testweise w2v -> Datei mit nur "review text" als Zeilen

    6. grep -o -P '(?<=text":").*(?=",)' yelp_academic_dataset_review.json > review_text_only.txt

    7. covert to lowercase : sed -e 's/(.*)/\L\1/' regextest.txt > output.txt

    8. remove special characters remove newline character: sed -i 's/\n/ /g' small-samlple.txt remove everything not a word or space: