Skip to content
Snippets Groups Projects
Select Git revision
  • c738bb44ea46c6c4b245f3a08112e5a3465a5738
  • master default protected
  • v3-modify-mail
  • snyk-fix-207483a1e839c807f95a55077e86527d
  • translations_3b5aa4f3c755059914cfa23d7d2edcde_ru
  • translations_6e4a5e377a3e50f17e6402264fdbfcc6_ru
  • translations_3b5aa4f3c755059914cfa23d7d2edcde_fa_IR
  • translations_en-yml--master_fa_IR
  • snyk-fix-7d634f2eb65555f41bf06d6af930e812
  • translations_en-yml--master_ar
  • translations_3b5aa4f3c755059914cfa23d7d2edcde_el
  • jfederico-patch-1
  • v2
  • v3
  • v1
  • release-3.1.0.2
  • release-3.1.0.1
  • release-3.1.0
  • release-2.14.8.4
  • release-3.0.9.1
  • release-3.0.9
  • release-3.0.8.1
  • release-2.14.8.3
  • release-3.0.8
  • release-3.0.7.1
  • release-2.14.8.2
  • release-3.0.7
  • release-3.0.6.1
  • release-3.0.6
  • release-3.0.5.4
  • release-3.0.5.3
  • release-2.14.8.1
  • release-3.0.5.2
  • release-3.0.5.1
  • release-3.0.5
35 results

cloudbuild-dev.yaml

Blame
  • README.md 1.03 KiB
    1. Datensatz heruntergeladen und entpackt.

    2. wc -l yelp_academic_dataset_review.json (wie viele Zeilen hat der Datensatz) -> 8021122

    3. Datensatz geteilt mit "split -l 350000 yelp_academic_dataset_review.json" -> 23 Dateien je ~ 250mb

    4. Datenstruktur ermittelt -> {"review_id":"qCMDfOjWdoyNE-oU3h9DKg","user_id":"JtmLdyw4k1xV78jgjhKM_w","business_id":"vdR_vmmgfI56bwSZYYHtXg","stars":5.0,"useful":0,"funny":0,"cool":0, "text":"Tasty burgers and sandwiches..Staff were super friendly including the owner..I was just visiting the area and found this place yelp and thought would give this place a try..highly recommended!","date":"2019-08-01 06:51:51"}$-

    5. Testweise w2v -> Datei mit nur "review text" als Zeilen

    6. grep -o -P '(?<=text":").*(?=",)' yelp_academic_dataset_review.json > review_text_only.txt

    7. covert to lowercase : sed -e 's/(.*)/\L\1/' regextest.txt > output.txt

    8. remove special characters remove newline character: sed -i 's/\n/ /g' small-samlple.txt remove everything not a word or space: